<a href="https://colab.research.google.com/github/Sahithi530/Sahithi_INFO5731_Fall2024/blob/main/Tummala_Sahithi_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here


Research Question: **How does the distribution of electric vehicle (EV) charging stations correlate with the adoption rate of electric vehicles in different states of the U.S.?**

Objective: To analyze whether the availability of EV charging infrastructure influences the adoption rate of electric vehicles across various states.

Data to be Collected:

EV Charging Stations Data:

Source: Electric Vehicle Population Data on https://catalog.data.gov/dataset/electric-vehicle-population-data
Data Points:
Location: Address or coordinates of the charging stations.
Type: Level 1, Level 2, DC Fast Charger.
Status: Operational or inactive.
Electric Vehicle Adoption Data:

Source: Electric Vehicle Population Data or related datasets.
Data Points:
State: U.S. state where the vehicles are registered.
Number of EVs: Total number of electric vehicles registered.
Year: Year of registration.
Amount of Data Needed:

EV Charging Stations Data: Full dataset from the source, including all records of charging stations.
EV Adoption Data: Aggregated data per state, preferably yearly for the last 5-10 years.
Detailed Steps for Collecting and Saving the Data:

Download EV Charging Stations Data:

Go to the Electric Vehicle Population Data page.
Download the dataset in CSV format.
Download EV Adoption Data:

Search for EV adoption datasets on Data.gov or related sites. Download the dataset if available. If not available, check state-specific transportation departments for similar data.
Preprocess the Data:

Load the datasets into a data processing tool (e.g., Python with Pandas).
Clean the data (remove duplicates, handle missing values).
Save the Data:

Save the cleaned datasets as CSV files for further analysis.
Example filenames: ev_charging_stations.csv, ev_adoption_data.csv.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here
# Install pandas if not already installed
!pip install pandas

import pandas as pd

# Upload files to Google Colab
from google.colab import files

# Upload the files
uploaded = files.upload()

# Load the datasets into pandas DataFrames
# Replace 'ev_charging_stations.csv' and 'ev_adoption_data.csv' with your actual filenames
charging_stations_df = pd.read_csv('Electric_Vehicle_Population_Da.csv')


# Display the first few rows of each DataFrame to understand their structure
print("Charging Stations Data:")
print(charging_stations_df.head())


# Sample 1000 records from the charging stations data (if the dataset is large)
sampled_charging_stations_df = charging_stations_df.sample(n=1000, random_state=1, replace=True)

# Save the sampled data to CSV files
sampled_charging_stations_df.to_csv('sampled_ev_charging_stations.csv', index=False)

# Provide download links for the CSV files
files.download('sampled_ev_charging_stations.csv')






Saving Electric_Vehicle_Population_Da.csv to Electric_Vehicle_Population_Da (1).csv
Charging Stations Data:
   VIN (1-10)     County          City State  Postal Code  Model Year    Make  \
0  5YJSA1E28K  Snohomish      Mukilteo    WA      98275.0      2019.0   TESLA   
1  1C4JJXP68P     Yakima        Yakima    WA      98901.0      2023.0    JEEP   
2  WBY8P6C05L     Kitsap      Kingston    WA      98346.0      2020.0     BMW   
3  JTDKARFP1J     Kitsap  Port Orchard    WA      98367.0      2018.0  TOYOTA   
4  5UXTA6C09N  Snohomish       Everett    WA      98208.0      2022.0     BMW   

         Model                   Electric Vehicle Type  \
0      MODEL S          Battery Electric Vehicle (BEV)   
1     WRANGLER  Plug-in Hybrid Electric Vehicle (PHEV)   
2           I3          Battery Electric Vehicle (BEV)   
3  PRIUS PRIME  Plug-in Hybrid Electric Vehicle (PHEV)   
4           X5  Plug-in Hybrid Electric Vehicle (PHEV)   

  Clean Alternative Fuel Vehicle (CAFV) Eligibility  Ele

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# write your answer
# Import the files modules
import requests
import pandas as pd
import time
from google.colab import files

# Constants for API parameters and configuration
API_ENDPOINT = 'https://api.crossref.org/works'
SEARCH_TERM = 'XYZ'
TOTAL_ARTICLES = 1000 # As per the instructions in class
START_DATE = 2014
END_DATE = 2024
MAX_RETRIES = 5  # Maximum retry attempts
INITIAL_DELAY = 5  # Initial delay between retries

# Function to retrieve articles from CrossRef API, handling rate limits
def retrieve_articles(keyword, total_articles, start_year, end_year):
    articles_list = []
    params = {
        'query': keyword,
        'rows': 100,  # Number of results per request
        'filter': f'from-pub-date:{start_year}-01-01,until-pub-date:{end_year}-12-31',
        'select': 'title,container-title,published-print,author,abstract'
    }

    retries = 0
    while len(articles_list) < total_articles:
        try:
            response = requests.get(API_ENDPOINT, params=params)
            response.raise_for_status()  # Raising an exception for any HTTP error

            if response.status_code == 429:  # Rate limit exceeded
                retries += 1
                if retries > MAX_RETRIES:
                    print("Exceeded maximum retries due to rate limits. Terminating process.")
                    break
                delay = INITIAL_DELAY * (2 ** (retries - 1))  # Exponential backoff
                print(f"Rate limit exceeded. Retrying in {delay} seconds...")
                time.sleep(delay)
                continue

            json_data = response.json()
            article_items = json_data.get('message', {}).get('items', [])

            if not article_items:
                break  # Stopping if no more data is returned

            for article in article_items:
                article_info = {
                    'Title': article.get('title', [''])[0],
                    'Journal/Conference': article.get('container-title', [''])[0],
                    'Publication Year': article.get('published-print', {}).get('date-parts', [[0]])[0][0],
                    'Authors': ', '.join(author.get('family', '') + ' ' + author.get('given', '') for author in article.get('author', [])),
                    'Abstract': article.get('abstract', '')
                }
                articles_list.append(article_info)

                if len(articles_list) >= total_articles:
                    break

            # Implementing pagination by adjusting the 'offset' parameter for the next set of results
            params['offset'] = params.get('offset', 0) + 100

        except requests.RequestException as error:
            print(f"Error occurred while fetching data: {error}")
            break

    return articles_list

# Fetching articles and storing them in a DataFrame
article_data = retrieve_articles(SEARCH_TERM, TOTAL_ARTICLES, START_DATE, END_DATE)
df_articles = pd.DataFrame(article_data)

# Saving the DataFrame to a CSV file
df_articles.to_csv('articles_crossref.csv', index=False)

# Downloading the CSV file
files.download('articles_crossref.csv')

# Previewing the first five rows of the DataFrame
df_articles.head()



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,Title,Journal/Conference,Publication Year,Authors,Abstract
0,Testing metadata xyz,,0,,
1,Testing metadata xyz,,0,,
2,XYZ arm (XYZ robotic arm),"The Dictionary of Genomics, Transcriptomics an...",0,,
3,"PEG‐XYZ, peg‐XYZ",Catalysis from A to Z,0,Noir B.L.C.,
4,Testing metadata xyz,,0,,


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here

# Installing the praw library
!pip install praw pandas

# Importing praw and pandas
import praw
import pandas as pd

# Setting up Reddit API credentials
REDDIT_CLIENT_ID = 'O01Bu9bMMEdpYrF9exIJCzg'
REDDIT_SECRET = 'K0bQhSzmvacd2U0NmBUSQZx66o4dv7Q'
REDDIT_USER_AGENT = 'I need to collect data to complete an assignment'

# Initializing Reddit API client
reddit = praw.Reddit(
    client_id=REDDIT_CLIENT_ID,
    client_secret=REDDIT_SECRET,
    user_agent=REDDIT_USER_AGENT
)

# Defining a function to fetch posts from Reddit
def fetch_reddit_posts(query, limit=100):
    posts = []
    # Searching Reddit posts with the query
    for submission in reddit.subreddit(query).search(query, limit=limit):
        post = {
            'Title': submission.title,
            'Score': submission.score,
            'Author': str(submission.author),
            'Created Date': submission.created_utc,
            'URL': submission.url,
            'Content': submission.selftext
        }
        posts.append(post)
    return posts

# Fetching Reddit posts
query = 'Python'  # Replacing with your keyword or subreddit
post_limit = 100  # Number of posts to fetch
posts_data = fetch_reddit_posts(query, limit=post_limit)

# Converting to DataFrame
df_posts = pd.DataFrame(posts_data)

# Saving the DataFrame to a CSV file
csv_file = 'reddit_posts.csv'
df_posts.to_csv(csv_file, index=False)

# Displaying the first few rows of the DataFrame
df_posts.head()

# Enabling file download from Google Colab
from google.colab import files
files.download(csv_file)










It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
https://myunt-my.sharepoint.com/:f:/r/personal/sahithitummala_my_unt_edu/Documents/5731?csf=1&web=1&e=ZjbIFp


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
In working on web scraping and data collection, I found the learning experience both challenging and rewarding.
Key concepts like handling APIs, dealing with rate limits, and using libraries such as praw and asyncpraw were crucial.
When using tools like ParseHub, I found them not user-friendly and less flexible than coding.

In my field, the ability to gather and analyze data from various online sources can significantly enhance research and decision-making. It allows for the extraction of valuable insights and trends, which is essential for data-driven projects.
'''

'\nIn working on web scraping and data collection, I found the learning experience both challenging and rewarding. \nKey concepts like handling APIs, dealing with rate limits, and using libraries such as praw and asyncpraw were crucial.\nWhen using tools like ParseHub, I found them not user-friendly and less flexible than coding.\n\nIn my field, the ability to gather and analyze data from various online sources can significantly enhance research and decision-making. It allows for the extraction of valuable insights and trends, which is essential for data-driven projects.\n'