<a href="https://colab.research.google.com/github/ManoharRavula/ManoharRavula.github.io/blob/master/Ravulapalli_Manohar_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
'''
Question - When is the best time to do outdoor activties so that it will not affect health due to air quality?

Data Needed for Analysis:
Air Quality Data: Measurements of air pollutants, including PM2.5, NO2, SO2, CO, and O3, from monitoring stations like Air quality open data platform across from city. If we include time stamps in data we can analyze temporal variations.

Amount of Data Needed:
daily with hourly readings over multiple months or days to capture fluctuations.

Detailed Steps for Collecting and Saving the Data:
Obtaining API Access:
Registering for an API key with AQICN or a similar air quality data provider that offers hourly air quality index (AQI) and pollutant measurements.
Collecting Air Quality Data:
Using the provided API key we can request hourly air quality data from the city. Further we will ensure to include all relevant pollutants.

Data Storage:
We can Store collected data in a structured format, such as a CSV file or a database. For example, using pandas DataFrame for data manipulation and save the data to CSV:

Data Analysis:
We can Analyze the data to identify times of day with the lowest pollution levels across different time in a day. And can create visualizations or can use statistical analysis to determine significant differences in air quality at different times of the day.

'''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [16]:
import requests
from datetime import datetime, timedelta
import pandas as pd

#Setting up token or API key
api_key = '5969732f8309906220fb5ad5fe280210c2a8b89c'
base_url = 'https://api.waqi.info/feed/'
city = 'Irving'
start_datetime = datetime(2024, 1, 1)  # Setting the start date to January 1, 2024
days_to_collect = 42  # Collecting data for 42 days
hours_to_collect = 24 * days_to_collect

data = []

# Iterate over each hour in the range
for hour_offset in range(hours_to_collect):
    current_datetime = start_datetime + timedelta(hours=hour_offset)

    full_url = f"{base_url}{city}/?token={api_key}"
    response = requests.get(full_url)

    if response.status_code == 200:
        json_data = response.json()
        if json_data['status'] == 'ok':
            # Extracting values for each pollutant
            pm25 = json_data['data'].get('iaqi', {}).get('pm25', {}).get('v', None)
            no2 = json_data['data'].get('iaqi', {}).get('no2', {}).get('v', None)
            so2 = json_data['data'].get('iaqi', {}).get('so2', {}).get('v', None)
            co = json_data['data'].get('iaqi', {}).get('co', {}).get('v', None)
            data.append({
                'DateTime': current_datetime.strftime('%Y-%m-%d %H:%M'),
                'City': city,
                'PM2.5': pm25,
                'NO2': no2,
                'SO2': so2,
                'CO': co
            })
        else:
            print(f"Data retrieval failed for {current_datetime}")
    else:
        print(f"Failed to retrieve data for {current_datetime}")

# Convert to DataFrame
df = pd.DataFrame(data)
df.to_csv('air_quality_data.csv', index=False)

# Display or save the DataFrame
print(df)


              DateTime    City  PM2.5   NO2  SO2   CO
0     2024-01-01 00:00  Irving     37  10.9  0.9  2.3
1     2024-01-01 01:00  Irving     37  10.9  0.9  2.3
2     2024-01-01 02:00  Irving     37  10.9  0.9  2.3
3     2024-01-01 03:00  Irving     37  10.9  0.9  2.3
4     2024-01-01 04:00  Irving     37  10.9  0.9  2.3
...                ...     ...    ...   ...  ...  ...
1003  2024-02-11 19:00  Irving     37  10.9  0.9  2.3
1004  2024-02-11 20:00  Irving     37  10.9  0.9  2.3
1005  2024-02-11 21:00  Irving     37  10.9  0.9  2.3
1006  2024-02-11 22:00  Irving     37  10.9  0.9  2.3
1007  2024-02-11 23:00  Irving     37  10.9  0.9  2.3

[1008 rows x 6 columns]


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [37]:
import requests
import time

def get_articles(api_key, keyword, start_year, end_year, max_results=1000):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search/bulk"
    headers = {"x-api-key": api_key}
    params = {
        "query": keyword,
        "fields": "title,venue,year,authors,abstract",
        "limit": 100,
    }

    articles = []
    total_fetched = 0

    while total_fetched < max_results:
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code == 200:
            data = response.json()
            papers = data.get("data", [])
            for paper in papers:
                if paper.get("year") is not None and start_year <= int(paper["year"]) <= end_year:
                    articles.append({
                        "title": paper["title"],
                        "venue": paper.get("venue", ""),
                        "year": paper["year"],
                        "authors": [author["name"] for author in paper.get("authors", [])],
                        "abstract": paper.get("abstract", "")
                    })
                    total_fetched += 1
                    if total_fetched >= max_results:
                        break
            if total_fetched < max_results and "next" in data and data["next"]:
                params["offset"] = data["next"]
            else:
                break
        else:
            print(f"Failed to fetch articles: {response.text}")
            break

        time.sleep(1)  # Needed to Respect the rate limit as my API needed 1

    return articles

#Collected API key from semantics scholarship
api_key = "34lYGGaAFf7fEZDcAdipa9qlsw3bYIE01OxVzY5Y"
keyword = "XYZ"
start_year = 2014 #setting start year to 2014
end_year = 2024  #setting end year to 2024

articles = get_articles(api_key, keyword, start_year, end_year)
for article in articles:
    print(f"Title: {article['title']}")
    print(f"Venue: {article['venue']}")
    print(f"Year: {article['year']}")
    print(f"Authors: {', '.join(article['authors'])}")
    print(f"Abstract: {article['abstract']}\n")


Title: Constrained Quaternions Using Euler Angles
Venue: 
Year: 2017
Authors: D. Eberly
Abstract: 2 The Unconstrained Problem 3 2.1 Rotation X (The Analysis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Rotation X, Y, or Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Rotation XY (The Analysis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Distinct Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Equal Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Rotation XY, YX, YZ, ZY, ZX, XZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Rotation YX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Rotation ZX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3 Rotation XZ . . . . . . . . . . 

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [1]:
# write your answer here
https://drive.google.com/drive/folders/1aLM-0kUvW9oOtDp2-kdfyMXJOdyJ0TI3?usp=sharing

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
This assignment was the most challening one. It made me to explore various information for learning about web scraping and data collection.
I have learned how to get access to other websites and interact with them using API retreving the data and modifying the data into my desire
able format.

For the 3 question to fetch articles from sources I thought of using beautifulscop for getting data due to no access to that websites
but I have applied for semantics scholor API and got access to it, further after getting api I was needed to study the documentation of
it to see the url used for fetching bulk data.

For the question 4, I have applied for Twitter developer account and got the credentials for analyzing social media platforms but
however the basic plan didnt allow me to analyze tweets from twitter # forbidden access
So I have decided to try the no coding platform for web scraping using octoparse and It was so easy to get the tweets just by giving the url
of our desired page and easy way to export the results into any format like csv or excel etc.
access.

This assignment really enhanced my research ability and how to solve any problem given, It made me to think out of box to get answers.

'''