<a href="https://colab.research.google.com/github/NityaVattam2002/Nitya_INFO5731_Fall2024/blob/main/Vattam_Nitya_Exercise_02_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [1]:
# write your answer here
'''Research Question: How do search trends related to climate change differ across regions, and what factors influence these differences?

Data Needed:
Search terms related to climate change ("global warming," "carbon footprint").
Geographic location (country, city, or region).
Date and time of searches.
Search frequency (number of searches for each term).
Potentially related keywords and topics.

Amount of Data:
A dataset of 1000 samples of search trends, with at least 100 samples from 10 different regions would be sufficient for initial analysis.

Steps for Collecting Data:

Choose a Scraping Method
Define Search Keywords: Select a list of search terms related to climate change, such as "climate change," "global warming," "renewable energy," etc.
Select regions for comparison.
Collect Data: Use the pytrends library to collect search trend data for the specified keywords and regions.
Set a time frame (monthly or yearly trends) and gather data.
Save Data:
Store the collected data in a CSV file with the following columns:

Region
Keyword
Date/Time
Search Volume
Related Keywords
'''

'Research Question: How do search trends related to climate change differ across regions, and what factors influence these differences?\n\nData Needed:\nSearch terms related to climate change ("global warming," "carbon footprint").\nGeographic location (country, city, or region).\nDate and time of searches.\nSearch frequency (number of searches for each term).\nPotentially related keywords and topics.\n\nAmount of Data:\nA dataset of 1000 samples of search trends, with at least 100 samples from 10 different regions would be sufficient for initial analysis.\n\nSteps for Collecting Data:\n\nChoose a Scraping Method\nDefine Search Keywords: Select a list of search terms related to climate change, such as "climate change," "global warming," "renewable energy," etc.\nSelect regions for comparison.\nCollect Data: Use the pytrends library to collect search trend data for the specified keywords and regions.\nSet a time frame (monthly or yearly trends) and gather data.\nSave Data:\nStore the co

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [7]:
!pip install pytrends --upgrade




In [8]:
# write your answer here
from pytrends.request import TrendReq
import pandas as pd
import time

# Initialize pytrends
pytrends = TrendReq(hl='en-US', tz=360)

# Define the search terms and regions
keywords = ["climate change", "global warming", "carbon footprint", "renewable energy"]
regions = ['US', 'GB', 'IN', 'DE', 'FR', 'BR', 'CA', 'AU', 'ZA', 'RU']

# List to store all dataframes
all_data = []

# Function to handle requests with a delay
def safe_request(pytrends, keyword, region):
    success = False
    while not success:
        try:
            pytrends.build_payload([keyword], cat=0, timeframe='today 12-m', geo=region, gprop='')
            data = pytrends.interest_over_time()
            success = True
            return data
        except Exception as e:
            if isinstance(e, pytrends.exceptions.TooManyRequestsError):
                print("Hit rate limit. Waiting before retrying...")
                time.sleep(60)  # Wait for 60 seconds before retrying
            else:
                print(f"An error occurred: {e}")
                return pd.DataFrame()

# Collect search data for each region and keyword
for region in regions:
    for keyword in keywords:
        data = safe_request(pytrends, keyword, region)

        # Only keep data with actual results (no missing data)
        if not data.empty:
            data = data.reset_index()
            data['Region'] = region
            data['Keyword'] = keyword
            all_data.append(data)

        # Stop once we reach 1000 samples
        if sum([len(df) for df in all_data]) >= 1000:
            break
    if sum([len(df) for df in all_data]) >= 1000:
        break

# Concatenate all dataframes into one
all_data = pd.concat(all_data, ignore_index=True)

# Save to CSV
all_data.to_csv('climate_change_search_trends.csv', index=False)

print(f"Collected {len(all_data)} samples and saved to 'climate_change_search_trends.csv'")



Collected 1007 samples and saved to 'climate_change_search_trends.csv'


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [4]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def fetch_scholar_data(keyword, num_papers, start_year, end_year):
    base_url = "https://scholar.google.com/scholar"
    collected_data = []
    params = {
        'q': keyword,
        'hl': 'en',
        'as_ylo': start_year,
        'as_yhi': end_year
    }

    # Loop to paginate through results
    for start in range(0, num_papers, 10):
        params['start'] = start
        response = requests.get(base_url, params=params)
        if response.status_code != 200:
            print(f"Failed to retrieve data: Status code {response.status_code}")
            break

        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all result containers
        results = soup.find_all('div', class_='gs_ri')
        if not results:
            print("No more results found or error in parsing.")
            break

        for result in results:
            title_elem = result.find('h3', class_='gs_rt')
            title = title_elem.text if title_elem else 'N/A'

            # Venue/Journal/Conference
            venue_elem = result.find('div', class_='gs_a')
            venue = venue_elem.text if venue_elem else 'N/A'

            # Abstract
            abstract_elem = result.find('div', class_='gs_rs')
            abstract = abstract_elem.text if abstract_elem else 'N/A'

            # Year
            year = 'N/A'
            for text in venue.split():
                if text.isdigit() and len(text) == 4 and start_year <= int(text) <= end_year:
                    year = text
                    break

            # Authors
            authors = venue.split('-')[0].strip()

            collected_data.append({
                'Title': title,
                'Venue': venue,
                'Year': year,
                'Authors': authors,
                'Abstract': abstract
            })

        # Progress reporting
        print(f"Retrieved {len(collected_data)}/{num_papers} papers.")

        # Delay to avoid hitting rate limits
        time.sleep(10)

    return collected_data

# Collect 1000 papers published between 2014 and 2024 with the keyword "XYZ"
papers = fetch_scholar_data(keyword="XYZ", num_papers=1000, start_year=2014, end_year=2024)

# Check if any data was collected before saving
if papers:
    # Save to CSV
    df = pd.DataFrame(papers)
    df.to_csv('google_scholar_articles_data.csv', index=False)
    print("Data collection complete. Saved to google_scholar_articles_data.csv")
else:
    print("No data collected. Please check the scraping process and response.")



Retrieved 10/1000 papers.
Retrieved 20/1000 papers.
Retrieved 30/1000 papers.
Retrieved 40/1000 papers.
Retrieved 50/1000 papers.
Retrieved 60/1000 papers.
Retrieved 70/1000 papers.
Retrieved 80/1000 papers.
Retrieved 90/1000 papers.
Retrieved 100/1000 papers.
Retrieved 110/1000 papers.
Retrieved 120/1000 papers.
Retrieved 130/1000 papers.
Retrieved 140/1000 papers.
Retrieved 150/1000 papers.
Retrieved 160/1000 papers.
Retrieved 170/1000 papers.
Retrieved 180/1000 papers.
Retrieved 190/1000 papers.
Retrieved 200/1000 papers.
Retrieved 210/1000 papers.
Retrieved 220/1000 papers.
Retrieved 230/1000 papers.
Retrieved 240/1000 papers.
Retrieved 250/1000 papers.
Retrieved 260/1000 papers.
Retrieved 270/1000 papers.
Retrieved 280/1000 papers.
Retrieved 290/1000 papers.
Retrieved 300/1000 papers.
Retrieved 310/1000 papers.
Retrieved 320/1000 papers.
Retrieved 330/1000 papers.
Retrieved 340/1000 papers.
Retrieved 350/1000 papers.
Retrieved 360/1000 papers.
Retrieved 370/1000 papers.
Retrieved 

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [5]:
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl.metadata (9.8 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0 update-checker-0.18.0


In [6]:
# write your answer here
import praw
import pandas as pd
import datetime

# Initialize PRAW with Reddit API credentials
reddit = praw.Reddit(
    client_id='hCx-0eR7fztbJlBX_yJmBA',
    client_secret='5TJxHcY3Qs5LyPmxtYeq0ULFwZ_DWw',
    user_agent='dev1'
)

# Define search parameters
keywords = ["climate change", "global warming", "carbon footprint", "renewable energy"]
subreddits = ['all']
limit = 100  # Number of posts to fetch per keyword

# List to store data
data_list = []

# Function to fetch posts for a given keyword
def fetch_reddit_data(keyword):
    for subreddit in subreddits:
        for submission in reddit.subreddit(subreddit).search(keyword, limit=limit):
            data = {
                'Title': submission.title,
                'Author': submission.author.name if submission.author else 'N/A',
                'Score': submission.score,
                'Created_At': datetime.datetime.fromtimestamp(submission.created_utc),
                'URL': submission.url,
                'Keyword': keyword
            }
            data_list.append(data)

# Collect data for each keyword
for keyword in keywords:
    fetch_reddit_data(keyword)

# Convert to DataFrame
df = pd.DataFrame(data_list)

# Save to CSV
df.to_csv('reddit_data.csv', index=False)

print(f"Collected {len(df)} samples and saved to 'reddit_data.csv'")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 400 samples and saved to 'reddit_data.csv'


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I appreciate the way the course is taught, with clear explanations and examples that make complex concepts easier to understand. The practical insights into web scraping and data collection have been particularly valuable. However, it would be helpful to see more examples during lectures to solidify understanding. For in-class tasks, additional explanations and ideas on how to approach the questions would be beneficial.
'''

'\nI appreciate the way the course is taught, with clear explanations and examples that make complex concepts easier to understand. The practical insights into web scraping and data collection have been particularly valuable. However, it would be helpful to see more examples during lectures to solidify understanding. For in-class tasks, additional explanations and ideas on how to approach the questions would be beneficial.\n'