<a href="https://colab.research.google.com/github/Madhu-3499/DataScienceEssentials/blob/main/Surisetti_Madhu_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

An interesting line of inquiry would be how different age groups' sleep habits and productivity relate to one another. We must gather information on people's age, sleep length, quality, and productivity levels every day, in addition to any other pertinent variables like occupation and lifestyle choices, in order to investigate this. A large dataset that spans multiple demographics, including age groups ranging from young adults to elders, would be necessary for a thorough analysis.

We might use wearable technologies in conjunction with surveys to gather the required data. First, surveys aimed at people in various age groups might be sent out to collect data on self-reported productivity levels, usual sleep patterns, and perceived sleep quality. These questionnaires could be given out in person or via internet platforms.

Additionally, we may use wearable technology, such as fitness trackers or smartwatches, to get objective measurements of sleep length and quality. Users' sleep patterns throughout the night, including the length of various sleep stages and disruptions, can be monitored by these devices. We can learn more about how differences in sleep patterns affect productivity in various age groups by combining these data sets.

After gathering the data, we can examine it to find any trends or correlations between productivity levels and sleep patterns. Regression analysis is one statistical tool that might be used in this investigation to assess the significance and strength of the associations found. Moreover, age-demographic subgroup analysis might shed light on how these associations change as people go through different phases of life.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
import random
import pandas as pd

# Initialize empty lists for data
age_data = []
weight_data = []
calorie_intake_data = []

# Generate random data for 1000 individuals
for _ in range(1000):
    age = random.randint(18, 80)  # Random age between 18 and 80
    weight = random.uniform(40, 120)  # Random weight in kilograms (40 kg to 120 kg)

    # Calculate Calorie Intake
    calorie_intake = 0
    if age <= 30:
        calorie_intake = 15.3 * weight + 679
    elif 31 <= age <= 60:
        calorie_intake = 11.6 * weight + 879
    else:
        calorie_intake = 13.5 * weight + 487

    # Determine Activity Level based on Calorie Intake
    if calorie_intake < 1800:
        activity_level = "Sedentary"
    elif 1800 <= calorie_intake < 2200:
        activity_level = "Lightly Active"
    elif 2200 <= calorie_intake < 2600:
        activity_level = "Moderately Active"
    else:
        activity_level = "Very Active"

    # Append data to respective lists
    age_data.append(age)
    weight_data.append(weight)
    calorie_intake_data.append(calorie_intake)

# Create a DataFrame
data = pd.DataFrame({
    'Age': age_data,
    'Weight (kg)': weight_data,
    'Calorie Intake': calorie_intake_data,
    'Activity Level': [activity_level for _ in range(1000)]
})

# Save the data to a CSV file
data.to_csv('calorie_intake.csv', index=False)

print("Data collection and saving complete.")


Data collection and saving complete.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [3]:

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import json

# Function to fetch articles from Google Scholar
def fetch_google_scholar_articles(query, start_year, end_year, num_articles):
    # Base URL for Google Scholar
    base_url = "https://scholar.google.com/scholar"
    articles = []  # List to store the collected articles

    # Loop to paginate through search results (10 results per page)
    for start in range(0, num_articles, 10):
        # Parameters for the search query, including keywords and date range
        params = {
            "q": query,             # Search query
            "as_ylo": start_year,   # Start year of publication range
            "as_yhi": end_year,     # End year of publication range
            "start": start          # Pagination offset
        }

        # User-Agent header to mimic a web browser
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.1234.0 Safari/537.36"
        }

        # Send a GET request to the Google Scholar search URL with parameters and headers
        response = requests.get(base_url, params=params, headers=headers)

        # Check if the response is successful (HTTP status code 200)
        if response.status_code == 200:
            # Parse the HTML content of the response using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all the search result div elements
            results = soup.find_all('div', {'class': 'gs_ri'})

            # Iterate through each search result
            for result in results:
                article = {}  # Dictionary to store article information

                # Extract the title of the article (inside an h3 element with class 'gs_rt')
                title = result.find('h3', {'class': 'gs_rt'})
                if title:
                    article['title'] = title.text  # Store the title in the dictionary

                # Extract the venue/journal/conference information (inside a div element with class 'gs_a')
                venue = result.find('div', {'class': 'gs_a'})
                if venue:
                    article['venue'] = venue.text  # Store the venue information

                # Extract the publication year (from the 'gs_a' div)
                year = result.find('div', {'class': 'gs_a'})
                if year:
                    # Split the text and get the last part (usually the year), then strip whitespace
                    year = year.text.split('-')[-1].strip()
                    article['year'] = year  # Store the year in the dictionary

                # Extract the authors (from the 'gs_a' div)
                authors = result.find('div', {'class': 'gs_a'})
                if authors:
                    # Split the text and get the first part (usually the authors), then strip whitespace
                    authors = authors.text.split('-')[0].strip()
                    article['authors'] = authors  # Store the authors in the dictionary

                # Extract the abstract (inside a div element with class 'gs_rs')
                abstract = result.find('div', {'class': 'gs_rs'})
                if abstract:
                    article['abstract'] = abstract.text  # Store the abstract in the dictionary

                # Append the article dictionary to the list of articles
                articles.append(article)

                # Check if the desired number of articles has been collected
                if len(articles) >= num_articles:
                    return articles

    return articles

# Main program
if __name__ == "__main__":
    keyword = "Artificial intelligence"  # Keyword for the search
    start_year = 2000                # Start year of publication range
    end_year = 2024                  # End year of publication range
    num_articles = 1000              # Desired number of articles to collect

    # Call the fetch_google_scholar_articles function to collect articles
    articles = fetch_google_scholar_articles(keyword, start_year, end_year, num_articles)

    # Save the collected articles to a JSON file
    with open("articles.json", "w", encoding="utf-8") as json_file:
        json.dump(articles, json_file, indent=4, ensure_ascii=False)

    # Print the number of collected articles and a confirmation message
    print(f"Collected {len(articles)} articles and saved to 'articles.json'.")


Collected 0 articles and saved to 'articles.json'.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [5]:
import tweepy
import pandas as pd

# Twitter API credentials
consumer_key = 'YourConsumerKey'
consumer_secret = 'YourConsumerSecret'
access_token = 'YourAccessToken'
access_token_secret = 'YourAccessTokenSecret'

# Authenticate with Twitter API
auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
api = tweepy.API(auth)

# Define search query parameters
query = 'world cup 2023'  # Example search query
count = 1000  # Number of tweets to retrieve per request

# Collect tweets
tweets_data = []
for tweet in tweepy.Cursor(api.search_tweets, q=query, count=count, lang='en').items(count):
    tweets_data.append([tweet.id_str, tweet.user.screen_name, tweet.created_at, tweet.text])

# Create DataFrame
columns = ['Tweet ID', 'Username', 'Created At', 'Text']
tweets_df = pd.DataFrame(tweets_data, columns=columns)

# Save data to CSV file
tweets_df.to_csv('twitter_data.csv', index=False)

print("Data collection and saving complete.")



Unauthorized: 401 Unauthorized
89 - Invalid or expired token.

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Working on web scraping projects has been a great way to get knowledge about how to collect data from various internet sources. It was necessary to comprehend CSS selectors, XPath, and HTML structure in order to navigate and get desired information efficiently. Dealing with dynamic material, anti-scraping methods, and rate restriction presented difficulties. It took modifying scraping methods, adding delays, and using rotating user agents to get over these obstacles. As a language model, being able to gather and evaluate data from internet sources improves my capacity to offer pertinent facts and insights in a variety of disciplines, assisting with research, analysis, and decision-making processes in a wide range of fields.