<a href="https://colab.research.google.com/github/Bhavyamadhuri/Bhavya_INFO5731_Fall2024/blob/main/Bhavya_Devarakonda_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [1]:
# write your answer here
#How does daily lifestyle factors vary the blood sugar levels of individuals at risk of developing type 2 Diabetes? We will need blood sugar levels monitoring continuously, dietary, physical activity, Sleep patterns, demographic information. We need a sample size of 200 individuals who are at risk of Type 2 Diabetes. For collecting data we can recruit participants through medical clinics, community health centers making sure that the sample is unbiased and diversified, making sure that consent has been obtained. All the data collection need to be integrated for collecting continuously and saving it for analysis.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [2]:
# write your answer here
import numpy as np
import pandas as pd
import random

# Set random seed for reproducibility
np.random.seed(53)

# Number of samples to generate
num_samples = 1000

# Function to generate random blood sugar levels (mg/dL)
def generate_blood_sugar():
    # Normal blood sugar levels range from 70-140 mg/dL, with spikes up to 200 mg/dL
    return np.random.normal(loc=100, scale=20)  # Mean 100, std 20

# Function to generate random dietary intake (grams of carbohydrates per day)
def generate_dietary_intake():
    # Average daily intake can range from 150 to 300 grams
    return random.randint(150, 300)

# Function to generate random physical activity (minutes per day)
def generate_physical_activity():
    # Activity can range from 0 (sedentary) to 120 minutes (very active)
    return random.randint(0, 120)

# Function to generate random sleep duration (hours per night)
def generate_sleep_duration():
    # Sleep duration typically ranges from 4 to 10 hours
    return round(np.random.normal(loc=7, scale=1.5), 2)  # Mean 7 hours, std 1.5 hours

# Function to generate random stress level (scale from 1 to 10)
def generate_stress_level():
    # Stress level scale from 1 (low) to 10 (high)
    return random.randint(1, 10)

# Collect data into a DataFrame
data = {
    "Participant_ID": range(1, num_samples + 1),
    "Blood_Sugar_Level": [generate_blood_sugar() for _ in range(num_samples)],
    "Dietary_Intake_Carbs": [generate_dietary_intake() for _ in range(num_samples)],
    "Physical_Activity_Minutes": [generate_physical_activity() for _ in range(num_samples)],
    "Sleep_Duration_Hours": [generate_sleep_duration() for _ in range(num_samples)],
    "Stress_Level": [generate_stress_level() for _ in range(num_samples)]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to a CSV file
df.to_csv('blood_sugar_lifestyle_data.csv', index=False)

# Display the first few rows of the DataFrame
print(df.head())

   Participant_ID  Blood_Sugar_Level  Dietary_Intake_Carbs  \
0               1         104.117297                   213   
1               2         123.335234                   268   
2               3          58.547204                   299   
3               4          87.346257                   273   
4               5         119.942529                   253   

   Physical_Activity_Minutes  Sleep_Duration_Hours  Stress_Level  
0                         84                  5.89             1  
1                        116                  8.72             1  
2                         39                  1.72             1  
3                        115                  4.12             6  
4                        115                  5.07            10  


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [4]:
# write your answer here
#installing the package 'requests'
!pip install requests

#importing modules 'requests', 'pandas' and 'time'
import requests
import requests
import pandas as pd
import time

# Replace with your Semantic Scholar API key if you have one
api_key = "YOUR_API_KEY"  # Optional

# Define the search parameters
keyword = "XYZ"
max_results = 1000  # Total number of articles to collect
year_start = 2014  # start year for the search
year_end = 2024   # end year for the search

# Define the base URL for Semantic Scholar API
base_url = "https://api.semanticscholar.org/graph/v1/paper/search"

# Parameters for API requests
params = {
    "query": keyword,
    "fields": "title,authors,year,venue,abstract",
    "limit": 100,  # The maximum number of articles to retrieve per request (API limit)
    "offset": 0
}

headers = {}
if api_key:
    headers["x-api-key"] = api_key

# List to store article data
articles = []

# Collect articles in batches until we reach max_results
while len(articles) < max_results:
    # Make the API request
    response = requests.get(base_url, headers=headers, params=params)

    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        papers = data.get("data", [])

        # Collect relevant information from each paper
        for paper in papers:
            # Ensure the paper falls within the desired year range
            if year_start <= paper.get("year", 0) <= year_end:
                article_info = {
                    "Title": paper.get("title"),
                    "Venue": paper.get("venue"),
                    "Year": paper.get("year"),
                    "Authors": ", ".join(author.get("name", "") for author in paper.get("authors", [])),
                    "Abstract": paper.get("abstract", "")
                }
                articles.append(article_info)

                # Stop if we've collected the desired number of articles
                if len(articles) >= max_results:
                    break

        # Update offset for the next batch
        params["offset"] += params["limit"]

        # Small delay to avoid hitting API rate limits
        time.sleep(1)

    else:
        print(f"Error: {response.status_code}")
        break

# Save the articles to a CSV file
df = pd.DataFrame(articles)
df.to_csv('articles_data.csv', index=False)

print(f"Collected {len(articles)} articles and saved to 'articles_data.csv'")


Error: 403
Collected 0 articles and saved to 'articles_data.csv'


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [8]:
# write your answer here
# write your answer here

!pip install praw # Install the praw module
import praw
import pandas as pd

# Reddit API credentials
client_id = 'YOUR_ACTUAL_CLIENT_ID' # Replace with your actual client ID
client_secret = 'YOUR_ACTUAL_CLIENT_SECRET' # Replace with your actual client secret
user_agent = 'YOUR_ACTUAL_USER_AGENT' # Replace with your actual user agent


# Initialize PRAW
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Define the subreddit and keyword to search
subreddit_name = 'learnpython'
search_query = 'web scraping'
num_posts = 100  # Number of posts to retrieve

# List to store the data
reddit_data = []

# Fetch submissions from the subreddit
subreddit = reddit.subreddit(subreddit_name)
for submission in subreddit.search(search_query, limit=num_posts):
    reddit_data.append({
        'Title': submission.title,
        'Author': submission.author.name if submission.author else 'N/A',
        'Score': submission.score,
        'Number of Comments': submission.num_comments,
        'URL': submission.url
    })

# Create a DataFrame and save to CSV
df_reddit = pd.DataFrame(reddit_data)
df_reddit.to_csv('reddit_data.csv', index=False)

print(f"Collected {len(reddit_data)} posts from r/{subreddit_name} and saved to 'reddit_data.csv'")




It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



ResponseException: received 401 HTTP response

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [5]:
'''
Write your response here.

Web scraping helped me understand how to extract live data from public sources and run analysis. I found it challenging  to view the file from one drive to extract data, by making it public view and access it. It looks like the web scraping technique can be useful in real time situations to extract data continuously.


'''

'\nWrite your response here.\n\nWeb scraping helped me understand how to extract live data from public sources and run analysis. I found it challenging  to view the file from one drive to extract data, by making it public view and access it. It looks like the web scraping technique can be useful in real time situations to extract data continuously.\n\n\n'