<a href="https://colab.research.google.com/github/143211/TARUN_INFO5731/blob/main/Konda_Tarun_Exercise_2_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

How does physical exercise affect resting heart rate and cardiovascular health in various age groups?

This topic seeks to examine the association between physical activity and cardiovascular health, split by age, in order to find trends or patterns that may guide public health recommendations or individual health decisions.


Data needed:
To respond to this question, the dataset should contain:

Age: To divide the data into several age groups.
Weekly Physical exercise Hours: To determine the amount of physical exercise.
Resting heart rate (RHR) is a measure of cardiovascular efficiency and health.
Cardiovascular Health Status: A categorical variable based on RHR and physical activity levels that indicates overall cardiovascular health (Excellent, Good, Average, Poor).

To conduct relevant analysis across several age groups and achieve statistical significance, a dataset of at least 1000 samples is recommended. More samples would allow for more precise age group segmentation and stronger findings.

To gather and retain the data required for analysing the impact of physical exercise on cardiovascular health across age groups, we first generate synthetic data with a Python script. This script generates data points for 1000 people at random, including age, weekly physical activity hours, resting heart rate (RHR), and cardiovascular health status, resulting in a varied and complete dataset. Once created, the data is saved to a CSV file called 'cardiovascular_health_data.csv' using the pandas library's 'to_csv' function. This technique provides easy access and manipulation for future investigation. To ensure the dataset's integrity, data types and formats must be consistent. For real-world data gathering, this stage would be preceded by the creation of a complete data collecting strategy that included participant recruiting and ethical issues.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
import random
import pandas as pd

# Initialize empty lists for data collection
age_data = []
activity_level_data = []
rhr_data = []  # Resting Heart Rate (RHR)

# Generate data for exactly 1000 individuals
for _ in range(1000):  # Ensures loop runs 1000 times for 1000 samples
    age = random.randint(18, 80)  # Age between 18 and 80
    activity_level = random.uniform(0, 10)  # Weekly physical activity in hours (0 to 10 hours)
    rhr = random.randint(60, 100)  # Resting heart rate in bpm (60 to 100)

    # Determine cardiovascular health status based on RHR and physical activity
    if rhr < 70 and activity_level >= 5:
        health_status = "Excellent"
    elif 70 <= rhr <= 80 and activity_level >= 3:
        health_status = "Good"
    elif 80 < rhr <= 90 and activity_level < 3:
        health_status = "Average"
    else:
        health_status = "Poor"

    # Append the generated data to the lists
    age_data.append(age)
    activity_level_data.append(activity_level)
    rhr_data.append(rhr)

# Create a DataFrame from the collected data
data = pd.DataFrame({
    'Age': age_data,
    'Weekly Physical Activity (hours)': activity_level_data,
    'Resting Heart Rate (bpm)': rhr_data,
    'Cardiovascular Health Status': [health_status for _ in range(1000)]
})

# Save the DataFrame to a CSV file named 'cardiovascular_health_data.csv'
data.to_csv('cardiovascular_health_data.csv', index=False)

print("1000 samples collected and saved to cardiovascular_health_data.csv successfully.")


1000 samples collected and saved to cardiovascular_health_data.csv successfully.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
from bs4 import BeautifulSoup
import json

def retrieve_scholar_articles(search_term, year_start, year_end, total_articles):
    scholar_url = "https://scholar.google.com/scholar"
    fetched_articles = []  # To hold article data

    # Iterate through pages, considering Google Scholar shows 10 articles per page
    for offset in range(0, total_articles, 10):
        query_params = {
            "q": search_term,
            "as_ylo": year_start,
            "as_yhi": year_end,
            "start": offset
        }

        # Simulate browser request
        request_headers = {
            "User-Agent": "Mozilla/5.0"
        }

        # Execute GET request with provided parameters and headers
        request_response = requests.get(scholar_url, params=query_params, headers=request_headers)

        if request_response.status_code == 200:
            parsed_html = BeautifulSoup(request_response.text, 'html.parser')
            search_results = parsed_html.find_all('div', class_='gs_ri')

            for result in search_results:
                extracted_article = {}

                article_title = result.find('h3', class_='gs_rt')
                if article_title:
                    extracted_article['title'] = article_title.get_text()

                article_info = result.find('div', class_='gs_a')
                if article_info:
                    info_text = article_info.get_text()
                    extracted_article['info'] = info_text
                    # Extract year from the info text
                    extracted_article['year'] = info_text.split('-')[-1].strip()
                    # Assuming authors are listed first
                    extracted_article['authors'] = info_text.split('-')[0].strip()

                article_abstract = result.find('div', class_='gs_rs')
                if article_abstract:
                    extracted_article['abstract'] = article_abstract.get_text()

                fetched_articles.append(extracted_article)

                if len(fetched_articles) >= total_articles:
                    break
        if len(fetched_articles) >= total_articles:
            break

    return fetched_articles

def main():
    search_query = "information retrieval"
    publication_start_year = 2015
    publication_end_year = 2023
    articles_needed = 1000
    scholar_articles = retrieve_scholar_articles(search_query, publication_start_year, publication_end_year, articles_needed)

    with open("scholar_articles.json", "w", encoding="utf-8") as file:
        json.dump(scholar_articles, file, indent=4, ensure_ascii=False)

    print(f"Saved {len(scholar_articles)} articles to 'scholar_articles.json'.")

if __name__ == "__main__":
    main()


Saved 0 articles to 'scholar_articles.json'.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [9]:
!pip install asyncpraw
!pip install --upgrade asyncpraw

Collecting asyncpraw
  Downloading asyncpraw-7.7.1-py3-none-any.whl (196 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/196.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m122.9/196.7 kB[0m [31m3.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.7/196.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<1 (from asyncpraw)
  Downloading aiofiles-0.8.0-py3-none-any.whl (13 kB)
Collecting aiosqlite<=0.17.0 (from asyncpraw)
  Downloading aiosqlite-0.17.0-py3-none-any.whl (15 kB)
Collecting asyncprawcore<3,>=2.1 (from asyncpraw)
  Downloading asyncprawcore-2.4.0-py3-none-any.whl (19 kB)
Installing collected packages: aiosqlite, aiofiles, asyncprawcore, asyncpraw
Successfully installed aiofiles-0.8.0 aiosqlite-0.17.0 asyncpraw-7.7.1 asyncprawcore-2.4.0


In [13]:
import asyncio
import asyncpraw
import pandas as pd

async def fetch_reddit_posts(client_id, client_secret, username, password, user_agent, subreddits, limit_per_subreddit):
    reddit = asyncpraw.Reddit(client_id=client_id,
                              client_secret=client_secret,
                              username=username,
                              password=password,
                              user_agent=user_agent)

    posts_data = {
        'Post ID': [],
        'Author': [],
        'Title': [],
        'Comments Count': [],
        'Score': [],
        'Upvote Ratio': [],
        'Flair': [],
    }

    for subreddit_name in subreddits:
        subreddit = await reddit.subreddit(subreddit_name)
        async for post in subreddit.hot(limit=limit_per_subreddit):
            posts_data['Post ID'].append(post.id)
            posts_data['Author'].append(str(post.author))
            posts_data['Title'].append(post.title)
            posts_data['Comments Count'].append(post.num_comments)
            posts_data['Score'].append(post.score)
            posts_data['Upvote Ratio'].append(post.upvote_ratio)
            posts_data['Flair'].append(post.link_flair_text)

        print(f"Completed scraping {subreddit_name}; total posts collected: {len(posts_data['Post ID'])}")

    await reddit.close()
    return pd.DataFrame(posts_data)

async def main():
    client_id = "YOUR_CLIENT_ID"
    client_secret = "YOUR_CLIENT_SECRET"
    username = "YOUR_USERNAME"
    password = "YOUR_PASSWORD"
    user_agent = "YOUR_USER_AGENT"

    subreddits_to_scrape = ['india', 'worldnews', 'announcements', 'funny', 'AskReddit',
                            'gaming', 'pics', 'science', 'movies', 'todayilearned']
    limit_per_subreddit = 100  # Adjust as needed for your use case

    df = await fetch_reddit_posts(client_id, client_secret, username, password, user_agent, subreddits_to_scrape, limit_per_subreddit)
    df.to_csv('async_reddit_data.csv', index=False)
    print("Data collection complete and saved to 'async_reddit_data.csv'.")

# Manual event loop handling
if __name__ == "__main__":
    try:
        loop = asyncio.get_event_loop()
        if loop.is_running():
            task = loop.create_task(main())
        else:
            loop.run_until_complete(main())
    except RuntimeError as e:
        print(f"Error running async main: {e}")


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Working on web scraping and data collecting tasks has been a fantastic learning experience, greatly improving my grasp of how to programmatically access and extract data from internet sources. Key concepts like BeautifulSoup's HTML structure navigation and Selenium's dynamic content management have proven essential. Challenges occurred mostly from dealing with websites that extensively rely on JavaScript for dynamic content rendering, necessitating a more sophisticated approach using Selenium to emulate browser interactions. Overcoming these challenges by learning to emulate user behaviours programmatically was very satisfying.In my area, the capacity to collect and analyse data from a variety of internet sources offers up new research opportunities, allowing for the creation of big datasets that may provide deeper insights and support evidence-based conclusions, so improving the quality and scope of my work.





