<a href="https://colab.research.google.com/github/KrinalM/Krinalben_INFO5731_Spring2020/blob/main/Monpara_Krinalben_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question:

What impact does teen use of social media have on mental health in urban settings?

Data Collection:

Define Variables: List important variables like the frequency, duration, and platforms used for social media use; mental health indicators like stress, depression, and anxiety; demographic data like age, gender, and socioeconomic status; and possible confounding variables like family dynamics and peer pressure.

Survey Design: Create a thorough questionnaire including questions on social media usage and validated scales for assessing mental health indicators (such as the PHQ-9 for depression and the GAD-7 for anxiety). Make sure the questions are considerate to the intended audience, succinct, and clear.

Sample Strategy: To guarantee representation across all age groups, genders, and socioeconomic backgrounds within metropolitan regions, employ a stratified random sample approach. To have enough statistical power, try to have a sample size of at least 500 people.

Procedure for Gathering Data: Send out the survey electronically through community centers, social media sites, school networks, and other pertinent channels. Get participants' informed permission while maintaining their privacy and confidentiality.

Data Management and Storage: Make use of safe, GDPR-compliant data storage solutions that adhere to data protection laws. To ensure security, encrypt critical data and provide each participant a special identification number. Make frequent data backups to guard against loss.

Data Cleaning and Preprocessing: To find and fix any mistakes, discrepancies, or missing values in the dataset, do a comprehensive data cleaning. As needed for analysis, recode variables and standardize replies.

Data analysis: While accounting for confounding variables, use statistical methods including regression analysis, correlation analysis, and structural equation modeling to investigate the associations between social media use and mental health outcomes.

Reporting and Interpretation: Examine the results in the context of current research and theoretical frameworks. Talk about the implications for future research paths, intervention measures, and policy. Results should be presented at pertinent conferences or seminars and published in peer-reviewed publications.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [62]:
import random
import pandas as pd

# Generate synthetic data for social media usage and mental health indicators
data = []

for _ in range(1000):
    age = random.randint(13, 19)
    gender = random.choice(['Male', 'Female'])
    socio_economic_status = random.choice(['Low', 'Middle', 'High'])
    social_media_usage = random.randint(0, 10)  # Assuming a scale of 0 to 10 for frequency/duration
    stress_level = random.randint(1, 10)  # Assuming a scale of 1 to 10 for stress level
    depression_score = random.randint(0, 27)  # PHQ-9 scale ranges from 0 to 27
    anxiety_score = random.randint(0, 21)  # GAD-7 scale ranges from 0 to 21

    data.append([age, gender, socio_economic_status, social_media_usage, stress_level, depression_score, anxiety_score])

# Create a DataFrame
columns = ['Age', 'Gender', 'Socioeconomic Status', 'Social Media Usage', 'Stress Level', 'Depression Score', 'Anxiety Score']
df = pd.DataFrame(data, columns=columns)

# Save DataFrame to CSV file
df.to_csv('social_media_mental_health_dataset.csv', index=False)

print("Dataset saved successfully.")

Dataset saved successfully.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [63]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_google_scholar(keyword, start_year, end_year, num_articles):
    base_url = "https://scholar.google.com/scholar"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
    params = {
        "q": keyword,
        "as_ylo": start_year,
        "as_yhi": end_year,
        "hl": "en",
        "as_sdt": "0,5",
        "num": 10  # Scraping 10 articles per page
    }

    articles = []

    for page in range(num_articles // 10):
        params["start"] = page * 10
        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        for result in soup.find_all('div', class_='gs_r'):
            title_tag = result.find('h3', class_='gs_rt')
            title = title_tag.text.strip() if title_tag else "N/A"

            venue_tag = result.find('div', class_='gs_a')
            venue_year = venue_tag.text.split('-') if venue_tag else ["N/A", "N/A"]
            venue = venue_year[0].strip()
            year = venue_year[-1].strip()

            authors_tag = result.find('div', class_='gs_a')
            authors = authors_tag.text.split('-')[1].strip() if authors_tag else "N/A"

            abstract_tag = result.find('div', class_='gs_rs')
            abstract = abstract_tag.text.strip() if abstract_tag else "N/A"

            articles.append([title, venue, year, authors, abstract])

        time.sleep(1)  # Adding a smaller delay between requests

    return articles

# Parameters
keyword = "XYZ"
start_year = 2014
end_year = 2024
num_articles = 1000

# Scraping Google Scholar
articles = scrape_google_scholar(keyword, start_year, end_year, num_articles)

# Creating DataFrame
df = pd.DataFrame(articles, columns=['Title', 'Venue/Journal/Conference', 'Year', 'Authors', 'Abstract'])

# Saving DataFrame to CSV
df.to_csv('google_scholar_articles.csv', index=False)

print("Articles saved successfully.")

Articles saved successfully.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [64]:
'''
I used online tool ParseHub fro data extraxtion.

SELECTED TOOL:
ParseHub is a powerful and user-friendly web scraping tool that allows you to extract data from websites easily.

STEPS FOR WEB SCRAPPING:
Download the ParseHub app.
Opened ParseHub and clicked on "New Project". Entered the URL of the website (https://www.imdb.com/chart/top/) for scrapping.
Set up Selectors
Refined Selections as needed
Clicked on the "Get Data" button to start the scraping process. ParseHub loaded the page and extracted the data.
Once the scraping is completed, reviewed the extracted data in the ParseHub interface. And then exported it in formats like CSV.

Below is the link to the CSV with data collected using ParseHub:

Link: https://myunt-my.sharepoint.com/:x:/g/personal/krinalbenmonpara_my_unt_edu/EcABEDyXVqRMsCWkBUiY3LUBl5e-mNm9qGZfOGvpDSHGNg?e=SuWa6L

'''

'\nI used online tool ParseHub fro data extraxtion.\n\nSELECTED TOOL:\nParseHub is a powerful and user-friendly web scraping tool that allows you to extract data from websites easily.\n\nSTEPS FOR WEB SCRAPPING:\nDownload the ParseHub app.\nOpened ParseHub and clicked on "New Project". Entered the URL of the website (https://www.imdb.com/chart/top/) for scrapping.\nSet up Selectors\nRefined Selections as needed\nClicked on the "Get Data" button to start the scraping process. ParseHub loaded the page and extracted the data.\nOnce the scraping is completed, reviewed the extracted data in the ParseHub interface. And then exported it in formats like CSV.\n\nBelow is the link to the CSV with data collected using ParseHub:\n\nLink: https://myunt-my.sharepoint.com/:x:/g/personal/krinalbenmonpara_my_unt_edu/EcABEDyXVqRMsCWkBUiY3LUBl5e-mNm9qGZfOGvpDSHGNg?e=SuWa6L\n\n'

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [65]:
'''
To be honest,
The learning experience in web scraping involves a combination of technical skills, domain knowledge, and ethical considerations.
Practicing on various online sources and real-world projects helps in honing these skills and becoming proficient in extracting data from the web.
Question 4 was very challenging for me. Used ParseHub for data extraction.
the ability to gather and analyze data from online sources enhances research,
decision-making, and insights across diverse fields, making web scraping a valuable tool for professionals and academics alike.

'''

'\nTo be honest, \nThe learning experience in web scraping involves a combination of technical skills, domain knowledge, and ethical considerations. \nPracticing on various online sources and real-world projects helps in honing these skills and becoming proficient in extracting data from the web.\nQuestion 4 was very challenging for me. Used ParseHub for data extraction.\nthe ability to gather and analyze data from online sources enhances research, \ndecision-making, and insights across diverse fields, making web scraping a valuable tool for professionals and academics alike.\n\n'