<a href="https://colab.research.google.com/github/TharunSaiVT/INFO-5731/blob/main/V_T_Tharun_sai_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
The research question which i always have on my mind is social media usage and mental health assessments in adolescents.
The study question examines, taking into account demographic variables, the relationship between teenage mental health and social media use.
Data collection from a sample of at least 1000 teenagers includes obtaining demographic data, social media usage patterns, and mental health evaluations.
Age, gender, ethnicity, household income, and parental education are examples of demographic statistics. Time spent, material kinds viewed, and frequency of use are all included in social media analytics.
Standardised instruments such as the PHQ-9 for depression and the GAD-7 for anxiety are used in mental health examinations.
Anonymity is ensured by data collection techniques such as surveys distributed through community centres, schools, or online platforms.
Mental health assessments are given by qualified specialists to ensure accuracy.
Analysis is made possible by integrated datasets, which bring together social media, mental health, and demographic data.
The dataset is saved with documentation in an organised format such as CSV.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [29]:
import pandas as pd
import numpy as np
from random import choice, randint

# Generate synthetic demographic data
demographic_data = pd.DataFrame(columns=['Age', 'Gender', 'Ethnicity', 'Household Income', 'Parental Education'])

for _ in range(1000):
    age = randint(13, 19)
    gender = choice(['Male', 'Female'])
    ethnicity = choice(['Caucasian', 'African American', 'Hispanic', 'Asian', 'Other'])
    household_income = choice(['< $25,000', '$25,000 - $50,000', '$50,000 - $75,000', '> $75,000'])
    parental_education = choice(['High School or Less', 'Some College', 'Bachelor\'s Degree', 'Graduate Degree'])

    demographic_data = pd.concat([demographic_data, pd.DataFrame({
        'Age': [age],
        'Gender': [gender],
        'Ethnicity': [ethnicity],
        'Household Income': [household_income],
        'Parental Education': [parental_education]
    })], ignore_index=True)

# Generate synthetic social media usage data
social_media_data = pd.DataFrame(columns=['Frequency of Use', 'Time Spent', 'Content Type'])

for _ in range(1000):
    frequency_of_use = choice(['Low', 'Medium', 'High'])
    time_spent = randint(1, 5)  # Assuming hours per day
    content_type = choice(['Text', 'Images', 'Videos', 'Mixed'])

    social_media_data = pd.concat([social_media_data, pd.DataFrame({
        'Frequency of Use': [frequency_of_use],
        'Time Spent': [time_spent],
        'Content Type': [content_type]
    })], ignore_index=True)

# Generate synthetic mental health assessment data
mental_health_data = pd.DataFrame(columns=['Depression Score', 'Anxiety Score'])

for _ in range(1000):
    depression_score = randint(0, 27)  # PHQ-9 score ranges from 0 to 27
    anxiety_score = randint(0, 21)  # GAD-7 score ranges from 0 to 21

    mental_health_data = pd.concat([mental_health_data, pd.DataFrame({
        'Depression Score': [depression_score],
        'Anxiety Score': [anxiety_score]
    })], ignore_index=True)

# Combine all data into a single dataset
dataset = pd.concat([demographic_data, social_media_data, mental_health_data], axis=1)

# Save the dataset to a CSV file
dataset.to_csv('adolescent_mental_health_dataset.csv', index=False)

print(dataset)


    Age  Gender         Ethnicity   Household Income   Parental Education  \
0    16    Male             Other          > $75,000    Bachelor's Degree   
1    14  Female             Other  $50,000 - $75,000      Graduate Degree   
2    18    Male             Other  $50,000 - $75,000    Bachelor's Degree   
3    14    Male  African American  $50,000 - $75,000  High School or Less   
4    16    Male             Asian          > $75,000  High School or Less   
..   ..     ...               ...                ...                  ...   
995  15    Male         Caucasian          > $75,000         Some College   
996  14  Female             Other  $50,000 - $75,000  High School or Less   
997  16  Female         Caucasian          > $75,000         Some College   
998  13    Male         Caucasian          > $75,000         Some College   
999  16    Male         Caucasian          < $25,000  High School or Less   

    Frequency of Use Time Spent Content Type Depression Score Anxiety Score

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [31]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import re
import time

def scrape_google_scholar(keyword, max_articles=1000):
    base_url = "https://scholar.google.com"
    query = f"/scholar?hl=en&as_sdt=0%2C5&q={keyword}&as_ylo=2014&as_yhi=2024"

    articles = []
    while len(articles) < max_articles:
        url = base_url + query
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            results = soup.find_all('div', class_='gs_r gs_or gs_scl')

            for result in results:
                article = {}
                title_tag = result.find('h3', class_='gs_rt')
                if title_tag:
                    article['title'] = title_tag.text.strip()

                venue_tag = result.find('div', class_='gs_a')
                if venue_tag:
                    venue_year = re.findall(r'\d{4}', venue_tag.text)
                    if len(venue_year) > 0:
                        article['year'] = int(venue_year[0])
                    venue = venue_tag.text.split('-')[-1].strip()
                    article['venue'] = venue

                authors_tag = result.find('div', class_='gs_a')
                if authors_tag:
                    authors = authors_tag.text.split('-')[0].strip()
                    article['authors'] = authors

                abstract_tag = result.find('div', class_='gs_rs')
                if abstract_tag:
                    article['abstract'] = abstract_tag.text.strip()

                articles.append(article)
                if len(articles) >= max_articles:
                    break

        next_page = soup.find('button', class_='gs_btnPR gs_in_ib gs_btn_half gs_btn_lsb gs_btn_srt gsc_pgn_pnx')
        if next_page:
            query = next_page['onclick'].split('=', 1)[-1][1:-2]
        else:
            break

        time.sleep(2)  # Add a delay to avoid being blocked

    return articles

# Example usage:
keyword = input("Enter the Keyword which you want to: ")
articles = scrape_google_scholar(keyword, max_articles=1000)

# Print the first few articles
for i, article in enumerate(articles[:5]):
    print(f"Article {i+1}:")
    print("Title:", article.get('title'))
    print("Venue:", article.get('venue'))
    print("Year:", article.get('year'))
    print("Authors:", article.get('authors'))
    print("Abstract:", article.get('abstract'))
    print()


Enter the Keyword which you want to: political
Article 1:
Title: [BOOK][B] The political persuaders
Venue: books.google.com
Year: 2020
Authors: D Nimmo
Abstract: For better or worse, political image is now more … Political estrangement, as illustrated by 
declining voting levels, may well be a by-product of deceptive political consultant and political …

Article 2:
Title: [BOOK][B] Political influence
Venue: taylorfrancis.com
Year: 2017
Authors: E Banfield
Abstract: In government, influence denotes one's ability to get others to act, think, or feel as one 
intends. A mayor who persuades voters to approve a bond issue exercises influence. A …

Article 3:
Title: Political powerlessness
Venue: HeinOnline
Year: 2015
Authors: NO Stephanopoulos
Abstract: … My primary goal in this Article, then, is to offer a definition of political powerlessness that … 
political processes ordinarily to be relied upon to protect minorities." Second, "those political …

Article 4:
Title: [BOOK][B] The state a

In [None]:
pip install tweepy



## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import tweepy
import pandas as pd

# Twitter API credentials
consumer_key ="7gknCxLS0z9jXTWXTREPk2Ork"
consumer_secret = "17jsjY4HKFueWrTRwksQhbFlBFpgWKWbAoS7c8p59wv8Qi9uyx"
access_token ="742231122835296256-q6a3y1ISjXyloYDIdgkciCzyreYj8yJ"
access_token_secret = "STMbqfGBok2CROYivck9kjSufzxQqN6vrKIvPuPqOJEim"

# Authenticate with Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Function to collect tweets based on keywords
def collect_tweets(keyword, count=10):
    tweets = []
    for tweet in tweepy.Cursor(api.search_tweets, q=keyword, lang="en", tweet_mode='extended').items(count):
        tweets.append([tweet.user.screen_name, tweet.created_at, tweet.full_text, tweet.retweet_count])
    return tweets

# Main function to collect and display tweets
def main():
    keyword = input("Enter a keyword to search for: ")
    count = int(input("Enter the number of tweets to collect: "))
    tweets_data = collect_tweets(keyword, count)
    if tweets_data:
        df = pd.DataFrame(tweets_data, columns=['Username', 'Timestamp', 'Tweet', 'Retweet Count'])
        print(df)
    else:
        print("No tweets found for the given keyword.")

if __name__ == "__main__":
    main()


Enter a keyword to search for: snap
Enter the number of tweets to collect: 5


Forbidden: 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
I am using ParseHub for web extracting as the tool. Because I need Basic access level for the twitter tweets extraction .
So I Used Parsehub for Data extraction of Literature Books from Amazon.

ParseHub is a powerful web scraping tool designed to extract valuable data from websites efficiently and effortlessly.
It aims to make web scraping easy for anyone without coding skills.
It comes with an intuitive interface and advanced capabilities to achieve complex web scraping tasks without a hassle.

Steps used for Web scraping using ParseHub:
1. In the text box add the following url.
2. Click on the "Select page" command plus button. Choose the "Select tool" from the tool menu.
3. Select on the book name, that will extract the all the book names in the website.
4. relative select the author of the selected book using a relative select option.
5. Similary relative select the Date and Review of the selected book using a relative select option.
6. Add the click option to that selected book name that create a new template.  then the Book Page is created.
7. Select the price of the corresponding book name and also the ratings of the corresponding book name.
8. After test run the project and also run the project
9. After running the project , download the CSV file generated.


Link for the files : https://myunt-my.sharepoint.com/:x:/g/personal/tharunsaivt_my_unt_edu/EeThhueH_rVKgFOOJu7aeTEBC-YcIzSrU2fC5TIb6V7CKg?e=065i0I

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''
First of all it was a good experience , as it was the first time doing web scraping using a tool and the python code.
One of the key concepts I found beneficial was understanding HTML structure and CSS selectors.
And also learning about different libraries in python for web scraping.
It was challenging when i was to web scrap the various social media platforms like accessing their using various tokens.
Additionally, some websites have strict anti-scraping measures, which require careful handling to avoid detection and potential blocking.
The ability to gather and analyze data from online sources is highly relevant to various fields, including mine.
