# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

**Data Collection:**

To answer this research question, we would need to collect longitudinal data on individuals' social media usage and their mental health indicators over time. This data could include:

**Social Media Usage Data:**

Frequency of social media use (e.g., daily, weekly)

Duration of social media use per session

Types of social media platforms used (e.g., Facebook, Instagram, Twitter)

Content consumption patterns (e.g., scrolling, posting, interacting with others)

Engagement metrics (e.g., likes, comments, shares)

Self-reported reasons for using social media (e.g., social connection, entertainment, information seeking)

**Mental Health and Well-being Indicators:**

Self-reported mental health assessments (e.g., depression, anxiety)

Perceived stress levels

Quality of sleep

Self-esteem and self-worth

Social support and loneliness measures

Demographic and Contextual Variables:


Age, gender, and other demographic information

Socioeconomic status

Life events and stressors

Other relevant contextual factors (e.g., pandemic-related factors, changes in job status)

**Data Collection Steps:**

Ethical Considerations: Ensure that the data collection process adheres to ethical guidelines and obtains informed consent from participants. Protect participants' privacy and confidentiality throughout the data collection and analysis process.

Participant Recruitment: Recruit a diverse sample of participants representing different demographics (e.g., age, gender, socioeconomic status) to capture a comprehensive understanding of the relationship between social media usage and mental health.

Data Collection Instruments: Develop or utilize validated survey instruments to collect data on social media usage, mental health indicators, and demographic/contextual variables. These instruments can be administered through online surveys or mobile applications for convenient data collection.

Longitudinal Data Collection: Collect data at multiple time points over an extended period (e.g., months or years) to capture changes in social media usage and mental health outcomes over time. Ensure consistent data collection procedures and minimize attrition to maintain data quality.

Data Storage: Store collected data securely in compliance with data protection regulations. Utilize encrypted storage methods and access controls to safeguard participant information.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import pandas as pd
import numpy as np

# Simulate data for social media usage
social_media_data = {
    'user_id': np.arange(1, 1001),
    'frequency': np.random.choice(['daily', 'weekly', 'monthly'], size=1000),
    'duration_minutes': np.random.randint(1, 601, size=1000),
    'platform': np.random.choice(['Facebook', 'Instagram', 'Twitter', 'Snapchat'], size=1000),
    'engagement_likes': np.random.randint(0, 100, size=1000),
    'engagement_comments': np.random.randint(0, 50, size=1000),
    'engagement_shares': np.random.randint(0, 20, size=1000),
    'reason': np.random.choice(['social connection', 'entertainment', 'information seeking'], size=1000)
}

# Simulate data for mental health indicators
mental_health_data = {
    'user_id': np.arange(1, 1001),
    'depression_score': np.random.randint(0, 11, size=1000),
    'anxiety_score': np.random.randint(0, 11, size=1000),
    'stress_level': np.random.randint(0, 11, size=1000),
    'quality_of_sleep': np.random.randint(0, 11, size=1000),
    'self_esteem': np.random.randint(0, 11, size=1000),
    'social_support': np.random.randint(0, 11, size=1000)
}

# Combine social media and mental health data into a single DataFrame
df = pd.DataFrame(social_media_data)
df = df.merge(pd.DataFrame(mental_health_data), on='user_id')

# Save the dataset to a CSV file
df.to_csv('social_media_mental_health_dataset.csv', index=False)


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# You code here (Please add comments in the code):

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import json
import re
total_count = 0
count = 0

result_df = {"Title":[], "Authors":[], "Year":[], "Abstract":[]}
main_url = "https://citeseerx.ist.psu.edu/search?q=information+retrieval&t=doc&sort=rlv&start={}"
for page_num in range(0, 5000, 10):
    #print(main_url.format(page_num))
    link1 = Request(main_url.format(page_num), headers={'User-Agent': 'Mozilla/5.0'})
    url1 = urlopen(link1)

    data1 = url1.read()
    data1_soup = BeautifulSoup(data1)
    #print(data1_soup)

    for i in data1_soup.find_all("a", attrs={'class':'remove doc_details'}):
        #print(i.text.strip())
        result_df["Title"].append(i.text.strip())

    for i in data1_soup.find_all("div", attrs={'class':'pubinfo'}):
        #print(len(i.find_all("span")))
        if len(i.find_all("span")) < 2:
            result_df["Authors"].append(i.find_all("span")[0].text.replace(" ", "").replace("\n", " ").split()[1])
            result_df['Year'].append(0)

        elif len(i.find_all("span")) > 2:
            result_df["Authors"].append(i.find_all("span")[0].text.replace(" ", "").replace("\n", " ").split()[1])
            result_df['Year'].append(i.find_all("span")[2].text.split()[1])

        else:
            result_df["Authors"].append(i.find_all("span")[0].text.replace(" ", "").replace("\n", " ").split()[1])
            result_df['Year'].append(i.find_all("span")[1].text.split()[1])
    for i in data1_soup.find_all("div", attrs={'class':'snippet'}):
        #for j in i.find_all("span"):
        #print(i.text.replace(" ", ""))
        #print(i.text)
        result_df["Abstract"].append(i.text)
    #print(title)

print(len(result_df["Title"]))
print(len(result_df["Authors"]))
print(len(result_df["Year"]))
print(len(result_df["Abstract"]))
df = pd.DataFrame(result_df)
print(df.head())
print(df.shape)
#df['Year'] = df['Year'].astype("int")
#df[df['Year'] >= 2014]

500
500
500
500
                                               Title  \
0                              Information Retrieval   
1                       Modern Information Retrieval   
2                      Private Information Retrieval   
3           An Introduction to Information Retrieval   
4  Naive (Bayes) at Forty: The Independence Assum...   

                                   Authors  Year  \
0                        C.J.vanRijsbergen  1979   
1  RicardoBaeza-Yates,BerthierRibeiro-Neto  1999   
2                          BennyChor,etal.     0   
3               ChristopherD.Manning,etal.  2007   
4                             DavidD.Lewis  1998   

                                            Abstract  
0                                        "...   ..."  
1  "... Information retrieval (IR) has changed co...  
2  "...   We describe schemes that enable a user ...  
3                                        "...   ..."  
4  "... The naive Bayes classifier, currently exp...  
(500

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import tweepy

key= '2Em7SxlX9jPMfL4x97r3zMO0x'
secret= 'sVbJzekKuiAgq83Y7gCwNVbSowqQokGVzWexKHl2cXIPceWtSd'
token= '1439767876962029572-uUMt8oWRyzj9ilE5zk4uYbL93sCMPT'
access= 'oydIGymn9bS767FVEMawE9GyGAnMmBJfaY2XXKmHnmliF'

# Creating the authentication object
auth = tweepy.OAuthHandler(key, secret)

# Setting your access token and secret
auth.set_access_token(token, access)

# Creating the API object while passing in auth information
api = tweepy.API(auth)

for tweets in api.search_tweets(q="iphone", lang="en"):
    print(tweets.text)
    print(tweets.created_at)
    print()

RT @polo_kimani: Brudda,usiwahi dhania hawa chiles wa Twitter for iphone wako out of your league,( financially they are) but the iphone huw…
2022-09-25 15:16:34+00:00

@thismikael @KnobSlobberz @CaliMOfficial_ @ATLONIKA How she aggressively throwing that big burnt brownie on my iPho… https://t.co/LjLmQHfXT8
2022-09-25 15:16:30+00:00

RT @nftbadger: When you drop your friend’s new iphone https://t.co/aZvg2VcdV8
2022-09-25 15:16:29+00:00

RT @nftbadger: When you drop your friend’s new iphone https://t.co/aZvg2VcdV8
2022-09-25 15:16:29+00:00

RT @nftbadger: When you drop your friend’s new iphone https://t.co/aZvg2VcdV8
2022-09-25 15:16:26+00:00

RT @alhajinuell: I never see affiliate marketer wey buy iPhone 14. I thought y’all were making over 200k everyday while we were sleeping? 😂…
2022-09-25 15:16:26+00:00

I'm using #Watusi on iPhone by @FouadRaheb to add new features for #WhatsApp! https://t.co/aRK1rm8WFC
2022-09-25 15:16:24+00:00

RT @nftbadger: When you drop your friend’s new iphon

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

I got no issue.

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**


**Learning Experience:**

Overall, working on web scraping tasks provided a valuable learning experience. I gained a better understanding of HTML structure, CSS selectors, and how to navigate and extract data from web pages using libraries like BeautifulSoup in Python. Learning about APIs and how to interact with them using tools like Tweepy for Twitter data collection was particularly beneficial. Additionally, understanding the importance of respecting website terms of service and handling rate limits was crucial in ensuring ethical data collection practices.

**Challenges Encountered:**

One challenge I encountered was handling dynamic content loaded via JavaScript on some websites. This required me to use more advanced techniques like Selenium WebDriver to interact with the page and extract data. Another challenge was dealing with rate limits and ensuring that my scraping activities did not overwhelm the servers or violate website policies. By implementing rate limiting and error handling mechanisms, I was able to overcome these challenges effectively.

**Relevance to My Field of Study:**

As a student of thi field, the ability to gather and analyze data from online sources is incredibly relevant. Web scraping allows me to collect large datasets for analysis, conduct sentiment analysis on social media data, track trends, and gather insights into user behavior. This skill enables me to augment traditional research methods with data-driven approaches and extract valuable information from diverse online sources. Overall, web scraping and data collection techniques enhance my ability to conduct comprehensive and insightful research in my field.