# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
#"Is it possible that the communal emotion of an area impact its everyday power usage?"

#Theory- The hypothesis is that variations in the city's energy consumption could be correlated with the emotional state of the population as a whole, as gathered from multiple social media platforms. For instance, people may use more electricity on really depressing days (emotionally as well as physically), either as a means of self-comfort or as a result of changing their behaviour and staying inside.

#Information Required- Social Media Sentiment Data: Gather sentiment ratings from sites like Reddit, Instagram, and Twitter to determine the general attitude of the city. This information ought to contain:The posts' date and time,to make sure the posts are from the desired city, use geolocation, Sentiment ratings (positive, neutral, and negative, for example).
#Weather Information: Past weather information to account for how the weather affects energy use-daylight hours, temperature, humidity, and precipitation.
#Data on Energy Consumption: The city provides hourly data on energy usage.This ought to have the overall amount of energy used as well as a breakdown by the commercial, industrial, and residential sectors.
#Event Data: Significant city events that may have an impact on energy use and mood, such as concerts, sporting events, and protests.Time, place, and nature of the occurrence,Number of Data Required
#Sentiment on Social Media: A whole year's worth of data, ideally thousands of posts every day, to capture daily and seasonal fluctuations.Hourly weather information for the same year is provided.
#Energy Consumption Data: Hourly data for the same year, if possible, preferably at the minute level of detail.
#Event Data: An exhaustive inventory of noteworthy occurrences

#Procedures for Gathering and Preserving the Information Social Media Sentiment Analysis:
#To compile postings, use APIs from websites like Reddit (Pushshift API) and Twitter (Twitter API).Use geolocation to filter posts and make sure they come from the desired city.To analyse sentiment, use Natural Language Processing (NLP) approaches. Sentiment scores can be obtained using programs like TextBlob or VADER.Store the information in a structured manner with fields for the timestamp, location, and sentiment score, such as a CSV file or database.
#Gathering Weather Data:Get weather information from NOAA or OpenWeatherMap, among other sources.To obtain hourly data for the desired city, use APIs.Include the timestamp and weather information in the data and store it in a database or CSV file.
#Data Collection on Energy Consumption: Work along with the city's energy supplier to obtain information about use. Use publicly accessible datasets or, if available, extrapolate from smart meter data if direct collaboration isn't feasible. Make that the information is anonymised and conforms with privacy laws.With fields for the timestamp, total consumption, and sector breakdown, save the data in a database or CSV file.
#Gathering Event Data:Create an event list by hand using social media, city event calendars, and news sources.Add information such as the event's date, location, and nature.Store this information in a database or CSV file.
#Data Organisation and Storage:For ease of access and analysis, keep all datasets in a single, centralised database (such as PostgreSQL or MySQL).If you have access to minute-level granularity for the energy consumption data, use a time-series database (such as InfluxDB).To preserve data integrity, make sure you do regular backups and data validation checks.
#Privacy of Data and Ethical Issues:To preserve user privacy, anonymise any personal information gathered via social media.Acquire the required authorisations and abide by data protection laws, such as the CCPA or GDPR.
#Preprocessing and Data Cleaning:To manage missing values, outliers, and data normalisation, preprocess the data.Sync up the time intervals between various datasets to ensure precise correlation analysis.

#This study may shed light on how social behaviour patterns and emotional states influence pragmatic issues like energy use, which may have implications for resource management and urban planning.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import random
import pandas as pd
from datetime import datetime, timedelta

def generate_sentiment_scores(num_samples):
    sentiment_data = []
    for _ in range(num_samples):
        timestamp = datetime.now() - timedelta(days=random.randint(0, 365))
        sentiment_score = round(random.uniform(-1, 1), 2)
        sentiment_data.append({'timestamp': timestamp, 'sentiment_score': sentiment_score})
    return sentiment_data

def generate_weather_data(num_samples):
    weather_data = []
    for _ in range(num_samples):
        timestamp = datetime.now() - timedelta(days=random.randint(0, 365))
        temperature = round(random.uniform(-10, 35), 2)
        humidity = random.randint(10, 100)
        weather_data.append({'timestamp': timestamp, 'temperature': temperature, 'humidity': humidity})
    return weather_data

sentiment_samples = generate_sentiment_scores(1000)

weather_samples = generate_weather_data(1000)

sentiment_df = pd.DataFrame(sentiment_samples)
weather_df = pd.DataFrame(weather_samples)

sentiment_df.to_csv('sentiment_data.csv', index=False)
weather_df.to_csv('weather_data.csv', index=False)

print("Sentiment data:")
print(sentiment_df.head())

print("\nWeather data:")
print(weather_df.head())


Sentiment data:
                   timestamp  sentiment_score
0 2024-01-19 19:07:26.961782            -0.49
1 2023-10-25 19:07:26.961826             0.71
2 2024-04-07 19:07:26.961835            -0.97
3 2024-03-04 19:07:26.961841            -0.01
4 2024-02-06 19:07:26.961848             0.67

Weather data:
                   timestamp  temperature  humidity
0 2024-02-29 19:07:26.969884        24.55        91
1 2023-11-07 19:07:26.969893        20.47        68
2 2024-08-03 19:07:26.969898        10.66        34
3 2023-12-02 19:07:26.969902        27.37        65
4 2023-12-13 19:07:26.969906         4.72        47


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
pip install beautifulsoup4 requests



In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

query = 'machine learning'
max_results = 100

url = f'http://export.arxiv.org/api/query?search_query=all:{query}&start=0&max_results={max_results}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')

entries = soup.find_all('entry')
papers_data = []

for entry in entries:
    title = entry.title.text
    authors = ', '.join([author.find('name').text for author in entry.find_all('author')])
    summary = entry.summary.text.strip()
    published = entry.published.text
    link = entry.id.text

    papers_data.append({
        'Title': title,
        'Authors': authors,
        'Summary': summary,
        'Published': published,
        'Link': link
    })

papers_df = pd.DataFrame(papers_data)
papers_df.to_csv('arxiv_papers.csv', index=False)

print("ArXiv Papers data:")
print(papers_df.head())


ArXiv Papers data:
                                               Title  \
0   Lecture Notes: Optimization for Machine Learning   
1  An Optimal Control View of Adversarial Machine...   
2  Minimax deviation strategies for machine learn...   
3  Machine Learning for Clinical Predictive Analy...   
4  Towards Modular Machine Learning Solution Deve...   

                                    Authors  \
0                                Elad Hazan   
1                               Xiaojin Zhu   
2  Michail Schlesinger, Evgeniy Vodolazskiy   
3                             Wei-Hung Weng   
4       Samiyuru Menik, Lakshmish Ramaswamy   

                                             Summary             Published  \
0  Lecture notes on optimization for machine lear...  2019-09-08T21:49:42Z   
1  I describe an optimal control view of adversar...  2018-11-11T14:28:34Z   
2  The article is devoted to the problem of small...  2017-07-16T09:15:08Z   
3  In this chapter, we provide a brief overview o

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [121]:
pip install asyncpraw



In [122]:
pip install nest_asyncio



In [133]:
import asyncpraw
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def fetch_reddit_data():
    client_id = 'your_client_id'
    client_secret = 'your_client_secret'
    user_agent = 'your_user_agent'

reddit = asyncpraw.Reddit(client_id=client_id,
                          client_secret=client_secret,
                          user_agent=user_agent)
try:
    await reddit.user.me()
    print("Authentication successful!")
except Exception as e:
    print(f"Authentication failed: {e}")

    data = []
    async for submission in subreddit.search(keyword, limit=10):
        data.append({
            'Title': submission.title,
            'Score': submission.score,
            'URL': submission.url,
            'Comments': submission.num_comments
        })

    await reddit.close()

    for post in data:
        print(post)

await fetch_reddit_data()


Authentication successful!


In [140]:
pip install instaloader



In [144]:
import instaloader

L = instaloader.Instaloader()
username = 'kavya_likitha'
password = 'Kavya2703'

L.login(username, password)

hashtag = 'nature'

data = []
try:
    for post in instaloader.Hashtag.from_name(L.context, hashtag).get_posts():
        data.append({
            'Post URL': post.url,
            'Likes': post.likes,
            'Comments': post.comments,
            'Caption': post.caption,
            'Date': post.date
        })
        if len(data) >= 10:  # Limit to 10 posts
            break

except instaloader.exceptions.QueryReturnedNotFoundException as e:
    print(f"Error: {e}")

for post in data:
    print(post)


JSON Query to explore/tags/nature/: 404 Not Found when accessing https://www.instagram.com/explore/tags/nature/?__a=1&__d=dis [retrying; skip with ^C]
JSON Query to explore/tags/nature/: 404 Not Found when accessing https://www.instagram.com/explore/tags/nature/?__a=1&__d=dis [retrying; skip with ^C]


Error: JSON Query to explore/tags/nature/: 404 Not Found when accessing https://www.instagram.com/explore/tags/nature/?__a=1&__d=dis


In [135]:
pip install --upgrade instaloader



In [None]:
import praw
import pandas as pd

reddit = praw.Reddit(
    client_id='YOUR_CLIENT_ID',
    client_secret='YOUR_CLIENT_SECRET',
    user_agent='YOUR_USER_AGENT'
)

subreddit_name = 'python'
search_term = 'data science'
post_limit = 100

posts_data = []

for submission in reddit.subreddit(subreddit_name).search(search_term, limit=post_limit):
    posts_data.append({
        'Title': submission.title,
        'Author': submission.author.name if submission.author else 'N/A',
        'Score': submission.score,
        'URL': submission.url,
        'Content': submission.selftext
    })

posts_df = pd.DataFrame(posts_data)
posts_df.to_csv('reddit_data.csv', index=False)

print("Reddit data:")
print(posts_df.head())


JSON Query to explore/tags/picture/: 404 Not Found when accessing https://www.instagram.com/explore/tags/picture/?__a=1&__d=dis [retrying; skip with ^C]
JSON Query to explore/tags/picture/: 404 Not Found when accessing https://www.instagram.com/explore/tags/picture/?__a=1&__d=dis [retrying; skip with ^C]


An error occurred: JSON Query to explore/tags/picture/: 404 Not Found when accessing https://www.instagram.com/explore/tags/picture/?__a=1&__d=dis
Instagram data:
Empty DataFrame
Columns: []
Index: []


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# Used ParseHub to extract the data and the project was successful but unable to download the csv/excel file due to some legal issues of the site.. attached the proofs in the comment section.


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:

#Reflections on Web Scraping and Data Collection

#Learning Experience: Diving into web scraping has been a fascinating journey into the realm of data extraction. I discovered that understanding HTML and CSS structures is crucial for navigating web pages effectively. Techniques like using BeautifulSoup for parsing and `requests` for handling HTTP requests became essential tools in my toolkit. The experience highlighted the importance of adapting to different website layouts and learning to handle dynamic content, which proved invaluable in mastering the nuances of scraping.

#Challenges Encountered: One notable challenge was dealing with sites that employ dynamic content loading, which sometimes rendered static scraping methods ineffective. For example, handling JavaScript-heavy sites required switching to Selenium for browser automation. Additionally, websites often employ measures to prevent scraping, such as CAPTCHA or rate limiting, which necessitated implementing thoughtful delays and retry logic to mitigate these barriers.

#Relevance to Your Field of Study: For a data science student, the ability to gather and analyze data from various online sources opens up a treasure trove of research opportunities. This skill enhances the ability to perform comprehensive analyses and gain insights from real-time data. Whether it’s for academic research, market analysis, or data-driven decision-making, mastering web scraping provides a robust foundation for extracting valuable information from the vast expanse of the web.
