# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

# write your answer here
Reasearch Question: How do movie ratings impact popularity of movies over time
Data Required: Data required for the above abstract is movie name, date, rating and amount of data required
is we need atleast 1000 records to perform statistical analysis and ideally we should have atleast 2000
records for better analysis.
Steps for collecting and saving the data:
1. Setup- We have install all the necessary dependencies and we have to identify reliable website to perform our analysis
2. Webscraping- We have write script such that our code will scrape data from multiple pages of the website from the dynamic link
3. Saving data- After getting the itegrated data from multiple pages of website, we can convert the data
into data frame and save it in the excel format and then visualize the data insights.




## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files

base_url = 'https://www.themoviedb.org/movie?page='
all_page_data = []

for num in range(1, 51):  # Scraping the data for the first 50 pages
    if len(all_page_data) >= 1000:
        break

    resp = requests.get(base_url + str(num))
    soup = BeautifulSoup(resp.text, 'html.parser')
    all_div = soup.find_all('div', class_='card style_1')

    for item in all_div:
        if len(all_page_data) >= 1000:
            break  # Breaking the loop after collecting the 1000 records

        content_div = item.find('div', class_='content')

        movie_name = content_div.find('h2').text.strip()
        movie_date = content_div.find('p').text.strip()

        rating_div = item.find('div', class_='user_score_chart')
        rating = rating_div["data-percent"] if rating_div else 'N/A'
        movie_data = {
            'Movie_name': movie_name,
            'Release_date': movie_date,
            'Rating': rating
        }
        all_page_data.append(movie_data)
    print(f"Completed page {num}. Total movies collected: {len(all_page_data)}")
df = pd.DataFrame(all_page_data)
excel_filename = 'tmdb_movies_data.xlsx' # Saving data into excel sheet
df.to_excel(excel_filename, index=False)
files.download(excel_filename) # Downloading the Excel sheet
print("Scraping completed and Excel file downloaded.")

Completed page 1. Total movies collected: 20
Completed page 2. Total movies collected: 40
Completed page 3. Total movies collected: 60
Completed page 4. Total movies collected: 80
Completed page 5. Total movies collected: 100
Completed page 6. Total movies collected: 120
Completed page 7. Total movies collected: 140
Completed page 8. Total movies collected: 160
Completed page 9. Total movies collected: 180
Completed page 10. Total movies collected: 200
Completed page 11. Total movies collected: 220
Completed page 12. Total movies collected: 240
Completed page 13. Total movies collected: 260
Completed page 14. Total movies collected: 280
Completed page 15. Total movies collected: 300
Completed page 16. Total movies collected: 320
Completed page 17. Total movies collected: 340
Completed page 18. Total movies collected: 360
Completed page 19. Total movies collected: 380
Completed page 20. Total movies collected: 400
Completed page 21. Total movies collected: 420
Completed page 22. Total m

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Scraping completed and Excel file downloaded.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# imports
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import logging
import random
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import os
# Set up logging for output
logging.basicConfig(filename='scraping_log.txt', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# function for scraping 1000 articles from AMC digital library
def scrape_acm_articles(search_query, limit=1000):
    base_url = "https://dl.acm.org/action/doSearch" #base URL
    params = {
        "AllField": search_query,
        "startPage": 1,
        "pageSize": 50
    }
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
    ]
    collected_articles = []
    current_page = 1
    total_collected = 0
    logging_interval = 100
    # Retrying for blocked requests
    retry_strategy = Retry(
        total=5,
        backoff_factor=2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "HEAD", "OPTIONS"]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session = requests.Session()
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    # checking the collected records be less than limit for every scrape
    while total_collected < limit:
        params["startPage"] = current_page
        headers = {
            "User-Agent": random.choice(user_agents)
        }
        try:
            response = session.get(base_url, params=params, headers=headers, timeout=20)
            response.raise_for_status()
        except requests.HTTPError as http_err:
            logging.error(f"HTTP error on page {current_page}: {http_err}")
            time.sleep(120)
            current_page += 1
            continue
        except requests.RequestException as req_err:
            logging.error(f"Request error on page {current_page}: {req_err}")
            time.sleep(120)
            current_page += 1
            continue
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all("div", class_="issue-item")
        if not articles:
            logging.info(f"No more articles found on page {current_page}.")
            break
        page_articles_count = 0
        for article in articles:
            title_elem = article.find("h5", class_="issue-item__title")
            title = title_elem.get_text(strip=True) if title_elem else "Title unavailable"
            conference_elem = article.find("div", class_="issue-item__detail")
            conference = "Unknown"
            if conference_elem:
                conf_link = conference_elem.find("a")
                if conf_link:
                    conference = conf_link.get_text(strip=True)
            year_elem = article.find("div", class_="bookPubDate")
            year = year_elem.get_text(strip=True) if year_elem else "Year unknown"
            authors_elem = article.find("ul", class_="rlist--inline")
            authors = authors_elem.get_text(strip=True) if authors_elem else "Authors unavailable"
            abstract_elem = article.find("div", class_="issue-item__abstract")
            abstract = abstract_elem.get_text(strip=True) if abstract_elem else "No abstract provided"
            collected_articles.append({
                "Title": title,
                "Conference": conference,
                "Year": year,
                "Authors": authors,
                "Abstract": abstract
            })
            page_articles_count += 1
            total_collected += 1
            if total_collected % logging_interval == 0:
                logging.info(f"Collected {total_collected} articles")
            if total_collected >= limit: # condition to break out of loop after reaching 1000 records
                break
        logging.info(f"Collected {page_articles_count} articles on page {current_page}.")
        current_page += 1
        time.sleep(random.uniform(10, 20))
    return collected_articles
def save_to_excel(data, file_name):
    df = pd.DataFrame(data)
    df.to_excel(file_name, index=False)
    logging.info(f"Saved {len(data)} records to {file_name}")
def run_scraper():
    search_query = "machine learning" #searching articles regarding machine learning
    article_limit = 1000
    logging.info(f"Starting scraping: {search_query}") #logging info for scraping
    scraped_data = scrape_acm_articles(search_query, article_limit)
    excel_filename = 'acm_articles_ml.xlsx'
    save_to_excel(scraped_data, excel_filename) #saving to excel
    print(f"Scraping complete articles saved to {excel_filename}.")
    print("For detailed logs, check 'scraping_log.txt'.")
    try:
        from google.colab import files
        files.download(excel_filename) # using google colab's download function
        print(f"{excel_filename} is ready for download in Colab.")
    except ImportError:
        print(f"File saved locally: {os.path.abspath(excel_filename)}")
if __name__ == "__main__":
    run_scraper()

ERROR:root:Request error on page 11: HTTPSConnectionPool(host='dl.acm.org', port=443): Read timed out.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install ntscraper
from ntscraper import Nitter # importing nitter
import pandas as pd
from google.colab import files
scraper = Nitter(log_level=1, skip_instance_check=False)
tweets = scraper.get_tweets("MrBeast", mode="user", number=1000)
# Intializing list which can store data of tweets
tweet_data = {
    'URL': [],
    'Tweet Text': [],
    'Username': [],
    'Likes Count': [],
    'Quotes Count': [],
    'Retweets Count': [],
    'Comments Count': []
}
# Limit to 40 tweets
for i, tweet in enumerate(tweets['tweets']):
    if i >= 40:
        break
    tweet_data['URL'].append(tweet['link'])
    tweet_data['Tweet Text'].append(tweet['text'])
    tweet_data['Username'].append(tweet['user']['name'])
    tweet_data['Likes Count'].append(tweet['stats']['likes'])
    tweet_data['Quotes Count'].append(tweet['stats']['quotes'])
    tweet_data['Retweets Count'].append(tweet['stats']['retweets'])
    tweet_data['Comments Count'].append(tweet['stats']['comments'])
tweet_df = pd.DataFrame(tweet_data)
output_filename = "MrBeast_tweets_data.xlsx"
tweet_df.to_excel(output_filename, index=False)
files.download(output_filename)
print("Scraping completed")

Collecting ntscraper
  Downloading ntscraper-0.3.17-py3-none-any.whl.metadata (7.4 kB)
Downloading ntscraper-0.3.17-py3-none-any.whl (12 kB)
Installing collected packages: ntscraper
Successfully installed ntscraper-0.3.17


Testing instances: 100%|██████████| 16/16 [00:13<00:00,  1.19it/s]
INFO:root:No instance specified, using random instance https://nitter.privacydev.net
INFO:root:Current stats for MrBeast: 17 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 33 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 48 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 64 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 81 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 94 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 105 tweets, 0 threads...
INFO:root:Current stats for MrBeast: 121 tweets, 0 threads...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Scraping completed


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I enjoyed solving these tasks, It was challenging a bit because I was getting CAPTCHA for google scholar web site and I tried to resolve it by using API but it didn't work, then I tried for AMC digital library and got the expected output.
This web scraping is very useful to get the live data and get all the insights by analyzing the data. In my research it can be use full to find the exact reviews on a particular film across different web sites.
'''