<a href="https://colab.research.google.com/github/ChristianBridge/Christian_INFO5731_SPRING2025/blob/main/Bridge_Christian_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
#Required for Densho Digital Repository
!pip install cloudscraper



In [2]:
#Required for Twitter
!pip install tweepy



In [3]:
#Libraries
#How to get url responses
import requests
#Basic webscraping
from bs4 import BeautifulSoup
#Help get passed javascript
import cloudscraper
#Used for config file
import json
#Data management and manipulation
import numpy as np
import pandas as pd
#Used to delay the requests to urls
import time
#used for twitter
import tweepy
#Used for reading/writing files to operating system
import os
#Used for processing text
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
#Used for counting POS tags
from collections import Counter

In [4]:
#Config contains APIs/Secrets
with open("config.json", "r") as config_file:
    config = json.load(config_file)

In [5]:
#Question 1-3 - Semantic Scholar

#url contains query information and field information to retreive
url = "http://api.semanticscholar.org/graph/v1/paper/search/bulk?query=machine+learning&fields=url,abstract,authors"

#Place holders/configuration for loop
sem_api_key = config["sem_api_key"]
token = ''
papers = []
delay = 1
headers = {"x-api-key": sem_api_key}

#Range of 10 needed as 1,000 papers are requested each time
for i in range(1,11):

    #No continuation token is given for the first attempt
    if i == 1:
        response = requests.get(url,headers=headers)

        #if response is good
        if response.status_code == 200:
            #Store json data in papers variable
            response_data = response.json()
            papers.extend(response_data['data'])
            #Store continuation token for next run
            token = response_data['token']
            #Need to delay between requests to avoid being blocked
            print(f"Waiting {delay} seconds before the next request.")
            time.sleep(delay)
        #if response is bad
        else:
            print(f"Request failed with status code {response.status_code}: {response.text}")

    #All subsequent attempts need a continuation token to not repeat papers
    if i > 1:
        #use continuation token
        params = {"token": token}
        response = requests.get(url,params=params,headers=headers)

        #if response is good
        if response.status_code == 200:
            #add new json data to previously acquired papers
            response_data = response.json()
            papers.extend(response_data['data'])
            #get the new token for continuation
            token = response_data['token']
            #Need to delay between requets to avoid being blocked
            print(f"Waiting {delay} seconds before the next request.")
            time.sleep(delay)
        #if response is bad
        else:
            print(f"Request failed with status code {response.status_code}: {response.text}")

#Ensure that we have retreived 10000 papers
print(f"Total papers retreived: {len(papers)}")

Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Waiting 1 seconds before the next request.
Total papers retreived: 10000


In [6]:
#Create a dataframe using the retrieved papers json data
semantic_df = pd.DataFrame(papers)

In [7]:
#Look at new dataframe
semantic_df

Unnamed: 0,paperId,url,abstract,authors
0,00000c33779acab142af6c7a6dae8b36fac0805d,https://www.semanticscholar.org/paper/00000c33...,In the era of burgeoning electric vehicle (EV)...,"[{'authorId': '66925799', 'name': 'Ahmad Almag..."
1,0000238f07f151172cf2602588ba762b55c8464b,https://www.semanticscholar.org/paper/0000238f...,Background Meditation apps have surged in popu...,"[{'authorId': '31717083', 'name': 'Christian A..."
2,00002d31a8c758062a51d9a259313d81a5eaf399,https://www.semanticscholar.org/paper/00002d31...,,"[{'authorId': '2088997835', 'name': 'L. Szczyr..."
3,0000315635be19f6278dbc72597b3065fac405f0,https://www.semanticscholar.org/paper/00003156...,Background Humans must be able to cope with th...,"[{'authorId': '2201143605', 'name': 'Nida Shaf..."
4,00005d68c6c7eb4d3c27da8242a30b9a498f991e,https://www.semanticscholar.org/paper/00005d68...,The growing number of cloud-based services has...,"[{'authorId': '2085549408', 'name': 'Iehab Alr..."
...,...,...,...,...
9995,032b27475daba55376171d41629e8986f031431b,https://www.semanticscholar.org/paper/032b2747...,,"[{'authorId': '69923982', 'name': 'Seyed Mosta..."
9996,032b3905a76c716fba857c7878209840b9e46ada,https://www.semanticscholar.org/paper/032b3905...,In the context of increasing negative environm...,"[{'authorId': '2267532012', 'name': 'Son Vu Ho..."
9997,032b3db9b36592cf639f9e46a501b527f7d4bc86,https://www.semanticscholar.org/paper/032b3db9...,,"[{'authorId': '2092892151', 'name': 'P. Monika..."
9998,032b423054d52846f9306f23bdd07eccee35a5c8,https://www.semanticscholar.org/paper/032b4230...,Some of the simplest tasks for a human to acco...,"[{'authorId': '2200544713', 'name': 'Eshani Ak..."


In [8]:
#Authors is stored weird, we will make it an array of names instead
semantic_df['authors'] = semantic_df['authors'].apply(lambda x: [author['name'] for author in x])

In [9]:
#Check out dataframe again
semantic_df

Unnamed: 0,paperId,url,abstract,authors
0,00000c33779acab142af6c7a6dae8b36fac0805d,https://www.semanticscholar.org/paper/00000c33...,In the era of burgeoning electric vehicle (EV)...,"[Ahmad Almaghrebi, Kevin James, Fares al Juhes..."
1,0000238f07f151172cf2602588ba762b55c8464b,https://www.semanticscholar.org/paper/0000238f...,Background Meditation apps have surged in popu...,"[Christian A. Webb, M. Hirshberg, R. Davidson,..."
2,00002d31a8c758062a51d9a259313d81a5eaf399,https://www.semanticscholar.org/paper/00002d31...,,"[L. Szczyrba, Yang Zhang, D. Pamukçu, D. Eroglu]"
3,0000315635be19f6278dbc72597b3065fac405f0,https://www.semanticscholar.org/paper/00003156...,Background Humans must be able to cope with th...,"[Nida Shafiq, Isma Hamid, Muhammad Asif, Qamar..."
4,00005d68c6c7eb4d3c27da8242a30b9a498f991e,https://www.semanticscholar.org/paper/00005d68...,The growing number of cloud-based services has...,"[Iehab Alrassan, Asma Alqahtani]"
...,...,...,...,...
9995,032b27475daba55376171d41629e8986f031431b,https://www.semanticscholar.org/paper/032b2747...,,"[Seyed Mostafa Pourhashemi, M. Mosleh, Y. Erfani]"
9996,032b3905a76c716fba857c7878209840b9e46ada,https://www.semanticscholar.org/paper/032b3905...,In the context of increasing negative environm...,"[Son Vu Hong Pham, Khoi Van Tien Nguyen, Long ..."
9997,032b3db9b36592cf639f9e46a501b527f7d4bc86,https://www.semanticscholar.org/paper/032b3db9...,,"[P. Monika, G. Raju]"
9998,032b423054d52846f9306f23bdd07eccee35a5c8,https://www.semanticscholar.org/paper/032b4230...,Some of the simplest tasks for a human to acco...,"[Eshani Akanksha Bisht, Purnendu Prabhat, Sach..."


In [10]:
#Create .csv file from data frame
semantic_df.to_csv("semantic_scholar_10000.csv", index = False)

In [11]:
#Question 1-4 - Densho Digital Repository

# Create a cloudscraper instance
scraper = cloudscraper.create_scraper()

#Create a list for the narrator descriptions
densho_narrators = []

#Base URL to iterate on, narrators stored numerically
base_url = "https://ddr.densho.org/narrators/"

#We need about 1000 narrators, assuming theyre stored 1-1000
for i in range(1,1001):
    #Create URL to scrape e.g. "https://ddr.densho.org/narrators/1"
    url = base_url+str(i)
    #return response, and get html
    response = scraper.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    #if response is good
    if response.status_code == 200:
        #Information is stored in the first paragraph
        paragraph = soup.find('p')
        paragraph = paragraph.get_text(strip = True)
        #Add narrator information to list
        densho_narrators.append(paragraph)

    #Some narrator #s are missing, if so list status code and narrator #
    else:
        print(f"Narrator{i} unavailable due to status code {response.status_code}")

Narrator4 unavailable due to status code 500
Narrator6 unavailable due to status code 500
Narrator22 unavailable due to status code 500
Narrator36 unavailable due to status code 500
Narrator37 unavailable due to status code 500
Narrator40 unavailable due to status code 500
Narrator127 unavailable due to status code 500
Narrator128 unavailable due to status code 500
Narrator137 unavailable due to status code 500
Narrator203 unavailable due to status code 500
Narrator204 unavailable due to status code 500
Narrator227 unavailable due to status code 500
Narrator228 unavailable due to status code 500
Narrator229 unavailable due to status code 500
Narrator230 unavailable due to status code 500
Narrator240 unavailable due to status code 500
Narrator253 unavailable due to status code 500
Narrator261 unavailable due to status code 500
Narrator262 unavailable due to status code 500
Narrator263 unavailable due to status code 500
Narrator270 unavailable due to status code 500
Narrator271 unavailab

In [12]:
#Ensure we get approx. 1000 narrators
len(densho_narrators)

929

In [13]:
#Return the first 5 narrators' information
print(densho_narrators[:5])

["Nisei male. Born September 23, 1925, in Seattle, Washington. Spent prewar childhood in Seattle's Nihonmachi. Incarcerated at Puyallup Assembly Center, Washington, and Minidoka concentration camp, Idaho. Refused to participate in the draft and was imprisoned at McNeil Island Penitentiary, Washington, for draft resistance. Resettled in Seattle.", "Nisei male. Born January 25, 1920, in Seattle, Washington. Incarcerated at the Puyallup Assembly Center, Washington, and Minidoka concentration camp, Idaho. Resisted the draft, with the rationale that the U.S. government had classified him 4-C, an enemy alien, and he was therefore under no obligation to serve. Imprisoned at McNeil Island Penitentiary, Washington. Vocal critic of the Japanese American Citizens League. Resettled in Seattle, Washington. Thought by some to be the model for the main character in John Okada's No-No Boy.", 'Nisei male. During World War II, served with I Company, part of the 442nd Regimental Combat Team, an all-Japan

In [14]:
#Create a dataframe from the list
densho_df = pd.DataFrame(densho_narrators)

In [15]:
#View dataframe
densho_df

Unnamed: 0,0
0,"Nisei male. Born September 23, 1925, in Seattl..."
1,"Nisei male. Born January 25, 1920, in Seattle,..."
2,"Nisei male. During World War II, served with I..."
3,"Nisei male. Born September 19, 1917, in Hanape..."
4,"Nisei female. Born July 15, 1906, in Bedderavi..."
...,...
924,"Born December 8, 1934, in Pittsburgh, Pennsylv..."
925,"Sansei female. Born December 10, 1937, in Stoc..."
926,"Sansei male. Born April 20, 1939, in Salinas, ..."
927,"Sansei female. Born December 27, 1942, in the ..."


In [16]:
#Create CSV from dataframe
densho_df.to_csv("densho_narrators_929.csv", index = False)

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [17]:
#Download the relevant resources
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [18]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [19]:
# Define preprocessing function
def preprocess_text(text):
    #Convert to lowercase
    text = str(text).lower()
    #Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    #Tokenization
    tokens = word_tokenize(text)
    #Import set of stopwords from NLTK
    stop_words = set(stopwords.words('english'))
    #Using the stop word set, remove them from the text
    tokens = [word for word in tokens if word not in stop_words]
    #Stem words
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    #Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    return lemmatized_tokens

In [20]:
#semantic_df text pre processing
semantic_df['processed abstract'] = semantic_df['abstract'].apply(preprocess_text)

In [21]:
#Check output of processing text
semantic_df['processed abstract'][1]

['background',
 'medit',
 'app',
 'surg',
 'popular',
 'recent',
 'year',
 'increas',
 'number',
 'individu',
 'turn',
 'app',
 'cope',
 'stress',
 'includ',
 'covid19',
 'pandem',
 'medit',
 'app',
 'commonli',
 'use',
 'mental',
 'health',
 'app',
 'depress',
 'anxieti',
 'howev',
 'littl',
 'known',
 'well',
 'suit',
 'app',
 'object',
 'studi',
 'aim',
 'develop',
 'test',
 'datadriven',
 'algorithm',
 'predict',
 'individu',
 'like',
 'benefit',
 'appbas',
 'medit',
 'train',
 'method',
 'use',
 'random',
 'control',
 'trial',
 'data',
 'compar',
 '4week',
 'medit',
 'app',
 'healthi',
 'mind',
 'program',
 'hmp',
 'assessmentonli',
 'control',
 'condit',
 'school',
 'system',
 'employe',
 'n662',
 'develop',
 'algorithm',
 'predict',
 'like',
 'benefit',
 'hmp',
 'baselin',
 'clinic',
 'demograph',
 'characterist',
 'submit',
 'machin',
 'learn',
 'model',
 'develop',
 'person',
 'advantag',
 'index',
 'pai',
 'reflect',
 'individu',
 'expect',
 'reduct',
 'distress',
 'primari',

In [22]:
#Create .csv file from data frame
semantic_df.to_csv("semantic_scholar_10000.csv", index = False)

In [23]:
densho_df[0][0]

"Nisei male. Born September 23, 1925, in Seattle, Washington. Spent prewar childhood in Seattle's Nihonmachi. Incarcerated at Puyallup Assembly Center, Washington, and Minidoka concentration camp, Idaho. Refused to participate in the draft and was imprisoned at McNeil Island Penitentiary, Washington, for draft resistance. Resettled in Seattle."

In [24]:
#densho_df text preprocessing
densho_df[1] = densho_df[0].apply(preprocess_text)

In [25]:
#Check results
densho_df

Unnamed: 0,0,1
0,"Nisei male. Born September 23, 1925, in Seattl...","[nisei, male, born, septemb, 23, 1925, seattl,..."
1,"Nisei male. Born January 25, 1920, in Seattle,...","[nisei, male, born, januari, 25, 1920, seattl,..."
2,"Nisei male. During World War II, served with I...","[nisei, male, world, war, ii, serv, compani, p..."
3,"Nisei male. Born September 19, 1917, in Hanape...","[nisei, male, born, septemb, 19, 1917, hanapep..."
4,"Nisei female. Born July 15, 1906, in Bedderavi...","[nisei, femal, born, juli, 15, 1906, bedderavi..."
...,...,...
924,"Born December 8, 1934, in Pittsburgh, Pennsylv...","[born, decemb, 8, 1934, pittsburgh, pennsylvan..."
925,"Sansei female. Born December 10, 1937, in Stoc...","[sansei, femal, born, decemb, 10, 1937, stockt..."
926,"Sansei male. Born April 20, 1939, in Salinas, ...","[sansei, male, born, april, 20, 1939, salina, ..."
927,"Sansei female. Born December 27, 1942, in the ...","[sansei, femal, born, decemb, 27, 1942, topaz,..."


In [26]:
#Create CSV from dataframe
densho_df.to_csv("densho_narrators_929.csv", index = False)

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [27]:
#Part 1 - Part of Speech Tagging
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [28]:
#Using pos_tag apply for processed text column and save as new column
semantic_df['POS'] = semantic_df['processed abstract'].apply(nltk.pos_tag)

In [29]:
#Use List Comprehension to iterate through tuples skipping the first word
#selecting the tag, and then creating a massive list of the dataset
all_pos_tags = [tag for sublist in semantic_df['POS'] for _, tag in sublist]

#Use Counter to then count all POS in the massive list
pos_counts = Counter(all_pos_tags)

#Make a dataframe to easily see the counts of each POS
pos_counts_df = pd.DataFrame(pos_counts.items(), columns=['POS', 'Count'])

# Print the POS counts
print(pos_counts_df)

     POS   Count
0     NN  653599
1    VBP   46530
2     JJ  182743
3    NNS   23040
4     RB   15842
5     RP     224
6    VBN    6073
7    VBZ    7813
8     VB   14222
9     IN   16795
10    CD   27779
11    MD    3249
12   JJR    3026
13   VBD   13881
14     $     194
15    FW    5523
16   NNP     864
17   JJS    3698
18   RBR    1482
19    DT     349
20    CC     746
21    TO      44
22   RBS     259
23   WDT      21
24   VBG     296
25    WP      67
26   POS       6
27   PRP      78
28   SYM       3
29  PRP$      15
30   WP$     114
31   WRB      53
32    EX       7
33  NNPS       2
34    ''       1


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [30]:
#Part 1 - Webscraping and saving as CSV

BASE_URL = "https://api.github.com/search/repositories"
HEADERS = {"Accept": "application/vnd.github.v3+json"}
QUERY = "topic:github-actions"
PER_PAGE = 100  # Max allowed per page
TOTAL_RESULTS = 1000
MAX_PAGES = TOTAL_RESULTS // PER_PAGE

actions = []

for page in range(1, MAX_PAGES + 1):
    params = {"q": QUERY, "sort": "stars", "per_page": PER_PAGE, "page": page}
    response = requests.get(BASE_URL, headers=HEADERS, params=params)

    if response.status_code == 200:
        data = response.json()
        actions.extend([(repo["name"], repo["description"], repo["html_url"]) for repo in data.get("items", [])])
    else:
        print(f"Failed to retrieve page {page}: {response.status_code}")
        break  # Stop if we hit an error

    time.sleep(2)  # Respect GitHub API rate limits

# Save results to a file
with open("github_actions_1000.csv", "w", encoding="utf-8") as file:
    file.write("Name,Description,URL\n")
    for name, description, url in actions:
        file.write(f"{name},{description},{url}\n")

print(f"Collected {len(actions)} GitHub Actions. Data saved to 'github_actions_1000.csv'")

Collected 1000 GitHub Actions. Data saved to 'github_actions_1000.csv'


In [31]:
# Load the dataset safely
try:
    df = pd.read_csv("github_actions_1000.csv", delimiter=",", on_bad_lines="skip", encoding="utf-8")
except pd.errors.ParserError as e:
    print(f"CSV Parsing Error: {e}")
    exit()

# Check the first few rows to understand the structure
print(df.head())

# Ensure the correct columns exist
if "Name" not in df.columns:
    print("Error: 'Name' column not found in the CSV file.")
    exit()

# Apply preprocessing to 'Name' column
df['Processed_Name'] = df['Name'].apply(preprocess_text)
df['Processed_Description'] = df['Description'].apply(preprocess_text)

# Save the processed data
df.to_csv("processed_github_actions.csv", index=False, encoding="utf-8")

print("Preprocessing complete. Data saved to 'processed_github_actions.csv'")

              Name                                        Description  \
0              act                  Run your GitHub Actions locally 🚀   
1  awesome-actions  A curated list of awesome actions to use on Gi...   
2  Actions-OpenWrt  A template for building OpenWrt with GitHub Ac...   
3      OpenWrt-Rpi  Raspberry Pi & NanoPi R2S/R4S & G-Dock & x86 O...   
4        danger-js    ⚠️ Stop saying "you forgot to …" in code review   

                                         URL  
0              https://github.com/nektos/act  
1   https://github.com/sdras/awesome-actions  
2  https://github.com/P3TERX/Actions-OpenWrt  
3    https://github.com/SuLingGG/OpenWrt-Rpi  
4        https://github.com/danger/danger-js  
Preprocessing complete. Data saved to 'processed_github_actions.csv'


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [35]:
#Part 1 - Webscraping Tweets

# Set up your API credentials
API_KEY = config["API_KEY"]
API_SECRET = config["API_SECRET"]
ACCESS_TOKEN = config["ACCESS_TOKEN"]
ACCESS_SECRET = config["ACCESS_SECRET"]
BEARER_TOKEN = config["BEARER_TOKEN"]

# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=API_KEY,
    consumer_secret=API_SECRET,
    access_token=ACCESS_TOKEN,
    access_token_secret=ACCESS_SECRET
)
api = tweepy.API(auth)

# Authenticate with Tweepy
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Search query for tweets containing "#deepseek"
query = "#deepseek -is:retweet"  # Excludes retweets
max_results = 10  # Adjust as needed

# Fetch tweets
tweets = client.search_recent_tweets(query=query, max_results=max_results, tweet_fields=["id", "text"], user_fields=["username"], expansions="author_id")

# Extract tweet details
tweet_data = []
user_dict = {user["id"]: user["username"] for user in tweets.includes["users"]}

for tweet in tweets.data:
    tweet_data.append({
        "tweet_id": tweet.id,
        "username": user_dict.get(tweet.author_id, "Unknown"),
        "text": tweet.text
    })

# Print results
for tweet in tweet_data:
    print(tweet)

{'tweet_id': 1892399026341036371, 'username': 'newscastjp', 'text': '【3/14開催】「DeepSeek-R1」GPUクラウド活用ウェビナー\u3000日本語モデル解説！ローカル環境で構築する方法について解説し...\n#AI #AIエージェント #ChatGPT #GPU #LLM #RAG #チャットボット #生成AI #DeepSeek #GPUクラウド\nhttps://t.co/Xh3XLgF8DT @AIsmiley_inc'}
{'tweet_id': 1892398964059586575, 'username': 'yogiliman', 'text': '🚀 Exciting news in the tech world! DeepSeek seeks funding and has caught the attention of Alibaba and state investors! What do you think this means for the future of AI and investment trends? 🤔 Dive into the details here: https://t.co/IYMeuABy7E #DeepSeek #AI #Investment #Tec… https://t.co/Aro17xErqJ'}
{'tweet_id': 1892397746524447191, 'username': 'TechHome100', 'text': "Realme Neo7 SE\nDeepSeek-R1 × AI Master Assists \nThe industry's first game function is connected to DeepSeek\n#realme #realmeNeo7SE #DeepSeek https://t.co/jGBKe4GDKe"}
{'tweet_id': 1892396801510297995, 'username': 'TongzhouNantong', 'text': "Here's a poem by #deepseek for #Tongzhou! Check it out!\nN

In [36]:
#Part 2 - Clean text and save as CSV
# Convert the list of tweet data to a pandas DataFrame
df = pd.DataFrame(tweet_data)

# Preprocess text using previously created function
df["tweet_id"] = df["tweet_id"].apply(preprocess_text)
df["username"] = df["username"].apply(preprocess_text)
df["text"] = df["text"].apply(preprocess_text)

print(df)

# Save DataFrame to a CSV file
df.to_csv("tweets.csv", index=False)

                tweet_id           username  \
0  [1892399026341036371]       [newscastjp]   
1  [1892398964059586575]        [yogiliman]   
2  [1892397746524447191]      [techhome100]   
3  [1892396801510297995]  [tongzhounantong]   
4  [1892396001182568826]        [yilien000]   
5  [1892395697468477731]       [simone9989]   
6  [1892395419692650888]     [metaversehub]   
7  [1892394811568607689]         [3phonehk]   
8  [1892394650583069084]         [joelmavi]   
9  [1892394570387742937]        [shahramma]   

                                                text  
0  [314deepseekr1gpu, ai, ai, chatgpt, gpu, llm, ...  
1  [excit, news, tech, world, deepseek, seek, fun...  
2  [realm, neo7, se, deepseekr1, ai, master, assi...  
3  [here, poem, deepseek, tongzhou, check, nanton...  
4  [elonmuskil, beaucoup, dingnieur, tr, intellig...  
5  [grok3, gt, altr, ia, pensa, come, un, umano, ...  
6  [5, wechat, ai, search, start, grey, test, dee...  
7  [ai, 3phoneai, ai, deepseek, chatgppt, 

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

Accessing websites and scraping the data from them without an API is extremely time consuming, difficult to understand, and often leads to many issues like Amazon denying access to the program, or needing to find creative ways to iterate through a website when they do not store reviews on iterating pages. Initially I thought we had to scrape all 5 of the listed resources for part 1 and found myself getting extremely frustrated by Amazon, IMDB, and G2. However, once I had actually managed to find websites with available APIs like Semantic Scholar and websites that could easily be navigated with clouscraper and beautiful soup (because they were not extremely condensed html), I found it much easier to begin gathering data. I have a lot more experience dealing with data that is stored in csv, dataframes, or lists so when it came to finally preprocessing text, moving things into a dataframe, and saving as a CSV it was significantly easier for me. I do not have much experience working with POS tagging, Named Entity Recognition, and similar tasks, so unfortunately I had to skip those for the sake of time. Lastly, I did enjoy working with APIs like Semantic Scholar and Twitter, these APIs make it significantly easier to extract data in a structured format, iterate, and search through compared to HTML & BeautifulSoup. I have in my freetime tried webscraping Reddit using Async PRAW, and found that it can be difficult because you need to understand how each website stores their data, and how to navigate the website through programming instead of a user interface.

I believe the time provided to complete the assignment was enough, however I really struggled on question 3 and was unable to complete parts 2 & 3. I have however missed the last two classes, therefore it is possible that information provided in class would have made it able for me to complete those questions.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog