<a href="https://colab.research.google.com/github/KruthiReddyKasarla/Kruthi_INFO5731_Spring-2025/blob/main/Kasarla_Kruthi_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [6]:
import csv
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

# Initialize variables
count = 0
main_url = "https://ddr.densho.org/narrators/?page={}"

# File to store results
OUTPUT_FILE = "densho_narrators.csv"

# Define the CSV header
HEADER = ["Narrator_Name", "Bio", "Title", "Interviewer", "Location", "Date", "Densho_ID", "Transcript"]

# Open CSV file to write results
with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(HEADER)

    # Loop through pages of narrators
    page_num = 1
    while count < 904:
        # Opening URL and Parsing with BeautifulSoup
        link1 = Request(main_url.format(page_num), headers={'User-Agent': 'Mozilla/5.0'})
        url1 = urlopen(link1)
        data1 = url1.read()
        data1_soup = BeautifulSoup(data1, 'html.parser')

        print(f"*** Page {page_num} ***")

        # Iterating through each narrator's link on the page
        for narrator_link in data1_soup.find_all('h4'):
            if count >= 904:  # Stop if we've reached 904 narrators
                break

            # Going to narrator transcript page
            link2 = Request(narrator_link.a.get('href'), headers={'User-Agent': 'Mozilla/5.0'})
            url2 = urlopen(link2)
            data2 = url2.read()
            data2_soup = BeautifulSoup(data2, 'html.parser')

            count += 1
            print(f"Processing narrator {count}...")

            # Extract narrator details
            narrator = data2_soup.find_all("div", attrs={'class':'col-sm-8 col-md-8'})[0]
            result = {}

            # Narrator name and Bio
            try:
                result['Narrator_Name'] = narrator.h1.text.strip().replace('"', "")
            except:
                result['Narrator_Name'] = ""

            try:
                result['Bio'] = narrator.p.text
            except:
                result['Bio'] = ""

            # Initialize other fields as empty
            result['Title'] = ""
            result['Interviewer'] = ""
            result['Location'] = ""
            result['Date'] = ""
            result['Densho_ID'] = ""
            result['Transcript'] = ""

            # If interview transcripts are available
            if data2_soup.find_all("div", attrs={'class':'url'}):
                interview_count = 1
                for interviews in data2_soup.find_all("div", attrs={'class':'url'}):
                    try:
                        link3 = Request(interviews.a.get('href'), headers={'User-Agent': 'Mozilla/5.0'})
                        url3 = urlopen(link3)
                        data3 = url3.read()
                        data3_soup = BeautifulSoup(data3, 'html.parser')

                        # Extract title, interviewer, location, date, densho ID
                        for data in data3_soup.find_all('div', attrs={'class':'segmentHead'}):
                            for attr in data.text.splitlines():
                                if 'Title' in attr:
                                    result['Title'] = ' '.join(attr.strip().split()[1:])
                                if 'Interviewer' in attr:
                                    result['Interviewer'] = ' '.join(attr.strip().split()[1:])
                                if 'Location' in attr:
                                    result['Location'] = ' '.join(attr.strip().split()[1:])
                                if 'Date' in attr:
                                    result['Date'] = ' '.join(attr.strip().split()[1:])
                                if 'Densho ID' in attr:
                                    result['Densho_ID'] = ' '.join(attr.strip().split()[2:])

                        # Collect transcript
                        transcript = []
                        for seg in data3_soup.find_all('div', attrs={'class':'segmentBody'}):
                            for p in seg.find_all('p')[1:-1]:
                                if '?' in p.text:
                                    transcript.append(re.sub('\s+', ' ', re.sub("  +", "", p.text)))
                        result['Transcript'] = " | ".join(transcript)

                    except Exception as e:
                        print(f"Error processing interview {interview_count}: {e}")
                        continue
                    interview_count += 1

            # Write result to CSV
            writer.writerow([result['Narrator_Name'], result['Bio'], result['Title'], result['Interviewer'],
                             result['Location'], result['Date'], result['Densho_ID'], result['Transcript']])

        page_num += 1

print(f"Total count of narrators is: {count}")



*** Page 1 ***
Processing narrator 1...
Processing narrator 2...
Processing narrator 3...
Processing narrator 4...
Processing narrator 5...
Processing narrator 6...
Processing narrator 7...
Processing narrator 8...
Processing narrator 9...
Processing narrator 10...
Processing narrator 11...
Processing narrator 12...
Processing narrator 13...
Processing narrator 14...
Processing narrator 15...
Processing narrator 16...
Processing narrator 17...
Processing narrator 18...
Processing narrator 19...
Processing narrator 20...
Processing narrator 21...
Processing narrator 22...
Processing narrator 23...
Processing narrator 24...
Processing narrator 25...
*** Page 2 ***
Processing narrator 26...
Processing narrator 27...
Processing narrator 28...
Processing narrator 29...
Processing narrator 30...
Processing narrator 31...
Processing narrator 32...
Processing narrator 33...
Processing narrator 34...
Processing narrator 35...
Processing narrator 36...
Processing narrator 37...
Processing narrat

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [7]:
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Define text cleaning functions:

# (1)
def remove_noise(text):
    return re.sub(r'[^\w\s]', '', text)  # Removes all non-word characters and punctuation

# (2)
def remove_numbers(text):
    return re.sub(r'\d+', '', text)  # Removes all digits

# (3)
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# (4)
def lowercase_text(text):
    return text.lower()  # Converts text to lowercase

# (5)
ps = PorterStemmer()

def stemming(text):
    return ' '.join([ps.stem(word) for word in text.split()])

# (6)
lemmatizer = WordNetLemmatizer()

def lemmatization(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Input and output file names
input_file = 'densho_narrators.csv'
output_file = 'densho_narrators_cleaned.csv'

# Open the input CSV and create the output CSV
with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', newline='', encoding='utf-8') as outfile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames + ['Cleaned_Transcript']  # Adding new column for cleaned data
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()


    for row in reader:
        # Get the 'Transcript' or 'Narrator_Name' to clean
        raw_text = row['Transcript'] if row['Transcript'] else row['Narrator_Name']

        # Apply the cleaning steps
        cleaned_text = remove_noise(raw_text)
        cleaned_text = remove_numbers(cleaned_text)
        cleaned_text = remove_stopwords(cleaned_text)
        cleaned_text = lowercase_text(cleaned_text)
        cleaned_text = stemming(cleaned_text)
        cleaned_text = lemmatization(cleaned_text)

        # Add the cleaned text to the row
        row['Cleaned_Transcript'] = cleaned_text

        # cleaned data to the output CSV
        writer.writerow(row)

print(f"Data cleaned and saved in {output_file}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Data cleaned and saved in densho_narrators_cleaned.csv


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [8]:
import spacy
import pandas as pd
from collections import Counter
from spacy import displacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned data from the CSV file
df = pd.read_csv('densho_narrators_cleaned.csv')

# Ensure the column exists
if 'Bio' not in df.columns:
    print("Error: 'Bio' column not found! Available columns are:", df.columns)
else:
    # Extract the first non-null cleaned text entry
    cleaned_narrators_data = df['Bio'].dropna().iloc[0] if not df['Bio'].dropna().empty else ""

    if cleaned_narrators_data == "":
        print("Error: No valid text data found in 'Bio' column!")
    else:
        # Process the text using spaCy
        doc = nlp(cleaned_narrators_data)

        # 1.
        pos_counts = Counter(token.pos_ for token in doc)

        print("\nParts of Speech (POS) Counts:")
        print(f"Nouns (NOUN): {pos_counts['NOUN']}")
        print(f"Verbs (VERB): {pos_counts['VERB']}")
        print(f"Adjectives (ADJ): {pos_counts['ADJ']}")
        print(f"Adverbs (ADV): {pos_counts['ADV']}")

        # 2.
        print("\nDependency Parsing Visualization:")
        displacy.render(doc, style="dep", jupyter=True)

        # 3.
        ner_counts = Counter(ent.label_ for ent in doc.ents)

        print("\nNamed Entity Recognition (NER) Counts:")
        for entity, count in ner_counts.items():
            print(f"{entity}: {count}")

        # Display Named Entity Recognition Visualization
        print("\nNamed Entity Recognition Visualization:")
        displacy.render(doc, style="ent", jupyter=True)

        # 4.
        example_sentence = "Apple was founded by Steve Jobs in Cupertino in 1976."
        doc_example = nlp(example_sentence)

        print("\nDependency Parsing of Example Sentence:")
        for token in doc_example:
            print(f"Word: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}")

        # Dependency visualization for example sentence
        displacy.render(doc_example, style="dep", jupyter=True)



Parts of Speech (POS) Counts:
Nouns (NOUN): 12
Verbs (VERB): 8
Adjectives (ADJ): 4
Adverbs (ADV): 0

Dependency Parsing Visualization:



Named Entity Recognition (NER) Counts:
PERSON: 1
DATE: 1
GPE: 8
NORP: 1
EVENT: 1
ORG: 2

Named Entity Recognition Visualization:



Dependency Parsing of Example Sentence:
Word: Apple, Dependency: nsubjpass, Head: founded
Word: was, Dependency: auxpass, Head: founded
Word: founded, Dependency: ROOT, Head: founded
Word: by, Dependency: agent, Head: founded
Word: Steve, Dependency: compound, Head: Jobs
Word: Jobs, Dependency: pobj, Head: by
Word: in, Dependency: prep, Head: founded
Word: Cupertino, Dependency: pobj, Head: in
Word: in, Dependency: prep, Head: founded
Word: 1976, Dependency: pobj, Head: in
Word: ., Dependency: punct, Head: founded


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub‚Äôs usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [11]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import pandas as pd

base_url = "https://github.com/marketplace?type=actions"
total_pages = 500

# List to store scraped data
scraped_data = []

for page in range(1, total_pages + 1):
    try:
        request = Request(base_url, headers={'User-Agent': 'Mozilla/5.0'})
        response = urlopen(request)
        html_content = response.read()

        soup = BeautifulSoup(html_content, "html.parser")
        container = soup.find('div', class_="mt-4 marketplace-common-module__marketplace-list-grid--vCk7D")

        if container:
            items = container.find_all('div', class_="position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3")

            for item in items:
                header = item.find('div', class_="d-flex flex-justify-between flex-items-start gap-3")
                product_name = header.find('a').text.strip()
                product_desc = item.find('p').text.strip()

                # Append to list
                scraped_data.append([product_name, product_desc])
                print(product_name)
                print(product_desc)
        else:
            print(f"No data found on page {page}")

    except HTTPError as e:
        print(f"HTTP Error on page {page}: {e}")

# Convert to DataFrame
df = pd.DataFrame(scraped_data, columns=["Product Name", "Description"])

# Save as CSV
output_file = "github_marketplace_actions.csv"
df.to_csv(output_file, index=False, encoding="utf-8")

print(f"CSV file '{output_file}' has been successfully generated!")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Scan Github Actions with TruffleHog
Metrics embed
An infographics generator with 40+ plugins and 300+ options to display stats about your GitHub account
yq - portable yaml processor
create, read, update, delete, merge, validate and do more with yaml
Super-Linter
Super-linter is a ready-to-run collection of linters and code analyzers, to help validate your source code
Gosec Security Checker
Runs the gosec security checker
Rebuild Armbian and Kernel
Support Amlogic, Rockchip and Allwinner boxes
OpenCommit ‚Äî improve commits with AI üßô
Replaces lame commit messages with meaningful AI-generated messages when you push to remote
Checkout
Checkout a Git repository at a particular version
SSH Remote Commands
Executing remote ssh commands
GitHub Pages action
GitHub Actions for GitHub Pages üöÄ Deploy static files and publish your site easily. Static-Site-Generators-friendly
Cache
Cache artifacts like dependencies and build out

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [1]:
pip install tweepy



In [3]:
import tweepy

# Set your keys and tokens here
api_key = 'ZwdEHbvpbvLq5Km3j0Ymuz2BM'
api_key_secret = 'jBAZxUuKGv0mCe8KWolbf9GlPV0ofSSwkyLpQ2XsDShvwLQqF2'
access_token = '1892334348478697474-HmOIXxFbTlUogKdd6vqGJtSQzf22Rw'
access_token_secret = 'YHXrwNmuWnhEbxKNvzC9k0Ek6edmbFqk6Uleu2uSJc2fc'

# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=api_key,
    consumer_secret=api_key_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
)
api = tweepy.API(auth)

In [4]:
import tweepy

# Set up bearer token
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAAGWgzQEAAAAAwsv1s%2BcZK9CxY6zCXcBj57RcEzE%3D4rJGRU0DAx7yyE2TqYNzRwcnv7RMGRFQUcOdKEMtb3y0KjIRZZ')



query = "#llm -is:retweet"
tweets = client.search_recent_tweets(query=query, tweet_fields=["created_at", "text", "author_id"], max_results=100)

# Print tweets
for tweet in tweets.data:
    print(f"User: {tweet.author_id}, Tweet: {tweet.text}")


User: 1835547425446207488, Tweet: #DeveloperWeek2025 #ai #softwareengineer #aiagents #machinelearning #llm #deeplearning #machinelearning #100Devs https://t.co/TPBr5AUDpI
User: 1484066855044063238, Tweet: üë©‚Äçüíª It's now easier than ever to integrate Visual AI into your production applications! If you're looking to build with state-of-the-art Vision-Language Models (VLMs), check out our new Python SDK and Node SDKs.

https://t.co/iYFPQ2GRxr

#vlm #llm #genai
User: 1640297006, Tweet: We're never going back to a world with no #GenerativeAI - let's not forget #SymbolicAI could be the best of both worlds - ML &amp; #GenAI working together ! https://t.co/DXMWy3jz6T #innovation #disruption #LLM #SLM #AgenticAI #AgenticWorkflow #KnowledgeGraphs
User: 3138657423, Tweet: I made it full screen now! 
Not sure why the AI likes to play in a small rectangle...
#AI #Grok3 #GROK3AI #grok #indiegame #indiedev #IndieGameDev #JRPG #rpg #javascript #html5 #CSS #HTML #LLM #programming #code https://t.

In [5]:
import pandas as pd

# Create a list to store tweet data
tweet_data = []

if tweets.data:
    for tweet in tweets.data:
        tweet_data.append({
            'tweet_id': tweet.id,
            'author_id': tweet.author_id,
            'created_at': tweet.created_at,
            'text': tweet.text
        })

# Convert tweet data to pandas DataFrame
df = pd.DataFrame(tweet_data)

# Display DataFrame
print(df)

# Save to a CSV file
df.to_csv("tweets_llm.csv", index=False)

               tweet_id            author_id                created_at  \
0   1892340124542754970  1835547425446207488 2025-02-19 22:26:58+00:00   
1   1892338804088045996  1484066855044063238 2025-02-19 22:21:43+00:00   
2   1892338301001957515           1640297006 2025-02-19 22:19:43+00:00   
3   1892333861008241128           3138657423 2025-02-19 22:02:04+00:00   
4   1892333338557632966            416378712 2025-02-19 22:00:00+00:00   
..                  ...                  ...                       ...   
95  1892202476654461088           3018375669 2025-02-19 13:20:00+00:00   
96  1892202225025388583  1813879763007606785 2025-02-19 13:19:00+00:00   
97  1892201764406931602  1086871886590488577 2025-02-19 13:17:10+00:00   
98  1892197093516718354   933974788002889729 2025-02-19 12:58:37+00:00   
99  1892197010008379820            550268745 2025-02-19 12:58:17+00:00   

                                                 text  
0   #DeveloperWeek2025 #ai #softwareengineer #aiag...  

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

Thoughts on the Assignment:
This assignment was a great opportunity to apply web scraping, API usage, and text processing techniques to real-world datasets. I found it insightful to work with different sources like GitHub Marketplace, Twitter, and review sites, as it provided hands-on experience in data collection and cleaning.

Challenges Faced:
One of the biggest challenges was handling pagination efficiently while scraping large datasets. Additionally, cleaning text data required careful preprocessing to remove noise while preserving meaningful information. Setting up API access, especially for Twitter, required proper authentication and managing rate limits.

Aspects Enjoyed:
I enjoyed working with real-world text data and applying NLP techniques such as POS tagging, dependency parsing, and named entity recognition. It was interesting to analyze how different preprocessing steps affected the quality of the final dataset.

Time to Complete the Assignment:
The given time was manageable, but some tasks, like collecting large datasets and cleaning them, required significant effort. Fortunately, the one-day deadline extension helped me complete the assignment more effectively, allowing extra time for debugging and improving data quality. The structured approach in the assignment also helped in completing it efficiently.



# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog