# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Your code here

import csv
import requests
from bs4 import BeautifulSoup
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
}
offsets = list(range(0, 980, 25))

#print(len(offsets))

url = 'https://ddr.densho.org/api/0.2/narrator/?limit=25&offset='
k=0
finalData={}
for offset in offsets:
  #print(url+str(offset))
  resp = requests.get(url+str(offset),headers=headers)
  data = resp.json()
  for i in range(len(data['objects'])):
    name = data['objects'][i][  'display_name']

    if 'b_date' not in data['objects'][i]:
        bday = ''
    else:
        bday = data['objects'][i]["b_date"]

    bio = data['objects'][i]['bio']

    finalData[str(k)] = {'name':name,'birthday':bday,'bio':bio}
    k=k+1
finalData


csv_file = 'data.csv'

with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['ID', 'Name', 'Birthday', 'Bio'])
    writer.writeheader()

    for key, values in finalData.items():
        writer.writerow({
            'ID': key,
            'Name': values['name'],
            'Birthday': values['birthday'],
            'Bio': values['bio']
        })



# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [28]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
df = pd.read_csv("/content/data.csv")
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
    filtered_text = ''.join([char if char.isalnum() or char.isspace() else ' ' for char in str(text)])
    tokens = word_tokenize(filtered_text)
    filtered_tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text
df['Cleaned_Bio'] = df['Bio'].apply(preprocess_text)
df.to_csv("cleaned_data.csv", index=False)
print(df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


   ID                   Name             Birthday  \
0   0           Kay Aiko Abe  1927-05-09T00:00:00   
1   1                Art Abe  1921-06-12T00:00:00   
2   2  Sharon Tanagi Aburano  1925-10-31T00:00:00   
3   3        Toshiko Aiboshi  1928-04-08T00:00:00   
4   4      Douglas L. Aihara  1950-03-15T00:00:00   

                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  nisei femal born may 9 1927 selleck washington...  
1  nisei male born june 12 1921 seattl washington...  
2  nisei femal born octob 31 1925 seattl washingt...  
3  nisei femal born juli 8 1928 boyl height calif...  
4  sansei male born march 15 1950 torranc califor...  


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [31]:
df

Unnamed: 0,ID,Name,Birthday,Bio,Cleaned_Bio
0,0,Kay Aiko Abe,1927-05-09T00:00:00,"Nisei female. Born May 9, 1927, in Selleck, Wa...",nisei femal born may 9 1927 selleck washington...
1,1,Art Abe,1921-06-12T00:00:00,"Nisei male. Born June 12, 1921, in Seattle, Wa...",nisei male born june 12 1921 seattl washington...
2,2,Sharon Tanagi Aburano,1925-10-31T00:00:00,"Nisei female. Born October 31, 1925, in Seattl...",nisei femal born octob 31 1925 seattl washingt...
3,3,Toshiko Aiboshi,1928-04-08T00:00:00,"Nisei female. Born July 8, 1928, in Boyle Heig...",nisei femal born juli 8 1928 boyl height calif...
4,4,Douglas L. Aihara,1950-03-15T00:00:00,"Sansei male. Born March 15, 1950, in Torrance,...",sansei male born march 15 1950 torranc califor...
...,...,...,...,...,...
972,972,Karen Yoshitomi,1962-11-01T00:00:00,"Sansei female. Born 1962 in Spokane, Washingto...",sansei femal born 1962 spokan washington fathe...
973,973,John Young,1923-05-22T00:00:00,"Chinese American male. Born May 22, 1923, in L...",chine american male born may 22 1923 los angel...
974,974,Sharon Yuen,1945-07-01T00:00:00,"Sansei female. Born July 1945 in Seattle, Wash...",sansei femal born juli 1945 seattl washington ...
975,975,Lois Yuki,1944-09-13T00:00:00,"Nisei female. Born September 13, 1944, in the ...",nisei femal born septemb 13 1944 tule lake con...


In [32]:
import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")

def syntax_structure_analysis(text):
    doc = nlp(text)

    noun_count = len([token for token in doc if token.pos_ == 'NOUN'])
    verb_count = len([token for token in doc if token.pos_ == 'VERB'])
    adj_count = len([token for token in doc if token.pos_ == 'ADJ'])
    adv_count = len([token for token in doc if token.pos_ == 'ADV'])

    entities = {'ID': 0, 'PERSON': 0, 'DATE': 0}

    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1

    print(f"Total Nouns: {noun_count}")
    print(f"Total Verbs: {verb_count}")
    print(f"Total Adjectives: {adj_count}")
    print(f"Total Adverbs: {adv_count}")

    print("\nNamed Entity Recognition (NER) entities:")
    for entity_type, count in entities.items():
        print(f"{entity_type.capitalize()}: {count}")

text = df['Cleaned_Bio'][0]
syntax_structure_analysis(text)


Total Nouns: 13
Total Verbs: 3
Total Adjectives: 2
Total Adverbs: 0

Named Entity Recognition (NER) entities:
Id: 0
Person: 1
Date: 0


In [36]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = df['Cleaned_Bio'][0]
doc = nlp(sentence)
print("Dependency Parsing Tree:")
displacy.render(doc, style="dep", jupyter=True)
constituency_tree = "(S (NP (DT The) (NN cat)) (VP (VBD chased) (NP (DT the) (NN mouse))) (. .))"
print("\nConstituency Parsing Tree:")


Dependency Parsing Tree:



Constituency Parsing Tree:


In [44]:
import spacy
import pandas as pd

# Load English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Sample DataFrame (replace with your actual DataFrame)
data = df

# Function to perform Named Entity Recognition and count entities
def extract_entities(text):
    doc = nlp(text)
    entity_counts = {"PERSON": 0, "ORG": 0, "GPE": 0, "PRODUCT": 0, "DATE": 0}
    for ent in doc.ents:
        if ent.label_ in entity_counts:
            entity_counts[ent.label_] += 1
    return entity_counts

# Apply entity extraction function to the 'Cleaned_Bio' column in the DataFrame
entity_counts = df['Cleaned_Bio'].apply(extract_entities)

# Initialize counts for total entities
total_entity_counts = {"PERSON": 0, "ORG": 0, "GPE": 0, "PRODUCT": 0, "DATE": 0}

# Sum the entity counts across all rows
for counts in entity_counts:
    for entity_type, count in counts.items():
        total_entity_counts[entity_type] += count

# Print the total counts of each entity type
print("Total Entity Counts:")
for entity_type, count in total_entity_counts.items():
    print(f"{entity_type}: {count}")


Total Entity Counts:
PERSON: 1327
ORG: 970
GPE: 3822
PRODUCT: 8
DATE: 1471


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
""" It was very challenging but got to learn alot for the first one first i tried to go with the IMDB but after
lots of tring i was able to get only 1st 10 reviews because of load more option and i googled lots of methods but many time it denied permission
to access data then i tried with Densho where it was in pages after lot of hurdels i was able to get data of narattors In the second question
the second question it was done quickly and the third was very difficult for me i was not able to get the output till the last moment """