<a href="https://colab.research.google.com/github/17251A0404/Abhigna_INFO5731_Spring2024/blob/main/DARA_ABHIGNA_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_imdb_movie_reviews(movie_id, num_reviews=1000):
    base_url = f"https://www.imdb.com/title/{movie_id}/reviews"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}

    reviews = []
    page_num = 1
    while len(reviews) < num_reviews:
        url = f"{base_url}?sort=submissionDate&dir=desc&ratingFilter=0&page={page_num}"
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        review_containers = soup.find_all('div', class_='review-container')
        if not review_containers:
            break
        for container in review_containers:
            review_text = container.find('div', class_='text').text.strip()
            reviews.append(review_text)
            if len(reviews) >= num_reviews:
                break
        page_num += 1

    return reviews

def save_reviews_to_csv(reviews, filename):
    df = pd.DataFrame(reviews, columns=['Review'])
    df.to_csv(filename, index=False)

def main():
    # Movie ID of the movie you want to collect reviews for (IMDB ID can be found in the movie's IMDB URL)
    movie_id = "tt0172495"  # Example movie: The Matrix
    num_reviews = 1000  # Number of reviews to collect
    filename = "movie_reviews.csv"

    reviews = scrape_imdb_movie_reviews(movie_id, num_reviews)
    save_reviews_to_csv(reviews, filename)
    print(f"{len(reviews)} reviews saved to {filename}.")

if __name__ == "__main__":
    main()


1000 reviews saved to movie_reviews.csv.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def remove_noise(text):
    # Remove special characters and punctuations
    text = ''.join([char for char in text if char not in string.punctuation])
    return text

def remove_numbers(text):
    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])
    return text

def remove_stopwords(text):
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_text = ' '.join([word for word in words if word.lower() not in stop_words])
    return filtered_text

def lowercase(text):
    # Lowercase all texts
    return text.lower()

def stemming(text):
    # Stemming
    stemmer = PorterStemmer()
    words = word_tokenize(text)
    stemmed_text = ' '.join([stemmer.stem(word) for word in words])
    return stemmed_text

def lemmatization(text):
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in words])
    return lemmatized_text

def clean_text(text):
    text = remove_noise(text)
    text = remove_numbers(text)
    text = remove_stopwords(text)
    text = lowercase(text)
    text = stemming(text)
    text = lemmatization(text)
    return text

def clean_and_save_reviews(filename):
    df = pd.read_csv(filename)
    df['Cleaned_Review'] = df['Review'].apply(clean_text)
    cleaned_filename = filename.split('.')[0] + '_cleaned.csv'
    df.to_csv(cleaned_filename, index=False)
    print(f"Cleaned data saved to {cleaned_filename}.")

if __name__ == "__main__":
    filename = "movie_reviews.csv"  # Change this to your CSV filename
    clean_and_save_reviews(filename)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned data saved to movie_reviews_cleaned.csv.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import spacy
from spacy import displacy
import pandas as pd
from collections import Counter

# Load the English language model
nlp = spacy.load("en_core_web_sm")

def pos_tagging_and_count(text):
    # Tag Parts of Speech of each word in the text and calculate the count of each POS
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc]
    pos_counts = Counter(pos_tags)
    return pos_counts

def constituency_and_dependency_parsing(text):
    # Print out the constituency parsing trees and dependency parsing trees of all the sentences
    doc = nlp(text)
    for sentence in doc.sents:
        print(f"Sentence: {sentence.text}")
        print("Parts of Speech:")
        for token in sentence:
            print(f"{token.text}/{token.pos_}", end=" ")
        print("\nConstituency Parsing Tree:")
        for np in sentence.noun_chunks:
            print(f"{np.text}/{np.root.dep_}", end=" ")
        print("\nDependency Parsing Tree:")
        displacy.render(sentence, style="dep", jupyter=True, options={"distance": 120})

def named_entity_recognition(text):
    # Extract named entities and calculate the count of each entity type
    doc = nlp(text)
    entities = Counter([(ent.text, ent.label_) for ent in doc.ents])
    return entities

def main():
    filename = "movie_reviews_cleaned.csv"  # Change this to your cleaned CSV filename
    df = pd.read_csv(filename)
    text = ' '.join(df['Cleaned_Review'])

    # Parts of Speech Tagging and Count
    pos_counts = pos_tagging_and_count(text)
    print("Parts of Speech (POS) Tagging:")
    print(pos_counts)

    # Constituency and Dependency Parsing
    print("\nConstituency and Dependency Parsing:")
    constituency_and_dependency_parsing(text)

    # Named Entity Recognition
    entities = named_entity_recognition(text)
    print("\nNamed Entity Recognition (NER):")
    print(entities)

if __name__ == "__main__":
    main()


Sentence: one man serv roman armi movi show gladiat live struggl surviv fight fear anger lot mighti arena rome coliseum also seen movi show mighti ancient rome recommend movi everyon like histori action adventur best movi time let prefac mention someth ridley scott fanboy hold probabl higher pantheon modern filmmak dont mind admit howev one film seem step direct audienc feel neg toward posit probabl one film he prais gladiat ive sever view past coupl decad begin sort im disengag larg swath film think write downth stori spanish gener maximu russel crow built well ill say script john logan david franzoni william nicholson structur point need tell stori reveng there earli glori emperor marcu aureliu richard harri unjust fall hand commodu joaquin phoenix rebirth gladiatori game provinc ownership proximo oliv reed revel back forth toward reveng end yet cant connect ive lay blame film tone recent view refin criticismfirstli tone remark dour bare happi moment throughout exceedingli rare leav 

Sentence: she terribl feminin hard cold even begin given one moment smile laugh maximu expens otherwis much respit form miseri around even maximu she never potenti alli dark danger gameth time film lighten tone action sequenc gener pretti great scott qualiti action filmmak understand point spectacl achiev battl germania fight africa gladiatori bout coliseum expertli stage film film begin strong recreat roman battl tactic help han zimmer lisa gerrard rous score get lost dour tone machin roman senateand lead second main critic polit there point later commodu quizz lucilla peopl rome like distant victori repli glori rome idea definit larger plot film return glori rome preimperi state republ without emperor
Parts of Speech:
she/PRON terribl/VERB feminin/ADJ hard/ADJ cold/NOUN even/ADV begin/VERB given/VERB one/NUM moment/NOUN smile/NOUN laugh/NOUN maximu/NOUN expens/NOUN otherwis/ADP much/ADJ respit/NOUN form/NOUN miseri/ADJ around/ADP even/ADV maximu/VERB she/PRON never/ADV potenti/VERB a

Sentence: okay fine never make case preimperi state better way current state never dig maximu care beyond fact die wish marcu aureliu would real one want doesnt matter movieand lead think core issu film maximu strong reason want plot actual want reveng commodu enough there effort wrap effort senat led gracchu derek jacobi commodu return power senat pervers incent commodu point suppos ignor reason maximu care marcu aureliu told care essenti real disconnect film present charact interact plot film never actual reconcilesand yet think
Parts of Speech:
okay/ADJ fine/NOUN never/ADV make/VERB case/NOUN preimperi/ADJ state/NOUN better/ADJ way/NOUN current/ADJ state/NOUN never/ADV dig/VERB maximu/NOUN care/NOUN beyond/ADP fact/NOUN die/NOUN wish/VERB marcu/NOUN aureliu/NOUN would/AUX real/VERB one/NUM want/NOUN does/AUX nt/PART matter/VERB movieand/NOUN lead/NOUN think/VERB core/PROPN issu/PROPN film/PROPN maximu/PROPN strong/ADJ reason/NOUN want/VERB plot/NOUN actual/ADJ want/VERB reveng/PROPN

Sentence: film pretti
Parts of Speech:
film/NOUN pretti/PROPN 
Constituency Parsing Tree:
film/nsubj 
Dependency Parsing Tree:


Sentence: okay
Parts of Speech:
okay/INTJ 
Constituency Parsing Tree:

Dependency Parsing Tree:


Sentence: mostli spectacl front give film full mark look great begin end there painterli approach visual help small part scott penchant use smoke machin everywher ever make film im use smoke machin ridley scott fault action scene topnotch perform everyon commit effect
Parts of Speech:
mostli/PROPN spectacl/PROPN front/NOUN give/VERB film/NOUN full/ADJ mark/NOUN look/VERB great/ADJ begin/NOUN end/VERB there/ADV painterli/ADJ approach/NOUN visual/ADJ help/NOUN small/ADJ part/NOUN scott/PROPN penchant/PROPN use/NOUN smoke/NOUN machin/NOUN everywher/NOUN ever/ADV make/VERB film/NOUN i/PRON m/AUX use/VERB smoke/NOUN machin/PROPN ridley/NOUN scott/PROPN fault/PROPN action/NOUN scene/NOUN topnotch/NOUN perform/VERB everyon/NOUN commit/VERB effect/NOUN 
Constituency Parsing Tree:
mostli/nsubj spectacl front/appos film/dative full mark/nsubj painterli approach visual help/dative small part/dobj scott penchant use smoke/appos machin everywher/nsubj film/dobj i/nsubj smoke machin ridley scott fau

Sentence: special mention phoenix goe commodu demonstr descent mad paranoia welli engag stori much weird combin superseri roman polit assum presumpt like shout democraci crowd expect cheer odd disconnect main charact plot that unfold around know film legion fan will go arena defend film honor anyon would besmirch besmirch done weird stori man dont think realli work wellmad fun spot look great that far noth movi masterclass act cast direct write perform russel crow joaquin phoenix set bar extrem high isnt singl dull moment throughout movi runtim plot well written proper backstori defin protagonist veng motiv despit long run time movi never feel elong take perfect amount time convey stori violent scene movi scari remind brutal use exist recent past fight movi feel realist although protagonist alway shown slight edg howev thoroughli justifi get see past well movi deserv five oscar receiv sequel come shall go high bar term audienc expect gladiat epic masterpiec captiv start finish set back

Sentence: wheatscott make keep pace way want scene action fight blood everywher follow slow sad thought momentsaaaaand got mani great quotesw need movi like first one best movi open histori cinema start truli epic unleash sheer visual brillianc let hour russel crow perfect cast he eighti buff action hero know hold fair fight command respect everi minut
Parts of Speech:
wheatscott/NOUN make/VERB keep/VERB pace/NOUN way/NOUN want/VERB scene/NOUN action/NOUN fight/VERB blood/NOUN everywher/NOUN follow/VERB slow/ADJ sad/ADJ thought/NOUN momentsaaaaand/PROPN got/VERB mani/PROPN great/ADJ quotesw/NOUN need/VERB movi/NOUN like/ADP first/ADJ one/NUM best/ADJ movi/NOUN open/ADJ histori/NOUN cinema/NOUN start/PROPN truli/PROPN epic/PROPN unleash/PROPN sheer/ADJ visual/ADJ brillianc/NOUN let/VERB hour/NOUN russel/NOUN crow/NOUN perfect/ADJ cast/NOUN he/PRON eighti/VERB buff/PROPN action/NOUN hero/NOUN know/VERB hold/VERB fair/ADJ fight/NOUN command/NOUN respect/NOUN everi/PROPN minut/NOUN 
Consti

Sentence: he screen even enemi support cast oliv reed richard harri conni nelson djimon hounsou joaquin phoenix hold add stori give best line ever said film rare film everyth cast costum fit perfectlywow say ive seen film number time rare beast dont make anymor never u want watch film roman empir set gladiat arena ancient colosseum noth like good hundr year cinema better spartacu ben hur well salut u truli entertain gladiat histor epic direct ridley scott showcas power perform russel crow gener maximu decimu meridiu film also featur strong support role joaquin phoenix cun corrupt emperor commodu conni nielsen dignifi lucillaset ancient rome movi follow maximu seek vengeanc treacher commodu betray forc slaveri grip storylin breathtak action sequenc intens charact dynam make gladiat timeless classic film stun visual evoc score elev impact immers audienc grandeur brutal ancient romei highli recommend gladiat compel narr except perform emot depth remain cinemat triumph continu captiv audie

Sentence: visual becom sensori feast bittersweet end linger transcend mere sword sandal tell power tale resili wit crow epic roar betray gener turn gladiat vengeanc simmer unforgiv colosseum rome underneath honor glow experi epic battl mesmer score visual transport ancient rome cinemat journey histor spectacl profound narr strength honor reson long final scene super good movi go see best act everyth good senc work amaz stori good love amaz area movi domain havent seen yet see stun plea go watch worth wont disappoint trust good tri spoil anyth go watch amaz rullel cruel phinx super good movi seen agre movi movi good sorta hard explain go watch see mean sweep saga ancient rome reveng redempt cinemat triumph boast breathtak action sequenc unforgett perform timeless soundtrack rank among greatest cinema historyset turbul world roman empir gladiat tell stori maximu decimu meridiu rever roman gener betray murder powerhungri emperor commodu maximu left dead enslav vow aveng famili emperora ma

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
