# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [16]:
# Your code here
import requests
from bs4 import BeautifulSoup
import csv

def scrape_reviews(url, num_pages, output_file):
    reviews_data = []

    for page_num in range(1, num_pages + 1):
        page_url = f"{url}&page={page_num}"

        response = requests.get(page_url)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page_num}. Exiting.")
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        review_containers = soup.find_all('div', class_='lister-item-content')

        for container in review_containers:
            review_text = container.find('div', class_='text').get_text()
            username = container.find('span', class_='display-name-link').get_text()
            review_date = container.find('span', class_='review-date').get_text()

            reviews_data.append([username, review_date, review_text])

    if reviews_data:
        with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Username', 'Review Date', 'Review Text'])
            csv_writer.writerows(reviews_data)

        print(f"{len(reviews_data)} reviews have been successfully scraped and saved to '{output_file}'.")
    else:
        print("No reviews found on the pages.")

if __name__ == "__main__":
    movie_url = 'https://www.imdb.com/title/tt9362722/reviews?ref_=tt_urv'
    num_pages_to_scrape = 100
    output_csv_file = 'Spiderman_reviews.csv'

    scrape_reviews(movie_url, num_pages_to_scrape, output_csv_file)





2500 reviews have been successfully scraped and saved to 'Spiderman_reviews.csv'.


In [17]:
import pandas as pd
pd.read_csv('Spiderman_reviews.csv')

Unnamed: 0,Username,Review Date,Review Text
0,MiroslavKyuranov,31 May 2023,"It's honestly absurd how good the ""Spider-Vers..."
1,UniqueParticle,2 June 2023,"The animation, flow of everything, genius char..."
2,pugpool10,2 June 2023,If it wasn't already obvious in the first film...
3,rickothan,2 June 2023,This film is a visual concert. The animation a...
4,cricketbat,1 June 2023,Spider-Man: Into the Spider-Verse is probably ...
...,...,...,...
2495,pressboard,15 June 2023,I generally liked the animation but it constan...
2496,srdjankostic91,31 May 2023,"I rarely write reviews, but this movie is trul..."
2497,georgetbarber,2 June 2023,"Adjusted rating - 8.5/10: a genuine spectacle,..."
2498,matris1,3 June 2023,The good: This is a gorgeous and competently m...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [18]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

#1
df=pd.read_csv('Spiderman_reviews.csv')

import re

df['Reviews after Noise Removal'] = df['Review Text'].str.replace('[^\w\s]','')
df['Reviews after Noise Removal'] = df['Reviews after Noise Removal'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
#2

df['After digits removal'] = df['Reviews after Noise Removal'].apply(lambda y: ''.join([i for i in y if not i.isdigit()]))
#3
from nltk.corpus import stopwords
s = stopwords.words('english')
df['Stopwords Removal'] = df['After digits removal'].apply(lambda x: " ".join(x for x in x.split() if x not in s))
#4
df['Lower Case'] = df['Stopwords Removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df
#5
from nltk.stem import PorterStemmer
s = PorterStemmer()
df['After Stemming'] = df['Lower Case'].apply(lambda x: " ".join([s.stem(word) for word in x]))
#6
from textblob import Word
import nltk
nltk.download('wordnet')

df['After Lemmatization'] = df['After Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.to_csv('Spiderman_reviews.csv', index=False)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  df['Reviews after Noise Removal'] = df['Review Text'].str.replace('[^\w\s]','')
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [19]:
import spacy
from spacy import displacy
from nltk import pos_tag, word_tokenize, RegexpParser
import nltk
import pandas as pd

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Read CSV file
df = pd.read_csv('Spiderman_reviews.csv')
print(df.head())

# Extract a sample sentence from a specific column
column_name = 'After Lemmatization'  # Change this to the column containing the clean text
sample_sentence = df[column_name].iloc[0]

# (1) Parts of Speech (POS) Tagging
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}

    for token in doc:
        if token.pos_ == 'NOUN':
            pos_counts['Noun'] += 1
        elif token.pos_ == 'VERB':
            pos_counts['Verb'] += 1
        elif token.pos_ == 'ADJ':
            pos_counts['Adjective'] += 1
        elif token.pos_ == 'ADV':
            pos_counts['Adverb'] += 1

    return pos_counts

pos_result = pos_tagging(sample_sentence)
print("\n(1) Parts of Speech (POS) Tagging:")
print(pos_result)

# (2) Constituency Parsing
# Tokenize and process the sentence with NLTK
words = word_tokenize(sample_sentence)
tagged_words = pos_tag(words)

# Define a simple grammar for NP (Noun Phrase) and VP (Verb Phrase)
grammar = r"""
    NP: {<DT>?<JJ>*<NN>}    # Noun Phrase
    VP: {<VB.*><NP|PP>}     # Verb Phrase
    PP: {<IN><NP>}          # Prepositional Phrase
"""

# Using RegexpParser from NLTK to perform constituency parsing
parser = RegexpParser(grammar)
tree = parser.parse(tagged_words)

print("\nConstituency Parsing Tree:")
tree.pretty_print()

# (3) Dependency Parsing
doc = nlp(sample_sentence)

# Display Dependency Parsing Tree using spaCy
print("\nDependency Parsing Tree:")
displacy.render(doc, style="dep", options={'distance': 80})

# (4) Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = {}

    for ent in doc.ents:
        entities[ent.label_] = entities.get(ent.label_, 0) + 1

    return entities

ner_result = named_entity_recognition(sample_sentence)
print("\n(4) Named Entity Recognition (NER):")
print(ner_result)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


           Username  Review Date  \
0  MiroslavKyuranov  31 May 2023   
1    UniqueParticle  2 June 2023   
2         pugpool10  2 June 2023   
3         rickothan  2 June 2023   
4        cricketbat  1 June 2023   

                                         Review Text  \
0  It's honestly absurd how good the "Spider-Vers...   
1  The animation, flow of everything, genius char...   
2  If it wasn't already obvious in the first film...   
3  This film is a visual concert. The animation a...   
4  Spider-Man: Into the Spider-Verse is probably ...   

                         Reviews after Noise Removal  \
0  Its honestly absurd how good the SpiderVerse m...   
1  The animation flow of everything genius charac...   
2  If it wasnt already obvious in the first film ...   
3  This film is a visual concert The animation an...   
4  SpiderMan Into the SpiderVerse is probably my ...   

                                After digits removal  \
0  Its honestly absurd how good the SpiderVerse m... 


(4) Named Entity Recognition (NER):
{'ORG': 18, 'EVENT': 2}



Constituency parsing analyzes sentence structure by grouping words into constituents like phrases or clauses. In "The animation, flow of everything, genius character development, and action were all electrifying," it identifies these elements. Dependency parsing illustrates word relationships, revealing how each contributes to the sentence's meaning and structure.

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''The assignment is bit challenging. The webscrapping and cleaning the text is like practise session, which is quite easy. The most difficulty i found is in parsing. It took lot of time to write.
