<a href="https://colab.research.google.com/github/Akarsh-20/Akarsh_INFO5731_UNT/blob/main/Doddi_Akarsh_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here
import requests
from bs4 import BeautifulSoup
import csv

def scrape_imdb_reviews(movie_id, num_reviews=10000):
    base_url = f'https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3'

    reviews = []
    page_number = 1

    while len(reviews) < num_reviews:
        url = f"{base_url}&start={page_number}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='text show-more__control')

        if not review_elements:
            break

        for review in review_elements:
            reviews.append(review.text.strip())

        page_number += 1

    return reviews[:num_reviews]

def save_reviews_to_csv(data, filename='imdb_reviews.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(['Review'])
        writer.writerows([[review] for review in data])

# Here you can change the movie ID
movie_id = 'tt15398776'
num_reviews_to_scrape = 10000

movie_reviews = scrape_imdb_reviews(movie_id, num_reviews_to_scrape)
save_reviews_to_csv(movie_reviews)




# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write your code here
# importing required libraries
import requests
from bs4 import BeautifulSoup
import csv
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # (1) Remove noise (special characters and punctuations)
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])

    # (2) Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # (3) Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    text = ' '.join([word for word in tokens if word.lower() not in stop_words])

    # (4) Lowercase all texts
    text = text.lower()

    # (5) Stemming
    porter_stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    text = ' '.join([porter_stemmer.stem(word) for word in tokens])

    # (6) Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])

    return text

def scrape_imdb_reviews(movie_id, num_reviews=10000):
    base_url = f'https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3'

    reviews = []
    page_number = 1

    while len(reviews) < num_reviews:
        url = f"{base_url}&start={page_number}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='text show-more__control')

        if not review_elements:
            break

        for review in review_elements:
            cleaned_review = clean_text(review.text.strip())
            reviews.append(cleaned_review)

        page_number += 1

    return reviews[:num_reviews]

def save_reviews_to_csv(data, filename='imdb_reviews_cleaned.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(['Original Review', 'Cleaned Review'])
        for original_review, cleaned_review in zip(data, [clean_text(review) for review in data]):
            writer.writerow([original_review, cleaned_review])

# Replace 'your_movie_id' with the actual IMDb movie ID
movie_reviews = scrape_imdb_reviews('tt15398776', num_reviews=10000)
save_reviews_to_csv(movie_reviews)





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Write your code here
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('words')
import requests
from bs4 import BeautifulSoup
import csv
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from nltk.chunk import tree2conlltags
from collections import Counter

def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

def constituency_parsing(text):
    sentences = sent_tokenize(text)
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged)
        print("Constituency Parsing Tree:")
        print(parsing_tree)

def dependency_parsing(text):
    sentences = sent_tokenize(text)
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged)
        conll_tags = tree2conlltags(parsing_tree)
        print("Dependency Parsing Tree:")
        for tag in conll_tags:
            print(tag)

def named_entity_recognition(text):
    sentences = sent_tokenize(text)
    entities = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged, binary=True)
        entities.extend([(word, entity) for word, entity, tag in tree2conlltags(parsing_tree) if entity != 'O'])
    entity_counts = Counter(entities)
    return entity_counts

# Load cleaned reviews from the CSV file
cleaned_reviews = []
with open('imdb_reviews_cleaned.csv', 'r', encoding='utf-8') as csv_file:
    reader = csv.reader(csv_file)
    next(reader)  # Skip header
    for row in reader:
        cleaned_reviews.append(row[1])

# Performing analyses on a sample review
sample_review = cleaned_reviews[0]

# (1) Parts of Speech (POS) Tagging
pos_tags, pos_counts = pos_tagging(sample_review)
print("\nParts of Speech (POS) Tagging:")
print(pos_tags)
print("POS Counts:", pos_counts)

# (2) Constituency Parsing and Dependency Parsing
constituency_parsing(sample_review)
dependency_parsing(sample_review)

# (3) Named Entity Recognition
entity_counts = named_entity_recognition(sample_review)
print("\nNamed Entity Recognition:")
print(entity_counts)





[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.



Parts of Speech (POS) Tagging:
[('youll', 'NN'), ('wit', 'NN'), ('brain', 'NN'), ('fulli', 'JJ'), ('switch', 'NN'), ('watch', 'NN'), ('oppenheim', 'NN'), ('could', 'MD'), ('easili', 'VB'), ('get', 'VB'), ('away', 'RB'), ('nonatt', 'JJ'), ('viewer', 'NN'), ('intellig', 'NN'), ('filmmak', 'NN'), ('show', 'NN'), ('audienc', 'VBZ'), ('great', 'JJ'), ('respect', 'JJ'), ('fire', 'NN'), ('dialogu', 'NN'), ('pack', 'NN'), ('inform', 'NN'), ('relentless', 'NN'), ('pace', 'NN'), ('jump', 'NN'), ('differ', 'NN'), ('time', 'NN'), ('oppenheim', 'JJ'), ('life', 'NN'), ('continu', 'VB'), ('hour', 'NN'), ('runtim', 'NN'), ('visual', 'JJ'), ('clue', 'NN'), ('guid', 'NN'), ('viewer', 'NN'), ('time', 'NN'), ('youll', 'JJ'), ('get', 'NN'), ('grip', 'NN'), ('quit', 'NN'), ('quickli', 'NN'), ('relentless', 'NN'), ('help', 'NN'), ('express', 'VB'), ('urgenc', 'JJ'), ('u', 'JJ'), ('attack', 'NN'), ('chase', 'NN'), ('atom', 'NN'), ('bomb', 'NN'), ('germani', 'NN'), ('could', 'MD'), ('absolut', 'VB'), ('career

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency parsing and dependency parsing are two methods used in natural language processing to extract syntactic information from sentences.

**Constituency Parsing**:

Based on the formalism of context-free grammars

Divide the sentence into constituents, which are sub-phrases that belong to a specific category in the grammar

The parse tree includes sentences broken into sub-phrases, each belonging to a grammar category

Focuses on the hierarchial structure of the sentence

Can be used in word processing systems for grammar checking

**DEPENDENCY PARSING**:

Based on dependencies between words in a sentence

The parse tree connects words according to their relationships

Focuses on the linear structure of the sentence

Can be more useful for several downstream tasks like information extraction or question answering

Can be used to extract subject-verb-object triples that are often indicative of semantic relations between predicates
