<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here


import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the IMDb page with user reviews
url = "https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2"

# Create a CSV file to store the data
csv_filename = "imdb_reviews.csv"
csv_file = open(csv_filename, 'w', encoding='utf-8', newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Title", "User", "Date", "Rating", "Review"])

# Function to scrape and save reviews
def scrape_reviews(url, csv_writer):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        reviews = soup.find_all("div", class_="lister-item-content")

        for review in reviews:
            title = review.find("a", class_="title").text.strip()
            user = review.find("span", class_="display-name-link").text.strip()
            date = review.find("span", class_="review-date").text.strip()
            rating = review.find("span", class_="rating-other-user-rating").text.strip()
            review_text = review.find("div", class_="text").text.strip().replace("\n", " ")
            csv_writer.writerow([title, user, date, rating, review_text])

# Scrape multiple pages of reviews (adjust the range accordingly)
for page_num in range(1, 500):
    page_url = f"{url}&start={page_num * 10}"
    scrape_reviews(page_url, csv_writer)

csv_file.close()
print("Reviews have been scraped and saved to", csv_filename)

Reviews have been scraped and saved to imdb_reviews.csv


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\soumya nanditha\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [5]:
# Write your code here

import csv
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK data (if not already downloaded)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize the stopwords, stemmer, and lemmatizer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a function to clean the text
def clean_text(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize the text and remove stopwords
    words = [word for word in text.lower().split() if word not in stop_words]
    # Stem and lemmatize each word
    words = [lemmatizer.lemmatize(stemmer.stem(word)) for word in words]
    # Join the cleaned words back into a string
    return ' '.join(words)

# Load the CSV data into a pandas DataFrame
csv_filename = "imdb_reviews.csv"
df = pd.read_csv(csv_filename)

# Clean and preprocess the 'Review' column
df['Cleaned Review'] = df['Review'].apply(clean_text)

# Save the DataFrame with cleaned data to a new CSV file
cleaned_csv_filename = "imdb_reviews_cleaned.csv"
df.to_csv(cleaned_csv_filename, index=False, encoding='utf-8')

print("Text data has been cleaned and saved to", cleaned_csv_filename)



[nltk_data] Downloading package stopwords to C:\Users\soumya
[nltk_data]     nanditha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\soumya
[nltk_data]     nanditha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\soumya
[nltk_data]     nanditha\AppData\Roaming\nltk_data...


Text data has been cleaned and saved to imdb_reviews_cleaned.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Write your code here






**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**