<a href="https://colab.research.google.com/github/KrinalM/Krinalben_INFO5731_Spring2020/blob/main/Monpara_Krinalben_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [15]:
# Import necessary libraries
!pip install semanticscholar
import pandas as pd
import warnings
from semanticscholar import SemanticScholar



In [16]:
# Suppressing warnings and setting display options
warnings.filterwarnings("ignore")
pd.set_option('display.width', None)

# Initializing SemanticScholar
sem_sch = SemanticScholar()

# Specifying the query
query = "data science"

# Creating a DataFrame to store abstracts
df_paper = pd.DataFrame(columns=['abstract'])

# Initializing total number of papers fetched
total_papers = 0

#10,000 abstracts were taking too long to acquire, thus I obtained just 1000 abstracts.

# Loop until 1000 abstracts are collected
while total_papers < 1000:
    # Search for papers related to the query
    res = sem_sch.search_paper(query, fields=['abstract'])

    # If no results are found, break out of the loop
    if not res:
        break

    # Iterate through each paper in the search results
    for paper in res:
        # If the paper has an abstract
        if paper.abstract:
            # Increment the total number of papers fetched
            total_papers += 1
            # Add the abstract to the DataFrame
            df_paper = df_paper.append({'abstract': paper.abstract}, ignore_index=True)

        # If 1000 abstracts are collected, break out of the loop
        if total_papers == 1000:
            break

# Save the DataFrame containing abstracts to a CSV file
df_paper.to_csv('1000_abstracts.csv', index=False)

# Print a message indicating that data collection and saving is complete
print("Data collection and saving complete.")

Data collection and saving complete.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [17]:
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to remove non-alphabetic characters
def remove_noise(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Function to remove stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    return ' '.join([word.lower() for word in tokens if word.lower() not in stop_words])

# Function to stem text
def stem_text(text):
    tokens = nltk.word_tokenize(text)
    return ' '.join([stemmer.stem(word) for word in tokens])

# Function to lemmatize text
def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in tokens])

# Read the CSV file into a DataFrame
df = pd.read_csv('1000_abstracts.csv')

# Apply preprocessing steps and store them in separate columns
df['noise_removed'] = df['abstract'].apply(remove_noise)
df['numbers_removed'] = df['noise_removed'].apply(remove_numbers)
df['stopwords_removed'] = df['numbers_removed'].apply(remove_stopwords)
df['lowercased'] = df['stopwords_removed'].str.lower()
df['stemmed'] = df['lowercased'].apply(stem_text)
df['lemmatized'] = df['lowercased'].apply(lemmatize_text)

# Write the cleaned DataFrame to a new CSV file
df.to_csv('abstracts_cleaned.csv', index=False)

# Print a message indicating that cleaning and saving is complete
print("Data cleaning and saving complete.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Data cleaning and saving complete.


In [18]:
# Import necessary libraries
import pandas as pd
from google.colab import files

# Assuming df_clean is already defined and contains the lemmatized column

# Selecting the lemmatized column from df_clean
df_cleaned_data = df['lemmatized']
# Saving the entire cleaned dataframe to a CSV file
df.to_csv('df_clean.csv', index=False)
# Saving only the lemmatized column to a separate CSV file
df_cleaned_data.to_csv('df_cleaned_data.csv', index=False)
# Downloading both CSV files using Google Colab's files.download() function
files.download("df_clean.csv")
files.download("df_cleaned_data.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [19]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
import spacy
nltk.download('averaged_perceptron_tagger')

# Load SpaCy English model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm")

# Read the CSV file containing the clean text
try:
    df = pd.read_csv('df_cleaned_data.csv')
except FileNotFoundError:
    print("File not found. Please ensure the file path is correct.")
    exit()
except Exception as e:
    print("An error occurred while reading the CSV file:", e)
    exit()

# Function to conduct Parts of Speech (POS) Tagging and calculate counts
def pos_tagging_and_count(text):
    nouns = verbs = adjectives = adverbs = 0
    if isinstance(text, str):  # Check if text is a string
        tokens = word_tokenize(text)
        pos_tags = pos_tag(tokens)
        for _, tag in pos_tags:
            if tag.startswith('N'):  # Noun
                nouns += 1
            elif tag.startswith('V'):  # Verb
                verbs += 1
            elif tag.startswith('J'):  # Adjective
                adjectives += 1
            elif tag.startswith('R'):  # Adverb
                adverbs += 1
    return nouns, verbs, adjectives, adverbs

# Function to perform Constituency Parsing and Dependency Parsing
def parse_trees(text):
    if isinstance(text, str):  # Check if text is a string
        sentences = sent_tokenize(text)
        for sentence in sentences:
            print("Constituency Parsing Tree:")
            parsed_sentence = nlp(sentence)
            for token in parsed_sentence:
                print(token.text, token.dep_, token.head.text, token.head.pos_,
                      [child for child in token.children])
            print("\nDependency Parsing Tree:")
            print(parsed_sentence)
            print("="*50)

# Function to perform Named Entity Recognition (NER) and calculate counts
def ner_and_count(text):
    entities = []
    if isinstance(text, str):  # Check if text is a string
        doc = nlp(text)
        for ent in doc.ents:
            entities.append(ent.text + " - " + ent.label_)
    return entities

# Apply functions to each row of the DataFrame
df['noun'], df['verb'], df['adjective'], df['adverb'] = zip(*df['lemmatized'].apply(pos_tagging_and_count))

df['lemmatized'].head(2).apply(parse_trees)
#df['lemmatized'].apply(parse_trees)
df['entities'] = df['lemmatized'].apply(ner_and_count)

# Display DataFrame with added columns
print(df)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Constituency Parsing Tree:
cambridge compound university PROPN []
university compound press NOUN [cambridge]
press nsubj let VERB [university]
let ROOT let VERB [press, summarize, n, onto, preserve]
u nsubj summarize VERB []
summarize ccomp let VERB [u, finding]
finding xcomp summarize VERB [projection]
random amod projection NOUN []
projection dobj finding VERB [random, set]
set acl projection NOUN []
r compound n NOUN []
n ccomp let VERB [r]
onto prep let VERB [subspace]
mdimensional amod subspace NOUN []
subspace pobj onto ADP [mdimensional]
approximately advmod preserve VERB []
preserve ccomp let VERB [approximately, geometry]
geometry dobj preserve VERB []

Dependency Parsing Tree:
cambridge university press let u summarize finding random projection set r n onto mdimensional subspace approximately preserve geometry
Constituency Parsing Tree:
chatgpt compound conversational PROPN []
conversational nsubj utilizes VERB [chatgpt]
ai compound interface PROPN []
interface nsubj utilizes

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [20]:
# Write your response below
'''
to be honest, This assignment was very very tough. It tooks 7-8 hours to complete it. I had to use many references for this assignment.
IT WAS VERY TOUGH FOR ME.
I founded challenges in Question 2 and 3.
However, question 1 was easy.
10,000 abstracts were taking too long to acquire, thus I obtained just 1000 abstracts

'''

'\nto be honest, This assignment was very very tough. It tooks 7-8 hours to complete it. I had to use many references for this assignment. \nIT WAS VERY TOUGH FOR ME.\nI founded challenges in Question 2 and 3.\nHowever, question 1 was easy.\n10,000 abstracts were taking too long to acquire, thus I obtained just 1000 abstracts\n\n'