<a href="https://colab.research.google.com/github/TharunSaiVT/INFO-5731/blob/main/V__T_Tharun_Sai_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [49]:
!pip install semanticscholar



In [50]:
import pandas as pd
import warnings
import time
from semanticscholar import SemanticScholar

# Ignoring warnings
warnings.filterwarnings("ignore")
pd.set_option('display.width', None)

# Initialization
sem_sch = SemanticScholar()

# Function to fetch abstracts iteratively
def fetch_abstracts(keywords, count):
    total_abstracts = []
    offset = 0
    while count > 0:
        for keyword in keywords:
            start_time = time.time()
            res = sem_sch.search_paper([keyword], fields=['abstract'])
            for paper in res:
                total_abstracts.append({'abstract': paper.abstract})
                count -= 1
                if count == 0:
                    break
                # Check if time limit (10 seconds) is exceeded
                if time.time() - start_time > 10:
                    break
            if count == 0:
                break
        offset += 1
    return total_abstracts

# Search articles using multiple topics and fetch 10000 abstracts
abstracts = fetch_abstracts(['machine learning', 'data science', 'artificial intelligence', 'information extraction'], 10000)

# Create DataFrame
df_paper = pd.DataFrame(abstracts)

# Save to CSV
df_paper.to_csv('output.csv', index=False)

# Print DataFrame
print(df_paper)


                                               abstract
0     We present Fashion-MNIST, a new dataset compri...
1     TensorFlow is a machine learning system that o...
2     TensorFlow is an interface for expressing mach...
3     The goal of precipitation nowcasting is to pre...
4                                                  None
...                                                 ...
9995  The image displayed in computed tomography is ...
9996  Automatically acquiring synonymous collocation...
9997  The past decade has seen an explosion in the a...
9998  We have recently completed the sixth in a seri...
9999  Face is a complex multidimensional visual mode...

[10000 rows x 1 columns]


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [51]:
import pandas as pd
import re
import warnings
from semanticscholar import SemanticScholar


# Function to clean text by removing special characters and punctuations
def clean_text(text):
    if text is None:
        return ''
    # Remove special characters and punctuations using regular expressions
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return cleaned_text

# Clean the abstracts and save the clean data in a new column
df_paper['cleaned_abstract'] = df_paper['abstract'].apply(clean_text)

# Save DataFrame to CSV with both original and cleaned abstracts
df_paper.to_csv('output_with_cleaned_abstract.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract
0,"We present Fashion-MNIST, a new dataset compri...",We present FashionMNIST a new dataset comprisi...
1,TensorFlow is a machine learning system that o...,TensorFlow is a machine learning system that o...
2,TensorFlow is an interface for expressing mach...,TensorFlow is an interface for expressing mach...
3,The goal of precipitation nowcasting is to pre...,The goal of precipitation nowcasting is to pre...
4,,
...,...,...
9995,The image displayed in computed tomography is ...,The image displayed in computed tomography is ...
9996,Automatically acquiring synonymous collocation...,Automatically acquiring synonymous collocation...
9997,The past decade has seen an explosion in the a...,The past decade has seen an explosion in the a...
9998,We have recently completed the sixth in a seri...,We have recently completed the sixth in a seri...


In [52]:
import pandas as pd
import re
import warnings
from semanticscholar import SemanticScholar


# Function to clean text by removing numbers
def clean_text(text):
    if text is None:
        return ''
    # Remove numbers using regular expressions
    cleaned_text = ''.join([i for i in text if not i.isdigit()])
    return cleaned_text

# Clean the abstracts and save the clean data in a new column
df_paper['cleaned_abstract'] = df_paper['abstract'].apply(clean_text)

# Save DataFrame to CSV with both original and cleaned abstracts
df_paper.to_csv('output_with_cleaned_abstract.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract
0,"We present Fashion-MNIST, a new dataset compri...","We present Fashion-MNIST, a new dataset compri..."
1,TensorFlow is a machine learning system that o...,TensorFlow is a machine learning system that o...
2,TensorFlow is an interface for expressing mach...,TensorFlow is an interface for expressing mach...
3,The goal of precipitation nowcasting is to pre...,The goal of precipitation nowcasting is to pre...
4,,
...,...,...
9995,The image displayed in computed tomography is ...,The image displayed in computed tomography is ...
9996,Automatically acquiring synonymous collocation...,Automatically acquiring synonymous collocation...
9997,The past decade has seen an explosion in the a...,The past decade has seen an explosion in the a...
9998,We have recently completed the sixth in a seri...,We have recently completed the sixth in a seri...


In [53]:
import pandas as pd
import re
import warnings
from semanticscholar import SemanticScholar


# Function to lower the text
def clean_text(text):
    if text is None:
        return ''
    # functions to lowercase all the letters
    cleaned_text = text.lower()
    return cleaned_text

# Clean the abstracts and save the clean data in a new column
df_paper['cleaned_abstract'] = df_paper['abstract'].apply(clean_text)

# Save DataFrame to CSV with both original and cleaned abstracts
df_paper.to_csv('output_with_cleaned_abstract.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract
0,"We present Fashion-MNIST, a new dataset compri...","we present fashion-mnist, a new dataset compri..."
1,TensorFlow is a machine learning system that o...,tensorflow is a machine learning system that o...
2,TensorFlow is an interface for expressing mach...,tensorflow is an interface for expressing mach...
3,The goal of precipitation nowcasting is to pre...,the goal of precipitation nowcasting is to pre...
4,,
...,...,...
9995,The image displayed in computed tomography is ...,the image displayed in computed tomography is ...
9996,Automatically acquiring synonymous collocation...,automatically acquiring synonymous collocation...
9997,The past decade has seen an explosion in the a...,the past decade has seen an explosion in the a...
9998,We have recently completed the sixth in a seri...,we have recently completed the sixth in a seri...


In [54]:
import pandas as pd
import re
import warnings
import nltk
from semanticscholar import SemanticScholar
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))

# Function to clean text by removing stopwords
def remove_stopwords(text):
    if text is None:
        return ''
    # Tokenize the text
    words = text.split()
    # Remove stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Join the filtered words back into a string
    cleaned_text = ' '.join(filtered_words)
    return cleaned_text

# Clean the abstracts by removing stopwords and save the clean data in a new column
df_paper['abstract_without_stopwords'] = df_paper['abstract'].apply(remove_stopwords)

# Save DataFrame to CSV with both original and cleaned abstracts
df_paper.to_csv('output_without_stopwords.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract,abstract_without_stopwords
0,"We present Fashion-MNIST, a new dataset compri...","we present fashion-mnist, a new dataset compri...","present Fashion-MNIST, new dataset comprising ..."
1,TensorFlow is a machine learning system that o...,tensorflow is a machine learning system that o...,TensorFlow machine learning system operates la...
2,TensorFlow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,TensorFlow interface expressing machine learni...
3,The goal of precipitation nowcasting is to pre...,the goal of precipitation nowcasting is to pre...,goal precipitation nowcasting predict future r...
4,,,
...,...,...,...
9995,The image displayed in computed tomography is ...,the image displayed in computed tomography is ...,image displayed computed tomography scaled rep...
9996,Automatically acquiring synonymous collocation...,automatically acquiring synonymous collocation...,Automatically acquiring synonymous collocation...
9997,The past decade has seen an explosion in the a...,the past decade has seen an explosion in the a...,past decade seen explosion amount digital info...
9998,We have recently completed the sixth in a seri...,we have recently completed the sixth in a seri...,"recently completed sixth series ""Message Under..."


In [55]:
import pandas as pd
import re
import warnings
import nltk
from semanticscholar import SemanticScholar
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Create a stemmer object
porter_stemmer = PorterStemmer()

# Function to perform stemming on text
def perform_stemming(text):
    if text is None:
        return ''
    # Tokenize the text
    words = word_tokenize(text)
    # Perform stemming on each word
    stemmed_words = [porter_stemmer.stem(word) for word in words]
    # Join the stemmed words back into a string
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

# Perform stemming on the abstracts and save the clean data in a new column
df_paper['stemmed_abstract'] = df_paper['abstract'].apply(perform_stemming)

# Save DataFrame to CSV with both original and stemmed abstracts
df_paper.to_csv('output_with_stemmed_abstract.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract,abstract_without_stopwords,stemmed_abstract
0,"We present Fashion-MNIST, a new dataset compri...","we present fashion-mnist, a new dataset compri...","present Fashion-MNIST, new dataset comprising ...","we present fashion-mnist , a new dataset compr..."
1,TensorFlow is a machine learning system that o...,tensorflow is a machine learning system that o...,TensorFlow machine learning system operates la...,tensorflow is a machin learn system that oper ...
2,TensorFlow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,TensorFlow interface expressing machine learni...,tensorflow is an interfac for express machin l...
3,The goal of precipitation nowcasting is to pre...,the goal of precipitation nowcasting is to pre...,goal precipitation nowcasting predict future r...,the goal of precipit nowcast is to predict the...
4,,,,
...,...,...,...,...
9995,The image displayed in computed tomography is ...,the image displayed in computed tomography is ...,image displayed computed tomography scaled rep...,the imag display in comput tomographi is a sca...
9996,Automatically acquiring synonymous collocation...,automatically acquiring synonymous collocation...,Automatically acquiring synonymous collocation...,automat acquir synonym colloc pair such as and...
9997,The past decade has seen an explosion in the a...,the past decade has seen an explosion in the a...,past decade seen explosion amount digital info...,the past decad ha seen an explos in the amount...
9998,We have recently completed the sixth in a seri...,we have recently completed the sixth in a seri...,"recently completed sixth series ""Message Under...",we have recent complet the sixth in a seri of ...


In [56]:
import pandas as pd
import re
import warnings
import nltk
from semanticscholar import SemanticScholar
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Function to perform lemmatization on text
def perform_lemmatization(text):
    if text is None:
        return ''
    # Tokenize the text
    words = word_tokenize(text)
    # Perform lemmatization on each word
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    # Join the lemmatized words back into a string
    lemmatized_text = ' '.join(lemmatized_words)
    return lemmatized_text

# Perform lemmatization on the abstracts and save the clean data in a new column
df_paper['lemmatized_abstract'] = df_paper['abstract'].apply(perform_lemmatization)

# Save DataFrame to CSV with both original and lemmatized abstracts
df_paper.to_csv('output_with_lemmatized_abstract.csv', index=False)

# Optionally, you can also return or further manipulate the DataFrame
df_paper


Unnamed: 0,abstract,cleaned_abstract,abstract_without_stopwords,stemmed_abstract,lemmatized_abstract
0,"We present Fashion-MNIST, a new dataset compri...","we present fashion-mnist, a new dataset compri...","present Fashion-MNIST, new dataset comprising ...","we present fashion-mnist , a new dataset compr...","We present Fashion-MNIST , a new dataset compr..."
1,TensorFlow is a machine learning system that o...,tensorflow is a machine learning system that o...,TensorFlow machine learning system operates la...,tensorflow is a machin learn system that oper ...,TensorFlow is a machine learning system that o...
2,TensorFlow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,TensorFlow interface expressing machine learni...,tensorflow is an interfac for express machin l...,TensorFlow is an interface for expressing mach...
3,The goal of precipitation nowcasting is to pre...,the goal of precipitation nowcasting is to pre...,goal precipitation nowcasting predict future r...,the goal of precipit nowcast is to predict the...,The goal of precipitation nowcasting is to pre...
4,,,,,
...,...,...,...,...,...
9995,The image displayed in computed tomography is ...,the image displayed in computed tomography is ...,image displayed computed tomography scaled rep...,the imag display in comput tomographi is a sca...,The image displayed in computed tomography is ...
9996,Automatically acquiring synonymous collocation...,automatically acquiring synonymous collocation...,Automatically acquiring synonymous collocation...,automat acquir synonym colloc pair such as and...,Automatically acquiring synonymous collocation...
9997,The past decade has seen an explosion in the a...,the past decade has seen an explosion in the a...,past decade seen explosion amount digital info...,the past decad ha seen an explos in the amount...,The past decade ha seen an explosion in the am...
9998,We have recently completed the sixth in a seri...,we have recently completed the sixth in a seri...,"recently completed sixth series ""Message Under...",we have recent complet the sixth in a seri of ...,We have recently completed the sixth in a seri...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [57]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
import spacy

# Load SpaCy English model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm")

# Read the CSV file containing the clean text
try:
    df = pd.read_csv('output_with_lemmatized_abstract.csv')
except FileNotFoundError:
    print("File not found. Please ensure the file path is correct.")
    exit()
except Exception as e:
    print("An error occurred while reading the CSV file:", e)
    exit()

# Function to conduct Parts of Speech (POS) Tagging and calculate counts
def pos_tagging_and_count(text):
    nouns = verbs = adjectives = adverbs = 0
    if isinstance(text, str):  # Check if text is a string
        tokens = word_tokenize(text)
        pos_tags = pos_tag(tokens)
        for _, tag in pos_tags:
            if tag.startswith('N'):  # Noun
                nouns += 1
            elif tag.startswith('V'):  # Verb
                verbs += 1
            elif tag.startswith('J'):  # Adjective
                adjectives += 1
            elif tag.startswith('R'):  # Adverb
                adverbs += 1
    return nouns, verbs, adjectives, adverbs

# Function to perform Constituency Parsing and Dependency Parsing
def parse_trees(text):
    if isinstance(text, str):  # Check if text is a string
        sentences = sent_tokenize(text)
        for sentence in sentences:
            print("Constituency Parsing Tree:")
            parsed_sentence = nlp(sentence)
            for token in parsed_sentence:
                print(token.text, token.dep_, token.head.text, token.head.pos_,
                      [child for child in token.children])
            print("\nDependency Parsing Tree:")
            print(parsed_sentence)
            print("="*50)

# Function to perform Named Entity Recognition (NER) and calculate counts
def ner_and_count(text):
    entities = []
    if isinstance(text, str):  # Check if text is a string
        doc = nlp(text)
        for ent in doc.ents:
            entities.append(ent.text + " - " + ent.label_)
    return entities

# Apply functions to each row of the DataFrame
df['noun'], df['verb'], df['adjective'], df['adverb'] = zip(*df['lemmatized_abstract'].apply(pos_tagging_and_count))
df['lemmatized_abstract'].apply(parse_trees)
df['entities'] = df['lemmatized_abstract'].apply(ner_and_count)

# Display DataFrame with added columns
print(df)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
private amod communication NOUN []
communication pobj for ADP [private, between]
between prep communication NOUN [device]
two nummod device NOUN []
wireless amod device NOUN []
device pobj between ADP [two, wireless]
, punct for ADP []
from prep evaluate VERB [strength]
the det strength NOUN []
received amod strength NOUN []
signal compound strength NOUN []
strength pobj from ADP [the, received, signal, variation]
( punct variation NOUN []
RSS nmod variation NOUN []
) punct variation NOUN []
variation appos strength NOUN [(, RSS, ), on]
on prep variation NOUN [channel]
the det channel NOUN []
wireless amod channel NOUN []
channel pobj on ADP [the, wireless, between]
between prep channel NOUN [device]
the det device NOUN []
two nummod device NOUN []
device pobj between ADP [the, two]
. punct evaluate VERB []

Dependency Parsing Tree:
We evaluate the effectiveness of secret key extraction , for private communication between

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [58]:
# Write your response below

I find it ch