<a href="https://colab.research.google.com/github/HarshaSolingaram/INFO_5731/blob/main/Last_First_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
!pip install semanticscholar
import pandas as pd
import warnings
from semanticscholar import SemanticScholar



In [None]:

warnings.filterwarnings("ignore")
pd.set_option('display.width', None)
sem_sch = SemanticScholar()

queries = ["machine learning", "data science", "artificial intelligence", "information extraction"]
df_paper = pd.DataFrame(columns=['abstract'])
total_papers = 0

for query in queries:

    print(f"Processing query: {query}")
    count = 0

    while count < 2500:

        res = sem_sch.search_paper(query, fields=['abstract'])

        if not res:
            break

        for paper in res:

            if paper.abstract:
                count += 1
                total_papers += 1
                paper_info = {'abstract': paper.abstract}
                df_paper = df_paper.append(paper_info, ignore_index=True)

            if count == 2500:
                break

df_paper.to_csv('research_papers_abstracts_2500_per_query.csv', index=False)


Processing query: machine learning
Processing query: data science
Processing query: artificial intelligence
Processing query: information extraction


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def remove_noise(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

def remove_numbers(text):
    return re.sub(r'\d+', '', text)

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    return ' '.join([word.lower() for word in tokens if word.lower() not in stop_words])

def stem_text(text):
    tokens = nltk.word_tokenize(text)
    return ' '.join([stemmer.stem(word) for word in tokens])

def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in tokens])

df_clean = pd.read_csv('research_papers_abstracts_2500_per_query.csv')

df_clean['noise_removed'] = df_clean['abstract'].apply(remove_noise)
df_clean['numbers_removed'] = df_clean['noise_removed'].apply(remove_numbers)
df_clean['stopwords_removed'] = df_clean['numbers_removed'].apply(remove_stopwords)
df_clean['lowercased'] = df_clean['stopwords_removed'].str.lower()
df_clean['stemmed'] = df_clean['lowercased'].apply(stem_text)
df_clean['lemmatized'] = df_clean['lowercased'].apply(lemmatize_text)

df_clean.to_csv('research_papers_abstracts_cleaned_separate_steps.csv', index=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
df_clean

Unnamed: 0,abstract,noise_removed,numbers_removed,stopwords_removed,lowercased,stemmed,lemmatized
0,"We present Fashion-MNIST, a new dataset compri...",We present FashionMNIST a new dataset comprisi...,We present FashionMNIST a new dataset comprisi...,present fashionmnist new dataset comprising x ...,present fashionmnist new dataset comprising x ...,present fashionmnist new dataset compris x gra...,present fashionmnist new dataset comprising x ...
1,TensorFlow is a machine learning system that o...,TensorFlow is a machine learning system that o...,TensorFlow is a machine learning system that o...,tensorflow machine learning system operates la...,tensorflow machine learning system operates la...,tensorflow machin learn system oper larg scale...,tensorflow machine learning system operates la...
2,TensorFlow is an interface for expressing mach...,TensorFlow is an interface for expressing mach...,TensorFlow is an interface for expressing mach...,tensorflow interface expressing machine learni...,tensorflow interface expressing machine learni...,tensorflow interfac express machin learn algor...,tensorflow interface expressing machine learni...
3,The goal of precipitation nowcasting is to pre...,The goal of precipitation nowcasting is to pre...,The goal of precipitation nowcasting is to pre...,goal precipitation nowcasting predict future r...,goal precipitation nowcasting predict future r...,goal precipit nowcast predict futur rainfal in...,goal precipitation nowcasting predict future r...
4,Machine learning addresses the question of how...,Machine learning addresses the question of how...,Machine learning addresses the question of how...,machine learning addresses question build comp...,machine learning addresses question build comp...,machin learn address question build comput imp...,machine learning address question build comput...
...,...,...,...,...,...,...,...
9992,Most modern Information Extraction (IE) system...,Most modern Information Extraction IE systems ...,Most modern Information Extraction IE systems ...,modern information extraction ie systems imple...,modern information extraction ie systems imple...,modern inform extract ie system implement sequ...,modern information extraction ie system implem...
9993,Documents contain information that can be used...,Documents contain information that can be used...,Documents contain information that can be used...,documents contain information used various app...,documents contain information used various app...,document contain inform use variou applic ques...,document contain information used various appl...
9994,"We develop CALM, a coordination analyzer that ...",We develop CALM a coordination analyzer that i...,We develop CALM a coordination analyzer that i...,develop calm coordination analyzer improves up...,develop calm coordination analyzer improves up...,develop calm coordin analyz improv upon conjun...,develop calm coordination analyzer improves up...
9995,The goal of Open Information Extraction (OIE) ...,The goal of Open Information Extraction OIE is...,The goal of Open Information Extraction OIE is...,goal open information extraction oie extract s...,goal open information extraction oie extract s...,goal open inform extract oie extract surfac re...,goal open information extraction oie extract s...


In [67]:
df_cleaned_data = df_clean['lemmatized']
df_cleaned_data

df_clean.to_csv('df_clean.csv', index=False)
df_cleaned_data.to_csv('df_cleaned_data.csv', index=False)

from google.colab import files
files.download("df_clean.csv")
files.download("df_cleaned_data.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
import spacy
nltk.download('averaged_perceptron_tagger')

# Load SpaCy English model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm")

# Read the CSV file containing the clean text
try:
    df = pd.read_csv('df_cleaned_data.csv')
except FileNotFoundError:
    print("File not found. Please ensure the file path is correct.")
    exit()
except Exception as e:
    print("An error occurred while reading the CSV file:", e)
    exit()

# Function to conduct Parts of Speech (POS) Tagging and calculate counts
def pos_tagging_and_count(text):
    nouns = verbs = adjectives = adverbs = 0
    if isinstance(text, str):  # Check if text is a string
        tokens = word_tokenize(text)
        pos_tags = pos_tag(tokens)
        for _, tag in pos_tags:
            if tag.startswith('N'):  # Noun
                nouns += 1
            elif tag.startswith('V'):  # Verb
                verbs += 1
            elif tag.startswith('J'):  # Adjective
                adjectives += 1
            elif tag.startswith('R'):  # Adverb
                adverbs += 1
    return nouns, verbs, adjectives, adverbs

# Function to perform Constituency Parsing and Dependency Parsing
def parse_trees(text):
    if isinstance(text, str):  # Check if text is a string
        sentences = sent_tokenize(text)
        for sentence in sentences:
            print("Constituency Parsing Tree:")
            parsed_sentence = nlp(sentence)
            for token in parsed_sentence:
                print(token.text, token.dep_, token.head.text, token.head.pos_,
                      [child for child in token.children])
            print("\nDependency Parsing Tree:")
            print(parsed_sentence)
            print("="*50)

# Function to perform Named Entity Recognition (NER) and calculate counts
def ner_and_count(text):
    entities = []
    if isinstance(text, str):  # Check if text is a string
        doc = nlp(text)
        for ent in doc.ents:
            entities.append(ent.text + " - " + ent.label_)
    return entities

# Apply functions to each row of the DataFrame
df['noun'], df['verb'], df['adjective'], df['adverb'] = zip(*df['lemmatized'].apply(pos_tagging_and_count))

df['lemmatized'].head(2).apply(parse_trees)
#df['lemmatized'].apply(parse_trees)
df['entities'] = df['lemmatized'].apply(ner_and_count)

# Display DataFrame with added columns
print(df)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Constituency Parsing Tree:
present amod image NOUN []
fashionmnist amod image NOUN []
new amod dataset NOUN []
dataset nmod product NOUN [new, comprising]
comprising acl dataset NOUN []
x punct product NOUN []
grayscale compound image NOUN []
image compound product NOUN [grayscale]
fashion compound product NOUN []
product compound image NOUN [dataset, x, image, fashion]
category compound image NOUN []
image nsubj serve VERB [present, fashionmnist, product, category, per]
per prep image NOUN [fashionmnist]
category compound training NOUN []
training compound set VERB [category]
set amod test NOUN []
image compound test NOUN []
test compound set VERB [set, image]
set amod fashionmnist NOUN [training, test]
image compound fashionmnist NOUN []
fashionmnist pobj per ADP [set, image, intended]
intended acl fashionmnist NOUN []
serve ROOT serve VERB [image, learning, dataset, url]
direct amod replacement NOUN []
dropin nmod replacement NOUN []
replacement nmod learning VERB [direct, dropin]
o

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [70]:
# Write your response below

"""

it was bit of tough thing to do as the data scrapping is not working good at all, seroiusly i tried each and every topic to do.
i am not happy with kind of assignment may be it was easy when we learned completely but when we are still in learning state not easy.
and the last step the 3rd question challaged in many ways i did understand the concepts now thanks to you.
cannot complain about the time but yeah doing this with out any help from the experts not going in my way,
i would like to understand more about the how to collect data throught hyper links in a hyperlink.

"""

'\n\nit was bit of tough thing to do as the data scrapping is not working good at all, seroiusly i tried each and every topic to do. \ni am not happy with kind of assignment may be it was easy when we learned completely but when we are still in learning state not easy.\nand the last step the 3rd question challaged in many ways i did understand the concepts now thanks to you.\ncannot complain about the time but yeah doing this with out any help from the experts not going in my way, \ni would like to understand more about the how to collect data throught hyper links in a hyperlink. \n\n'