<a href="https://colab.research.google.com/github/ManoharRavula/Manohar_INFO5731_-Spring2024/blob/main/Ravulapalli_Manohar_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
import time
import csv
def get_articles(api_key, keyword, max_results=10000):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search/bulk"
    headers = {"x-api-key": api_key}
    params = {
        "query": keyword,
        "fields": "abstract",
        "limit": 100,
    }

    articles = []
    total_fetched = 0
    # checking max results
    while total_fetched < max_results:
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code == 200:
            data = response.json()
            papers = data.get("data", [])
            # looping through papers to get abstract
            for paper in papers:
                    articles.append({
                        "abstract": paper.get("abstract", "")
                    })
                    total_fetched += 1
                    if total_fetched >= max_results:
                        break
            if total_fetched < max_results and "next" in data and data["next"]:
                params["offset"] = data["next"]
            else:
                break
        else:
            print(f"Failed to fetch articles: {response.text}")
            break

        time.sleep(1)

    return articles

#API key
semantic_key = "34lYGGaAFf7fEZDcAdipa9qlsw3bYIE01OxVzY5Y"
keyword = ["machine learning", "data science", "artificial intelligence", "information extraction"]
articles = get_articles(semantic_key, keyword)
filename = 'abstracts.csv'
#writing the file and saving
with open(filename, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['abstract'])
    writer.writeheader()
    for article in articles:
        writer.writerow({'abstract': article['abstract']})

print(f"Data saved to {filename}")

Data saved to abstracts.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
# Write code for each of the sub parts with proper comments.
!pip install pandas nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Load the data
df = pd.read_csv('abstracts.csv')

# Function to clean the text
def clean_text(text):
    # Ensuring that input is string
    text = str(text)

    # 1. Removing special characters and punctuations
    text = re.sub(r'\W', ' ', text)

    # 2. Removing any numbers
    text = re.sub(r'\d', '', text)

    # 3. Remove the stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if not word in stop_words]

    # 4. Lowercasing all texts
    filtered_text = [word.lower() for word in filtered_text]

    # 5. Stemming
    # ps = PorterStemmer()
    # stemmed_text = [ps.stem(word) for word in filtered_text]

    # # 6. Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = [lemmatizer.lemmatize(word) for word in filtered_text]

    # Join the words back into one string
    cleaned_text = ' '.join(lemmatized_text)
    return cleaned_text


# Applying the cleaning function to the abstracts column and saving the result in a new column cleaned_abstract
df['cleaned_abstract'] = df['abstract'].apply(clean_text)

# Save the dataframe to a new CSV file
df.to_csv('cleaned_abstracts.csv', index=False)
from google.colab import files

files.download('cleaned_abstracts.csv')
print("Data cleaned and saved to cleaned_abstracts.csv")





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Data cleaned and saved to cleaned_abstracts.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [6]:
!pip install nltk stanza pandas
!pip install benepar
#Importing necessary models
import pandas as pd
import nltk
from nltk import Tree #needed for displaying Constituency Parsing
from nltk.tokenize import word_tokenize
from collections import Counter
import spacy
from spacy import displacy
#Import needed for Constituency Parsing
import benepar
benepar.download('benepar_en3')
parser = benepar.Parser("benepar_en3")

# Loading Spacy NLP model
nlp = spacy.load("en_core_web_sm")

#Models necessary for tokenization and part-of-speech (POS) tagging,
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Loading the cleaned abstracts
df = pd.read_csv('cleaned_abstracts.csv')
df['cleaned_abstract'] = df['cleaned_abstract'].fillna('').astype(str)
text = ' '.join(df['cleaned_abstract'].tolist())

#1)Parts of Speech (POS) Tagging
tokens = word_tokenize(text)
tags = nltk.pos_tag(tokens)

# Count specific POS tags
pos_counts = Counter(tag for word, tag in tags)
nouns = sum(count for tag, count in pos_counts.items() if tag.startswith('N'))
verbs = sum(count for tag, count in pos_counts.items() if tag.startswith('V'))
adjectives = sum(count for tag, count in pos_counts.items() if tag.startswith('J'))
adverbs = sum(count for tag, count in pos_counts.items() if tag.startswith('R'))

print(f"Nouns: {nouns}, Verbs: {verbs}, Adjectives: {adjectives}, Adverbs: {adverbs}")

#3)Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [ent.label_ for ent in doc.ents]
    return Counter(entities)

ner_counts = named_entity_recognition(text)
print("Named Entity Counts:", ner_counts)

#2)Dependency Parsing
example_sentence = df['cleaned_abstract'][0]
doc = nlp(example_sentence)
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

#Constituency Parsing
tree = parser.parse(example_sentence)
tree.pretty_print()



Collecting stanza
  Downloading stanza-1.8.0-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.4/970.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting emoji (from stanza)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji, stanza
Successfully installed emoji-2.10.1 stanza-1.8.0
Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Collecting accelerate>=0.21.0 (from transformers[tokenizers,torch]>=4.2.2->benepar)
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for coll

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Nouns: 56530, Verbs: 18615, Adjectives: 23560, Adverbs: 4764
Named Entity Counts: Counter({'CARDINAL': 682, 'ORG': 558, 'PERSON': 450, 'DATE': 249, 'ORDINAL': 221, 'GPE': 154, 'NORP': 144, 'TIME': 21, 'LANGUAGE': 19, 'LOC': 13, 'PRODUCT': 13, 'QUANTITY': 8, 'FAC': 8, 'PERCENT': 2, 'WORK_OF_ART': 1, 'LAW': 1, 'MONEY': 1})


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TOP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                       



# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
This assignment was very challenging. Getting semantic scholor Api key, including key words and setting late limit to get accurate results, encounterting errors and excecuting was challenging, Learned how to write files , using the written file to further cleaning process using nltk library and again saving the the results to new column and again performing further syntax and structure analysis of the clean text required me to
understand all the libraries, exploring which libraries would support my analysis, understanding Dependency and Constituency Parsing. All this made my understand how Natural language processing works and what are the steps invloved in understanding in depth, this assigment gave me a solid foundation!