<a href="https://colab.research.google.com/github/Sakhakhini/1/blob/main/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

In [1]:
%%capture
!pip install -U nltk 
!pip install -U spacy
!pip install benepar
!python -m spacy download en_core_web_sm

In [2]:
import re
import string
from collections import Counter
from typing import Any, Dict, List, Tuple

import benepar
import nltk
import pandas as pd
import requests
import spacy
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

benepar.download("benepar_en3")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of the product [Apple iPhone 11](https://www.amazon.com/Apple-iPhone-11-64GB-Unlocked/dp/B07ZPKF8RG/ref=sr_1_13?dchild=1&keywords=iphone+12&qid=1631721363&sr=8-13) on amazon.

(2) Collect the top 10000 User Reviews of the film [Shang-Chi and the Legend of the Ten Rings](https://www.imdb.com/title/tt9376612/reviews?ref_=tt_sa_3) from IMDB.

(3) Collect all the reviews of the top 100 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query [natural language processing](https://citeseerx.ist.psu.edu/search?q=natural+language+processing&submit.x=0&submit.y=0&sort=rlv&t=doc) from CiteSeerX.

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 tweets by using hashtag ["#blacklivesmatter"](https://twitter.com/hashtag/blacklivesmatter) from Twitter. 


In [3]:
BASE_URL = "https://www.imdb.com/title/tt9376612/reviews"
FIRST_PAGE_URL = BASE_URL + "?ref_=tt_urv"
AJAX_PAGE_URL = BASE_URL + "/_ajax"


def fetch_soup(url: str, session: requests.Session,
               params: Dict[str, Any] = None) -> BeautifulSoup:
    response = session.get(url, params=params)
    soup = BeautifulSoup(response.text, 'html.parser')

    return soup


def extract_content(soup: BeautifulSoup, query: Dict[str, Any]) -> List[str]:
    items = soup.find_all(query['name'], query['attrs'])
    content = [item.text.strip() for item in items]

    return content


def extract_reviews(soup: BeautifulSoup):
    titles = extract_content(soup, {'name': 'a', 'attrs': {'class': 'title'}})
    contents = extract_content(soup, {
        'name': 'div',
        'attrs': {
            'class': 'text show-more__control'
        }
    })

    return zip(titles, contents)


def create_next_request_params(soup: BeautifulSoup) -> Dict[str, Any]:
    pagination_div = soup.find('div', {'data-key': True})
    return {'ref_': 'tt_ql_urv', 'paginationKey': pagination_div['data-key']}


def fetch_reviews(pages: int = 10):
    # Holds all the extracted reviews
    reviews = []
    # Create a session for the requests
    session = requests.Session()
    # Fetch contents of the first page
    soup = fetch_soup(FIRST_PAGE_URL, session)
    reviews.extend(extract_reviews(soup))
    # Fetch contents of the first page
    for _ in range(pages):
        params = create_next_request_params(soup)
        soup = fetch_soup(AJAX_PAGE_URL, session, params)
        reviews.extend(extract_reviews(soup))

    return reviews

In [13]:
reviews = fetch_reviews(80)
raw_reviews_df = pd.DataFrame(reviews, columns=['title', 'review'])
raw_reviews_df.to_csv('raw-reviews.csv')

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming. 

(6) Lemmatization.

In [14]:
# Load the downloaded reviews
reviews_df = pd.read_csv('raw-reviews.csv')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


def remove_punctuation(text: str) -> str:
    chars_no_punct = [char for char in text if char not in string.punctuation]
    text_no_punct = "".join(chars_no_punct)

    return text_no_punct


def remove_numbers(text: str) -> str:
    return re.sub(r"[0-9]", "", text)


def remove_stopwords(text: str) -> str:
    words_no_stopwords = [
        word for word in text.split() if word not in stopwords.words("english")
    ]
    text_no_stopwords = " ".join(words_no_stopwords)

    return text_no_stopwords


def apply_stemmer(text: str) -> str:
    words_stemmed = [stemmer.stem(word) for word in text.split()]
    text_stemmed = " ".join(words_stemmed)

    return text_stemmed


def apply_lemmatizer(text: str) -> str:
    words_lemmatized = [lemmatizer.lemmatize(word) for word in text.split()]
    text_lemmatized = " ".join(words_lemmatized)

    return text_lemmatized


def clean_text(text: str) -> str:
    normalized_text = text.lower()
    text_no_punct = remove_punctuation(normalized_text)
    text_no_numbers = remove_numbers(text_no_punct)
    text_no_stopwords = remove_stopwords(text_no_numbers)
    # Stemmer has wierd cases where it destroys the word people -> peopl
    # thus it is off by default
    # text_stemmed = apply_stemmer(text_no_stopwords)
    text_lemmatized = apply_lemmatizer(text_no_stopwords)

    return text_lemmatized

In [15]:
raw_reviews_df = pd.read_csv('raw-reviews.csv')
clean_reviews_df = raw_reviews_df[['title', 'review']].applymap(clean_text)
clean_reviews_df.to_csv('clean-reviews.csv')

# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes: 

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [16]:
clean_reviews_df = pd.read_csv('clean-reviews.csv')

## Parts Of Speech

In [17]:
PARTS_OF_SPEECH = ["NOUN", "VERB", "ADJ", "ADV"]


def tag_pos(text: str) -> List[Tuple[str, str]]:
    word_tokens = nltk.word_tokenize(text)
    pos = nltk.pos_tag(word_tokens, tagset="universal")

    return pos


def count_pos(tagged_words: List[Tuple[str, str]], pos: List[str]) -> Dict[str, int]:
    counts = Counter(tag for _, tag in tagged_words)
    # Filter out only the required pos
    counts = {tag: counts[tag] for tag in pos}

    return counts


def total_pos_counts(pos_counts: List[Dict[str, int]],
                     pos: List[str]) -> Dict[str, int]:
    total_counts = {}
    for pos_count in pos_counts:
        for part in pos:
            if part in total_counts:
                total_counts[part] = total_counts[part] + pos_count[part]
            else:
                total_counts[part] = pos_count[part]

    return total_counts


pos_counts = [
    count_pos(tag_pos(text), PARTS_OF_SPEECH)
    for text in clean_reviews_df['review'].to_list()
]
# Sum pos counts of all sentences
total_pos_counts(pos_counts, PARTS_OF_SPEECH)

{'ADJ': 29832, 'ADV': 11461, 'NOUN': 57267, 'VERB': 23968}

## Constituency Parsing and Dependency Parsing

In [18]:
MAX_LENGTH = 200 

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})


def parse_dependencies(docs: List):
    def make_relationship(doc):
        return [(token, token.dep_, token.head) for token in doc]

    relationships = [make_relationship(doc) for doc in docs]

    return relationships


def parse_constituents(docs: List):
    tree = [list(doc.sents)[0]._.parse_string for doc in docs]

    return tree

def trim_text(text: str, max_length: int) -> str:
    words = text.split()[:max_length]
    text = " ".join(words)

    return text

docs = [nlp(trim_text(review, MAX_LENGTH)) for review in clean_reviews_df['review'].to_list()]

  'with `validate_args=False` to turn off validation.')


### Dependency Parsing

In [19]:
print("Dependency Parsing")
dependencies = parse_dependencies(docs)
dependencies[:2] # show only a subset to prevent clutter

Dependency Parsing


[[(big, 'amod', fan),
  (fan, 'nsubj', saw),
  (many, 'amod', reason),
  (many, 'amod', reason),
  (marvel, 'amod', reason),
  (film, 'compound', reason),
  (reason, 'nsubj', saw),
  (saw, 'ROOT', saw),
  (one, 'nummod', daughter),
  (oldest, 'amod', daughter),
  (daughter, 'nsubj', insisted),
  (insisted, 'ccomp', saw),
  (watch, 'xcomp', insisted),
  (itand, 'nmod', movie),
  (overall, 'amod', movie),
  (impressed, 'amod', movie),
  (movie, 'dobj', watch),
  (though, 'mark', think),
  (think, 'advcl', saw),
  (enjoyable, 'amod', itselfi),
  (insane, 'amod', candy),
  (eye, 'compound', candy),
  (candy, 'compound', story),
  (story, 'compound', itselfi),
  (itselfi, 'nsubj', talk),
  (could, 'aux', talk),
  (talk, 'ccomp', think),
  (plot, 'npadvmod', noticed),
  (noticed, 'amod', say),
  (review, 'compound', ill),
  (film, 'compound', ill),
  (ill, 'compound', say),
  (say, 'dobj', talk),
  (looked, 'ccomp', think),
  (great, 'acomp', looked),
  (would, 'aux', seen),
  (great, 'advmo

### Constituency Parsing

In [20]:
print("Constituency Parsing")
constituents = parse_constituents(docs)
constituents[:2] # show only a subset to prevent clutter

Constituency Parsing


['(S (NP (NP (JJ big) (NN fan)) (NP (JJ many) (JJ many) (NN marvel) (NN film) (NN reason))) (VP (VBD saw) (S (NP (CD one) (JJS oldest) (NN daughter)) (VP (VBD insisted) (S (VP (VB watch)))))) (CC itand) (RB overall) (VBN impressed) (NP (NN movie)) (IN though) (S (VP (VB think) (NP (ADJP (JJ enjoyable)) (NML (JJ insane) (NML (NN eye) (NN candy))) (NN story)))) (NP (PRP itselfi)) (VP (MD could) (VP (VB talk) (NP (NN plot)) (VP (VBN noticed) (SBAR (S (NP (NN review) (NN film)) (ADVP (RB ill)) (VB say) (SBAR (S (VBD looked) (ADJP (ADJP (JJ great)) (MD would) (ADJP (JJ great))) (VBN seen) (NP (NML (JJ big) (NN screen)) (NN story)) (SBAR (IN though) (S (ADJP (JJ enjoyable)) (VP (VBD was) (RB nt) (NP (NP (PRP one)) (VP (ADJP (ADJP (RB particularly) (VBN grabbed)) (VBN impacted) (ADJP (RB mepossibly) (JJ violent))) (JJ frenetic) (NN film) (NN consequence) (VP (VBD lacked) (NP (NN intimacy) (NN humanity))))))))))))))))',
 '(SINV (JJ okay) (NN movie) (FW simu) (FW liu) (NN son) (ADJP (ADJP (RB s

## Named Entity Recognition

In [21]:
def recognize_entities(docs: List[str]):
    def extract_entities(doc):
        return [(ent.text, ent.label_) for ent in doc.ents]

    return [extract_entities(doc) for doc in docs]


entities = recognize_entities(docs)
entities[:2] # show only a subset to prevent clutter

[[('one', 'CARDINAL'), ('one', 'CARDINAL')],
 [('ten', 'CARDINAL'),
  ('tony leung', 'PERSON'),
  ('michelle', 'PERSON'),
  ('last couple decade', 'DATE'),
  ('twenty minute', 'QUANTITY'),
  ('chinese', 'NORP'),
  ('ben kingsley', 'PERSON'),
  ('mandarin iron', 'LOC')]]

A constituency parsing tree shows the entire structure of text and dependencies within the text, while a dependency tree shows the relationships between words in the text. 

As an example the first sentence constituency tree is: 
```
'(S (NP (NP (JJ big) (NN fan)) (NP (JJ many) (JJ many) (NN marvel) (NN film) (NN reason))) (VP (VBD saw) (S (NP (CD one) (JJS oldest) (NN daughter)) (VP (VBD insisted) (S (VP (VB watch)))))) (CC itand) (RB overall) (VBN impressed) (NP (NN movie)) (IN though), ...)
```
, while its dependency tree is:
```
[(big, 'amod', fan),
  (fan, 'nsubj', saw),
  (many, 'amod', reason),
  (many, 'amod', reason),
  (marvel, 'amod', reason),
  (film, 'compound', reason),...]
```
Constituency tree is thus useful for visualizing the entire structure of text as it is information rich. A dependency tree is then useful when coincise information is needed. 