## Homework 5
*Author: Puri Rudick*

##### 1. Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
- It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
- Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
- Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
- Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  


In [83]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk

In [80]:
movies = {
    'Avengers: Endgame': 'tt4154796',
    'Shang-Chi and the Legend of the Ten Rings': 'tt9376612',
    'Doctor Strange in the Multiverse of Madness': 'tt9419884',
    'Guardians of the Galaxy Vol. 2': 'tt3896198',
    'Spider-Man: No Way Home': 'tt10872600'
}

I picked 5 movies in the adventure, fantacy, and superhero genre. All movies are from Marvel Studio.

I obtain the movie title_id from IMDB website and put them into a dictionary above.

---
##### 2. Extract noun phrase (NP) chunks from your reviews using the following procedure:
- In Python, use BeautifulSoup to grab the main review text from each link.  
- Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
- You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


In [81]:
def get_movie_reviews(movies, movie_tt):
    url_text = 'https://www.imdb.com/title/' + movie_tt
    url = (url_text + "/reviews/_ajax?ref_=undefined&paginationKey={}")
    key = ""
    data = {"movie_name":[], "review_title": [], "review_txt": []}

    for i in range(0,10):
        response = requests.get(url.format(key))
        soup = BeautifulSoup(response.content, "html.parser")
        # Find the pagination key
        pagination_key = soup.find("div", class_="load-more-data")
        if not pagination_key:
            break

        # Update the `key` variable in-order to scrape more reviews
        key = pagination_key["data-key"]
        for title, review in zip(soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")):
            data["movie_name"].append(movies)
            data["review_title"].append(title.get_text(strip=True))
            data["review_txt"].append(review.get_text())

    review = pd.DataFrame(data)
    return review


In [82]:
df = pd.DataFrame()
for m in movies:
    review = get_movie_reviews(m, movies[m])
    df = pd.concat([df, review])

df

Unnamed: 0,movie_name,review_title,review_txt
0,Avengers: Endgame,Not as good as infinity war..,But its a pretty good film. A bit of a mess in...
1,Avengers: Endgame,Crazy in every sense,This film is an emotional rollercoaster with s...
2,Avengers: Endgame,Not as good as infinity war but a great movie,Rating: 8.6Not as good as Infinity war pacing-...
3,Avengers: Endgame,Time travel is such a lazy way to write stories,Only a month or so back I was talking to a fri...
4,Avengers: Endgame,"The writers got carried away, the directors ov...",I've just come from watching Endgame and I mus...
...,...,...,...
245,Spider-Man: No Way Home,Pretty Darn Good,"Peter Parker, outed as Spider-Man, and framed ..."
246,Spider-Man: No Way Home,90 mins in and already the best spiderman movi...,"I won't spoil anything, but this story is very..."
247,Spider-Man: No Way Home,Best movie I've ever seen.,"Yes, it doesn't make sense to only bring 2 spi..."
248,Spider-Man: No Way Home,A soft reset movie that traps audience with no...,This movie is like the JW:Fallen kingdom for T...


In [84]:
# Returns the tagged version of the review
def getMovieReviewTags(reviewSentence):	
    nps = []
    # NOTE: For purposes of the exercise I'm treating
    # the grammar variable as a global so that the regular
    # expression patterns can be change it as needed
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(reviewSentence)
    # loop through the trees produced and pull out only the 
    # NP subtrees
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)
    return nps

# Returns the NP chunked review results
def processReviewText(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [getMovieReviewTags(sent) for sent in sentences]
    return sentences


# use an initial grammar regular expression that looks at sequences of proper nouns, nouns following each other and determiner/possessive, adjectives and noun.
grammar = r'''
     NP: {<DT|PP\$>?<JJ>*<NN>}  
         {<NNP>+}               
         {<NN><NN>}               
    '''

df['proc_review'] = df['review_txt'].apply(processReviewText)
df.head()

Unnamed: 0,movie_name,review_title,review_txt,proc_review
0,Avengers: Endgame,Not as good as infinity war..,But its a pretty good film. A bit of a mess in...,"[[good film], [A bit, a mess, effortless feel,..."
1,Avengers: Endgame,Crazy in every sense,This film is an emotional rollercoaster with s...,"[[This film, an emotional rollercoaster, super..."
2,Avengers: Endgame,Not as good as infinity war but a great movie,Rating: 8.6Not as good as Infinity war pacing-...,"[[Infinity, war, pacing-wise, the saga], [High..."
3,Avengers: Endgame,Time travel is such a lazy way to write stories,Only a month or so back I was talking to a fri...,"[[a month, a friend, time, travel], [no proble..."
4,Avengers: Endgame,"The writers got carried away, the directors ov...",I've just come from watching Endgame and I mus...,"[[Endgame, Civil War, Infinity War], [], [Endg..."


In [86]:
grammar = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
    """   
df['proc_review2'] = df['review_txt'].apply(processReviewText)
df.head()

Unnamed: 0,movie_name,review_title,review_txt,proc_review,proc_review2
0,Avengers: Endgame,Not as good as infinity war..,But its a pretty good film. A bit of a mess in...,"[[good film], [A bit, a mess, effortless feel,...","[[good film], [bit, mess, parts, feel infinity..."
1,Avengers: Endgame,Crazy in every sense,This film is an emotional rollercoaster with s...,"[[This film, an emotional rollercoaster, super...","[[film, emotional rollercoaster, superhero plo..."
2,Avengers: Endgame,Not as good as infinity war but a great movie,Rating: 8.6Not as good as Infinity war pacing-...,"[[Infinity, war, pacing-wise, the saga], [High...","[[Infinity war pacing-wise, saga], [production..."
3,Avengers: Endgame,Time travel is such a lazy way to write stories,Only a month or so back I was talking to a fri...,"[[a month, a friend, time, travel], [no proble...","[[month, friend, serious movies, time travel, ..."
4,Avengers: Endgame,"The writers got carried away, the directors ov...",I've just come from watching Endgame and I mus...,"[[Endgame, Civil War, Infinity War], [], [Endg...","[[Endgame, Civil War, Infinity War], [films], ..."


I obtained first 250 user reviews for each movie then combined all of them into a dataframe.

I then used a processReviewText function to tokenize the review, run it through the NP parser and then return only the NP tree values then added these to a new column called proc_review.

I changed up the grammar value to add additional checks to see if it yields a different results and created a new column in the data frame called proc_review2.

---
##### 3. Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).


In [88]:
df.to_csv('movies_reviews')

For this analysis:
1. I pulled first 250 user reviews of 5 movies in I picked 5 movies in the adventure, fantacy, and superhero genre. All movies are from Marvel Studio. total in 1,250 reviews. *The review data pulled on 06/26/2022.*
    - Avengers: Endgame
    - Shang-Chi and the Legend of the Ten Rings
    - Doctor Strange in the Multiverse of Madness
    - Guardians of the Galaxy Vol. 2
    - Spider-Man: No Way Home
2. I ran each review through a NP-chunker with shallow parsing twice, with differing levels of regular expression requirements.
3. The parsing results show that both arsers captured NP records quite well, except for when the review does not have a correct sentence breaking.