## Homework 5
*Author: Puri Rudick*

##### 1. Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
- It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
- Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
- Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
- Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  


In [1]:
import re
import nltk
from nltk import pos_tag
from nltk.tag import UnigramTagger
from nltk.corpus import brown
from nltk import word_tokenize

In [31]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import itertools

In [28]:
def urlToSoup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


def moviesLink(movieSoup):
    movie_tags = movieSoup.find_all('a', attrs={'class': None})
    movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]
    # remove duplicate links
    movie_tags = list(dict.fromkeys(movie_tags))
    return movie_tags

def ReviewLinks(movie_tags):
    movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
    return movie_links

In [30]:
# The url is a IMDB web page. Displaying first 50 movies with the filters of:
# Feature Film
# Superhero
# No more than 2021 Release year or range
# Sorted by Popularity

url = '''https://www.imdb.com/search/keyword/?keywords=superhero&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=a581b14c-5a82-4e29-9cf8-54f909ced9e1&pf_rd_r=HP5TEQ42K8KKB6H9G3TM&
pf_rd_s=center-5&pf_rd_t=15051&pf_rd_i=genre&ref_=kw_ref_yr&mode=detail&page=1&title_type=movie&sort=moviemeter,asc&release_date=%2C2021'''

movies_soup = urlToSoup(url)
movie_links = moviesLink(movies_soup)
review_links = ReviewLinks(movie_links)

print("There are a total of " + str(len(movie_links)) + " movies display on this web page.")
print("First 10 superhero movies, wich released before 2021 from IMDB reviews links:")
review_links[:10]


There are a total of 50 movies display on this web page.
First 10 superhero movies, wich released before 2021 from IMDB reviews links:


['https://www.imdb.com/title/tt10872600/reviews',
 'https://www.imdb.com/title/tt4154796/reviews',
 'https://www.imdb.com/title/tt9032400/reviews',
 'https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1477834/reviews',
 'https://www.imdb.com/title/tt7097896/reviews',
 'https://www.imdb.com/title/tt6334354/reviews',
 'https://www.imdb.com/title/tt4154756/reviews',
 'https://www.imdb.com/title/tt3501632/reviews',
 'https://www.imdb.com/title/tt2015381/reviews']

In [32]:
# Function that returns the index of negative and positive review.
def minMax(a):   
    # get the index of least rated user review
    minpos = a.index(min(a))
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    return minpos, maxpos

# Function returns a negative and positive review for each movie.
def getMovieReviews(soup):    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = base_url + n_review_tag['href']
    p_review_link = base_url + p_review_tag['href']
    
    return n_review_link, p_review_link

In [34]:
# get all movie review links
movie_review_list = [getMovieReviews(movie_soup) for movie_soup in movies_soup]

movie_review_list = list(itertools.chain(*movie_review_list))

# For a check make a listing of the review URLs
print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]

There are a total of 100 individual movie reviews
Displaying 10 reviews


['https://www.imdb.com/review/rw7648266/',
 'https://www.imdb.com/review/rw8021381/',
 'https://www.imdb.com/review/rw5066919/',
 'https://www.imdb.com/review/rw7585695/',
 'https://www.imdb.com/review/rw7758165/',
 'https://www.imdb.com/review/rw7535063/',
 'https://www.imdb.com/review/rw1917099/',
 'https://www.imdb.com/review/rw5478826/',
 'https://www.imdb.com/review/rw5232016/',
 'https://www.imdb.com/review/rw4561489/']

In [None]:
def get_review_from_url(url):
    html = get_txt(url)
    tags = BeautifulSoup(html, 'html.parser', parse_only=SoupStrainer(strain))
    review = clean_txt(tags.text)
    return review

def get_review_from_site(url):
    reviews = []

    reviews_home_text = get_txt(url)
    all_links = get_links_from(reviews_home_text)
    links = get_links(all_links)

    review_urls = get_review_urls(links)
    for url in review_urls:
        reviews.append(get_review_from_url(url))
    return reviews


def get_reviews_from_all_sites(url_list):
    all_reviews = []
    for review in review_links:
        review_url = review_links[review]
        all_reviews = all_reviews + get_review_from_site(review_url)
    return all_reviews

In [6]:
# function to build the list of movie review links
def buildReviewLinks(movie_tags):
    movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
    return movie_links


# build out the list of reviews
base_url = "https://www.imdb.com"
review_links = buildReviewLinks(movie_tags)

print("There are a total of " + str(len(review_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
review_links[:10]

There are a total of 50 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt10872600/reviews',
 'https://www.imdb.com/title/tt4154796/reviews',
 'https://www.imdb.com/title/tt9032400/reviews',
 'https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1477834/reviews',
 'https://www.imdb.com/title/tt7097896/reviews',
 'https://www.imdb.com/title/tt6334354/reviews',
 'https://www.imdb.com/title/tt4154756/reviews',
 'https://www.imdb.com/title/tt3501632/reviews',
 'https://www.imdb.com/title/tt2015381/reviews']

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    # Update the `key` variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())

df = pd.DataFrame(data)
print(df)


                                                  title  \
0                                      A step backwards   
1                          Teenage Spider-Man in Europe   
2                            Far better from Homecoming   
3                              Spider man Far from home   
4     Spidey is back in form with this new and one o...   
...                                                 ...   
2320                              Last feelings said!!!   
2321                        Jake Gyllenhaal? Seriously?   
2322                                            Peurile   
2323                        It Just killed ironman More   
2324      An unfavorable start of the post-Endgame era.   

                                                 review  
0     And so my long Marvel-watching journey comes t...  
1     "Spider-Man: Far from Home" is a typical film ...  
3     Spider man far from home\n2019\n12A\ndirector:...  
4     This film is even better than Homecoming. It t...  
.

---
##### 2. Extract noun phrase (NP) chunks from your reviews using the following procedure:
- In Python, use BeautifulSoup to grab the main review text from each link.  
- Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
- You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


---
##### 3. Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).


My news sentence is from this [link](https://www.nbcnews.com/tech/internet/internet-explorers-run-finally-comes-end-rcna33628 "Title"). It talks about how Microsoft Edge will completely replace Internet Explorer.

My manual tagging will be:
'As':IN, 'of':IN, 'Wednesday':NNP, 'Microsoft':NNP, 'will':MD, 'no':RB, 'longer':RBR, 'support':VB, 'the':DT, 'once':RB, 'dominant':JJ 'browser':NN, 'that':WDT, 'legions':NNS, 'of':IN, 'web':NN, 'surfers':NNS, 'loved':VBD, 'to':TO, 'hate':VB, 'and':CC, 'a':DT, 'few':JJ, 'still':RB, 'claim':VBP, 'to':TO, 'adore':VB

The pos_tag and spaCy taggers did a good job for tagging this sentence. The results from both are relatively the same as what I, a human, tagged. The UnigramTagger did a poor job on this sentence. As you can see that some words like 'Microsoft', 'browser', and 'surfers' do not get any tags. I'm not sure if these words are considered as quite new words since they started to be used after internet has become popular.