# Cameron Stewart
# HW 5

## 1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8 
- It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
- Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
- Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
- Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews. 

In [1]:
#Load Required Libraries
from bs4 import BeautifulSoup
import nltk
import requests
import random
import spacy

The base URL is from an advanced search that filtered for the following items on IMDb:
- Feature Film
- Marvel comics
- Superhero
- At least 7.0 IMDb user rating (average)
- 2015-2021 Release year or range

In addition, I manually filtered to ensure the movie had at least 3000 votes to ensure the film was relevant and had sufficient reviews.

After gathering the movie_id to all the films that met the above criteria, I created links to the reviews page. I created two links for each movie. One that populated the reviews in ascending order and another that populated the movies in descending order based on rating to ensure there was a mix of positive and negative reviews.

In [2]:
# Downloading html for movie list from filter
url = 'https://www.imdb.com/search/keyword/?keywords=marvel-comics%2Csuperhero&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=a581b14c-5a82-4e29-9cf8-54f909ced9e1&pf_rd_r=DFH913Q4P4WKBQ2SK3ZW&pf_rd_s=center-5&pf_rd_t=15051&pf_rd_i=genre&ref_=kw_ref_rt_usr&sort=num_votes,desc&mode=detail&page=1&title_type=movie&user_rating=7.0%2C&release_date=2015%2C2021'
response = requests.get(url)

#Put into soup format
soup = BeautifulSoup(response.text, 'lxml')

#Get review count for all movies on page (called votes)
spans = soup.find_all('span',  {'name': 'nv'})
vote_count=[int(x.get_text().replace(',', '')) for x in spans if x.get_text()[0]!='$']

#Set vote count threshhold for movies you would consider for study
vote_count_boolean=[x>3000 for x in vote_count]

#Grab movie titles of all movies on page
movie_headers=soup.find_all('h3', class_='lister-item-header')
movie_titles=[x.a.contents[0] for i,x in enumerate(movie_headers) if vote_count_boolean[i]]           

#Get movie ids to link to the review websites
movie_ribbon=soup.find_all('div',class_='lister-item-image ribbonize')
movie_id=[x.get('data-tconst') for i,x in enumerate(movie_ribbon) if vote_count_boolean[i]]

#Create website links for each movie's review list
movie_review_links=['https://www.imdb.com/title/'+x+'/reviews' for x in movie_id]
len(movie_review_links) #count of movies in study

#Review site with high reviews at top
high_movie_review_links=[x+'?sort=userRating&dir=desc&ratingFilter=0' for x in movie_review_links]

#Review site with low reviews at top
low_movie_review_links=[x+'?sort=userRating&dir=asc&ratingFilter=0' for x in movie_review_links]

#Combine high and low review page links
all_review_page_links=high_movie_review_links+low_movie_review_links

print('Genre of focus in Marvel Movies')
print('Movies being analyzed that meet criteria:',movie_titles)
print('Number of movies being analyzed that meet criteria:',len(movie_titles))
print('Number of review page links gathered:',len(all_review_page_links))

Genre of focus in Marvel Movies
Movies being analyzed that meet criteria: ['Avengers: Endgame', 'Avengers: Infinity War', 'Deadpool', 'Avengers: Age of Ultron', 'Captain America: Civil War', 'Logan', 'Black Panther', 'Thor: Ragnarok', 'Doctor Strange', 'Guardians of the Galaxy Vol. 2', 'Ant-Man', 'Spider-Man: Homecoming', 'Deadpool 2', 'Spider-Man: No Way Home', 'Spider-Man: Into the Spider-Verse', 'Spider-Man: Far from Home', 'Ant-Man and the Wasp', 'Shang-Chi and the Legend of the Ten Rings']
Number of movies being analyzed that meet criteria: 18
Number of review page links gathered: 36


Next, I stepped through each review page and scraped the individual link to any review with at least 10 characters. This was to ensure the review has some content to analyze. I also scraped the body of each of the reviews.

In [3]:
#Grab links to individual reviews and text from reviews as long as review has at least 10 characters
all_review_links=[]
all_reviews=[]
for i in all_review_page_links:
    #Go to review page for each movie
    response2=requests.get(i)
    soup2 = BeautifulSoup(response2.text, 'lxml')

    #Check if review has any text has at least 10 characters
    review_body=soup2.find_all('div',class_='content')
    review_body_clean=[x.div.get_text().replace('\n',' ') for x in review_body]
    review_body_length_check=[len(x)>10 for x in review_body_clean]

    #Grab link to individual review
    review_header=soup2.find_all('a',class_='title')
    ind_review_link=['https://www.imdb.com'+x.get('href') for i,x in enumerate(review_header) if review_body_length_check[i]]

    #Paste review link and review text to lists
    all_review_links.extend(ind_review_link)
    all_reviews.extend([x for i,x in enumerate(review_body_clean) if review_body_length_check[i]])
    
print("Count of reviews collected that meet criteria:",len(all_review_links))
print('Example individual review link',all_review_links[2])
print('Example Review:',all_reviews[2])

Count of reviews collected that meet criteria: 894
Example individual review link https://www.imdb.com/review/rw4804358/
Example Review: Thank you Marvel, End Game is an ending that give me speechless.


## 2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
- In Python, use BeautifulSoup to grab the main review text from each link.  
- Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
- You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


Now, we are gathering the top actor and character names that are relevant to the movies being reviewed. These can help capture any key names that are missed when chunking.

In [4]:
#Grab all top actor and character names for lexicon
actor_holder=[]
character_holder=[]
for i in movie_id:
    url = 'https://www.imdb.com/title/'+ i
    response3=requests.get(url)
    soup3 = BeautifulSoup(response3.text, 'lxml')
    actors=soup3.find_all('a',{'data-testid':"title-cast-item__actor"})
    actors_clean=[x.get_text() for x in actors]
    actor_holder.extend(actors_clean)

    characters=soup3.find_all('a',{'data-testid':"cast-item-characters-link"})
    characters_split=[x.get_text().split() for x in characters]
    characters_clean=[' '.join(x[int(-len(x)/2):]) for x in characters_split]
    character_holder.extend(characters_clean)

key_names=list(set(actor_holder+character_holder))
print('Amount of unique character and actor names gathered:',len(key_names))

Amount of unique character and actor names gathered: 475


I shuffled the list so the requested final output of 100 reviews will be a mix of positive and negative reviews from multiple movies. Over 800 reviews were gathered.

In [5]:
#Make a copy of raw review list and shuffle it so that positive and negative reviews from all the movies are mixed together
#I did this because we will only be taking 100 out of >800 reviews
review_list=all_reviews.copy()
random.seed(4)
random.shuffle(review_list)

Using spacy, I chunked the noun phrases. Using NLTK word tokenizer and the key actor and character list gathered above, I was able to capture any missed key names and add them to the chunked noun phrases.

In [6]:
#Use Spacy to output noun phrases and NLTK to tokenize for name searches
nlp = spacy.load("en_core_web_sm")
changes=[]
review_chunks=[]
for i,v in enumerate(review_list[0:100]): #loop through review list (can change from first 100)
    doc = nlp(v)
    noun_phrases=[str(x).lower() for x in doc.noun_chunks]
    tokenized_review=nltk.word_tokenize(v.lower())
    tokenized_noun_phrases=nltk.word_tokenize(' '.join(noun_phrases))
    for name in key_names: #loop through name list
            split_name=name.lower().split()
            name_length=len(split_name)
            for index,tag in enumerate(tokenized_review): #loop through tokenized review list to determine if name is in the review
                tag_subset=tokenized_review[index:index+name_length]
                tag_clean=' '.join(tag_subset)
                if tag_clean==name.lower() and name.lower() not in ' '.join(noun_phrases): #If name is in review and name is not in chunked NP, then append
                    noun_phrases.append(name)
                    changes.append('index'+str(i)+':'+tag_clean)
    review_chunks.append(noun_phrases)

print('Amount of missed names identified and corrected:',len(changes))

Amount of missed names identified and corrected: 4


Print first 100 reviews with noun phrase chunks extracted from the shuffled review list.

In [7]:
for i,v in enumerate(review_chunks):
    print('REVIEW ',i+1,': ',v,'\n',sep='')

REVIEW 1: ['what a movie', 'i', 'it', 'i', 'it', '100%.great movie']

REVIEW 2: ['i', 'deadpool', 'i', 'deadpool', 'the only thing', 'i', 'morena baccarin', 'vanessa', 'she', 'she', 'the beginning', 'the end', 'it', 'senseless violence', 'it', 'the worst superhero movies', 'i', 'batman', 'robin', 'i', 'it', 'one star', 'vanessa']

REVIEW 3: ['i', 'i', 'the best marvel film', 'i', 'other marvel films', 'i', 'this film', 'its own merits', 'i', 'they', 'ragnarok', 'what', 'a 10star']

REVIEW 4: ['the marvel movies', 'the issues', 'a second viewing', 'many more flaws', 'the way', 'you', 'decent special effects', 'this one', 'a shot', 'you', 'this film', 'its sequel', 'the lower echelon', 'mcu films', 'the likes', 'age', 'ultron', 'thor', 'the dark world', 'this film', 'me', 'moments', 'excessive and wholly inappropriate instances', 'tonal whiplash', 'embarrassing dialogue', 'pym', 'his daughter', 'a very well-acted emotional exchange', 'a lazy one-liner', 'scott lang', 'the most egregious 

## 3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

I used BeautifulSoup to go from an advanced search page and automatically pull individual reviews from IMDb. The advanced search filtered for the following items:
- Feature Film
- Marvel comics
- Superhero
- At least 7.0 IMDb user rating (average)
- 2015-2021 Release year or range
- In addition, I manually filtered to ensure the movie had at least 3000 votes to ensure the film was relevant and had sufficient reviews

Using the movie_id for the 18 films that met the criteria, I was able to make the links to the reviews page and the top characters/actors page. I created two links for the reviews of each movie. One that populated the reviews in ascending order and another that populated the movies in descending order based on rating to ensure there was a mix of positive and negative reviews. From each of these reviews links, I was able to evaluate 25 reviews (50 per movie). I verified each review had at least 10 characters in the body or the review was skipped due to lack of content. After evaluating 900 reviews, 894 individual reviews met the criteria and the url links were gathered along with the content in each review's body. In the first 100 reviews, 4 key names were identified that were missed and were corrected to the noun phrase chunk list.

With all the information needed pulled, I moved on to using Spacy to chunk the noun phrases in each review. To evaluate if any of the key character or actor names were missed, I used NLTK tokenizer to help search the review for a match to any of the key names. If a name was identified in the review and was missing from the chunked noun phrases, then the name was added to the chunked noun phrase for that review.

Due to having pulled over 800 reviews, the reviews are shuffled to display a mix of positive and negative reviews from multiple movies in the first 100 elements. Finally, I printed the output of the noun phrase chunks for the first 100 reviews.