## Eliminate Duplicates and Extract Movie Features from Scraped Data
#### In this notebook, we take movies from the previous notebook, identify and eliminate duplicate movies, and further clean and extract important movie features and rearrange them to facilitate analysis. Extracted features are then saved for further cleaning in Notebook 4.

In [1]:
import pandas as pd
import seaborn as sns
import requests, re, json, time
from bs4 import BeautifulSoup as bs4

### Import movies_features files

In [3]:
# Load json of movies_features. N.B., this notebook is set up to use data scraped previously
# and unfortunately too large to push to GitHub. If you'd like to replicate this process,
# go to https://drive.google.com/open?id=19VPd3pdjrwVV56OVDNtKo2o_V3afwrvU to get this file
# and put in the /data file.
with open('data/movies_features_0_to_8764new.json') as json_file:  
    movies_features = json.load(json_file)

In [5]:
# Run cell if yoU want to see movies_features of first 3 movies
movies_features[0:2]

[{'age': 'age 10+',
  'family_topics': 'Families can talk about how The Music of Silence portrays Bocelli\'s blindness. How does it compare to other representations of disabilities you\'ve seen on-screen? Does it feel realistic? Do you think it gave a real picture of Bocelli\'s personal struggle?\xa0\n\nHow accurate do you think the movie is to actual events overall? Why might filmmakers decide to change the facts in a movie that\'s based on real life?\n\nHow does Bocelli exhibit courage and perseverance? Why are those important character strengths?\n\nOne of the most important aspects of any drama is conflict. What or who causes the conflict in this film? Is there an antagonist? What is the struggle?\xa0\n\nAre you an opera fan? Why or why not? How can you distinguish among operatically trained voices to tell which are the "most beautiful"? Are your standards for opera singers different from your standards for pop singers?\xa0\n\n',
  'is_it_any_good': 'Nearly everything about this bi

### Identify and Eliminate Duplicate Movies
#### Data placed into a pandas DataFrame to facilitate efficient examination of features and identification of duplicate movies. Of the 8778 movies in our collection, 8625 are unique (98.3%). 153 movies dropped.

In [6]:
df = pd.DataFrame(movies_features)

In [7]:
df.head()

Unnamed: 0,age,family_topics,is_it_any_good,movie_details_raw,movie_id,one_line_description,overall_rating,parental_rating_and_spoilers_raw,slug,title,what_is_the_story_raw,what_parents_need_to_know_raw
0,age 10+,Families can talk about how The Music of Silen...,Nearly everything about this biopic (which was...,"\nMovie details\n\n\nIn theaters: February 2, ...",0,Mild Andrea Bocelli biopic lacks momentum.,2,"[<div class=""field field-name-field-collection...",the-music-of-silence,The Music of Silence,"In THE MUSIC OF SILENCE, an Italian boy with f...","<div class=""field field-name-field-parents-nee..."
1,age 16+,Families can talk about how The Miseducation o...,"With an earthy, realistic tone, this timely dr...","\nMovie details\n\n\nIn theaters: August 3, 20...",1,"Timely, effective YA-based tale about LGBTQ te...",4,"[<div class=""field field-name-field-collection...",the-miseducation-of-cameron-post,The Miseducation of Cameron Post,"In THE MISEDUCATION OF CAMERON POST, it's 1993...","<div class=""field field-name-field-parents-nee..."
2,age 16+,Families can talk about whether The Spy Who Du...,This often hilarious action comedy is a great ...,"\nMovie details\n\n\nIn theaters: August 3, 20...",2,Bullets and profanity fly in feminist friendsh...,4,"[<div class=""field field-name-field-collection...",the-spy-who-dumped-me,The Spy Who Dumped Me,"In THE SPY WHO DUMPED ME, Audrey (Mila Kunis) ...","<div class=""field field-name-field-parents-nee..."
3,age 13+,Families can talk about The Darkest Minds and ...,Despite starring a YA-film veteran (Stenberg) ...,"\nMovie details\n\n\nIn theaters: August 3, 20...",3,"Skilled actors can't save derivative, violent ...",2,"[<div class=""field field-name-field-collection...",the-darkest-minds,The Darkest Minds,THE DARKEST MINDS is based on the first book i...,"<div class=""field field-name-field-parents-nee..."
4,age 15+,Families can talk about the elements that make...,"Appealing lead performances, especially by Kri...",\nMovie details\n\n\nOn DVD or streaming: Augu...,4,Goodhearted but predictable father-daughter co...,2,"[<div class=""field field-name-field-collection...",like-father,Like Father,"In LIKE FATHER, Rachel Hamilton (Kristen Bell)...","<div class=""field field-name-field-parents-nee..."


In [8]:
len(df)

8765

In [9]:
len(df.title.unique())

8714

In [10]:
# 99.4% of movies are unique.
8714/8765

0.9941814033086138

In [11]:
df.drop_duplicates('title', inplace=True)

In [12]:
len(df) 

8714

In [13]:
movies_features = df.to_dict('records')

In [14]:
# Run cell if you want to see first 3 movies from movies_features
movies_features[0:2]

[{'age': 'age 10+',
  'family_topics': 'Families can talk about how The Music of Silence portrays Bocelli\'s blindness. How does it compare to other representations of disabilities you\'ve seen on-screen? Does it feel realistic? Do you think it gave a real picture of Bocelli\'s personal struggle?\xa0\n\nHow accurate do you think the movie is to actual events overall? Why might filmmakers decide to change the facts in a movie that\'s based on real life?\n\nHow does Bocelli exhibit courage and perseverance? Why are those important character strengths?\n\nOne of the most important aspects of any drama is conflict. What or who causes the conflict in this film? Is there an antagonist? What is the struggle?\xa0\n\nAre you an opera fan? Why or why not? How can you distinguish among operatically trained voices to tell which are the "most beautiful"? Are your standards for opera singers different from your standards for pop singers?\xa0\n\n',
  'is_it_any_good': 'Nearly everything about this bi

## Partial Cleaning and Extracting of Movie Features:
#### Some of the features scraped in the previous notebook need to be extracted in order to flatten movie ratings and other details that may be of importance to family members (e.g., cast, director, movie run time, etc.). I will also prepare text to be used to create a corpus of words used to describe the movie. This will be useful in determining what movies are similar to one another for our recommender.

### Helper Functions to Extract Movie Details, Parental Ratings, etc.:

In [15]:
def extract_movie_details(movie_details):
    details = []
    values = []
    details_lines = re.split('\n', movie_details)
    for line in details_lines:
        try:
            detail_name, detail_value, temp = re.split(':(.*)', line)
            details.append(detail_name)
            values.append(detail_value)
        except:   # if there is not a : in the line, it will ignore line
            pass
    
    return dict(zip(details, values))

In [16]:
def extract_parental_ratings_and_spoilers(parental_ratings):
    soup = bs4(parental_ratings, 'lxml')
    ratings = [re.findall('\d', str(x))[0] for x in soup.find_all('div', attrs={'class':'content-grid-rating'})]
    labels = [x.get_text() for x in soup.find_all('span', attrs={'class':"sprite-cover"})]
    spoilers = [x.get_text() for x in soup.find_all('div', attrs={'class': 'field-type-text-long'})]
    return dict(zip(labels, ratings)), spoilers   

In [17]:
def text_what_parents_need_to_know(what_parents_need_to_know):
    soup = bs4(what_parents_need_to_know, 'lxml')
    what_parents_need_to_know_text = soup.get_text()
    
    return what_parents_need_to_know_text

In [18]:
def extract_features_from_a_movie(movie_num):
    extracted_feat = {}
    movie_id = movies_features[movie_num]['movie_id']
    movie_details_dict = extract_movie_details(movies_features[movie_num]['movie_details_raw'])
    parent_ratings_dict, spoilers = extract_parental_ratings_and_spoilers(movies_features[movie_num]['parental_rating_and_spoilers_raw'])
    what_parents_need_to_know_text = text_what_parents_need_to_know(movies_features[movie_num]['what_parents_need_to_know_raw'])
    
    extracted_feat['movie_id'] = movie_id
    extracted_feat['movie_details_dict'] = movie_details_dict
    extracted_feat['parent_ratings_dict'] = parent_ratings_dict
    extracted_feat['spoilers'] = spoilers
    extracted_feat['what_parents_need_to_know_text'] = what_parents_need_to_know_text
    extracted_feat['family_topics'] = movies_features[movie_num]['family_topics']
    extracted_feat['is_it_any_good'] = movies_features[movie_num]['is_it_any_good']
    extracted_feat['what_is_the_story'] = movies_features[movie_num]['what_is_the_story_raw']
    extracted_feat['age'] = movies_features[movie_num]['age']
    extracted_feat['one_line_description'] = movies_features[movie_num]['one_line_description']
    extracted_feat['overall_rating'] = movies_features[movie_num]['overall_rating']
    extracted_feat['slug'] = movies_features[movie_num]['slug']
    extracted_feat['title'] = movies_features[movie_num]['title']
    return extracted_feat

In [21]:
# Run cell if you want to see example of features extracted from a single movie.
extract_features_from_a_movie(0)

{'age': 'age 10+',
 'family_topics': 'Families can talk about how The Music of Silence portrays Bocelli\'s blindness. How does it compare to other representations of disabilities you\'ve seen on-screen? Does it feel realistic? Do you think it gave a real picture of Bocelli\'s personal struggle?\xa0\n\nHow accurate do you think the movie is to actual events overall? Why might filmmakers decide to change the facts in a movie that\'s based on real life?\n\nHow does Bocelli exhibit courage and perseverance? Why are those important character strengths?\n\nOne of the most important aspects of any drama is conflict. What or who causes the conflict in this film? Is there an antagonist? What is the struggle?\xa0\n\nAre you an opera fan? Why or why not? How can you distinguish among operatically trained voices to tell which are the "most beautiful"? Are your standards for opera singers different from your standards for pop singers?\xa0\n\n',
 'is_it_any_good': 'Nearly everything about this biopi

### Extract Features from all movies

In [20]:
extracted_movies = []
def extract_features_from_many_movies(first_movie, num_movies_to_extract):
    '''
    Follow DocString Conventions and annotate functions
    '''
    movie_features = {}
    for movie in range(num_movies_to_extract):
        movie = (movie + first_movie)
        if movie % 100 == 0:
            print(movie)
        extracted_features = extract_features_from_a_movie(movie)
        extracted_movies.append(extracted_features)
    return extracted_movies

In [22]:
# Comment out unless you want to extract features from movies again.
# If you have run previous cells and now have a different number of movies to extract,
# adjust the num_movies_to_extract appropriately.
extracted_movies = extract_features_from_many_movies(0, len(df))

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700


In [23]:
# Comment out cell to supress extra text. Run if you'd like to see extracted features
extracted_movies[-1]

{'age': 'age 8+',
 'family_topics': "Families can talk about silent movies. What did you expect going in? Were any parts surprising? Did you ever forget you were watching a silent film and just get into the story?\n\nFamilies can also talk about technology and filmmaking. Buster Keaton didn't have any of the tools we have today and still managed to make the action exciting. Do you think not relying on technology somehow made this filmmaker more inventive? Or do you think he was limited by the lack of CGI and other effects common today?\n\n",
 'is_it_any_good': "Even viewers who normally don't seek out silent movies or classics in general are in for a treat. SHERLOCK JR. is clever, charming, inventive, and full of surprises. There's so much packed into 44 minutes, it's hard to believe that there's a movie within a movie and a love story and a frame-up and it all ties together and makes perfect sense with just the occasional pithy caption.\nThe runaway moped scene had to take so much pla

### Export Extracted Movie Features for Further Cleaning and NLP

In [24]:
# # Uncomment if you have run prior notebooks and want to update movie recommender system
# with open('data/extracted_movies.json', 'w') as output:
#     json.dump(extracted_movies, output) 

#### Extracted features from movies saved to json file for final cleaning and division into text and non-text features in the next notebook.