## Eliminate Duplicates and Extract Movie Features from Scraped Data
#### In this notebook, we take movies from the previous notebook, identify and eliminate duplicate movies, and further clean and extract important movie features and rearrange them to facilitate analysis. Extracted features are then saved for further cleaning in Notebook 4.

In [1]:
import pandas as pd
import seaborn as sns
import requests, re, json, time
from bs4 import BeautifulSoup as bs4

### Import movies_features files

In [2]:
# Load json of movies_features. N.B., this notebook is set up to use data scraped previously
# and unfortunately too large to push to GitHub. If you'd like to replicate this process,
# go to https://drive.google.com/open?id=19VPd3pdjrwVV56OVDNtKo2o_V3afwrvU to get this file
# and put in the /data file.
with open('data/movies_features_0_to_8779_less_2.json') as json_file:  
    movies_features = json.load(json_file)

In [3]:
# Run cell if yoU want to see movies_features of first 3 movies
# movies_features[0:2]

### Identify and Eliminate Duplicate Movies
#### Data placed into a pandas DataFrame to facilitate efficient examination of features and identification of duplicate movies. Of the 8778 movies in our collection, 8625 are unique (98.3%). 153 movies dropped.

In [4]:
df = pd.DataFrame(movies_features)

In [5]:
df.head()

Unnamed: 0,age,family_topics,is_it_any_good,movie_details_raw,movie_id,one_line_description,overall_rating,parental_rating_and_spoilers_raw,slug,title,what_is_the_story_raw,what_parents_need_to_know_raw
0,age 17+,Families can talk about Sicario: Day of the So...,This sequel to Sicario is solid enough in its ...,"\nMovie details\n\n\nIn theaters: June 29, 201...",0,"Solid drug war sequel has intense violence, la...",3,"[<div class=""field field-name-field-collection...",sicario-day-of-the-soldado,Sicario: Day of the Soldado,"In SICARIO: DAY OF THE SOLDADO, the drug wars ...","<div class=""field field-name-field-parents-nee..."
1,age 16+,Families can talk about Damsel's use of violen...,"This very odd Western, which is peppered with ...","\nMovie details\n\n\nIn theaters: June 22, 201...",1,"Unusual, uneven Western has violence, language...",3,"[<div class=""field field-name-field-collection...",damsel,Damsel,"DAMSEL takes place in the Old West, with Samue...","<div class=""field field-name-field-parents-nee..."
2,age 15+,Families can talk about the rapid-fire disturb...,This thriller owes obvious debts to movies lik...,"\nMovie details\n\n\nIn theaters: June 22, 201...",2,Rapid-fire disturbing images in paranoid thril...,2,"[<div class=""field field-name-field-collection...",distorted,Distorted,"In DISTORTED, a disturbed woman named Lauren (...","<div class=""field field-name-field-parents-nee..."
3,age 15+,Families can talk about Berg's sexual orientat...,"This thriller is very well written and acted, ...","\nMovie details\n\n\nIn theaters: June 22, 201...",3,"Well-made spy thriller has some war violence, ...",4,"[<div class=""field field-name-field-collection...",the-catcher-was-a-spy,The Catcher Was a Spy,THE CATCHER WAS A SPY is based on the true sto...,"<div class=""field field-name-field-parents-nee..."
4,age 16+,Families can talk about how Boundaries portray...,"Though blessed with outstanding performers, th...","\nMovie details\n\n\nIn theaters: June 22, 201...",4,Indie road-trip dramedy has pot dealing/smoking.,2,"[<div class=""field field-name-field-collection...",boundaries,Boundaries,"In BOUNDARIES, single mom Laura Jaconi (Vera F...","<div class=""field field-name-field-parents-nee..."


In [6]:
len(df)

8778

In [7]:
len(df.title.unique())

8625

In [8]:
# 98.26% of movies are unique.
8625/8778

0.98257006151743

In [9]:
df.drop_duplicates('title', inplace=True)

In [10]:
len(df) 

8625

In [11]:
movies_features = df.to_dict('records')

In [12]:
# Run cell if you want to see first 3 movies from movies_features
# movies_features[0:2]

## Partial Cleaning and Extracting of Movie Features:
#### Some of the features scraped in the previous notebook need to be extracted in order to flatten movie ratings and other details that may be of importance to family members (e.g., cast, director, movie run time, etc.). I will also prepare text to be used to create a corpus of words used to describe the movie. This will be useful in determining what movies are similar to one another for our recommender.

### Helper Functions to Extract Movie Details, Parental Ratings, etc.:

In [13]:
def extract_movie_details(movie_details):
    details = []
    values = []
    details_lines = re.split('\n', movie_details)
    for line in details_lines:
        try:
            detail_name, detail_value, temp = re.split(':(.*)', line)
            details.append(detail_name)
            values.append(detail_value)
        except:   # if there is not a : in the line, it will ignore line
            pass
    
    return dict(zip(details, values))

In [14]:
def extract_parental_ratings_and_spoilers(parental_ratings):
    soup = bs4(parental_ratings, 'lxml')
    ratings = [re.findall('\d', str(x))[0] for x in soup.find_all('div', attrs={'class':'content-grid-rating'})]
    labels = [x.get_text() for x in soup.find_all('span', attrs={'class':"sprite-cover"})]
    spoilers = [x.get_text() for x in soup.find_all('div', attrs={'class': 'field-type-text-long'})]
    return dict(zip(labels, ratings)), spoilers   

In [15]:
def text_what_parents_need_to_know(what_parents_need_to_know):
    soup = bs4(what_parents_need_to_know, 'lxml')
    what_parents_need_to_know_text = soup.get_text()
    
    return what_parents_need_to_know_text

In [16]:
def extract_features_from_a_movie(movie_num):
    extracted_feat = {}
    movie_id = movies_features[movie_num]['movie_id']
    movie_details_dict = extract_movie_details(movies_features[movie_num]['movie_details_raw'])
    parent_ratings_dict, spoilers = extract_parental_ratings_and_spoilers(movies_features[movie_num]['parental_rating_and_spoilers_raw'])
    what_parents_need_to_know_text = text_what_parents_need_to_know(movies_features[movie_num]['what_parents_need_to_know_raw'])
    
    extracted_feat['movie_id'] = movie_id
    extracted_feat['movie_details_dict'] = movie_details_dict
    extracted_feat['parent_ratings_dict'] = parent_ratings_dict
    extracted_feat['spoilers'] = spoilers
    extracted_feat['what_parents_need_to_know_text'] = what_parents_need_to_know_text
    extracted_feat['family_topics'] = movies_features[movie_num]['family_topics']
    extracted_feat['is_it_any_good'] = movies_features[movie_num]['is_it_any_good']
    extracted_feat['what_is_the_story'] = movies_features[movie_num]['what_is_the_story_raw']
    extracted_feat['age'] = movies_features[movie_num]['age']
    extracted_feat['one_line_description'] = movies_features[movie_num]['one_line_description']
    extracted_feat['overall_rating'] = movies_features[movie_num]['overall_rating']
    extracted_feat['slug'] = movies_features[movie_num]['slug']
    extracted_feat['title'] = movies_features[movie_num]['title']
    return extracted_feat

In [17]:
# Run cell if you want to see example of features extracted from a single movie.
# extract_features_from_a_movie(0)

### Extract Features from all movies

In [18]:
extracted_movies = []
def extract_features_from_many_movies(first_movie, num_movies_to_extract):
    '''
    Follow DocString Conventions and annotate functions
    '''
    movie_features = {}
    for movie in range(num_movies_to_extract):
        movie = (movie + first_movie)
        if movie % 100 == 0:
            print(movie)
        extracted_features = extract_features_from_a_movie(movie)
        extracted_movies.append(extracted_features)
    return extracted_movies

In [19]:
# Leave this cell commented out unless you want to extract features from movies again.
# If you have run previous cells and now have a different number of movies to extract,
# adjust the num_movies_to_extract appropriately.
# extracted_movies = extract_features_from_many_movies(0,8625)

In [20]:
# Cell commented out to supress extra text. Run if you'd like to see extracted features
# extracted_movies[-1]

### Export Extracted Movie Features for Further Cleaning and NLP

In [21]:
# Uncomment if you have run prior notebooks and want to update movie recommender system
# with open('data/extracted_movies.json', 'w') as output:
#     json.dump(extracted_movies, output)   

#### Extracted features from movies saved to json file for final cleaning and division into text and non-text features in the next notebook.