## Cleaning and Organizing Text and Non-Text Features
#### Movie data imported from the previous notebook for final preprocessing and division into text and non-text features. Text features first cleaned using REGEX and concatenated into a single feature. Text will be further processed in Notebook 5. Non-text features (MPAA rating, genre and "parental concerns ratings'" see below) were extracted and put into a separate list for exploratory data analysis in Notebook 6.

### Importing Extracted Features for Preprocessing

In [1]:
import pandas as pd
import seaborn as sns
import requests, re, json, copy
from bs4 import BeautifulSoup as bs4

In [2]:
# Load json of extracted movie data. Unless you have run prior notebooks, extracted_movies
# currently has data from an early version that contains 8625 movies.
with open('data/extracted_movies.json') as json_file:  
    extracted_movies = json.load(json_file)

In [3]:
extracted_movies[0]['title']

'Sicario: Day of the Soldado'

### Clean and Concatenate Text: Remove Artifacts, Contractions, Punctuation

#### Clean and combine text features into single feature called text. These features include what_parents_need_to_know_text, family_topics, what_is_the_story,  is_it_any_good, one_line_description, and spoilers. This text was put into a list along with the movie title, the movie_id and the slug so the correct web page can be accessed if the movie is recommended.

In [4]:
def text_cleaner(text):
    return re.sub(r"(\'s|\'d}|\'t|\'m|\'re|\xa0|\n|/)", ' ', text)

In [5]:
def clean_text_for_movie(movie_num):
    cleaned_movies = {}
    clean_text = []
    clean_text.append(text_cleaner(extracted_movies[movie_num]['family_topics']))
    clean_text.append(text_cleaner(extracted_movies[movie_num]['is_it_any_good']))
    clean_text.append(text_cleaner(extracted_movies[movie_num]['one_line_description']))
    clean_text.append(text_cleaner(' '.join(extracted_movies[movie_num]['spoilers'])))
    clean_text.append(text_cleaner(extracted_movies[movie_num]['what_is_the_story']))
    clean_text.append(text_cleaner(extracted_movies[movie_num]['what_parents_need_to_know_text']))
    try:
        clean_text.append(text_cleaner(extracted_movies[movie_num]['movie_details_dict']['MPAA explanation']))
    except KeyError:  # some are rated NR, so have no MPAA explanation
        pass    
    try:
        clean_text.append(text_cleaner(extracted_movies[movie_num]['movie_details_dict']['Character Strengths']))
    except KeyError:  # not every movie has character strengths listed
        pass
    try:
        clean_text.append(text_cleaner(extracted_movies[movie_num]['movie_details_dict']['Topics']))
    except KeyError:  # not every movie has topics listed
        pass
                                       
    cleaned_movies['text'] = ' '.join(clean_text)
    cleaned_movies['movie_id'] = extracted_movies[movie_num]['movie_id']
    cleaned_movies['title'] = extracted_movies[movie_num]['title']
    cleaned_movies['slug'] = extracted_movies[movie_num]['slug']
    return cleaned_movies

In [6]:
movies_features_text = []
def clean_text_for_many_movies(first_movie, num_movies_to_clean):
    movie_text = {}  # container for all text from one movie
    for movie in range(num_movies_to_clean):
        movie = (movie + first_movie)
        if movie % 1000 == 0:
            print(movie)
        movie_text = clean_text_for_movie(movie)
        movies_features_text.append(movie_text)
    return movies_features_text

In [7]:
movies_features_text = clean_text_for_many_movies(0, len(extracted_movies))

0
1000
2000
3000
4000
5000
6000
7000
8000


In [8]:
len(movies_features_text)

8625

#### movies_features_text is a list of dictionaries, each of which contains the words associated with a particular movie. Along with these words are three unique movie identifiers, the title of the movie, a slug which can be used to pull up a Common Sense Media web page for the movie, and a movie_id number.

## Create Other Features List
#### This second list of dictionaries will contain non-text features associated with movies including Age, Overall Movie Rating, Genre, and Parental Concerns Ratings. In Theatres, DVD/Streaming MPAA Also mark with movie_id, title, and slug.

In [9]:
extracted_movies[0]['parent_ratings_dict'].keys()

dict_keys(['Positive Messages', 'Positive Role Models & Representations', 'Violence', 'Sex', 'Language', 'Consumerism', 'Drinking, Drugs & Smoking'])

### Create lists of features that exist only in certain movies, not all
#### There are two groups of features that exist only in some of the movies, ratings on a scale of 0-5 on aspects of movies that are important to parents (e.g., Positive_Messages, Language, Sex, Violence, etc.). Since the data is derived from Common Sense Media's dataset which caters to advice to parents of children up to 17 years of age, movies that are targeted at teenagers are rated on a scale of 0 to 5 on "Sex" and "Violence" whereas movies targeted towards youngsters are rated on "Sexy Stuff" and "Violence & Scariness."
#### There are other features that only exist in certain movies, such as 'in_theatres' and 'DVD/Streaming' which contain the appropriate release dates.

In [10]:
parental_categories_set = set()

In [11]:
for movie in extracted_movies:
    parental_categories_set.update(set(movie['parent_ratings_dict'].keys()))

In [12]:
parental_categories = list(parental_categories_set)
parental_categories

['Consumerism',
 'Violence & Scariness',
 'Language',
 'Educational Value',
 'Positive Role Models & Representations',
 'Sex',
 'Positive Messages',
 'Violence',
 'Sexy Stuff',
 'Drinking, Drugs & Smoking']

In [13]:
occasional_features_list = ['in_theatres', 'DVD/Streaming', 'MPAA_explanation']

### Create Other_features list
#### Movies_other_features is a list that contains features for each movie such as the MPAA rating (G, PG, PG-13, etc.), the genre, the typical recommended minimum viewer age, a one line description of the movie and the overall_rating according to Common Sense Media's movie reviewers. Also in this list are various unique identifiers of each movie such as the title, an arbitrary "movie_id" number and a slug which can be used to call up more information about each recommended movie from Common Sense Media's web page.

In [14]:
def make_other_features_dict_for_movie(movie_num):
    movie_features_other = {}
    movie_features_other['age'] = int(extracted_movies[movie_num]['age'][4:-1])
    movie_features_other['MPAA_rating'] = extracted_movies[movie_num]['movie_details_dict']['MPAA rating']
    movie_features_other['overall_rating'] = int(extracted_movies[movie_num]['overall_rating'])
    movie_features_other['one_line_description'] = extracted_movies[movie_num]['one_line_description']
    movie_features_other['genre'] = extracted_movies[movie_num]['movie_details_dict']['Genre']
    movie_features_other['movie_id'] = extracted_movies[movie_num]['movie_id']
    movie_features_other['title'] = extracted_movies[movie_num]['title']
    movie_features_other['slug'] = extracted_movies[movie_num]['slug']
    for category in parental_categories:
        try:
            movie_features_other[category] = int(extracted_movies[movie_num]['parent_ratings_dict'][category])
        except KeyError:  # to skip parental categories not included in movie
            pass
    for category in occasional_features_list:
        try:
            movie_features_other[category] = extracted_movies[movie_num]['movie_details_dict'][category]
        except KeyError:  # if no DVD/Streaming release date planned or if movie is unrated
            pass
     
    return movie_features_other

In [15]:
movies_other_features = []
def make_other_features_list_for_many_movies(first_movie, num_movies):
    for movie in range(num_movies):
        movie = (movie + first_movie)
        if movie % 1000 == 0:
            print(movie)
        movies_other_features.append(make_other_features_dict_for_movie(movie))
    return movies_other_features

In [16]:
movies_other_features = make_other_features_list_for_many_movies(0,len(extracted_movies))

0
1000
2000
3000
4000
5000
6000
7000
8000


In [17]:
# Run cell if you want to see the non-text features as list of dicts
# movies_other_features[0:1]

#### movies_other_features contains specific features of movies to be used in building our movie recommender system in the next notebook.

### Save features folders as json's for export to next notebook

In [18]:
with open('data/movies_features_text.json', 'w') as output:
    json.dump(movies_features_text, output)

In [19]:
with open('data/movies_other_features.json', 'w') as output:
    json.dump(movies_other_features, output)

#### Movie features now saved to two separate files, one called movies_features_text containing the text files associated with the 8625 unique movies in my dataset and the other called movies_other_features containing features such as the recommended age of movie viewers, the movie genre, and a one-line description of the movie. Text features will now be analyzed in the next notebook, where I will tokenize and vectorize the text associated with each movie and prepare it for Natural Language Processing. Non-text features will undergo exploratory data analysis in Notebook 6.