## Tokenizing and "Count Vectorizing" Text
#### In this notebook, I will process the text data to each movie, doing final cleaning of punctuation, making all words lowercase and spliting text into individual word "tokens" (N.B., for those who are text processing-savvy, this was done outside of Vectorizers in order to retain the '-', since curse words are represented in these reviews as s--t and f--k. The frequency with which curse words appear in reviews may be an important text feature for parents...).

#### The first vectorizer I will explore is Count Vectorizer. Count Vectorizer will take the simple frequency of words found in text associated with each movie and turn it into that movie's "word vector." Movies with similar word vectors will be judged to be similar. To favor words that appear to be associated with movies in general and to reduce the dimensionality of our word vector space, I will run the word vectors through a process called Truncated SVD. Truncated SVD is a method designed to capture the most variance in our collected movie word vectors in the fewest dimensions possible by taking orthogonal components through our vector space, each of which is a combination of words. I will then analyze these components to see if I can derive some meaning for some of them (see below).

#### Once we get our text data into truncatedSVD format, we will then use cosine similarity to determine which movies are most similar to which other movies in our data set. We will then also incorporate non-text data to see how this improves our cosine similarity-based similarity matrix (see Notebook 7).

### Load Movie Text Data

In [1]:
import pandas as pd
import numpy as np
import requests, re, json, copy
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process



In [2]:
# Load json of movies_features_text
with open('data/movies_features_text.json') as json_file:  
    movies_features_text = json.load(json_file)

In [3]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,movie_id,slug,text,title
0,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


In [4]:
def get_movie_list(last_movie):
    for movie_num in range(last_movie):
        movie_titles = movies_features_text[movie_num]['title']
        return movie_titles

In [6]:
movies_features_text[0]['text'][:380]

'Families can talk about Sicario: Day of the Soldado  violence. Which parts were gruesome, and which were exciting? How did the movie achieve these effects? What  the impact of media violence on kids?  How are drinking, smoking, and drugs depicted? Are they glamorized? Does the movie make the drug business look alluring?  What does the movie have to say about law versus justice?'

In [7]:
stopwords = set(stopwords.words('english'))
# Remove punctuation from all text of each movie and remove stopwords
def clean_text_for_movie(text):
    '''
    Takes in all text of a single movie, makes lowercase and removes punctuation and stopwords
    from text. Returns words in input text as a single string, w/o English stopwords.
    '''
    words = re.sub("[^a-zA-Z\-]", " ", text).lower().split()  # removes punctuation, makes lowercase
    cleantext = [w for w in words if not w in stopwords]  # eliminates common "stop words"
    return(" ".join(cleantext))  # returns words as a string, each word separated by a space

In [8]:
clean_text_test = clean_text_for_movie(movies_features_text[0]['text'])

In [9]:
clean_text_test[:259]

'families talk sicario day soldado violence parts gruesome exciting movie achieve effects impact media violence kids drinking smoking drugs depicted glamorized movie make drug business look alluring movie say law versus justice difference two end justify means'

In [10]:
clean_text = []
def clean_text_for_movies(first_movie, num_movies_to_clean):
    print("Number of movies cleaned so far:")
    for movie in range(num_movies_to_clean):
        movie = (movie + first_movie)
        if movie % 1000 == 0:
            print(movie)
        clean_txt = clean_text_for_movie(movies_features_text[movie]['text'])
        movies_features_text[movie]['clean_text'] = clean_txt
        clean_text.append(clean_txt)
    return clean_text

In [11]:
clean_text = clean_text_for_movies(0,len(movies_features_text))

Number of movies cleaned so far:
0
1000
2000
3000
4000
5000
6000
7000
8000


In [12]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,clean_text,movie_id,slug,text,title
0,families talk sicario day soldado violence par...,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,families talk damsel use violence intense freq...,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,families talk rapid-fire disturbing images dis...,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,families talk berg sexual orientation presente...,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,families talk boundaries portrays drugs drug u...,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


#### Movies_features_text now has two new features, one of which contains the list of words used in movie reviews and other text associated with each of our 8625 unique movies and a second list that contains a list of bigrams of these words, to capture names of actors, separated by sentence. movies_fetures_text is now ready for vectorization.

### Vectorize text for NLP
#### I will initially use a tool called Count Vectorizer to establish an easily interpreted simple count of unigram and bigram frequency in my dataset. I will evaluate the predictive value of Count Vectorization before and after combining it with my Non-Text Features (see Notebook 7).
#### I will also use a process called TF-IDF (Term Frequency Inverse Document Frequency) Vectorization on my text data for comparison. TF-IDF gives the frequency of each word in the words associated with each movie (termed a "document") normalized by the frequency with which that word appears in all of the documents combined.
#### After Count or TD-IDF vectorization, I will then use truncated SVD on text data alone to reduce the number of features to reduce overfitting. The components that result from truncated SVD will be examined to identify discernable patterns.

In [13]:
vec = CountVectorizer(analyzer = "word",
                      tokenizer = None,      # tokenized in preprocessing
                      preprocessor = None,
                      stop_words = None,     # english stop words already removed, to retain -
                      min_df = 2,            # to eliminate typos
                      max_df = .9,           # to eliminate the word "movie"
                      max_features = 100000) 

data_features = pd.SparseDataFrame(vec.fit_transform(clean_text),
                                   columns=vec.get_feature_names(),
                                   default_fill_value=0)

In [14]:
data_features.shape   # unigrams only (far too many features with larger ngrams)

(8625, 42188)

In [15]:
feature_names = vec.get_feature_names()

## Code bits for Future Pre-Modeling and Analysis
### Investigative EDA - Pre-Modeling
    Things to look for:
      Global token counts
      Select words of interest

In [16]:
# vectorizer = CountVectorizer(min_df=2, max_df=.9)  # min_df = 2 eliminates typos,
#             # max_df = .9 eliminates the word "movie"

### Put Vectorized Data in DF for Analysis
- Sum aggregate token counts
- Plot / investigate
  - Histogram
  - Horizontal Barplot

In [17]:
# text = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

In [18]:
# sum_text = text.sum()

In [19]:
# sum_text.sort_values(ascending=False).head(30)

In [20]:
# sum_text.hist()

In [21]:
# mask = sum_text.between(3000, 15000)
# sum_text[mask].sort_values(ascending=False).hist()

In [22]:
# sum_text[mask].sort_values(ascending=True).head(20).plot(kind="barh", figsize=(5, 15))

### Count Vectorize Text

In [23]:
data_features.shape  # unigrams only (far too many features with larger ngrams)

(8625, 42188)

In [25]:
# text = pd.DataFrame(data_features.toarray(), columns=vectorizer.get_feature_names())

### TF-IDF Vectorizer
#### This vectorizer will vectorizes words in text by count, normalizing word frequency by including a factor that will decrease the effect of commonly occuring words specific to the posts analyzed. This is done for reasons similiar to why we discount words that occur frequently in the English language generally--- their appearance may obscure more important, differentiating words.

In [26]:
# tvec = TfidfVectorizer(analyzer = "word",
#                        tokenizer = None,      # tokenized in preprocessing
#                        preprocessor = None,
#                        stop_words = None,     # english stop words already removed, to retain -
#                        min_df = 2,            # to eliminate typos
#                        max_df = .9,           # to eliminate the word "movie"
#                        max_features = 42164) 
# 
# data_features_tfidf = pd.SparseDataFrame(tvec.fit_transform(clean_text),
#                                           columns=tvec.get_feature_names(),
#                                           default_fill_value=0)

In [27]:
# data_features_tfidf.shape

In [28]:
# feature_names = tvec.get_feature_names()

In [29]:
### Consider stemming, to avoid 'abandon', 'abandoned', 'abandoning', 'abandonment',
### and 'abandons' all ending up as separate words, etc...

In [30]:
### investigate top word choices--- how to 

In [31]:
# feature_names

In [32]:
# len(feature_names)

### Truncated SVD
#### To generate vectors that encapsulate the most variance in our text data in the fewest number of components.

In [33]:
svd = TruncatedSVD(n_components=1000)

#### Uncomment cells below and run, to find TruncatedSVD1000 results for CountVectorized words

In [34]:
# countvec_truncated is fit_transformed w/1000 components
countvec_truncated = svd.fit_transform(data_features)

In [35]:
 countvec_truncated.shape

(8625, 1000)

In [36]:
 components_countvec = pd.DataFrame(svd.components_.T, index=feature_names)

In [37]:
 components_countvec   # columns are svd components, 0 - 999, for count vectorized words

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
aa,0.000448,-0.000821,0.000569,-1.353337e-04,-0.001204,-0.000548,-0.000352,-0.001033,-0.000927,0.000093,...,-0.003093,0.001430,0.001368,-0.002153,-0.001275,0.002374,0.002212,-0.002156,0.002821,0.000926
aaa,0.000019,0.000083,0.000021,4.687004e-05,0.000049,0.000100,0.000036,-0.000075,0.000013,-0.000060,...,0.000210,-0.000133,-0.000732,-0.000731,-0.000122,-0.000702,-0.000025,-0.000328,0.000110,-0.000011
aaah,0.000030,0.000111,-0.000162,1.731217e-04,-0.000123,0.000005,0.000290,0.000041,0.000199,-0.000049,...,0.000025,0.000257,0.000003,0.000723,0.000505,-0.000267,-0.000028,-0.000103,0.001282,-0.000609
aardman,0.000073,0.000223,-0.000152,1.887633e-04,-0.000358,0.000288,-0.000170,-0.000134,0.000038,-0.000019,...,0.000114,0.000713,-0.000741,-0.000088,0.001815,-0.002386,-0.000801,-0.001680,-0.001817,0.001346
aaron,0.001849,-0.000661,0.002027,-1.548583e-03,0.000165,-0.000165,-0.000261,-0.002525,-0.004896,-0.003108,...,0.020806,0.021525,-0.042632,-0.023021,0.026861,0.034714,-0.006847,0.045742,0.027622,-0.011065
aasif,0.000027,0.000028,-0.000061,1.865180e-05,0.000028,0.000072,0.000002,-0.000183,0.000002,0.000014,...,-0.000089,0.000079,0.000013,0.000473,-0.000068,-0.000410,0.000627,0.000389,0.000515,-0.000534
aback,0.000056,0.000057,0.000176,-9.248011e-05,-0.000315,-0.000192,-0.000009,-0.000045,-0.000474,-0.000059,...,-0.000286,-0.000165,-0.001071,0.000317,-0.000029,-0.000096,-0.000535,0.000679,-0.001238,-0.000142
abacus,0.000085,0.000285,0.000165,-2.790680e-04,0.000162,0.000041,0.000313,-0.000530,-0.000304,-0.000897,...,0.009591,-0.001279,-0.000199,-0.002889,-0.004094,-0.003352,-0.003707,0.004018,-0.005480,-0.001547
abandon,0.000894,0.000066,0.000131,-9.637767e-04,-0.000758,0.000358,0.000639,0.001310,-0.000053,0.000402,...,-0.000465,-0.002807,-0.002287,-0.001965,0.003613,0.003613,0.001290,0.001326,-0.001432,0.008653
abandoned,0.002946,0.003778,-0.000260,-2.802035e-03,-0.000387,-0.002803,-0.001340,0.003820,-0.000805,0.004354,...,0.014273,-0.003137,0.002386,0.001670,0.000964,-0.010389,-0.021008,0.026051,0.000421,0.001902


In [38]:
 components_countvec.shape

(42188, 1000)

In [39]:
# tfidf_truncated = svd.fit_transform(data_features_tfidf)

In [40]:
# tfidf_truncated.shape

In [41]:
# components_tfidf = pd.DataFrame(svd.components_.T, index=feature_names)

In [42]:
# components_tfidf   # columns are svd components, 0 - 999, for tf-idf vectorized words

In [43]:
# components_tfidf.shape

#### Explore components--- list actual numerical value, but print head(10) and tail(10) for important components

In [44]:
word_importance_by_component = []

In [45]:
# word_importance_component_1 = components_tfidf[0].abs()
# word_importance_component_1

In [46]:
# type(word_importance_component_1)

In [47]:
# word_importance_component_1.index   ### Take word order from abs(), but list actual value!!

In [48]:
# word_importances_tfidf = []

In [49]:
# svd_explained_variance = svd.explained_variance_

In [50]:
cum_sum_explained_variance = []
def cum_sum_explained_var(vect, total_comp):
    cum_sum_explained_variance = []
    if total_comp > len(vect.explained_variance_):
        print("That's too many components. Max_components is 1000\.\n")
        total_comp = int(input("Enter new total_components:"))
    else:
        pass
    cum_sum_var = 0
    for i in range(total_comp):
        cum_sum_var += vect.explained_variance_[i]
        cum_sum_explained_variance.append(cum_sum_var)
    return cum_sum_explained_variance

In [51]:
# cum_sum_explained_var(svd, 1000)

In [52]:
# data_features_tfidf.shape[1]

In [53]:
## pd.DataFrame(index=feature_names, columns=components_tfidf)
## tfidf_features['feature names'] = vocab
## #pd.DataFrame(features_components_tfidf

In [54]:
# print(vocab)

### Cosine Similarity:  TFIDF_truncSVD700

In [55]:
### Calcualte as matrix of all movies to all movies of  countvec_truncated
sim_matrix_countvec_truncSVD700 = cosine_similarity(countvec_truncated, countvec_truncated)

In [56]:
similarity_matrix_countvec_truncSVD1000 = pd.DataFrame(sim_matrix_countvec_truncSVD700,
                                                columns=df['title'],
                                                index=df['title'])

In [57]:
similarity_matrix_countvec_truncSVD1000.head()

title,Sicario: Day of the Soldado,Damsel,Distorted,The Catcher Was a Spy,Boundaries,Izzy Gets the F*ck Across Town,Jurassic World: Fallen Kingdom,Brothers of the Wind,Unsane,Flower,...,Live and Let Die,Tintin: The Lake of Sharks,Tales of Beatrix Potter,Tintin: The Prisoners of the Sun,Gentle Giant,Tintin: The Calculus Affair,Visit to a Small Planet,Zoo Baby,Driftwood,Sherlock Jr.
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Sicario: Day of the Soldado,1.0,0.361348,0.225723,0.229002,0.306837,0.294909,0.305781,0.198979,0.299826,0.29051,...,0.239472,0.255223,0.204917,0.179374,0.168951,0.213407,0.229971,0.122875,0.183902,0.239167
Damsel,0.361348,1.0,0.279621,0.259207,0.438147,0.416179,0.274745,0.18731,0.361267,0.297415,...,0.213792,0.230755,0.196302,0.17318,0.215048,0.176575,0.219622,0.102794,0.264133,0.222598
Distorted,0.225723,0.279621,1.0,0.214984,0.19919,0.28636,0.206954,0.23588,0.459011,0.228103,...,0.213765,0.171376,0.152187,0.170625,0.15816,0.211158,0.180736,0.094415,0.28975,0.203684
The Catcher Was a Spy,0.229002,0.259207,0.214984,1.0,0.190891,0.204458,0.239723,0.2702,0.314329,0.234218,...,0.164965,0.1724,0.105151,0.129007,0.185724,0.248948,0.183909,0.139663,0.168943,0.177274
Boundaries,0.306837,0.438147,0.19919,0.190891,1.0,0.39093,0.187024,0.182267,0.284135,0.354171,...,0.212422,0.167183,0.184331,0.163967,0.174966,0.135271,0.228074,0.178154,0.287889,0.20798


In [58]:
similarity_matrix_countvec_truncSVD1000['Damsel']

title
Sicario: Day of the Soldado                             0.361348
Damsel                                                  1.000000
Distorted                                               0.279621
The Catcher Was a Spy                                   0.259207
Boundaries                                              0.438147
Izzy Gets the F*ck Across Town                          0.416179
Jurassic World: Fallen Kingdom                          0.274745
Brothers of the Wind                                    0.187310
Unsane                                                  0.361267
Flower                                                  0.297415
Midnight Sun                                            0.251928
Paul, Apostle of Christ                                 0.179320
The Death of Stalin                                     0.281319
Pacific Rim: Uprising                                   0.366675
The Swap                                                0.145450
Class Rank         

In [59]:
### Calculate as matrix of all movies to all movies of tfidf_truncated 
# sim_matrix_tfidf_truncSVD700 = cosine_similarity(tfidf_truncated, tfidf_truncated)

In [60]:
# similarity_matrix_tfidf_truncSVD700 = pd.DataFrame(sim_matrix_tfidf_truncSVD700,
#                                                    columns=df['title'],
#                                                    index=df['title'])

IndentationError: unexpected indent (<ipython-input-60-ed812e0bac23>, line 2)

In [61]:
# similarity_matrix_tfidf_truncSVD700.head()

In [62]:
# similarity_matrix_tfidf_truncSVD700['Damsel']

### Find Similar Movies

In [65]:
movie_list = df['title']

In [67]:
def title_recommender(movie_name, movie_list, limit=3):
    results = process.extract(movie_name, movie_list, limit=limit)
    return results

In [82]:
def find_similar_movies():
    movie_name = input("Give me a movie title and I'll give you five titles you might also like:")
    for title in df['title']:
        if title == movie_name:
            sim_movies_text = similarity_matrix_countvec_truncSVD1000[movie_name]
            print("Thanks! Here are my recommendations, along with review text similarity scores:")
            return sim_movies_text.sort_values(ascending=False)[1:6]
    limit = 3
    while title != movie_name:
        results = title_recommender(movie_name, df['title'], limit=limit)
        print("Sorry, that movie title isn't in my list. Did you mean", results, "?")
        movie_name = input("(I need the exact title, please...)")
        for title in df['title']:
            if title == movie_name:
                sim_movies_text = similarity_matrix_countvec_truncSVD1000[movie_name]
                print("Thanks! Here are my recommendations, along with review text similarity scores:")
                return sim_movies_text.sort_values(ascending=False)[1:6]
        limit += 1
        if limit >= 10:
            limit = 10

In [83]:
find_similar_movies()

Give me a movie title and I'll give you five titles you might also like:incredibles
Sorry, that movie title isn't in my list. Did you mean [('Incredibles 2', 95, 17), ('The Incredibles', 95, 6296), ('Red', 90, 4028)] ?
(I need the exact title, please...)The Incredibles
Thanks! Here are my recommendations, along with review text similarity scores:


title
Incredibles 2                               0.632292
My Happy Family                             0.462751
Who Is Simon Miller?                        0.434455
Mystery Men                                 0.426819
Ultimate Avengers 2: Rise of the Panther    0.419790
Name: The Incredibles, dtype: float64

In [84]:
find_similar_movies2()

Give me a movie title and I'll give you five titles you might also like:ParaNorman
Thanks! Here are my recommendations, along with review text similarity scores:


title
Norman                            0.556554
Floyd Norman: An Animated Life    0.547010
On Golden Pond                    0.402824
Miss Potter                       0.355590
What Lies Beneath                 0.342667
Name: ParaNorman, dtype: float64

#### As you can see above, recommendations are usually good, with some notable exceptions. Informal analyses show recommendations are accurate approximately 80-85% of the time. Unfortunately, this recommender will not work without the cosine similarity matrix, which is prohibitively large to upload to GitHub. In order to generate this file, you can either run the notebooks in this repo in order, or recompile the matrix from the 15 pieces that I've generated and uploaded as simmat0, simmat1, ..., simmat14. You can do so by executing the function at the bottom of notebook 5.1.