## Tokenizing and TF-IDF Vectorization:  Text-based Cosine Similarity
#### In this notebook, I will process the text data to each movie, doing final cleaning of punctuation, making all words lowercase and spliting text into individual word "tokens" (N.B., for those who are text processing-savvy, this was done outside of Vectorizers in order to retain the '-', since curse words are represented in these reviews as s--t and f--k. The frequency with which these curse words appear in reviews may be an important text feature for parents).

#### The vectorizer I will use in this notebook is TF-IDF Vectorizer. TF-IDF Vectorizer will take the frequency of words that are found in text associated with each movie and divide by the frequency with which a term appears in all of the documents put together. This technique is designed to give greater weight to terms that occur frequently in a document but not in other documents, thus controlling for words that appear frequently in all movies--- such as the word, "movie," for example. It will then turn the TF-IDF statistic into the movie's "word vector." These word vectors will then be run through Truncated SVD as described in Notebook 5.

#### Once we get our text data into truncatedSVD format, I will use cosine similarity to determine which movies are most similar to which other movies in our data set. We will then also incorporate non-text data to see how this improves our cosine similarity-based similarity matrix (see below).

### Load Movie Text Data

In [1]:
import pandas as pd
import numpy as np
import requests, re, json, copy, pickle
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process



In [2]:
# Load json of movies_features_text
with open('data/movies_features_text.json') as json_file:  
    movies_features_text = json.load(json_file)

In [3]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,movie_id,slug,text,title
0,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


In [4]:
def get_movie_list(last_movie):
    for movie_num in range(last_movie):
        movie_titles = movies_features_text[movie_num]['title']
        return movie_titles

In [5]:
range(3)

range(0, 3)

In [6]:
movies_features_text[0]['text'][:380]

'Families can talk about Sicario: Day of the Soldado  violence. Which parts were gruesome, and which were exciting? How did the movie achieve these effects? What  the impact of media violence on kids?  How are drinking, smoking, and drugs depicted? Are they glamorized? Does the movie make the drug business look alluring?  What does the movie have to say about law versus justice?'

In [7]:
stopwords = set(stopwords.words('english'))
# Remove punctuation from all text of each movie and remove stopwords
def clean_text_for_movie(text):
    '''
    Takes in all text of a single movie, makes lowercase and removes punctuation and stopwords
    from text. Returns words in input text as a single string, w/o English stopwords.
    '''
    words = re.sub("[^a-zA-Z\-]", " ", text).lower().split()  # removes punctuation, makes lowercase
    cleantext = [w for w in words if not w in stopwords]  # eliminates common "stop words"
    return(" ".join(cleantext))  # returns words as a string, each word separated by a space

In [8]:
clean_text_test = clean_text_for_movie(movies_features_text[0]['text'])

In [9]:
clean_text_test[:259]

'families talk sicario day soldado violence parts gruesome exciting movie achieve effects impact media violence kids drinking smoking drugs depicted glamorized movie make drug business look alluring movie say law versus justice difference two end justify means'

In [10]:
clean_text = []
def clean_text_for_movies(first_movie, num_movies_to_clean):
    print("Number of movies cleaned so far:")
    for movie in range(num_movies_to_clean):
        movie = (movie + first_movie)
        if movie % 1000 == 0:
            print(movie)
        clean_txt = clean_text_for_movie(movies_features_text[movie]['text'])
        movies_features_text[movie]['clean_text'] = clean_txt
        clean_text.append(clean_txt)
    return clean_text

In [11]:
clean_text = clean_text_for_movies(0,len(movies_features_text))

Number of movies cleaned so far:
0
1000
2000
3000
4000
5000
6000
7000
8000


In [12]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,clean_text,movie_id,slug,text,title
0,families talk sicario day soldado violence par...,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,families talk damsel use violence intense freq...,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,families talk rapid-fire disturbing images dis...,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,families talk berg sexual orientation presente...,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,families talk boundaries portrays drugs drug u...,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


#### Movies_features_text now has two new features, one of which contains the list of words used in movie reviews and other text associated with each of our 8625 unique movies and a second list that contains a list of bigrams of these words, to capture names of actors, separated by sentence. movies_fetures_text is now ready for vectorization.

### Vectorize text for NLP:  TF-IDF
#### In this notebook, I will use a process called TF-IDF (Term Frequency Inverse Document Frequency) Vectorization on my text data to compare with my Count Vectorized predictor. TF-IDF gives the frequency of each word in the words associated with each movie (termed a "document") normalized by the frequency with which that word appears in all of the documents combined. In other words, words that appear frequently in text associated with all movies in general are not going to be counted as important as words that appear frequently in a small subset of documents.
#### After TD-IDF vectorization, I will then use truncated SVD on text data alone to reduce the number of features to reduce overfitting. The components that result from truncated SVD will be examined to identify discernable patterns.

In [16]:
#Run if you want to TF-IDF Vectorize features. Will take considerable time.
tvec = TfidfVectorizer(analyzer = "word",
                       tokenizer = None,      # tokenized in preprocessing
                       preprocessor = None,
                       stop_words = None,     # english stop words already removed, to retain -
                       min_df = 2,            # to eliminate typos
                       max_df = .9,           # to eliminate the word "movie"
                       max_features = 50000) 

data_features_tfidf = pd.SparseDataFrame(tvec.fit_transform(clean_text),
                                         columns=tvec.get_feature_names(),
                                         default_fill_value=0)

In [17]:
# data_features_tfidf.shape

In [18]:
feature_names = tvec.get_feature_names()
feature_names

['aa',
 'aaa',
 'aaah',
 'aardman',
 'aaron',
 'aasif',
 'aback',
 'abacus',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandons',
 'abashed',
 'abba',
 'abbate',
 'abbess',
 'abbey',
 'abbi',
 'abbie',
 'abbot',
 'abbott',
 'abbreviated',
 'abby',
 'abc',
 'abdalla',
 'abdellatif',
 'abdi',
 'abdicate',
 'abdication',
 'abdomen',
 'abduct',
 'abducted',
 'abducting',
 'abduction',
 'abductions',
 'abductors',
 'abducts',
 'abdul',
 'abe',
 'abel',
 'abell',
 'abercrombie',
 'aberrant',
 'abetted',
 'abhor',
 'abhorrent',
 'abhorrently',
 'abhors',
 'abi',
 'abide',
 'abiding',
 'abigail',
 'abilities',
 'ability',
 'abin',
 'abject',
 'ablaze',
 'able',
 'abled',
 'ably',
 'abnegation',
 'abner',
 'abnormal',
 'abnormalities',
 'abnormally',
 'abo',
 'aboard',
 'abode',
 'abolish',
 'abolished',
 'abolishing',
 'abolition',
 'abolitionist',
 'abolitionists',
 'abominable',
 'abominably',
 'abomination',
 'abominations',
 'aboriginal',
 'aborigine',
 'aborigines',
 'abo

### Truncated SVD
#### To generate vectors that encapsulate the most variance in our text data in the fewest number of components.

In [19]:
svd = TruncatedSVD(n_components=1000)

In [20]:
# tfidfvec_truncated is fit_transformed w/1000 components
tfidfvec_truncated = svd.fit_transform(data_features_tfidf)

In [21]:
tfidfvec_truncated.shape

(8625, 1000)

In [22]:
# columns are svd components, 0 - 999, for tf-idf vectorized words; index is words
components_tfidfvec = pd.DataFrame(svd.components_.T, index=feature_names)

In [23]:
components_tfidfvec.shape

(42188, 1000)

#### Explore components to identify meanings of components

In [70]:
word_importance_component_1 = components_tfidfvec[0].sort_values(ascending=False)
word_importance_component_1[0:6]

characters    0.145347
one           0.140572
sex           0.115348
character     0.111659
film          0.106805
violence      0.106770
Name: 0, dtype: float64

In [71]:
word_importance_component_1[-7:-1]

prejudge            0.000025
beliebers           0.000025
persson             0.000023
sundberg            0.000023
rebranding          0.000023
anthropomorphize    0.000020
Name: 0, dtype: float64

In [None]:
word_importances = {}
word_importances.append(word_importance_component_1[0:10])

In [72]:
word_importance_component_2 = components_tfidfvec[1].sort_values(ascending=False)
word_importance_component_2[0:6]

christmas    0.303203
kids         0.204020
santa        0.142073
holiday      0.137228
dog          0.133801
family       0.129417
Name: 1, dtype: float64

In [73]:
word_importance_component_2[-7:-1]

nudity     -0.098234
violence   -0.102057
women      -0.103081
shown      -0.109350
drug       -0.118820
sexual     -0.125117
Name: 1, dtype: float64

In [74]:
word_importance_component_3 = components_tfidfvec[2].sort_values(ascending=False)
word_importance_component_3[0:6]

school       0.203101
sex          0.173508
teen         0.153460
girls        0.117679
christmas    0.116954
teens        0.115779
Name: 2, dtype: float64

In [75]:
word_importance_component_3[-7:-1]

evil     -0.092100
horror   -0.095386
scary    -0.100603
blood    -0.114092
war      -0.148512
action   -0.171642
Name: 2, dtype: float64

In [82]:
word_importance_component_6 = components_tfidfvec[5].sort_values(ascending=False)
word_importance_component_6[0:6]

dog        0.436106
dogs       0.203989
animals    0.180841
animal     0.124016
family     0.122604
man        0.099306
Name: 5, dtype: float64

In [83]:
word_importance_component_6[-7:-1]

war         -0.121366
santa       -0.124257
teen        -0.175223
girls       -0.177246
high        -0.185521
christmas   -0.207730
Name: 5, dtype: float64

In [86]:
word_importance_component_8 = components_tfidfvec[7].sort_values(ascending=False)
word_importance_component_8[0:6]

dog       0.436691
war       0.210951
dogs      0.194946
team      0.167513
school    0.156317
sports    0.142064
Name: 7, dtype: float64

In [87]:
word_importance_component_8[-7:-1]

magic     -0.071185
fantasy   -0.077160
horror    -0.083386
fairy     -0.092961
scary     -0.110499
love      -0.114002
Name: 7, dtype: float64

In [88]:
word_importance_df = {}

In [90]:
# word_importance_df = comp_1:word_importance_component_1[0:6]

In [28]:
# svd_explained_variance = svd.explained_variance_

In [29]:
cum_sum_explained_variance = []
def cum_sum_explained_var(vect, total_comp):
    cum_sum_explained_variance = []
    if total_comp > len(vect.explained_variance_):
        print("That's too many components. Max_components is 1000\.\n")
        total_comp = int(input("Enter new total_components:"))
    else:
        pass
    cum_sum_var = 0
    for i in range(total_comp):
        cum_sum_var += vect.explained_variance_[i]
        cum_sum_explained_variance.append({i, cum_sum_var})
    return cum_sum_explained_variance

In [30]:
# Run cell to get cumulative list of explained variances. Only 40% of variance explained by
# text features...

# cum_sum_explained_var(svd, 1000)

In [31]:
# data_features_tfidf.shape[1]

In [32]:
## pd.DataFrame(index=feature_names, columns=components_tfidf)
## tfidf_features['feature names'] = vocab
## #pd.DataFrame(features_components_tfidf

### Cosine Similarity:  TFIDF_truncSVD1000

In [33]:
### Calcualte as matrix of all movies to all movies of  countvec_truncated
sim_matrix_tfidfvec_truncSVD1000 = cosine_similarity(tfidfvec_truncated, tfidfvec_truncated)

In [34]:
similarity_matrix_tfidfvec_truncSVD1000 = pd.DataFrame(sim_matrix_tfidfvec_truncSVD1000,
                                                       columns=df['title'], index=df['title'])

In [47]:
pickle.dump(similarity_matrix_tfidfvec_truncSVD1000,
            open("similarity_matrix_tfidfvec_truncSVD1000.pkl", "wb" ))

In [35]:
similarity_matrix_tfidfvec_truncSVD1000.head()

title,Sicario: Day of the Soldado,Damsel,Distorted,The Catcher Was a Spy,Boundaries,Izzy Gets the F*ck Across Town,Jurassic World: Fallen Kingdom,Brothers of the Wind,Unsane,Flower,...,Live and Let Die,Tintin: The Lake of Sharks,Tales of Beatrix Potter,Tintin: The Prisoners of the Sun,Gentle Giant,Tintin: The Calculus Affair,Visit to a Small Planet,Zoo Baby,Driftwood,Sherlock Jr.
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Sicario: Day of the Soldado,1.0,0.191863,0.121058,0.105026,0.190047,0.118741,0.133465,0.084356,0.119499,0.148502,...,0.160482,0.065431,0.080594,0.049716,0.082526,0.055607,0.120428,0.034805,0.063027,0.073078
Damsel,0.191863,1.0,0.155643,0.12972,0.253115,0.173843,0.107899,0.098123,0.173747,0.122491,...,0.0931,0.051277,0.057892,0.037933,0.07835,0.039502,0.109071,0.041265,0.084943,0.07252
Distorted,0.121058,0.155643,1.0,0.134186,0.085488,0.111801,0.07905,0.142336,0.330883,0.1137,...,0.078046,0.050507,0.058628,0.037584,0.060264,0.058182,0.101859,0.023467,0.111956,0.078928
The Catcher Was a Spy,0.105026,0.12972,0.134186,1.0,0.096177,0.109292,0.084539,0.099543,0.164536,0.114867,...,0.056778,0.063964,0.042044,0.030727,0.113818,0.069232,0.088568,0.046777,0.037457,0.054134
Boundaries,0.190047,0.253115,0.085488,0.096177,1.0,0.170149,0.07711,0.079705,0.111558,0.21497,...,0.100153,0.02271,0.058466,0.030532,0.057677,0.033926,0.138365,0.057276,0.118458,0.065227


In [36]:
pd.DataFrame(similarity_matrix_tfidfvec_truncSVD1000['Damsel'].sort_values(ascending=False)[1:2].index)
            

Unnamed: 0,title
0,Diablo


In [37]:
# with open('data/similarity_matrix_tfidfvec_truncSVD1000.pkl', 'rb') as f:
#     similarity_matrix_tfidfvec_truncSVD1000 = pickle.load(f)

EOFError: Ran out of input

In [41]:
pd.DataFrame(similarity_matrix_tfidfvec_truncSVD1000['Damsel'].sort_values(ascending=False)[1:1000]).index[2]

'The Killer (O Matador)'

### Export Similarity Matrix:

In [38]:
type(similarity_matrix_tfidfvec_truncSVD1000)

pandas.core.frame.DataFrame

In [39]:
np.save('data/sim_matrix_tfidfvec_truncSVD1000.npy', sim_matrix_tfidfvec_truncSVD1000)

In [40]:
similarity_matrix_tfidfvec_truncSVD1000.to_csv('data/similarity_matrix_tfidfvec_truncSVD1000.csv')

### Find Similar Movies

In [49]:
movie_list = df['title']

In [50]:
movie_list[0:5]

0    Sicario: Day of the Soldado
1                         Damsel
2                      Distorted
3          The Catcher Was a Spy
4                     Boundaries
Name: title, dtype: object

In [43]:
def title_recommender(movie_name, movie_list, limit=3):
    results = process.extract(movie_name, movie_list, limit=limit)
    return results

In [44]:
def find_similar_movies():
    movie_name = input("Give me a movie title and I'll give you five titles you might also like:")
    for title in df['title']:
        if title == movie_name:
            sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
            print("Thanks! Here are my recommendations, along with review text similarity scores:")
            recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False)[1:6])
            return recommendations
    limit = 3
    while title != movie_name:
        results = title_recommender(movie_name, df['title'], limit=limit)
        print("Sorry, that movie title isn't in my list. Did you mean", results, "?")
        movie_name = input("(I need the exact title, please...)")
        for title in df['title']:
            if title == movie_name:
                sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
                print("Thanks! Here are my recommendations, along with review text similarity scores:")
                recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False)[1:6])
                return recommendations
            else:
                limit += 1
                if limit >= 10:
                    limit = 10

In [45]:
def find_all_similar_movies():
    movie_name = input("Give me a movie title and I'll give you five titles you might also like:")
    for title in df['title']:
        if title == movie_name:
            sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
            print("Thanks! Here are my recommendations, along with review text similarity scores:")
            recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False), index=df['title'])
            return recommendations
    limit = 3
    while title != movie_name:
        results = title_recommender(movie_name, df['title'], limit=limit)
        print("Sorry, that movie title isn't in my list. Did you mean", results, "?")
        movie_name = input("(I need the exact title, please...)")
        for title in df['title']:
            if title == movie_name:
                sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
                print("Thanks! Here are my recommendations, along with review text similarity scores:")
                recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False), index=df['title'])
                return recommendations
            else:
                limit += 1
                if limit >= 10:
                    limit = 10

In [46]:
find_similar_movies()

Give me a movie title and I'll give you five titles you might also like:Sicario
Thanks! Here are my recommendations, along with review text similarity scores:


Unnamed: 0_level_0,Sicario
title,Unnamed: 1_level_1
Sicario: Day of the Soldado,0.615682
Survivor,0.411374
Kate and Leopold,0.406113
Savages,0.400888
Smashed,0.392463


### Recommender System Evaluation
#### To evaluate my recommender system, I will have my colleagues test out the system, recording what they thought about each recommendation on a 6 point scale (5 = excellent recommendation, 4= good, 3= fair, 2 = poor, 1 = unrelated, 0 = IDK, that's a movie I've never seen)

In [None]:
def system_test(trials):
    rate_recs = []
    for trial in reversed(range(trials)):
        print("Thank you for trying out MovieRec4Parents(tm)! You have", trial+1, "tries to go.")
        recommendations = find_similar_movies()
        print(recommendations)
        rate_recs.append(recommendations)
        for rec in range(len(recommendations)):
            print("On a scale of 1-5, how good is recommendation", rec+1,"? If you don't know the movie, enter 0.)")
            rating = input()
            rate_recs.append((trial, rec+1, rating))
    print("You're done! I hope you enjoyed using MovieRec4Parents(tm). Tell your friends!")
    return rate_recs

In [None]:
system_test(5)

In [None]:
system_test(5)

In [None]:
recommendations_results = []
rec_results = []

In [None]:
rec_results = system_test(5)

In [None]:
recommendations_results.append(rec_results)

### Function to divide CosSimMatrix into 10 pieces so it can be uploaded to GitHub:

In [None]:
def decompose_sim_matrix(simmatrix=similarity_matrix_tfidfvec_truncSVD1000):
    for a in range(575):
        simmat0.append(simmatrix[a])
    for b in range(575):
        simmat1.append(simmatrix[575+b])
    for c in range(575):
        simmat2.append(simmatrix[575*2+c])
    for d in range(575):
        simmat3.append(simmatrix[575*3+d])
    for e in range(575):
        simmat4.append(simmatrix[575*4+e])
    for f in range(575):
        simmat5.append(simmatrix[575*5+f])
    for g in range(575):
        simmat6.append(simmatrix[575*6+g])
    for h in range(575):
        simmat7.append(simmatrix[575*7+h])
    for i in range(575):
        simmat8.append(simmatrix[575*8+i])
    for j in range(575):
        simmat9.append(simmatrix[575*9+j])
    for k in range(575):
        simmat10.append(simmatrix[575*10+k])
    for l in range(575):
        simmat11.append(simmatrix[575*11+l])
    for m in range(575):
        simmat12.append(simmatrix[575*12+m])
    for n in range(575):
        simmat13.append(simmatrix[575*13+n])
    for o in range(575):
        simmat14.append(simmatrix[575*14+o])

# Function to Compile Cosine Similarity Matrix:

In [None]:
# This is just pseudo-code so far. Right now, to get the cosine similarity matrix you will need
# to run myu notebooks in order. Please allow hours to do so. Or, you can contact me directly!
def recompile_sim_matrix(simmat0=simmat0, simmat1=simmat1, simmat2=simmat2, simmat3=simmat3,
                         simmat4=simmat4, simmat5=simmat5, simmat6=simmat6, simmat7=simmat7,
                         simmat8=simmat8, simmat9=simmat9, simmat10=simmat10, simmat11=simmat11,
                         simmat12=simmat12, simmat13=simmat13, simmat14=simmat14):
    similarity_matrix_tfidfvec_truncSVD1000 = pd.DataFrame(simmat0+simmat1+simmat2+simmat3+simmat4+simmat5+simmat6+simmat7+simmat8+simmat9+simmat10+simmat11+simmat12+simmat13+simmat14)