## Tokenizing and TF-IDF Vectorization:  Text-based Cosine Similarity
#### In this notebook, I will process the text data to each movie, doing final cleaning of punctuation, making all words lowercase and spliting text into individual word "tokens" (N.B., for those who are text processing-savvy, this was done outside of Vectorizers in order to retain the '-', since curse words are represented in these reviews as s--t and f--k. The frequency with which these curse words appear in reviews may be an important text feature for parents).

#### The vectorizer I will use in this notebook is TF-IDF Vectorizer. TF-IDF Vectorizer will take the frequency of words that are found in text associated with each movie and divide by the frequency with which a term appears in all of the documents put together. This technique is designed to give greater weight to terms that occur frequently in a document but not in other documents, thus controlling for words that appear frequently in all movies--- such as the word, "movie," for example. It will then turn the TF-IDF statistic into the movie's "word vector." These word vectors will then be run through Truncated SVD as described in Notebook 5.

#### Once we get our text data into truncatedSVD format, I will use cosine similarity to determine which movies are most similar to which other movies in our data set. We will then also incorporate non-text data to see how this improves our cosine similarity-based similarity matrix (see below).

### Load Movie Text Data

In [62]:
import pandas as pd
import numpy as np
import requests, re, json, copy, pickle
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process

In [4]:
# Load json of movies_features_text
with open('data/movies_features_text.json') as json_file:  
    movies_features_text = json.load(json_file)

In [5]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,movie_id,slug,text,title
0,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


In [6]:
def get_movie_list(last_movie):
    for movie_num in range(last_movie):
        movie_titles = movies_features_text[movie_num]['title']
        return movie_titles

In [7]:
range(3)

range(0, 3)

In [8]:
movies_features_text[0]['text'][:380]

'Families can talk about Sicario: Day of the Soldado  violence. Which parts were gruesome, and which were exciting? How did the movie achieve these effects? What  the impact of media violence on kids?  How are drinking, smoking, and drugs depicted? Are they glamorized? Does the movie make the drug business look alluring?  What does the movie have to say about law versus justice?'

In [9]:
stopwords = set(stopwords.words('english'))
# Remove punctuation from all text of each movie and remove stopwords
def clean_text_for_movie(text):
    '''
    Takes in all text of a single movie, makes lowercase and removes punctuation and stopwords
    from text. Returns words in input text as a single string, w/o English stopwords.
    '''
    words = re.sub("[^a-zA-Z\-]", " ", text).lower().split()  # removes punctuation, makes lowercase
    cleantext = [w for w in words if not w in stopwords]  # eliminates common "stop words"
    return(" ".join(cleantext))  # returns words as a string, each word separated by a space

In [10]:
clean_text_test = clean_text_for_movie(movies_features_text[0]['text'])

In [11]:
clean_text_test[:259]

'families talk sicario day soldado violence parts gruesome exciting movie achieve effects impact media violence kids drinking smoking drugs depicted glamorized movie make drug business look alluring movie say law versus justice difference two end justify means'

In [12]:
clean_text = []
def clean_text_for_movies(first_movie, num_movies_to_clean):
    print("Number of movies cleaned so far:")
    for movie in range(num_movies_to_clean):
        movie = (movie + first_movie)
        if movie % 1000 == 0:
            print(movie)
        clean_txt = clean_text_for_movie(movies_features_text[movie]['text'])
        movies_features_text[movie]['clean_text'] = clean_txt
        clean_text.append(clean_txt)
    return clean_text

In [13]:
clean_text = clean_text_for_movies(0,len(movies_features_text))

Number of movies cleaned so far:
0
1000
2000
3000
4000
5000
6000
7000
8000


In [14]:
df = pd.DataFrame(movies_features_text)
df.head()

Unnamed: 0,clean_text,movie_id,slug,text,title
0,families talk sicario day soldado violence par...,0,sicario-day-of-the-soldado,Families can talk about Sicario: Day of the So...,Sicario: Day of the Soldado
1,families talk damsel use violence intense freq...,1,damsel,Families can talk about Damsel use of violenc...,Damsel
2,families talk rapid-fire disturbing images dis...,2,distorted,Families can talk about the rapid-fire disturb...,Distorted
3,families talk berg sexual orientation presente...,3,the-catcher-was-a-spy,Families can talk about Berg sexual orientati...,The Catcher Was a Spy
4,families talk boundaries portrays drugs drug u...,4,boundaries,Families can talk about how Boundaries portray...,Boundaries


#### Movies_features_text now has two new features, one of which contains the list of words used in movie reviews and other text associated with each of our 8625 unique movies and a second list that contains a list of bigrams of these words, to capture names of actors, separated by sentence. movies_fetures_text is now ready for vectorization.

### Vectorize text for NLP:  TF-IDF
#### In this notebook, I will use a process called TF-IDF (Term Frequency Inverse Document Frequency) Vectorization on my text data to compare with my Count Vectorized predictor. TF-IDF gives the frequency of each word in the words associated with each movie (termed a "document") normalized by the frequency with which that word appears in all of the documents combined. In other words, words that appear frequently in text associated with all movies in general are not going to be counted as important as words that appear frequently in a small subset of documents.
#### After TD-IDF vectorization, I will then use truncated SVD on text data alone to reduce the number of features to reduce overfitting. The components that result from truncated SVD will be examined to identify discernable patterns.

In [15]:
# Run if you want to TF-IDF Vectorize features. Will take considerable time.
# tvec = TfidfVectorizer(analyzer = "word",
#                        tokenizer = None,      # tokenized in preprocessing
#                        preprocessor = None,
#                        stop_words = None,     # english stop words already removed, to retain -
#                        min_df = 2,            # to eliminate typos
#                        max_df = .9,           # to eliminate the word "movie"
#                        max_features = 50000) 
# 
# data_features_tfidf = pd.SparseDataFrame(tvec.fit_transform(clean_text),
#                                          columns=tvec.get_feature_names(),
#                                          default_fill_value=0)

In [16]:
# data_features_tfidf.shape

(8625, 42188)

In [17]:
feature_names = tvec.get_feature_names()
feature_names

['aa',
 'aaa',
 'aaah',
 'aardman',
 'aaron',
 'aasif',
 'aback',
 'abacus',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandons',
 'abashed',
 'abba',
 'abbate',
 'abbess',
 'abbey',
 'abbi',
 'abbie',
 'abbot',
 'abbott',
 'abbreviated',
 'abby',
 'abc',
 'abdalla',
 'abdellatif',
 'abdi',
 'abdicate',
 'abdication',
 'abdomen',
 'abduct',
 'abducted',
 'abducting',
 'abduction',
 'abductions',
 'abductors',
 'abducts',
 'abdul',
 'abe',
 'abel',
 'abell',
 'abercrombie',
 'aberrant',
 'abetted',
 'abhor',
 'abhorrent',
 'abhorrently',
 'abhors',
 'abi',
 'abide',
 'abiding',
 'abigail',
 'abilities',
 'ability',
 'abin',
 'abject',
 'ablaze',
 'able',
 'abled',
 'ably',
 'abnegation',
 'abner',
 'abnormal',
 'abnormalities',
 'abnormally',
 'abo',
 'aboard',
 'abode',
 'abolish',
 'abolished',
 'abolishing',
 'abolition',
 'abolitionist',
 'abolitionists',
 'abominable',
 'abominably',
 'abomination',
 'abominations',
 'aboriginal',
 'aborigine',
 'aborigines',
 'abo

### Truncated SVD
#### To generate vectors that encapsulate the most variance in our text data in the fewest number of components.

In [29]:
svd = TruncatedSVD(n_components=1000)

In [30]:
# tfidfvec_truncated is fit_transformed w/1000 components
tfidfvec_truncated = svd.fit_transform(data_features_tfidf)

In [31]:
tfidfvec_truncated.shape

(8625, 1000)

In [32]:
# columns are svd components, 0 - 999, for tf-idf vectorized words; index is words
components_tfidfvec = pd.DataFrame(svd.components_.T, index=feature_names)

In [34]:
components_tfidfvec.shape

(42188, 1000)

#### Explore components to identify meanings of components

In [127]:
word_importance_component_1 = components_tfidfvec[0].sort_values(ascending=False)
word_importance_component_1[0:10]

characters    0.145347
one           0.140572
sex           0.115348
character     0.111659
film          0.106805
violence      0.106770
man           0.096029
also          0.092475
kids          0.091393
shown         0.090240
Name: 0, dtype: float64

In [119]:
word_importance_component_2 = components_tfidfvec[1].sort_values(ascending=False)
word_importance_component_2[0:10]

christmas    0.303203
kids         0.204020
santa        0.142073
holiday      0.137228
dog          0.133801
family       0.129417
animals      0.123768
children     0.087443
story        0.078297
animated     0.077408
Name: 1, dtype: float64

In [120]:
word_importance_component_3 = components_tfidfvec[2].sort_values(ascending=False)
word_importance_component_3[0:10]

school       0.203101
sex          0.173508
teen         0.153460
girls        0.117679
christmas    0.116954
teens        0.115779
high         0.104160
girl         0.090488
family       0.087580
comedy       0.086355
Name: 2, dtype: float64

In [128]:
word_importance_component_5 = components_tfidfvec[5].sort_values(ascending=False)
word_importance_component_5[0:10]

dog        0.436106
dogs       0.203988
animals    0.180841
animal     0.124016
family     0.122604
man        0.099306
love       0.088713
comedy     0.088291
humor      0.087954
woman      0.080714
Name: 5, dtype: float64

In [40]:
# svd_explained_variance = svd.explained_variance_

In [41]:
cum_sum_explained_variance = []
def cum_sum_explained_var(vect, total_comp):
    cum_sum_explained_variance = []
    if total_comp > len(vect.explained_variance_):
        print("That's too many components. Max_components is 1000\.\n")
        total_comp = int(input("Enter new total_components:"))
    else:
        pass
    cum_sum_var = 0
    for i in range(total_comp):
        cum_sum_var += vect.explained_variance_[i]
        cum_sum_explained_variance.append({i, cum_sum_var})
    return cum_sum_explained_variance

In [129]:
# Run cell to get cumulative list of explained variances. Only 40% of variance explained by
# text features...

# cum_sum_explained_var(svd, 1000)

In [43]:
# data_features_tfidf.shape[1]

In [44]:
## pd.DataFrame(index=feature_names, columns=components_tfidf)
## tfidf_features['feature names'] = vocab
## #pd.DataFrame(features_components_tfidf

### Cosine Similarity:  TFIDF_truncSVD1000

In [46]:
### Calcualte as matrix of all movies to all movies of  countvec_truncated
sim_matrix_tfidfvec_truncSVD1000 = cosine_similarity(tfidfvec_truncated, tfidfvec_truncated)

In [47]:
similarity_matrix_tfidfvec_truncSVD1000 = pd.DataFrame(sim_matrix_tfidfvec_truncSVD1000,
                                                       columns=df['title'], index=df['title'])

In [48]:
similarity_matrix_tfidfvec_truncSVD1000.head()

title,Sicario: Day of the Soldado,Damsel,Distorted,The Catcher Was a Spy,Boundaries,Izzy Gets the F*ck Across Town,Jurassic World: Fallen Kingdom,Brothers of the Wind,Unsane,Flower,...,Live and Let Die,Tintin: The Lake of Sharks,Tales of Beatrix Potter,Tintin: The Prisoners of the Sun,Gentle Giant,Tintin: The Calculus Affair,Visit to a Small Planet,Zoo Baby,Driftwood,Sherlock Jr.
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Sicario: Day of the Soldado,1.0,0.185875,0.120836,0.076947,0.179054,0.123449,0.128679,0.095827,0.141957,0.146939,...,0.15741,0.071288,0.060127,0.046608,0.075211,0.057875,0.114417,0.030031,0.073789,0.059615
Damsel,0.185875,1.0,0.142226,0.124921,0.256099,0.182127,0.111982,0.112171,0.152846,0.132687,...,0.098654,0.060338,0.051493,0.04248,0.077435,0.046213,0.108209,0.040167,0.095263,0.081859
Distorted,0.120836,0.142226,1.0,0.147642,0.086119,0.100196,0.068447,0.116645,0.320688,0.10908,...,0.087186,0.047032,0.064578,0.036709,0.048456,0.065025,0.099285,0.025055,0.099473,0.077244
The Catcher Was a Spy,0.076947,0.124921,0.147642,1.0,0.075768,0.103934,0.074313,0.111247,0.151493,0.098537,...,0.061006,0.066348,0.053344,0.021037,0.093828,0.070085,0.08747,0.024565,0.043113,0.051983
Boundaries,0.179054,0.256099,0.086119,0.075768,1.0,0.166313,0.076284,0.073455,0.138091,0.231099,...,0.099723,0.024769,0.058123,0.021221,0.058384,0.023954,0.131997,0.051351,0.112004,0.063471


In [99]:
pd.DataFrame(similarity_matrix_tfidfvec_truncSVD1000['Damsel'].sort_values(ascending=False)[1:2].index)
            

Unnamed: 0,title
0,Once Upon a Time in the West


In [100]:
with open('data/similarity_matrix_tfidfvec_truncSVD1000.pkl', 'rb') as f:
    similarity_matrix_tfidfvec_truncSVD1000 = pickle.load(f)

In [113]:
pd.DataFrame(similarity_matrix_tfidfvec_truncSVD1000['Damsel'].sort_values(ascending=False)[1:1000]).index[2]

'The Killer (O Matador)'

### Export Similarity Matrix:

In [85]:
type(similarity_matrix_tfidfvec_truncSVD1000)

pandas.core.frame.DataFrame

In [75]:
np.save('data/sim_matrix_tfidfvec_truncSVD1000.npy', sim_matrix_tfidfvec_truncSVD1000)

In [87]:
similarity_matrix_tfidfvec_truncSVD1000.to_csv('data/similarity_matrix_tfidfvec_truncSVD1000.csv')

### Find Similar Movies

In [49]:
movie_list = df['title']

In [50]:
def title_recommender(movie_name, movie_list, limit=3):
    results = process.extract(movie_name, movie_list, limit=limit)
    return results

In [51]:
def find_similar_movies():
    movie_name = input("Give me a movie title and I'll give you five titles you might also like:")
    for title in df['title']:
        if title == movie_name:
            sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
            print("Thanks! Here are my recommendations, along with review text similarity scores:")
            recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False)[1:6])
            return recommendations
    limit = 3
    while title != movie_name:
        results = title_recommender(movie_name, df['title'], limit=limit)
        print("Sorry, that movie title isn't in my list. Did you mean", results, "?")
        movie_name = input("(I need the exact title, please...)")
        for title in df['title']:
            if title == movie_name:
                sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
                print("Thanks! Here are my recommendations, along with review text similarity scores:")
                recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False)[1:6])
                return recommendations
            else:
                limit += 1
                if limit >= 10:
                    limit = 10

In [78]:
def find_all_similar_movies():
    movie_name = input("Give me a movie title and I'll give you five titles you might also like:")
    for title in df['title']:
        if title == movie_name:
            sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
            print("Thanks! Here are my recommendations, along with review text similarity scores:")
            recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False), index=df['title'])
            return recommendations
    limit = 3
    while title != movie_name:
        results = title_recommender(movie_name, df['title'], limit=limit)
        print("Sorry, that movie title isn't in my list. Did you mean", results, "?")
        movie_name = input("(I need the exact title, please...)")
        for title in df['title']:
            if title == movie_name:
                sim_movies_text = similarity_matrix_tfidfvec_truncSVD1000[movie_name]
                print("Thanks! Here are my recommendations, along with review text similarity scores:")
                recommendations = pd.DataFrame(sim_movies_text.sort_values(ascending=False), index=df['title'])
                return recommendations
            else:
                limit += 1
                if limit >= 10:
                    limit = 10

In [52]:
find_similar_movies()

Give me a movie title and I'll give you five titles you might also like:Jaws II
Sorry, that movie title isn't in my list. Did you mean [('W.', 90, 4752), ('Jaws', 90, 7989), ('The Last Exorcism Part II', 86, 2965)] ?
(I need the exact title, please...)Jaws
Thanks! Here are my recommendations, along with review text similarity scores:


Unnamed: 0_level_0,Jaws
title,Unnamed: 1_level_1
Jaws 2,0.857806
Jaws: The Revenge,0.793216
Shark Night 3D,0.656341
Jaws 3,0.649864
The Shallows,0.58702


### Recommender System Evaluation
#### To evaluate my recommender system, I will have my colleagues test out the system, recording what they thought about each recommendation on a 6 point scale (5 = excellent recommendation, 4= good, 3= fair, 2 = poor, 1 = unrelated, 0 = IDK, that's a movie I've never seen)

In [54]:
def system_test(trials):
    rate_recs = []
    for trial in reversed(range(trials)):
        print("Thank you for trying out MovieRec4Parents(tm)! You have", trial+1, "tries to go.")
        recommendations = find_similar_movies()
        print(recommendations)
        rate_recs.append(recommendations)
        for rec in range(len(recommendations)):
            print("On a scale of 1-5, how good is recommendation", rec+1,"? If you don't know the movie, enter 0.)")
            rating = input()
            rate_recs.append((trial, rec+1, rating))
    print("You're done! I hope you enjoyed using MovieRec4Parents(tm). Tell your friends!")
    return rate_recs

In [159]:
system_test(5)

Thank you for trying out MovieRec4Parents(tm)! You have 2 tries to go.
Give me a movie title and I'll give you five titles you might also like:Ponyo
Thanks! Here are my recommendations, along with review text similarity scores:
                                      Ponyo
title                                      
My Neighbor Totoro                 0.485625
The Kingdom of Dreams and Madness  0.460105
From up on Poppy Hill              0.445626
The Secret World of Arrietty       0.412183
Kiki's Delivery Service            0.401037
How would you rank recommendation 1 ?
5
How would you rank recommendation 2 ?
0
How would you rank recommendation 3 ?
2
How would you rank recommendation 4 ?
2
How would you rank recommendation 5 ?
5
Thank you for trying out MovieRec4Parents(tm)! You have 1 tries to go.
Give me a movie title and I'll give you five titles you might also like:Kiki's Delivery Service
Thanks! Here are my recommendations, along with review text similarity scores:
                  

[                                      Ponyo
 title                                      
 My Neighbor Totoro                 0.485625
 The Kingdom of Dreams and Madness  0.460105
 From up on Poppy Hill              0.445626
 The Secret World of Arrietty       0.412183
 Kiki's Delivery Service            0.401037,
 (1, 1, '5'),
 (1, 2, '0'),
 (1, 3, '2'),
 (1, 4, '2'),
 (1, 5, '5'),
                                    Kiki's Delivery Service
 title                                                     
 From up on Poppy Hill                             0.418501
 Castle in the Sky                                 0.412754
 My Neighbor Totoro                                0.412105
 Ponyo                                             0.401037
 The Kingdom of Dreams and Madness                 0.391930,
 (0, 1, '2'),
 (0, 2, '5'),
 (0, 3, '5'),
 (0, 4, '5'),
 (0, 5, '0')]

In [160]:
system_test(5)

Thank you for trying out MovieRec4Parents(tm)! You have 5 tries to go.
Give me a movie title and I'll give you five titles you might also like:Star Wars
Sorry, that movie title isn't in my list. Did you mean [('Solo: A Star Wars Story', 90, 58), ('Star Wars: Episode VIII: The Last Jedi', 90, 208), ('Rogue One: A Star Wars Story', 90, 807)] ?
(I need the exact title, please...)Solo: A Star Wars Story
Thanks! Here are my recommendations, along with review text similarity scores:
                                           Solo: A Star Wars Story
title                                                             
Rogue One: A Star Wars Story                              0.764448
Star Wars: Episode VII: The Force Awakens                 0.731855
Star Wars: Episode IV: A New Hope                         0.682414
Star Wars: Episode VIII: The Last Jedi                    0.681480
Star Wars: Clone Wars                                     0.646752
How would you rank recommendation 1 ?
5
How would

[                                           Solo: A Star Wars Story
 title                                                             
 Rogue One: A Star Wars Story                              0.764448
 Star Wars: Episode VII: The Force Awakens                 0.731855
 Star Wars: Episode IV: A New Hope                         0.682414
 Star Wars: Episode VIII: The Last Jedi                    0.681480
 Star Wars: Clone Wars                                     0.646752,
 (4, 1, '5'),
 (4, 2, '5'),
 (4, 3, '5'),
 (4, 4, '5'),
 (4, 5, '5'),
                                Arrival
 title                                 
 Alien Trespass                0.491427
 Men in Black III              0.437827
 Prometheus                    0.436605
 The War of the Worlds (1953)  0.430676
 Wing Commander                0.429188,
 (3, 1, '2'),
 (3, 2, '3'),
 (3, 3, '4'),
 (3, 4, '3'),
 (3, 5, '3'),
                                                Iron Man
 title                                       

In [162]:
recommendations_results = []
rec_results = []

In [163]:
rec_results = system_test(5)

Thank you for trying out MovieRec4Parents(tm)! You have 5 tries to go.
Give me a movie title and I'll give you five titles you might also like:Solo: A Star Wars Story
Thanks! Here are my recommendations, along with review text similarity scores:
                                           Solo: A Star Wars Story
title                                                             
Rogue One: A Star Wars Story                              0.764448
Star Wars: Episode VII: The Force Awakens                 0.731855
Star Wars: Episode IV: A New Hope                         0.682414
Star Wars: Episode VIII: The Last Jedi                    0.681480
Star Wars: Clone Wars                                     0.646752
How would you rank recommendation 1 ?
5
How would you rank recommendation 2 ?
5
How would you rank recommendation 3 ?
5
How would you rank recommendation 4 ?
5
How would you rank recommendation 5 ?
5
Thank you for trying out MovieRec4Parents(tm)! You have 4 tries to go.
Give me a movi

In [165]:
recommendations_results.append(rec_results)

Thank you for trying out MovieRec4Parents(tm)! You have 5 tries to go.
Give me a movie title and I'll give you five titles you might also like:Won't You Be My Neighbor
Sorry, that movie title isn't in my list. Did you mean [("Won't You Be My Neighbor?", 100, 35), ('Forever My Girl', 86, 136), ('My Friend Dahmer', 86, 171)] ?
(I need the exact title, please...)Won't You Be My Neighbor?
Thanks! Here are my recommendations, along with review text similarity scores:
                                                    Won't You Be My Neighbor?
title                                                                        
The Swan Princess: Princess Tomorrow, Pirate To...                   0.344002
Into the Arms of Strangers: Stories of the Kind...                   0.341754
War Dance                                                            0.338886
Joan Didion: The Center Will Not Hold                                0.324683
Chillar Party                                                    

[                                                    Won't You Be My Neighbor?
 title                                                                        
 The Swan Princess: Princess Tomorrow, Pirate To...                   0.344002
 Into the Arms of Strangers: Stories of the Kind...                   0.341754
 War Dance                                                            0.338886
 Joan Didion: The Center Will Not Hold                                0.324683
 Chillar Party                                                        0.316977,
 (4, 1, '2'),
 (4, 2, '0'),
 (4, 3, '0'),
 (4, 4, '0'),
 (4, 5, '0'),
                         Muppet Treasure Island
 title                                         
 Treasure Island (1950)                0.678964
 Long John Silver                      0.642197
 Treasure Island (1934)                0.636264
 Treasure Planet                       0.581655
 The Muppet Movie                      0.531864,
 (3, 1, '5'),
 (3, 2, '3'),
 (3, 3, '4'

### Function to divide CosSimMatrix into 10 pieces so it can be uploaded to GitHub:

In [None]:
def decompose_sim_matrix(simmatrix=similarity_matrix_tfidfvec_truncSVD1000):
    for a in range(575):
        simmat0.append(simmatrix[a])
    for b in range(575):
        simmat1.append(simmatrix[575+b])
    for c in range(575):
        simmat2.append(simmatrix[575*2+c])
    for d in range(575):
        simmat3.append(simmatrix[575*3+d])
    for e in range(575):
        simmat4.append(simmatrix[575*4+e])
    for f in range(575):
        simmat5.append(simmatrix[575*5+f])
    for g in range(575):
        simmat6.append(simmatrix[575*6+g])
    for h in range(575):
        simmat7.append(simmatrix[575*7+h])
    for i in range(575):
        simmat8.append(simmatrix[575*8+i])
    for j in range(575):
        simmat9.append(simmatrix[575*9+j])
    for k in range(575):
        simmat10.append(simmatrix[575*10+k])
    for l in range(575):
        simmat11.append(simmatrix[575*11+l])
    for m in range(575):
        simmat12.append(simmatrix[575*12+m])
    for n in range(575):
        simmat13.append(simmatrix[575*13+n])
    for o in range(575):
        simmat14.append(simmatrix[575*14+o])

# Function to Compile Cosine Similarity Matrix:

In [None]:
# This is just pseudo-code so far. Right now, to get the cosine similarity matrix you will need
# to run myu notebooks in order. Please allow hours to do so. Or, you can contact me directly!
def recompile_sim_matrix(simmat0=simmat0, simmat1=simmat1, simmat2=simmat2, simmat3=simmat3,
                         simmat4=simmat4, simmat5=simmat5, simmat6=simmat6, simmat7=simmat7,
                         simmat8=simmat8, simmat9=simmat9, simmat10=simmat10, simmat11=simmat11,
                         simmat12=simmat12, simmat13=simmat13, simmat14=simmat14):
    similarity_matrix_tfidfvec_truncSVD1000 = pd.DataFrame(simmat0+simmat1+simmat2+simmat3+simmat4+simmat5+simmat6+simmat7+simmat8+simmat9+simmat10+simmat11+simmat12+simmat13+simmat14)