<a href="https://colab.research.google.com/github/TD1138/Interactive-Movie-Recommender/blob/main/notebooks/Interactive_Movie_Recommender_TF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Interactive Movie Recommender using Tensorflow

### Most movie recommender systems I've seen before require a starting point.
### This can be in the form of:
* ###   a reference movie (if you enjoyed X you will enjoy Y or Z)
* ###   a taste profile ('you liked A, B & C so how about D or E')

### I have 2 problems with these kinds of recommender systems:
### If I ask you where you want to go for dinner, it's an open-ended question - one is often  unable to come up with something. However if I ask where you'd rather go from 3 options, that is a much easier question to answer! Once we have a starting point we can start to narrow things down. This is how I'd like a recommender system to work - give me some options and I'll be able to narrow things down!
### The second issue is the need for a starting point - I either need to think of a film that I want any recommendations to be similar to, or I need to have already filled out various films I enjoy. In reality, I sometimes don't know exactly what film I want a film to be similar to, I jsut know what kind of thing I'm after!


### So my idea is to make a recommender that is more interactive than a classic  - taking input from the user.


### For example:
#### - User inputs 'Sci-Fi'
#### * Algorithm outputs 10 'Sci-Fi movies'
#### - User sees that the movies skew quite modern - enters '80s'
#### * Algorithm outputs 10 '80s Sci-Fi movies'
#### - User sees most of these movies feature male leads - adds 'female lead'
#### * Algorithm outputs 10 '80s Sci-Fi movies with Female Leads'

### My idea is to create a neural network with inputs being various features like year, director etc, but also some more meta features, like the tags you see in Netflix's micro-genres - like 'gory' 'visualy striking' etc.
### The output of the neural network would be a softmax with all the films in our dataset. The top 10 probabilities would then be served up as the recommendations.
### It will require a layer of natural language processing to get from the user inputs to the feature space - this will be a second phase of this project!

### But before I can get started with this grand ambition, I need some data! 


# The Data

### GroupLens is a computer science research lab at the University of Minnesota, specializing in recommender systems, among other things.
### They have a dataset which is ideal for my usecase - 'MovieLens 25M'
### Released December 2019, it contains 62k movies, with tags like the ones I was talking about above! See further documentation below:
### https://grouplens.org/datasets/movielens/


### The data is hosted on the GroupLens website, as a zip file.
### First I need to save the ZIP file to the local machine, then use a package to extract the data:

In [1]:
dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip'

In [2]:
import os
local_path_zip = os.path.join('..', 'dataset/ml-25m.zip')
local_path = os.path.join('..', 'dataset/ml-25m.zip/ml-25m')

In [3]:
import requests, zipfile, io
r = requests.get(dataset_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(local_path_zip)

####Let's have a look at the files we have downloaded:

In [147]:
pd.set_option("display.max_rows", 500, "display.max_columns", 50)
import pandas as pd
for file in os.listdir(local_path):
  if file[-4:] == '.csv':
    print(file)
    display(pd.read_csv(local_path+'/'+file, nrows=5))
  else:
    print(file+' is not a csv')
  print('\n')

genome-tags.csv


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s




links.csv


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862




movies.csv


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy




tags.csv


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455




genome-scores.csv


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075




README.txt is not a csv


ratings.csv


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510






### Let's read in the 'movies.csv' file for later use, and do some basic processing of the data:

In [111]:
movies_df = pd.read_csv(local_path+'/movies.csv')
#movies_df = movies_df.head(10)
movies_df['year'] = movies_df['title'].str.split(' \(').str[1].str.replace(')','')
movies_df['title'] = movies_df['title'].str.split(' \(').str[0]
#movies_df['genre_count']
movies_df.head(10)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
5,6,Heat,Action|Crime|Thriller,1995
6,7,Sabrina,Comedy|Romance,1995
7,8,Tom and Huck,Adventure|Children,1995
8,9,Sudden Death,Action,1995
9,10,GoldenEye,Action|Adventure|Thriller,1995


### Let's do some splitting/tokenising of the genres:

In [137]:
genres_processing = movies_df['genres'].str.split('|', expand=True)
#genres_processing.columns = ['genre_'+str(i) for i in genres_processing.columns]
genres_processing
genres_list = []
for col in genres_processing.columns:
  genres_counts = genres_processing[col].value_counts()
  genres_list.append(genres_counts)

genres_table = pd.concat(genres_list).reset_index().groupby('index').sum()
genres_table.reset_index(inplace=True)
genres_table.columns = ['genre', 'movie_count']
genres_table = genres_table[genres_table['genre'] != '(no genres listed)'].reset_index(drop=True)
display(genres_table)

Unnamed: 0,genre,movie_count
0,Action,7348
1,Adventure,4145
2,Animation,2929
3,Children,2935
4,Comedy,16870
5,Crime,5319
6,Documentary,5605
7,Drama,25606
8,Fantasy,2731
9,Film-Noir,353


In [138]:
for genre in genres_table['genre'].unique():
  movies_df[genre] = np.where(movies_df['genres'].str.contains(genre), 1, 0)
movies_df.head()

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,Adventure|Children|Fantasy,1995,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,Comedy|Romance,1995,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,Comedy,1995,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Let's take a closer look at 'tags.csv' - this file contains every single tag action done by users:

In [223]:
tags = pd.read_csv(local_path+'/tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [229]:
print('There are {} unique tags in the dataset'.format(len(tags['tag'].unique())))
tag_counts = tags[['tag', 'movieId']].groupby('tag').count().sort_values('tag').reset_index()
tag_counts.head(50)

There are 73051 unique tags in the dataset


Unnamed: 0,tag,movieId
0,Alexander Skarsgård,1
1,Difficult to find it,1
2,Filmes Antigos,2
3,Filmes Antigos,2
4,Kartik Aaryan,1
5,Kriti Sanon,1
6,Laurel Canyon,1
7,Luis Brandoni,1
8,Masami Nagasawa,1
9,O'Shea Jackson Jr.,1


### As we can see, there's a lot of standard text cleansing processes we can do to make things a bit clearer, including stripping any spaces, getting rid of odd characters etc - let's do this now:

In [163]:
def string_cleanse(string):
  string_clean = str(string).strip().lower().replace('"','').replace('#','').replace('\'','').replace('(','').replace(')','').replace('*','').replace('-','')
  return string_clean
tags['tag'] = tags['tag'].apply(string_cleanse)
clean_tag_counts = tags[['tag', 'movieId']].groupby('tag').count().sort_values('tag').reset_index()
clean_tag_counts.head(50)

Unnamed: 0,tag,movieId
0,,11
1,!950s superman tv show,1
2,&suspense,1
3,+++++++++++++++,1
4,",music business",1
5,.,1
6,...and this film has got nothing to with the a...,1
7,...livin in an amish paradise,1
8,...that its finally over,1
9,...to pay me to watch this again.,1


Next let's start to get some counts of how often these tags appear in the dataset.

Each row of the dataset is a tag applied by a user to a film - let's filter to a single, popular film, to get a feel for what this looks like - I have chosen 2001: A Space Odyssey as it is one of my favourites and is also quite a popular film. This film has id '924'

In [233]:
chosen_film_id = 924
chosen_film_tags = tags[tags['movieId'] == chosen_film_id]
print('Film ID {} has {} total tags, of which {} are unique, from {} different users\n'.format(chosen_film_id, chosen_film_tags['tag'].count(), chosen_film_tags['tag'].nunique(), chosen_film_tags['userId'].nunique()))
chosen_film_tags['tag'].value_counts()

Film ID 924 has 1749 total tags, of which 191 are unique, from 323 different users



Stanley Kubrick                          105
artificial intelligence                  103
sci-fi                                    97
space                                     89
atmospheric                               84
philosophical                             77
visually appealing                        74
slow                                      72
surreal                                   70
masterpiece                               61
cinematography                            52
cult film                                 43
meditative                                40
confusing ending                          40
future                                    38
music                                     38
classic                                   37
futuristic                                35
aliens                                    31
space travel                              26
soundtrack                                26
boring                                    26
robots    

### Let's look at the top 50 most used tags:
### This however desn't account for the fact that certain tags might be repeatedly used for certain films - e.g. the tag 'Harry Potter' might be repeatedly used across all the Harry Potter franchise - but isn't all that useful for our recommender.

In [165]:
tags[['tag', 'movieId']].groupby('tag').count().sort_values('movieId', ascending=False).reset_index().head(50)

Unnamed: 0,tag,movieId
0,scifi,9141
1,atmospheric,7053
2,action,6783
3,comedy,6368
4,surreal,5584
5,funny,5354
6,based on a book,5194
7,twist ending,4904
8,visually appealing,4691
9,romance,4482


### There also seems to be some further text clean-up required - look at all the variations of 'sci-fi'!

In [166]:
temp = tags[['tag', 'movieId']].groupby('tag').count().sort_values('movieId', ascending=False).reset_index()
temp[temp['tag'].str.lower().str.contains('sci')].head(20)

Unnamed: 0,tag,movieId
0,scifi,9141
80,science fiction,1635
250,science,651
278,bad science,604
544,scientist,329
607,mad scientist,298
758,classic scifi,238
761,sci fi,237
1112,fascism,160
1279,thoughtful scifi,136


### Let's create a function which takes in a list of variant tags that we want to standardise to one tag...

In [158]:
def condenser(string, from_list, to_value):
  for i in from_list:
    string = str(string).replace(i, to_value)
  return string

###...and let's use it on our scifi variants!

In [168]:
scifi_variants = ['science fiction', 'sci fi']
tags['tag'] = tags['tag'].apply(condenser, from_list=scifi_variants, to_value='scifi')

### Here we can see that some of the most popular films can have hundreds of unique tags, while many have just 1

In [11]:
import numpy as np
unique_tags_by_movie = tags.groupby('movieId')['tag'].nunique().reset_index()
unique_tags_by_movie.columns = ['movieId', 'unique_tag_count']
unique_tags_by_movie.sort_values('unique_tag_count', ascending=False, inplace=True)
unique_tags_by_movie.reset_index(inplace=True, drop=True)
unique_tags_by_movie

Unnamed: 0,movieId,unique_tag_count
0,260,698
1,356,521
2,296,498
3,318,384
4,1103,282
...,...,...
45246,158805,1
45247,176869,1
45248,176867,1
45249,176865,1


### Let's create a frequency table of unique tag counts - we can see that 15% of our films have just 1 tag, while over half have 5 tags or less.

In [12]:
tagcount_freqtable = unique_tags_by_movie.groupby('unique_tag_count').count().reset_index()
tagcount_freqtable.columns = ['unique_tag_count', 'frequency']
tagcount_freqtable['percent'] = tagcount_freqtable['frequency']/(tagcount_freqtable['frequency'].sum())
tagcount_freqtable['cumulative_percent'] = tagcount_freqtable['percent'].cumsum()
display(tagcount_freqtable.head(20))

Unnamed: 0,unique_tag_count,frequency,percent,cumulative_percent
0,1,7047,0.155731,0.155731
1,2,5379,0.11887,0.274602
2,3,4919,0.108705,0.383306
3,4,3976,0.087865,0.471172
4,5,3228,0.071335,0.542507
5,6,2561,0.056595,0.599103
6,7,2002,0.044242,0.643345
7,8,1694,0.037436,0.680781
8,9,1280,0.028287,0.709067
9,10,1165,0.025745,0.734812


### Let's take a look at that film with 698 tags - id 260

### I think we can all guess what film this is!
### As we can see, there are a lot of meaningless tags ('bite me'), a lot of tags with character names ('yoda' - only really relevant to this particular film/franchise), a lot of tags all meaning pretty much the same thing ('science-fiction', 'sf,science fiction', 'Science Fiction', 'sci-fi') and a lot of free form opinions ('birth of great scifi ideas', 'best movie ever'

### We can also see that there is a lot of upper/lower case versions of similar things

### I think next we should aggregate the other way round, and look at how many tags are used just once/very infrequently

In [13]:
tags[tags['movieId'] == 260].tag.unique()

array(['classic', 'scifi', 'action', 'adventure', 'fantasy',
       'space adventure', 'classic scifi', 'good vs evil', 'aliens',
       'oldie but goodie', 'scifi cult', 'space', 'cult classic',
       'futuristic', 'space action', 'space opera', 'old movie', 'epic',
       'harrison ford', 'space epic', 'action comedy', 'romance',
       'entertaining', 'good story', 'imdb top 250', 'darth vader',
       'luke skywalker', 'the death star', 'religion', 'heros journey',
       'george lucas', 'star wars', 'epic adventure',
       'birth of great scifi ideas', 'old fx quality', 'story driven',
       'bite me',
       'episode what? its cut off, so i dont even know what movie it is',
       'action, scifi', 'quotable', 'exciting', 'fun', 'heroic journey',
       'inventive', 'amazing', 'masterpiece', 'must see', 'universe',
       'sf,scifi', 'incest', 'james earl jones', 'jedi', 'john williams',
       'robots', 'atmospheric', 'future', 'great soundtrack',
       'science fantasy', 'sp

### Here we have the top 20 tags in the dataset, ranked by how many unique films the tag is applied to.
### The top one here is 'bdr' which means 'Blu-ray Disc - Recordable' which is not actually anything to do with the film
### Same logic applies to 'betamax' and 'dvdvideo'
### However some of these tags are indeed useful

In [172]:
def calc_unique_movies_by_tag(tags_df, tag_col='tag', movie_col='movieId'):
  df_agg = tags.groupby(tag_col)[movie_col].nunique().reset_index()
  df_agg.columns = ['tag', 'unique_movie_count']
  df_agg.sort_values('unique_movie_count', ascending=False, inplace=True)
  df_agg.reset_index(inplace=True, drop=True)
  return df_agg

movie_count_by_tag = calc_unique_movies_by_tag(tags)
movie_count_by_tag.head(20)

Unnamed: 0,tag,unique_movie_count
0,bdr,3948
1,woman director,3493
2,murder,2271
3,independent film,1838
4,comedy,1624
5,nudity topless,1474
6,based on a book,1395
7,clv,1357
8,drama,1307
9,romance,1296


### Let's also construct a frequency table as before:
### An incredible 57.6% of tags (37,498) only apply to one film!

In [173]:
def calc_tag_freq_table(tags_df, tag_col='tag', movie_col='movieId'):
  df_agg = calc_unique_movies_by_tag(tags_df, tag_col, movie_col)
  df_freq = df_agg.groupby('unique_movie_count').count().reset_index()
  df_freq.columns = ['unique_movie_count', 'frequency']
  df_freq['percent'] = df_freq['frequency']/(df_freq['frequency'].sum())
  df_freq['cumulative_percent'] = df_freq['percent'].cumsum()
  return df_freq

moviecount_freqtable = calc_tag_freq_table(tags)
display(moviecount_freqtable.head(20))

Unnamed: 0,unique_movie_count,frequency,percent,cumulative_percent
0,1,37498,0.576059,0.576059
1,2,8345,0.128199,0.704258
2,3,4157,0.063861,0.76812
3,4,2604,0.040004,0.808124
4,5,1795,0.027576,0.835699
5,6,1402,0.021538,0.857237
6,7,1027,0.015777,0.873014
7,8,821,0.012613,0.885627
8,9,633,0.009724,0.895351
9,10,539,0.00828,0.903632


### Creating a list of tags that apply to just one film and scanning through, we can see that the quality of these are not particularly useful

In [16]:
movie_count_by_tag[movie_count_by_tag['unique_movie_count']==1]['tag'].to_list()

['yihad',
 'the virgin suicides',
 'tray',
 'wicked stepmother',
 'travelling players',
 'what the hell did i just watch?',
 'this movie is perfect',
 'travis tope',
 'polemic',
 'the voiceovers here arent as corny as people say',
 'plant food',
 'theological debate',
 'plc hacks',
 'theeffects',
 'psycholgical thriller',
 'theranos',
 'wrongful sentence',
 'transcending abilities',
 'transcendental',
 'pol',
 'the villain from karate kid at the bar',
 'travis oates',
 'yellowstone national park',
 'this movie scares the stuffing out of me!',
 'what are you stuffed with?',
 'playwright:william gibson',
 'plot light',
 'plot point:photo booth',
 'track star',
 'tracking',
 'theology student',
 'this is what they say the notebook was',
 'travis cluff',
 'theone',
 'yegor letov',
 'tracking camera',
 'psychobiddy',
 'transcendentialist',
 'track down',
 'tracking shot',
 'psychobabble',
 'yiddish writer',
 'the views',
 'provoking',
 'yigal amir',
 'this movie is pretty...pretty gay',
 'p

Let's take only the tags that apply to more than 100 films
As previously discussed, there are some formats in here which we can immediately erase, such as 'bdr' (bluray disc), 'clv' (laserdisc), 'betamax', 'dvdvideo', 'bdvideo' 'turneysdvds' etc, as well as some other undesirable tags

In [175]:
movie_count_by_tag = calc_unique_movies_by_tag(tags)
movie_count_by_tag.head(250)

Unnamed: 0,tag,unique_movie_count
0,bdr,3948
1,woman director,3493
2,murder,2271
3,independent film,1838
4,comedy,1624
5,nudity topless,1474
6,based on a book,1395
7,clv,1357
8,drama,1307
9,romance,1296


### Let's create a helper function to explore different tags:

In [180]:
def tag_explorer(tag, tags_df, movies_df):
  temp = tags_df[tags_df['tag'] == tag][['movieId', 'tag', 'userId']]
  temp2 = temp.groupby(['movieId', 'tag'], as_index=False).count()
  temp2.sort_values('userId', ascending=False, inplace=True)
  temp2.columns = ['movieId', 'tag', 'user_count']
  temp2.reset_index(drop=True, inplace=True)
  temp2 = temp2.merge(movies_df[['movieId', 'title']], how='left', on='movieId')
  temp3 = temp2.reindex(['title', 'tag', 'user_count'], axis=1)
  return print(temp3.head(20))

In [222]:
temp = tags[tags['tag'] == 'surreal']#[['movieId', 'tag', 'userId']]
temp[temp['movieId']==7147]

Unnamed: 0,userId,movieId,tag,timestamp,tag_condenser_test
509,93,7147,surreal,1496543859,surreal
510,93,7147,surreal,1496543872,surrealism
4192,1541,7147,surreal,1433455810,surreal
4193,1541,7147,surreal,1433455818,surrealism
7625,2730,7147,surreal,1450767760,surreal
7626,2730,7147,surreal,1450767764,surrealism
19350,4117,7147,surreal,1442297742,surreal
218777,7335,7147,surreal,1502931705,surreal
218778,7335,7147,surreal,1502931712,surrealism
259550,15204,7147,surreal,1306924418,surreal


In [221]:
temp.groupby('movieId').count().loc[7147,:]

tag       100
userId    100
Name: 7147, dtype: int64

### The tag 'r' has no discernable information behind it:

In [181]:
tag_explorer('r', tags, movies_df)

                                                title tag  user_count
0   Borat: Cultural Learnings of America for Make ...   r           8
1                               Letters from Iwo Jima   r           5
2                  Before the Devil Knows You're Dead   r           5
3                                           In Bruges   r           5
4                               Science of Sleep, The   r           5
5                                       Departed, The   r           5
6                      Blind Swordsman: Zatoichi, The   r           5
7                                      Twelve Monkeys   r           5
8                                            Hot Fuzz   r           5
9                    Girl with the Dragon Tattoo, The   r           5
10                              Thank You for Smoking   r           4
11                                            Paprika   r           4
12                                     Gone Baby Gone   r           4
13                  

### ridiculous is a matter of opinion and probably not that useful:

In [182]:
tag_explorer('ridiculous', tags, movies_df)

                                         title         tag  user_count
0                                  Pacific Rim  ridiculous          23
1                    Grand Budapest Hotel, The  ridiculous          20
2                                    Kung Fury  ridiculous          18
3                                        Signs  ridiculous          16
4                                     Face/Off  ridiculous          13
5                 Kingsman: The Secret Service  ridiculous          12
6           Monty Python's The Meaning of Life  ridiculous          10
7                                 Shoot 'Em Up  ridiculous          10
8                               The Babysitter  ridiculous          10
9                                    True Lies  ridiculous           9
10                             Jennifer's Body  ridiculous           9
11               You Don't Mess with the Zohan  ridiculous           8
12         Transformers: Revenge of the Fallen  ridiculous           8
13  Tw

### The 'too long' tag, whle I may agree with some of these, is not really something I think anyone would ask for in a movie recommendation!

In [190]:
tag_explorer('too long', tags, movies_df)

                                     title       tag  user_count
0                        The Hateful Eight  too long          32
1                                   Avatar  too long          29
2       Hobbit: An Unexpected Journey, The  too long          28
3                           Shutter Island  too long          27
4                         Django Unchained  too long          21
5     Hobbit: The Desolation of Smaug, The  too long          21
6                Blue Is the Warmest Color  too long          19
7                                 Watchmen  too long          19
8                               Mr. Nobody  too long          17
9                              Cloud Atlas  too long          17
10                          Enter the Void  too long          15
11  Lord of the Rings: The Two Towers, The  too long          14
12                                  Zodiac  too long          13
13                               Boot, Das  too long          13
14                       

### Just to finish on a positive, I'd definitely agree with the surreal tag - it shows that these tags are indeed going to be useful!

In [189]:
tag_explorer('surreal', tags, movies_df)

                                    title      tag  user_count
0   Eternal Sunshine of the Spotless Mind  surreal         225
1                               Inception  surreal         196
2                            Donnie Darko  surreal         175
3                              Fight Club  surreal         159
4                        Mulholland Drive  surreal         124
5                         Pan's Labyrinth  surreal         113
6                                  Amelie  surreal         112
7                             Matrix, The  surreal         102
8                    Being John Malkovich  surreal          95
9                              Mr. Nobody  surreal          93
10                              Dark City  surreal          72
11                                 Brazil  surreal          71
12                  2001: A Space Odyssey  surreal          70
13                           Annihilation  surreal          68
14                               Coraline  surreal     

### Let's gather all the usless tags we've discovered above, and remove them from the tags dataframe:

In [207]:
format_tags = ['bdr', 'clv', 'betamax', 'dvdvideo', 'bdvideo', 'tumeys dvds', 'less than 300 ratings', 'dvd', 'dvdr', 'dvdram', 'vhs']
unknown_tags = ['prospect preferred', 'r']
undesirable_tags = ['rape', 'suicide', 'nudity full frontal', 'sex', 'torture', 'incest', 'nudity', 'prostitution', 'nudity topless', 'nudity topless  brief', 'nudity full frontal  notable', 'child abuse', 'nudity rear', 'nudity topless  notable']
genre_tags = ['scifi', 'action', 'comedy', 'romance', 'fantasy', 'adventure', 'thriller', 'drama', 'animation', 'horror', 'musical', 'crime', 'mystery', 'war', 'documentary', 'animated', 'british comedy']
time_tags = ['1930s', '1950s', '1960s', '1970s', '1980s', '1990s']
useless_opinion_tags = ['seen more than once', 'movie to see', 'reviewed', 'to see', 'ridiculous', 'boring', 'overrated', 'might like']
deletion_lists = [format_tags, unknown_tags, undesirable_tags, genre_tags, time_tags, useless_opinion_tags]
tag_deletions = [tag for tag_list in deletion_lists for tag in tag_list]
tags = tags[~tags['tag'].isin(tag_deletions)].reset_index(drop=True)

### Let's also clean up some of the similar ones in the top 500:

In [208]:
literature_variants = ['based on a book', 'based on novel or book', 'adapted from:book']
tags['tag'] = tags['tag'].apply(condenser, from_list=literature_variants, to_value='book adaptation')

philosophy_variants = ['philosophical', 'philosophy']
tags['tag'] = tags['tag'].apply(condenser, from_list=philosophy_variants, to_value='philosophy')

nyc_variants = ['new york city', 'new york']
tags['tag'] = tags['tag'].apply(condenser, from_list=nyc_variants, to_value='new york')

surreal_variants = ['surreal', 'surrealism']
tags['tag'] = tags['tag'].apply(condenser, from_list=surreal_variants, to_value='surreal')

alien_variants = ['alien', 'aliens']
tags['tag'] = tags['tag'].apply(condenser, from_list=alien_variants, to_value='alien')

cult_variants = ['cult classic', 'cult', 'cult film']
tags['tag'] = tags['tag'].apply(condenser, from_list=cult_variants, to_value='cult')

dance_variants = ['dance', 'dancing']
tags['tag'] = tags['tag'].apply(condenser, from_list=dance_variants, to_value='dance')

antihero_variants = ['antihero', 'dark hero']
tags['tag'] = tags['tag'].apply(condenser, from_list=antihero_variants, to_value='antihero')

In [209]:
top500tags = tags['tag'].value_counts().head(500)
top500tags = pd.DataFrame(top500tags).reset_index()
top500tags.columns = ['tag', 'tag_count']
top500tags.sort_values('tag')

Unnamed: 0,tag,tag_count
315,19th century,474
439,3d,359
282,70mm,520
279,absurd,532
143,acting,933
438,adam sandler,360
478,adapted from:comic,331
379,addiction,408
205,adultery,667
258,africa,564
