In [1]:
import pandas as pd

IMDB (https://datasets.imdbws.com/) has data available for use, but it requires some preprocessing. For starters, the name of the movie and the rating aren't in the same dataset. So we will combine those to one pandas dataframe. We can also combine any of the IMDB datasets for further data mining, as they have a shared constant for every row (tconst)


In [2]:
imdb_basics = pd.read_csv('./test-data/title.basics.tsv', sep='\t')
imdb_basics.shape

  imdb_basics = pd.read_csv('./test-data/title.basics.tsv', sep='\t')


(9786353, 9)

In [3]:
imdb_ratings = pd.read_csv('./test-data/title.ratings.tsv', sep='\t')
imdb_ratings.shape

(1302133, 3)

In [4]:
# DEPRECATED DATASET WE ORIGINALLY WANTED TO USE
#netflix = pd.read_csv('netflix_titles.csv')
#netflix.shape

In [5]:
# DEPRECATED DATASET
# NETFLIX:\n{netflix.iloc[:1]}\n\n\n
print(f"IMDB BASICS:\n{imdb_basics.iloc[:5]}\n\n\nIMDB RATINGS:\n{imdb_ratings.iloc[:5]}")

IMDB BASICS:
      tconst titleType            primaryTitle           originalTitle   
0  tt0000001     short              Carmencita              Carmencita  \
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres  
0       0      1894      \N              1         Documentary,Short  
1       0      1892      \N              5           Animation,Short  
2       0      1892      \N              4  Animation,Comedy,Romance  
3       0      1892      \N             12           Animation,Short  
4       0      1893      \N              1              Comedy,Short  


IMDB RATINGS:
      tconst  averageRating  numVotes
0  tt0000001            5.7      1965
1  tt0000002            

In [6]:
imdb_combined = pd.concat([imdb_basics, imdb_ratings], axis=1, join='inner')
print(f"{imdb_combined.shape}")

print(f"\nIMDB COMBINED:\n{imdb_combined.iloc[:5]}")

(1302133, 12)

IMDB COMBINED:
      tconst titleType            primaryTitle           originalTitle   
0  tt0000001     short              Carmencita              Carmencita  \
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres   
0       0      1894      \N              1         Documentary,Short  \
1       0      1892      \N              5           Animation,Short   
2       0      1892      \N              4  Animation,Comedy,Romance   
3       0      1892      \N             12           Animation,Short   
4       0      1893      \N              1              Comedy,Short   

      tconst  averageRating  numVotes  
0  tt0000001            5.7      1965  
1  tt0000002

Data is now in two sets, Netflix information, and IMDB information. Lets create one data set that contains Netflix, and IMDB movies together.

In [86]:
# DEPRECATED DATASET
#netflix_imdb_combined = pd.merge(netflix,imdb_combined, suffixes=['_netflix','_imdb'], left_on='title', right_on='originalTitle')
#print(f"{netflix_imdb_combined.shape}\n\nNETFLIX IMDB COMBINED:\n{netflix_imdb_combined.iloc[:1]}")

Now we have a combined dataframe of the shows available on netflix, and their imdb information. Lets filter out the TV shows

In [87]:
# DEPRECATED DATASET
#netflix_imdb_combined_no_tv = netflix_imdb_combined[(netflix_imdb_combined['type'] == 'Movie')]
#print(f"{netflix_imdb_combined_no_tv.shape}\n\nCOMBINED NO TV:\n{netflix_imdb_combined_no_tv.iloc[:1]}")

Lets also filter out an NaN rows

In [7]:

imdb_combined = imdb_combined.dropna()

 Lets start by creating our test/training data split (85/15)

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train, test = train_test_split(imdb_combined, test_size=0.15)

print(f"TRAINING:\t{train.shape}\n{train.iloc[:1]}\n\n\nTESTING:\t{test.shape}\n{test.iloc[:1]}")

TRAINING:	(1106810, 12)
             tconst  titleType    primaryTitle   originalTitle isAdult   
1124461  tt10282106  tvEpisode  Episode #1.722  Episode #1.722       0  \

        startYear endYear runtimeMinutes               genres     tconst   
1124461      2003      \N             \N  Comedy,Drama,Family  tt5866648  \

         averageRating  numVotes  
1124461            6.2         8  


TESTING:	(195320, 12)
           tconst  titleType                   primaryTitle   
633557  tt0655757  tvEpisode  Episode dated 12 January 2002  \

                        originalTitle isAdult startYear endYear   
633557  Episode dated 12 January 2002       0      2002      \N  \

       runtimeMinutes genres      tconst  averageRating  numVotes  
633557             \N  Music  tt13689534            7.8        53  


In [11]:
# Generate list of all genres
genre_list = []

genres = train['genres'].unique()
for genresgroup in genres:
  if genresgroup != genresgroup:
    print(genresgroup)
  genre_sep = genresgroup.split(',')
  genre_list = genre_list + genre_sep

# Find unique genres from our data.
list_set = set(genre_list)
unique_list_genre = (list(list_set))
unique_list_genre.remove("\\N")
for genre in unique_list_genre:
  print(genre)

Musical
Adult
Horror
Western
Adventure
Crime
Mystery
War
Action
Talk-Show
Animation
Family
Music
Short
History
Thriller
Film-Noir
Sport
Comedy
Biography
Romance
Sci-Fi
Game-Show
Reality-TV
News
Documentary
Drama
Fantasy


In [14]:
# Create dictionary for all genres
genre_split = {}
# Add each genre as a key, and its dictionary as the value
for genre in unique_list_genre:
  genre_split[genre] = train.loc[(train['genres'].str.contains(genre))]

genre_split['Horror'].head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst.1,averageRating,numVotes
61225,tt0062446,movie,The Rape of the Vampire,Le viol du vampire,0,1968,\N,95,Horror,tt0086100,7.7,25
834762,tt0861988,tvEpisode,Episode #1.354,Episode #1.354,0,1967,\N,23,"Drama,Fantasy,Horror",tt2088583,8.2,6
797628,tt0823500,short,My Tumour & I,My Tumour & I,0,2005,\N,20,"Horror,Short",tt1884299,8.1,9
1173680,tt10371140,tvEpisode,Mércores de fariñada,Mércores de fariñada,0,2020,\N,\N,"Crime,Horror,Mystery",tt6741930,9.5,10244
679262,tt0702191,tvEpisode,Episode #4.5,Episode #4.5,0,2001,\N,\N,"Drama,Horror,Mystery",tt14707412,5.5,22


Now all of our data is split by genre as well.

In [23]:
genre_split_avg = {}
# for each genre
for genre in unique_list_genre:
  #initalize values
  avg = 0
  count = 0
  #declare working genre
  print(f"GENRE: {genre}:")
  #iterate over dataframe to find the average rating, and number of movies
  for index, row in genre_split[genre].iterrows():
    #print(row['primaryTitle'], row['averageRating'])
    avg = avg + row['averageRating']
    count = count + 1
  avg = avg/count
  print(f"\nAverage: {avg}, Number: {count}\n\n")
  #split dataframe above and below average
  genre_split_avg['B'+genre] = genre_split[genre][genre_split[genre]['averageRating'] <= avg]
  genre_split_avg['A'+genre] = genre_split[genre][genre_split[genre]['averageRating'] > avg]

GENRE: Musical:

Average: 6.759743099207445, Number: 10977


GENRE: Adult:

Average: 6.880579460700019, Number: 38346


GENRE: Horror:

Average: 6.877658637658623, Number: 19305


GENRE: Western:

Average: 6.774837569816526, Number: 17546


GENRE: Adventure:

Average: 6.941617168938354, Number: 63976


GENRE: Crime:

Average: 6.936362195701784, Number: 82033


GENRE: Mystery:

Average: 6.92305374350145, Number: 30199


GENRE: War:

Average: 6.784001220504465, Number: 9832


GENRE: Action:

Average: 6.957743654302119, Number: 68314


GENRE: Talk-Show:

Average: 7.053274607430112, Number: 80941


GENRE: Animation:

Average: 6.928214331495986, Number: 54607


GENRE: Family:

Average: 7.019810944391495, Number: 90238


GENRE: Music:

Average: 6.963460965188307, Number: 60382


GENRE: Short:

Average: 6.921279570262586, Number: 145205


GENRE: History:

Average: 6.913391485963519, Number: 16279


GENRE: Thriller:

Average: 6.855765357502525, Number: 23832


GENRE: Film-Noir:

Average: 6.204

In [29]:
genre_split_avg['AAdult'].head()


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst.1,averageRating,numVotes
290571,tt0303630,video,Amateur Hours 5,Amateur Hours 5,1,1989,\N,120,Adult,tt0560663,7.2,19
1224644,tt10463200,tvEpisode,Classic,Classic,1,2019,\N,\N,Adult,tt7854360,8.0,5
449913,tt0468379,video,Pussyman's Blow Bang 1,Pussyman's Blow Bang 1,1,2005,\N,68,Adult,tt10059680,8.7,113
417420,tt0434989,video,Hairy Assed Daddies,Hairy Assed Daddies,1,2002,\N,82,Adult,tt0885522,7.6,474
134134,tt0138095,movie,Sexdance Fever,Sexdance Fever,1,1984,\N,\N,Adult,tt0206920,7.0,13
