In [7]:
import pandas as pd

IMDB (https://datasets.imdbws.com/) has data available for use, but it requires some preprocessing. For starters, the name of the movie and the rating aren't in the same dataset. So we will combine those to one pandas dataframe.

In [8]:
imdb_basics = pd.read_csv('data_imdb_basics.tsv', sep='\t')
imdb_basics.shape

  exec(code_obj, self.user_global_ns, self.user_ns)


(9621894, 9)

In [9]:
imdb_ratings = pd.read_csv('data_imdb_ratings.tsv', sep='\t')
imdb_ratings.shape

(1280237, 3)

In [10]:
netflix = pd.read_csv('netflix_titles.csv')
netflix.shape

(8807, 12)

In [11]:
print(f"NETFLIX:\n{netflix.iloc[:1]}\n\n\nIMDB BASICS:\n{imdb_basics.iloc[:5]}\n\n\nIMDB RATINGS:\n{imdb_ratings.iloc[:5]}")

NETFLIX:
  show_id   type                 title         director cast        country  \
0      s1  Movie  Dick Johnson Is Dead  Kirsten Johnson  NaN  United States   

           date_added  release_year rating duration      listed_in  \
0  September 25, 2021          2020  PG-13   90 min  Documentaries   

                                         description  
0  As her father nears the end of his life, filmm...  


IMDB BASICS:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres  
0       0      1894      \N              1        

In [12]:
imdb_combined = pd.concat([imdb_basics, imdb_ratings], axis=1, join='inner')
print(f"{imdb_combined.shape}")

print(f"\nIMDB COMBINED:\n{imdb_combined.iloc[:5]}")

(1280237, 12)

IMDB COMBINED:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres  \
0       0      1894      \N              1         Documentary,Short   
1       0      1892      \N              5           Animation,Short   
2       0      1892      \N              4  Animation,Comedy,Romance   
3       0      1892      \N             12           Animation,Short   
4       0      1893      \N              1              Comedy,Short   

      tconst  averageRating  numVotes  
0  tt0000001            5.7      1952  
1  tt0000002

Data is now in two sets, Netflix information, and IMDB information. Lets create one data set that contains Netflix, and IMDB movies together.

In [13]:
netflix_imdb_combined = pd.merge(netflix,imdb_combined, suffixes=['_netflix','_imdb'], left_on='title', right_on='originalTitle')
print(f"{netflix_imdb_combined.shape}\n\nNETFLIX IMDB COMBINED:\n{netflix_imdb_combined.iloc[:1]}")

(13351, 24)

NETFLIX IMDB COMBINED:
  show_id     type          title       director  \
0      s6  TV Show  Midnight Mass  Mike Flanagan   

                                                cast country  \
0  Kate Siegel, Zach Gilford, Hamish Linklater, H...     NaN   

           date_added  release_year rating  duration  ...   primaryTitle  \
0  September 24, 2021          2021  TV-MA  1 Season  ...  Midnight Mass   

   originalTitle isAdult startYear endYear runtimeMinutes genres     tconst  \
0  Midnight Mass       0      1999      \N             \N  Drama  tt0216854   

  averageRating numVotes  
0           5.3       11  

[1 rows x 24 columns]


Now we have a combined dataframe of the shows available on netflix, and their imdb information. Lets filter out the TV shows

In [14]:
netflix_imdb_combined_no_tv = netflix_imdb_combined[(netflix_imdb_combined['type'] == 'Movie')]
print(f"{netflix_imdb_combined_no_tv.shape}\n\nCOMBINED NO TV:\n{netflix_imdb_combined_no_tv.iloc[:1]}")

(9317, 24)

COMBINED NO TV:
  show_id   type                             title  \
2      s7  Movie  My Little Pony: A New Generation   

                        director  \
2  Robert Cullen, José Luis Ucha   

                                                cast country  \
2  Vanessa Hudgens, Kimiko Glenn, James Marsden, ...     NaN   

           date_added  release_year rating duration  ...  \
2  September 24, 2021          2021     PG   91 min  ...   

                       primaryTitle                     originalTitle isAdult  \
2  My Little Pony: A New Generation  My Little Pony: A New Generation       0   

  startYear endYear runtimeMinutes                      genres     tconst  \
2      2021      \N             90  Adventure,Animation,Comedy  tt4485950   

  averageRating numVotes  
2           7.9       12  

[1 rows x 24 columns]


 Lets start by creating our test/training data split (85/15)

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train, test = train_test_split(netflix_imdb_combined_no_tv, test_size=0.15)

print(f"TRAINING:\t{train.shape}\n{train.iloc[:1]}\n\n\nTESTING:\t{test.shape}\n{test.iloc[:1]}")

TRAINING:	(7919, 24)
     show_id   type     title    director  \
1020    s594  Movie  Snow Day  Chris Koch   

                                                   cast        country  \
1020  Chris Elliott, Mark Webber, Jean Smart, Schuyl...  United States   

        date_added  release_year rating duration  ... primaryTitle  \
1020  July 1, 2021          2000     PG   89 min  ...     Snow Day   

     originalTitle isAdult startYear endYear runtimeMinutes  \
1020      Snow Day       0      2007      \N             22   

                    genres     tconst averageRating numVotes  
1020  Family,Fantasy,Music  tt3027054           7.5        8  

[1 rows x 24 columns]


TESTING:	(1398, 24)
     show_id   type        title        director  \
9347   s6398  Movie  Cabin Fever  Travis Zariwny   

                                                   cast        country  \
9347  Gage Golightly, Matthew Daddario, Samuel Davis...  United States   

        date_added  release_year rating durati

In [17]:
# Generate list of all genres
genre_list = []

for genres in train['genres']:
  genre_sep = genres.split(',')
  genre_list = genre_list + genre_sep

# Find unique genres from our data.
list_set = set(genre_list)
unique_list_genre = (list(list_set))
unique_list_genre.remove("\\N")
for genre in unique_list_genre:
  print(genre)

Thriller
Comedy
Western
Adventure
History
News
Music
Fantasy
Animation
Horror
Sci-Fi
Mystery
Short
Game-Show
Drama
Crime
Family
Action
Film-Noir
War
Documentary
Biography
Sport
Talk-Show
Romance
Reality-TV
Adult
Musical


In [18]:
# Create dictionary for all genres
genre_split = {}
# Add each genre as a key, and its dictionary as the value
for genre in unique_list_genre:
  genre_split[genre] = train.loc[(train['genres'].str.contains(genre))]

print(genre_split['Horror'].iloc[:1])

    show_id   type         title     director  \
777    s493  Movie  Midnight Sun  Scott Speer   

                                                  cast        country  \
777  Bella Thorne, Patrick Schwarzenegger, Rob Rigg...  United States   

       date_added  release_year rating duration  ...  primaryTitle  \
777  July 8, 2021          2018  PG-13   91 min  ...  Midnight Sun   

    originalTitle isAdult startYear endYear runtimeMinutes  \
777  Midnight Sun       0      2003      \N             18   

                  genres     tconst averageRating numVotes  
777  Horror,Sci-Fi,Short  tt0704362           6.8        8  

[1 rows x 24 columns]


Now all of our data is split by genre as well.

In [19]:
genre_split_avg = {}
# for each genre
for genre in unique_list_genre:
  #initalize values
  avg = 0
  count = 0
  #declare working genre
  print(f"GENRE: {genre}:")
  #iterate over dataframe to find the average rating, and number of movies
  for index, row in genre_split[genre].iterrows():
    print(row['title'], row['averageRating'])
    avg = avg + row['averageRating']
    count = count + 1
  avg = avg/count
  print(f"\nAverage: {avg}, Number: {count}\n\n")
  #split dataframe above and below average
  genre_split_avg['B'+genre] = genre_split[genre][genre_split[genre]['averageRating'] <= avg]
  genre_split_avg['A'+genre] = genre_split[genre][genre_split[genre]['averageRating'] > avg]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Alive and Kicking 4.2
Fracture 7.1
Valentine's Day 6.1
The Trap 5.8
The Assignment 7.6
Paranoia 5.9
The Promise 6.9
Mariposa 7.6
Phir Bhi Dil Hai Hindustani 7.8
Raja Hindustani 6.5
Sarajevo 7.4
Badland 6.4
The Last Summer 7.8
Destiny 5.8
While We're Young 7.4
The Four Seasons 6.5
Honeytrap 7.9
Mad World 6.5
Hot Rod 7.8
Fools Rush In 6.5
The Disciple 9.3
Consequences 8.4
Monster 7.6
Seven 7.6
Stardust 8.1
The Dancer 7.4
Paid in Full 6.9
The Bounty Hunter 7.6
Deep 6.7
Standoff 8.5
Bright Star 6.9
The Bridge 8.4
Blind Date 4.2
Mrs. Serial Killer 6.5
Killers 5.4
Valentine's Day 5.2
Bedtime Stories 6.4
Christine 5.8
The Dancer 7.7
365 Days 7.5
Compulsion 7.6
Skin Trade 5.0
The Killer 8.6
Consequences 7.2
The First Lady 7.9
Savages 7.1
The Con Is On 7.0
Shadow 7.3
Blind Date 7.6
Game Over 8.2
Child's Play 7.4
Search Party 7.8
Blood Will Tell 8.4
The Dig 6.8
A Dangerous Woman 6.6
The Competition 6.6
Child's Play 8.9
Catch Me If 

In [20]:
print(f"{genre_split_avg['AAction'].iloc[:1]}\n\n\n{genre_split_avg['BAction'].iloc[:1]}")

     show_id   type      title      director  \
1556    s813  Movie  Swordfish  Dominic Sena   

                                                   cast        country  \
1556  John Travolta, Hugh Jackman, Halle Berry, Don ...  United States   

        date_added  release_year rating duration  ... primaryTitle  \
1556  June 2, 2021          2001      R   99 min  ...    Swordfish   

     originalTitle isAdult startYear endYear runtimeMinutes  \
1556     Swordfish       0      2001      \N             99   

                     genres     tconst averageRating numVotes  
1556  Action,Crime,Thriller  tt0429172           7.0       37  

[1 rows x 24 columns]


     show_id   type   title        director  \
9879   s6736  Movie  Fallen  Gregory Hoblit   

                                                   cast        country  \
9879  Denzel Washington, John Goodman, Donald Suther...  United States   

            date_added  release_year rating duration  ... primaryTitle  \
9879  November 