In [5]:
import pandas as pd

IMDB (https://datasets.imdbws.com/) has data available for use, but it requires some preprocessing. For starters, the name of the movie and the rating aren't in the same dataset. So we will combine those to one pandas dataframe.

In [6]:
imdb_basics = pd.read_csv('data_imdb_basics.tsv', sep='\t')
imdb_basics.shape

(186750, 9)

In [7]:
imdb_ratings = pd.read_csv('data_imdb_ratings.tsv', sep='\t')
imdb_ratings.shape

(964913, 3)

In [8]:
netflix = pd.read_csv('netflix_titles.csv')
netflix.shape

(8807, 12)

In [9]:
print(f"NETFLIX:\n{netflix.iloc[:1]}\n\n\nIMDB BASICS:\n{imdb_basics.iloc[:5]}\n\n\nIMDB RATINGS:\n{imdb_ratings.iloc[:5]}")

NETFLIX:
  show_id   type                 title         director cast        country  \
0      s1  Movie  Dick Johnson Is Dead  Kirsten Johnson  NaN  United States   

           date_added  release_year rating duration      listed_in  \
0  September 25, 2021          2020  PG-13   90 min  Documentaries   

                                         description  
0  As her father nears the end of his life, filmm...  


IMDB BASICS:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  
0        0      1894      \N              1      

In [10]:
imdb_combined = pd.concat([imdb_basics, imdb_ratings], axis=1, join='inner')
print(f"{imdb_combined.shape}")

print(f"\nIMDB COMBINED:\n{imdb_combined.iloc[:5]}")

(186750, 12)

IMDB COMBINED:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  \
0        0      1894      \N              1         Documentary,Short   
1        0      1892      \N              5           Animation,Short   
2        0      1892      \N              4  Animation,Comedy,Romance   
3        0      1892      \N             12           Animation,Short   
4        0      1893      \N              1              Comedy,Short   

      tconst  averageRating  numVotes  
0  tt0000001            5.7    1952.0  
1  tt00

Data is now in two sets, Netflix information, and IMDB information. Lets create one data set that contains Netflix, and IMDB movies together.

In [11]:
netflix_imdb_combined = pd.merge(netflix,imdb_combined, suffixes=['_netflix','_imdb'], left_on='title', right_on='originalTitle')
print(f"{netflix_imdb_combined.shape}\n\nNETFLIX IMDB COMBINED:\n{netflix_imdb_combined.iloc[:1]}")

(2301, 24)

NETFLIX IMDB COMBINED:
  show_id     type          title       director  \
0      s6  TV Show  Midnight Mass  Mike Flanagan   

                                                cast country  \
0  Kate Siegel, Zach Gilford, Hamish Linklater, H...     NaN   

           date_added  release_year rating  duration  ...   primaryTitle  \
0  September 24, 2021          2021  TV-MA  1 Season  ...  Midnight Mass   

   originalTitle isAdult startYear endYear runtimeMinutes  genres     tconst  \
0  Midnight Mass       0      1999      \N             \N   Drama  tt0216854   

  averageRating numVotes  
0           5.3     11.0  

[1 rows x 24 columns]


Now we have a combined dataframe of the shows available on netflix, and their imdb information. Lets filter out the TV shows

In [12]:
netflix_imdb_combined_no_tv = netflix_imdb_combined[(netflix_imdb_combined['type'] == 'Movie')]
print(f"{netflix_imdb_combined_no_tv.shape}\n\nCOMBINED NO TV:\n{netflix_imdb_combined_no_tv.iloc[:1]}")

(1673, 24)

COMBINED NO TV:
  show_id   type    title      director  \
1      s8  Movie  Sankofa  Haile Gerima   

                                                cast  \
1  Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...   

                                             country          date_added  \
1  United States, Ghana, Burkina Faso, United Kin...  September 24, 2021   

   release_year rating duration  ... primaryTitle originalTitle isAdult  \
1          1993  TV-MA  125 min  ...      Sankofa       Sankofa       0   

  startYear endYear runtimeMinutes  genres     tconst averageRating numVotes  
1      1993      \N            125   Drama  tt0150986           5.7     51.0  

[1 rows x 24 columns]


 Lets start by creating our test/training data split (85/15)

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
train, test = train_test_split(netflix_imdb_combined_no_tv, test_size=0.15)

print(f"TRAINING:\t{train.shape}\n{train.iloc[:1]}\n\n\nTESTING:\t{test.shape}\n{test.iloc[:1]}")

TRAINING:	(1422, 24)
   show_id   type          title    director  \
61    s162  Movie  Mars Attacks!  Tim Burton   

                                                 cast        country  \
61  Jack Nicholson, Glenn Close, Annette Bening, P...  United States   

           date_added  release_year rating duration  ...   primaryTitle  \
61  September 1, 2021          1996  PG-13  106 min  ...  Mars Attacks!   

    originalTitle isAdult startYear endYear runtimeMinutes         genres  \
61  Mars Attacks!       0      1996      \N            106  Comedy,Sci-Fi   

       tconst averageRating numVotes  
61  tt0167872           7.7   4706.0  

[1 rows x 24 columns]


TESTING:	(251, 24)
    show_id   type    title       director  \
844   s3204  Movie  Why Me?  Tudor Giurgiu   

                                                  cast  \
844  Emilian Oprea, Mihai Constantin, Andreea Vasil...   

                        country        date_added  release_year rating  \
844  Romania, Bulgaria, H

In [23]:
# Generate list of all genres
genre_list = []

for genres in train['genres']:
  genre_sep = genres.split(',')
  genre_list = genre_list + genre_sep

# Find unique genres from our data.
list_set = set(genre_list)
unique_list_genre = (list(list_set))
unique_list_genre.remove("\\N")
for genre in unique_list_genre:
  print(genre)

Biography
Game-Show
Animation
Horror
Musical
Sci-Fi
Sport
Film-Noir
Western
Talk-Show
Documentary
Family
Music
Drama
Adventure
Short
Adult
Mystery
History
Reality-TV
War
Crime
Fantasy
Romance
Comedy
Thriller
Action


In [34]:
# Create dictionary for all genres
genre_split = {}
# Add each genre as a key, and its dictionary as the value
for genre in unique_list_genre:
  genre_split[genre] = train.loc[(train['genres'].str.contains(genre))]

print(genre_split['Horror'].iloc[:1])

    show_id   type     title      director  \
209    s620  Movie  Deranged  Jameel Buari   

                                                  cast country     date_added  \
209  Nadia Buari, Ramsey Nouah, Zynnell Zuh, Prisci...     NaN  June 30, 2021   

     release_year rating duration  ...                           primaryTitle  \
209          2020  TV-14   98 min  ...  Deranged: Confessions of a Necrophile   

    originalTitle isAdult startYear endYear runtimeMinutes  \
209      Deranged       0      1974      \N             84   

                    genres     tconst averageRating numVotes  
209  Drama,Horror,Thriller  tt0096398           4.6     18.0  

[1 rows x 24 columns]


Now all of our data is split by genre as well.

In [49]:
genre_split_avg = {}
# for each genre
for genre in unique_list_genre:
  #initalize values
  avg = 0
  count = 0
  #declare working genre
  print(f"GENRE: {genre}:")
  #iterate over dataframe to find the average rating, and number of movies
  for index, row in genre_split[genre].iterrows():
    print(row['title'], row['averageRating'])
    avg = avg + row['averageRating']
    count = count + 1
  avg = avg/count
  print(f"\nAverage: {avg}, Number: {count}\n\n")
  #split dataframe above and below average
  genre_split_avg['B'+genre] = genre_split[genre][genre_split[genre]['averageRating'] <= avg]
  genre_split_avg['A'+genre] = genre_split[genre][genre_split[genre]['averageRating'] > avg]

GENRE: Biography:
Wyatt Earp 6.6
Compulsion 7.6
Searching for Bobby Fischer 6.5
Mandela 7.8
Kurt & Courtney 9.0
La Bamba 6.2
The Long Riders 5.2
Tyson 6.9
Roots 8.4
Trash 7.3
Winnie 4.2
Amy 6.4
Blaze 4.6
Lincoln 6.1
American Me 5.8
Haywire 3.3
Raging Bull 8.6
GoodFellas 7.1
Belmonte 4.6
Act of Vengeance 5.3
Jimi Hendrix 6.6
Unspeakable Acts 4.8
Mutiny on the Bounty 5.4
The Ryan White Story 7.6
Sylvia 7.0
Too Young the Hero 7.0
Donnie Brasco 6.5
Schindler's List 4.5
Big Bear 7.0
I'll See You in My Dreams 5.8
Queen 5.7
Why Do Fools Fall in Love 6.8
Brothers 7.2

Average: 6.345454545454545, Number: 33


GENRE: Game-Show:
Hugo 7.7
Click 4.6
Blind Date 6.9
Blind Date 7.5

Average: 6.675000000000001, Number: 4


GENRE: Animation:
Milk 5.7
Happy Go Lucky 5.3
Much Ado About Nothing 8.6
Evolution 6.6
Bebe's Kids 5.7
School Daze 6.3
An American Tail 7.5
Hugo 7.7
Dennis the Menace 5.0
Chicken Little 7.1
The Karate Kid 6.0
The Duel 8.1
Blind Date 6.6
Tarzan 4.9
Woody Woodpecker 4.4
Hide and Seek 5

In [53]:
print(f"{genre_split_avg['AAction'].iloc[:1]}\n\n\n{genre_split_avg['BAction'].iloc[:1]}")

     show_id   type          title director  \
1748   s6968  Movie  Hide and Seek  Liu Jie   

                                                   cast country  \
1748  Wallace Huo, Qin Hailu, Regina Wan, Jessie Li,...   China   

           date_added  release_year rating duration  ...   primaryTitle  \
1748  August 19, 2017          2016  TV-14  104 min  ...  Hide and Seek   

      originalTitle isAdult startYear endYear runtimeMinutes  \
1748  Hide and Seek       0      1964      \N             90   

                  genres     tconst averageRating numVotes  
1748  Action,Crime,Drama  tt0254646           7.4     16.0  

[1 rows x 24 columns]


    show_id   type     title       director  \
413   s1334  Movie  War Dogs  Todd Phillips   

                                                  cast  \
413  Jonah Hill, Miles Teller, Ana de Armas, Kevin ...   

                              country        date_added  release_year rating  \
413  United States, Cambodia, Romania  February 8, 