In [1]:
import pandas as pd

IMDB (https://datasets.imdbws.com/) has data available for use, but it requires some preprocessing. For starters, the name of the movie and the rating aren't in the same dataset. So we will combine those to one pandas dataframe.

In [2]:
imdb_basics = pd.read_csv('data_imdb_basics.tsv', sep='\t')
imdb_basics.shape

(473433, 9)

In [3]:
imdb_ratings = pd.read_csv('data_imdb_ratings.tsv', sep='\t')
imdb_ratings.shape

(1280237, 3)

In [4]:
netflix = pd.read_csv('netflix_titles.csv')
netflix.shape

(8807, 12)

In [5]:
print(f"NETFLIX:\n{netflix.iloc[:1]}\n\n\nIMDB BASICS:\n{imdb_basics.iloc[:5]}\n\n\nIMDB RATINGS:\n{imdb_ratings.iloc[:5]}")

NETFLIX:
  show_id   type                 title         director cast        country  \
0      s1  Movie  Dick Johnson Is Dead  Kirsten Johnson  NaN  United States   

           date_added  release_year rating duration      listed_in  \
0  September 25, 2021          2020  PG-13   90 min  Documentaries   

                                         description  
0  As her father nears the end of his life, filmm...  


IMDB BASICS:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  
0        0      1894      \N              1      

In [6]:
imdb_combined = pd.concat([imdb_basics, imdb_ratings], axis=1, join='inner')
print(f"{imdb_combined.shape}")

print(f"\nIMDB COMBINED:\n{imdb_combined.iloc[:5]}")

(473433, 12)

IMDB COMBINED:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  \
0        0      1894      \N              1         Documentary,Short   
1        0      1892      \N              5           Animation,Short   
2        0      1892      \N              4  Animation,Comedy,Romance   
3        0      1892      \N             12           Animation,Short   
4        0      1893      \N              1              Comedy,Short   

      tconst  averageRating  numVotes  
0  tt0000001            5.7      1952  
1  tt00

Data is now in two sets, Netflix information, and IMDB information. Lets create one data set that contains Netflix, and IMDB movies together.

In [11]:
netflix_imdb_combined = pd.merge(netflix,imdb_combined, suffixes=['_netflix','_imdb'], left_on='title', right_on='originalTitle')
print(f"{netflix_imdb_combined.shape}\n\nNETFLIX IMDB COMBINED:\n{netflix_imdb_combined.iloc[:1]}")

(5195, 24)

NETFLIX IMDB COMBINED:
  show_id     type          title       director  \
0      s6  TV Show  Midnight Mass  Mike Flanagan   

                                                cast country  \
0  Kate Siegel, Zach Gilford, Hamish Linklater, H...     NaN   

           date_added  release_year rating  duration  ...   primaryTitle  \
0  September 24, 2021          2021  TV-MA  1 Season  ...  Midnight Mass   

   originalTitle isAdult startYear endYear runtimeMinutes  genres     tconst  \
0  Midnight Mass       0      1999      \N             \N   Drama  tt0216854   

  averageRating numVotes  
0           5.3       11  

[1 rows x 24 columns]


Now we have a combined dataframe of the shows available on netflix, and their imdb information. Lets filter out the TV shows

In [12]:
netflix_imdb_combined_no_tv = netflix_imdb_combined[(netflix_imdb_combined['type'] == 'Movie')]
print(f"{netflix_imdb_combined_no_tv.shape}\n\nCOMBINED NO TV:\n{netflix_imdb_combined_no_tv.iloc[:1]}")

(3729, 24)

COMBINED NO TV:
  show_id   type    title      director  \
2      s8  Movie  Sankofa  Haile Gerima   

                                                cast  \
2  Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...   

                                             country          date_added  \
2  United States, Ghana, Burkina Faso, United Kin...  September 24, 2021   

   release_year rating duration  ... primaryTitle originalTitle isAdult  \
2          1993  TV-MA  125 min  ...      Sankofa       Sankofa       0   

  startYear endYear runtimeMinutes  genres     tconst averageRating numVotes  
2      1993      \N            125   Drama  tt0150986           5.7       51  

[1 rows x 24 columns]


This has gotten rid of ~1500 titles from our dataframe, hopefully a n=2729 is sufficient for our data mining. Lets start by creating our test/training data split (75/15)

In [13]:
from sklearn.model_selection import train_test_split

In [18]:
train, test = train_test_split(netflix_imdb_combined_no_tv, test_size=0.15)

print(f"TRAINING:\t{train.shape}\n{train.iloc[:1]}\n\n\nTESTING:\t{test.shape}\n{test.iloc[:1]}")

TRAINING:	(3169, 24)
    show_id   type      title     director  \
395    s595  Movie  Star Trek  J.J. Abrams   

                                                  cast  \
395  Chris Pine, Zachary Quinto, Karl Urban, Zoe Sa...   

                    country    date_added  release_year rating duration  ...  \
395  United States, Germany  July 1, 2021          2009  PG-13  128 min  ...   

                       primaryTitle originalTitle isAdult startYear endYear  \
395  Star Trek: The Animated Series     Star Trek       0      1973    1975   

    runtimeMinutes                      genres     tconst averageRating  \
395             30  Action,Adventure,Animation  tt0094423           8.3   

    numVotes  
395       42  

[1 rows x 24 columns]


TESTING:	(560, 24)
     show_id   type       title        director  \
4784   s8228  Movie  The Bridge  Kunle Afolayan   

                                                   cast  country  \
4784  Chidinma Ekile, Ademola Adedoyin, Kunle Afolay.

After our split, we have 3169 items to train our model with and 560 to test it.