<a href="https://colab.research.google.com/github/Marcelo0479/machinelearn/blob/main/Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To use this recommendation system you must download the following datasets:

[Netflix](https://www.kaggle.com/shivamb/netflix-shows/download),
[Prime video](https://www.kaggle.com/shivamb/amazon-prime-movies-and-tv-shows/download),
[Disney +](https://www.kaggle.com/shivamb/disney-movies-and-tv-shows/download),
[Hulu](https://www.kaggle.com/shivamb/hulu-movies-and-tv-shows/download)

In this recommendation system we are using the machine learning tecnich of NLP

# Recommendation system for a specific streaming service

First of all we need to import the libraries and the choosed database, then we need to treat these datas to be more apropriated to the algorithm.

In [70]:
import pandas as pd
import numpy as np

In [71]:
# Creating a function to choose the streaming service database
def database(csv_file):
  return pd.read_csv(csv_file)

In [72]:
df = database('hulu_titles.csv')

In [73]:
# Checking if theres is null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3073 entries, 0 to 3072
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       3073 non-null   object 
 1   type          3073 non-null   object 
 2   title         3073 non-null   object 
 3   director      3 non-null      object 
 4   cast          0 non-null      float64
 5   country       1620 non-null   object 
 6   date_added    3045 non-null   object 
 7   release_year  3073 non-null   int64  
 8   rating        2553 non-null   object 
 9   duration      2594 non-null   object 
 10  listed_in     3073 non-null   object 
 11  description   3069 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 288.2+ KB


For NLP based prediction, natural language processing, the most critical datas in this database is in the rating, listed_in and description columns. However, we are going to do some treatment for the other columns as well to make them look better.

In [74]:
# Handling missing values
for c in df.columns:
  if c in ['rating', 'listed_in']:
    df[c].fillna(df.rating.mode()[0], inplace=True)
  elif c == 'description':
    df.drop(df[df[c].isnull()].index, inplace=True)
  elif df[c].isnull().sum() > 0:
    df[c].fillna('NoDataAvaliable', inplace=True)
df.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

For the purposes of this analysis, we need to pay more attention to the data in the description columns. obviously we can't have duplicate lines, but we also can't have descriptions that are too short and that match the title.

In [75]:
# Handling duplicate values
df.duplicated().sum()

0

In [76]:
df.drop_duplicates(inplace=True)

Descripiton column problems


In [77]:
# Description egual title
df[df.title == df.description]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
717,s718,Movie,UFC 262,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"May 15, 2021",2021,TV-14,NoDataAvaliable,Sports,UFC 262
768,s769,Movie,UFC 261: Early Prelims and Prelims,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"April 24, 2021",2021,TV-14,NoDataAvaliable,Sports,UFC 261: Early Prelims and Prelims
2146,s2147,Movie,NASA's Giant Leaps: Past and Future - Celebrat...,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"July 19, 2019",2019,TV-14,NoDataAvaliable,"Documentaries, Science & Technology",NASA's Giant Leaps: Past and Future - Celebrat...


In [78]:
df.drop(df[df.title == df.description].index, inplace=True)

In [79]:
# Noise in description
smalls_descrip = []
for i in df.index:
  if len(df.description[i]) < 20:
    smalls_descrip.append(i)
df.loc[smalls_descrip]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1596,s1597,Movie,Soccer,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"June 12, 2020",2017,TV-14,NoDataAvaliable,Sports,Soccer games.


In [80]:
df.drop(df.loc[smalls_descrip].index, inplace=True)

In [81]:
# Resetting the index to avoid problems with algorithm results
df.reset_index(inplace=True)

In [82]:
# Creating some new columns to help the prediction algorithm.
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
df["ratings_ages"]= df["rating"].replace(ratings_ages)
df['gender_ratings_ages_and_description'] = df.listed_in + ', ' + df.ratings_ages + ', ' + df.description

In [83]:
# Importing the algorithms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [84]:
# Applying the algorithms
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df['gender_ratings_ages_and_description'])

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

index = pd.Series(df.index, index=df.title)

In [85]:
# Creating a function to calculate 10 recommendation based in a title
def recommendations(title, cosine_sim=cosine_sim):
  idx = index[title]

  sim_scores = list(enumerate(cosine_sim[idx]))
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse =True)
  sim_scores = sim_scores[1:11]
  movie_index = [i[0] for i in sim_scores]

  df_titles = df.title.iloc[movie_index]

  df_sim_scores = pd.DataFrame(sim_scores).set_index(0)
  df_sim_scores.rename(columns = {1 : 'sim_score'}, inplace=True)
  
  return pd.concat([df_titles, df_sim_scores], axis=1)

In [86]:
title_test = df.title[np.random.randint(0, len(df))]
title_test

'Tokyo Ghoul'

In [87]:
recommendations(title_test)

Unnamed: 0,title,sim_score
3016,Chrome Shelled Regios,0.118608
2797,In the Flesh,0.116681
1668,Digimon Frontier,0.098791
2976,Elfen Lied,0.097593
2761,Terraformars,0.097217
2106,BEM,0.096044
817,28 Days Later,0.094423
2887,Parasyte: The Maxim,0.090183
2591,Akame ga Kill!,0.085423
296,Solace,0.081027


# Recommendation system for all the streaming services.

For this porpose we need to join all the databases

In [88]:
df_netflix = pd.read_csv('netflix_titles.csv')
df_prime = pd.read_csv('amazon_prime_titles.csv')
df_disney = pd.read_csv('disney_plus_titles.csv')
df_hulu = pd.read_csv('hulu_titles.csv')

In [89]:
df_netflix['streaming'] = 'netflix'
df_prime['streaming'] = 'prime'
df_disney['streaming'] = 'disney +'
df_hulu['streaming'] = 'hulu'

In [90]:
df = pd.concat([df_netflix, df_prime, df_disney, df_hulu])

In [91]:
# Checking if theres is null values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22998 entries, 0 to 3072
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       22998 non-null  object
 1   type          22998 non-null  object
 2   title         22998 non-null  object
 3   director      14739 non-null  object
 4   cast          17677 non-null  object
 5   country       11499 non-null  object
 6   date_added    13444 non-null  object
 7   release_year  22998 non-null  int64 
 8   rating        22134 non-null  object
 9   duration      22516 non-null  object
 10  listed_in     22998 non-null  object
 11  description   22994 non-null  object
 12  streaming     22998 non-null  object
dtypes: int64(1), object(12)
memory usage: 2.5+ MB


In [92]:
# handling missing values
for c in df.columns:
  if c in ['rating', 'listed_in']:
    df[c].fillna(df.rating.mode()[0], inplace=True)
  elif c == 'description':
    df.drop(df[df[c].isnull()].index, inplace=True)
  elif df[c].isnull().sum() > 0:
    df[c].fillna('NoDataAvaliable', inplace=True)
df.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
streaming       0
dtype: int64

In [93]:
# Handling duplicate values
df.duplicated().sum()

0

In [94]:
df.drop_duplicates(inplace=True)

In [95]:
df.title.duplicated().sum()

881

In [96]:
df.drop_duplicates(subset='title', inplace=True)

In [97]:
# Description egual title
df[df.title == df.description]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,streaming
2563,s2564,TV Show,Elfen Lied,NoDataAvaliable,"Kira Vincent-Davis, Adam Conlon",NoDataAvaliable,NoDataAvaliable,2005,TV-NR,1 Season,Anime,Elfen Lied,prime
6416,s6417,Movie,Title before 1C onboarding - 5,NoDataAvaliable,"1, 2, 3",NoDataAvaliable,NoDataAvaliable,2021,18+,3 min,Drama,Title before 1C onboarding - 5,prime
6930,s6931,Movie,Title Post onboarding 8,1,1,NoDataAvaliable,NoDataAvaliable,2021,18+,61 min,Action,Title Post onboarding 8,prime
8501,s8502,Movie,Act 6 - Title 1,1,1,NoDataAvaliable,NoDataAvaliable,2021,ALL,61 min,Action,Act 6 - Title 1,prime
8502,s8503,Movie,Act 5 - Title 1,1,1,NoDataAvaliable,NoDataAvaliable,2021,ALL,61 min,Action,Act 5 - Title 1,prime
9001,s9002,TV Show,Roadkill Garage,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,2017,TV-NR,1 Season,"Sports, Unscripted",Roadkill Garage,prime
9557,s9558,Movie,Date Night: World Premiere,Shawn Levy,"Steve Carell, Tina Fey, Mark Wahlberg, Tara...",NoDataAvaliable,NoDataAvaliable,2010,NR,5 min,Comedy,Date Night: World Premiere,prime
9558,s9559,Movie,Date Night: Making a Scene,Shawn Levy,"Steve Carell, Tina Fey, Mark Wahlberg, Tara...",NoDataAvaliable,NoDataAvaliable,2010,NR,10 min,Comedy,Date Night: Making a Scene,prime
717,s718,Movie,UFC 262,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"May 15, 2021",2021,TV-MA,NoDataAvaliable,Sports,UFC 262,hulu
768,s769,Movie,UFC 261: Early Prelims and Prelims,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"April 24, 2021",2021,TV-MA,NoDataAvaliable,Sports,UFC 261: Early Prelims and Prelims,hulu


In [98]:
df.drop(df[df.title == df.description].index, inplace=True)

In [99]:
# Noise in description
smalls_descrip = []
for i in df.index:
  if len(df.description[i]) < 20:
    smalls_descrip.append(i)
df.loc[smalls_descrip]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,streaming
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,NoDataAvaliable,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",netflix
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,TV-MA,113 min,"Comedy, Drama",A small fishing village must procure a local d...,prime
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",NoDataAvaliable,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,disney +
0,s1,Movie,Ricky Velez: Here's Everything,NoDataAvaliable,NoDataAvaliable,NoDataAvaliable,"October 24, 2021",2021,TV-MA,NoDataAvaliable,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...,hulu
1,s2,TV Show,Blood & Water,NoDataAvaliable,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",netflix
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3071,s3072,Movie,Bigfoot: The Conspiracy,Chris Simoes,"Chris Simoes, Dave Watkins, Betsy Mitchell, Jo...",NoDataAvaliable,NoDataAvaliable,2020,16+,77 min,"Action, Horror, Suspense",This film follows a retired Border Patrol agen...,prime
3071,s3072,TV Show,The Twilight Zone,NoDataAvaliable,NoDataAvaliable,United States,NoDataAvaliable,1959,TV-PG,5 Seasons,"Classics, Science Fiction, Thriller",Rod Serling's seminal anthology series focused...,hulu
3072,s3073,Movie,Riot,John Lyde,"Matthew Reese, Dolph Lundgren, Danielle Chuchr...",NoDataAvaliable,"January 1, 2020",2015,TV-MA,88 min,Action & Adventure,"Seeking vengeance for the murder of his wife, ...",netflix
3072,s3073,Movie,Big Sur,Michael Polish,"Jean-Marc Barr, Kate Bosworth, Josh Lucas, Rad...",NoDataAvaliable,NoDataAvaliable,2013,R,80 min,"Drama, Romance",A recounting of Jack Kerouac's three sojourns ...,prime


In [100]:
df.drop(df.loc[smalls_descrip].index, inplace=True)

In [101]:
# Development remains in the description
df[df.cast.apply(lambda x: 'Test' in x)]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,streaming


In [102]:
df_tests = df[(df.cast.apply(lambda x: 'Test' in x)) & (df.country == 'NoDataAvaliable')]
df_tests

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,streaming


In [103]:
df.drop(df_tests.index, inplace=True)

In [104]:
# Resetting the index to avoid problems with algorithm results
df.reset_index(inplace=True)

In [105]:
# Creating some new columns to help the prediction algorithm.
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
df["ratings_ages"]= df["rating"].replace(ratings_ages)
df['gender_ratings_ages_and_description'] = df.listed_in + ', ' + df.ratings_ages + ', ' + df.description

In [132]:
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df['gender_ratings_ages_and_description'])

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

index = pd.Series(df.index, index=df.title)

In [133]:
def recommendations(title, cosine_sim=cosine_sim):
  idx = index[title]

  sim_scores = list(enumerate(cosine_sim[idx]))
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse =True)
  sim_scores = sim_scores[1:11]
  movie_index = [i[0] for i in sim_scores]

  df_titles_streamming = df[['title', 'streaming']].iloc[movie_index]

  df_sim_scores = pd.DataFrame(sim_scores).set_index(0)
  df_sim_scores.rename(columns = {1 : 'sim_score'}, inplace=True)
  
  return pd.concat([df_titles_streamming, df_sim_scores], axis=1)

In [146]:
title_test = df.title[np.random.randint(0, len(df))]
title_test

'What Happens in Vegas (Extended Edition)'

In [147]:
recommendations(title_test)

Unnamed: 0,title,streaming,sim_score
1018,What Happens in Vegas,prime,0.874151
934,The Wedding Trip,prime,0.129801
61,Sam Kinison: Live in Vegas,netflix,0.122722
216,Shark Night,netflix,0.116078
56,A Second Chance,netflix,0.111976
998,John Tucker Must Die,prime,0.098686
1016,White Night,prime,0.096693
994,Knight and Day,prime,0.09628
818,Two and a Half Men,prime,0.093694
1050,Knight and Day (Extended Edition),prime,0.092833


I will come back to this recommender system in the future to try to improve its results by using other NLP algorithms and changing their hyperparameters.