# Netflix Movies & TV Shows
In this notebook we will use machine learning to predict the rating of a tv show or movie.

## Imports

In [1]:
%config IPCompleter.greedy=True
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

### CSS for markdown

In [2]:
%%html
<!--just some css for table in markdown-->
<style>
table {float:left;}
</style>

## Our features are:
<br>

| Feature       | Beschrijving                                                              |
|:--------------|:--------------------------------------------------------------------------|
| Type:         | Movie or TV Show?                                                         | 
| Title:        | Title of the movie or TV Show                                             |
| Director:     | Shows who directed the movie or TV Show                                   |
| Cast:         | Shows all actors that played in the movie or TV Show                      | 
| Country:      | Country that produced the movie or TV Show                                |
| Date_added:   | When it became available on Netflix                                       |
| Release_year: | Year of release                                                           |
| Rating:       | Motion picture content rating                                             |
| Duration:     | Duration of the movie in minutes or the TV Show in seasons                |
| Listed_in:    | Netflix categories it is listed under.                                    |
| Description:  | Short description of the movie or TV Show                                 |

<br style="clear:both" />

#### Next we'll check what values we need to change into numeric values and what columns we can drop:

- actors usually have a preference in roles, so we can use actor names to predict the kind of movie / TV Show and therefore the rating.
- listed_in contains categories, which are definitely very helpful to predict the rating of a movie or TV Show

#### We can drop the remaining columns: show_id, type, director, country, date_added, release_year, duration and description and we'll also fill the null values with NA strings, except for the ratings as this is our target column. we will drop rows with null ratings instead:

In [3]:
contentDf = pd.read_csv("netflix_titles.csv")

#deleting unnecessary columns:
del contentDf["show_id"]
del contentDf["type"]
del contentDf["country"]
del contentDf["description"]
del contentDf["date_added"]
del contentDf["release_year"]
del contentDf["duration"]
del contentDf["director"]
del contentDf["title"]

#taking all rows where rating is not null:
contentDf= contentDf[contentDf["rating"].notna()]
#replacing all null values with a string value of "NA" instead
contentDf = contentDf.fillna("NA")
print(contentDf["rating"].value_counts())

TV-MA       2027
TV-14       1698
TV-PG        701
R            508
PG-13        286
NR           218
PG           184
TV-Y7        169
TV-G         149
TV-Y         143
TV-Y7-FV      95
G             37
UR             7
NC-17          2
Name: rating, dtype: int64


We have 14 different ratings but 10 of them have below 5% occurence each so it would be too hard to accurately predict them, we don't have enough data for them.

#### Therefore we will only take the top 4 ratings  

In [4]:
contentDf = contentDf[(contentDf["rating"]=="TV-MA")|(contentDf["rating"]=="TV-14")|(contentDf["rating"]=="TV-PG")|(contentDf["rating"]=="R")]
contentDf["rating"].value_counts()

TV-MA    2027
TV-14    1698
TV-PG     701
R         508
Name: rating, dtype: int64

#### We still have to convert all columns to numerical values. Let's assign a numerical value to each rating and add a binary column for each actor and genre:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

#get all actors and genres as an array:
actors = contentDf["cast"].values
genres = contentDf["listed_in"].values

vect = CountVectorizer(token_pattern='(?u)[a-zA-Z][a-zA-Z \'\-\&\.]+')
vect2 = CountVectorizer(token_pattern='(?u)[a-zA-Z][a-zA-Z \'\-\&\.]+')

#get vocabularies (All unique actors and all unique genres):
vect.fit(actors)
vect2.fit(genres)

#variables for readability
actorNames = vect.get_feature_names()
actorValuesArray = vect.transform(actors).toarray()
genreNames = vect2.get_feature_names()
genreValuesArray = vect2.transform(genres).toarray()

#Creating numpy ndarrays for column names and values:
columnNames = np.hstack([actorNames,genreNames])
columnValues = np.hstack([actorValuesArray,genreValuesArray])

#making new dataframe with our new columns:
dataprepped = pd.DataFrame(columnValues, columns=columnNames)

contentDf = contentDf.reset_index()
dataprepped = dataprepped.reset_index()
dataprepped["rating"] = contentDf["rating"]
dataprepped = dataprepped.replace({"rating":{"TV-MA":0,"TV-14":1,"TV-PG":2,"R":3}})

#Let's check our shape:
print("Totals:")
print(dataprepped.shape)

Totals:
(4934, 23451)


So we now have a total of 4934 rows and 23450 columns.

## Logistic Regression
Based on our data, we will use the logistic regression algorithm as this is a fast algorithm that allows supervised learning. The next step is dividing our dataset into a training set and a test set.

#### We will split our data: 70% for our training set and 30% for testing set.

In [6]:
featuredColNames = columnNames
predictedColName= ["rating"]

#x = dataframes with featured columns
#y = dataframe with correct result column
x = dataprepped[featuredColNames]
y = dataprepped[predictedColName]
testSize=0.3

xTrain, xTest, yTrain, yTest = train_test_split(x,y,test_size=testSize, random_state=26)
yTest = yTest.rating.values
yTrain = yTrain.rating.values
#Let's check if that worked correctly:
print("training set: {0:0.2f}%".format(len(xTrain)/len(dataprepped)*100))
print("test set: {0:0.2f}%".format(len(xTest)/len(dataprepped)*100))

training set: 69.98%
test set: 30.02%


#### Great, that's correct. Now we have to check if our rating counts are about evenly split:

In [7]:
totalRows = len(dataprepped.index)
totalTrain = len(yTrain)
totalTest = len(yTest)

#Original values
originalMA = len(dataprepped[dataprepped["rating"]==0])
original14 = len(dataprepped[dataprepped["rating"]==1])
originalPG = len(dataprepped[dataprepped["rating"]==2])
originalR = len(dataprepped[dataprepped["rating"]==3])

#Training values
trainingMA = len(yTrain[yTrain==0])
training14 = len(yTrain[yTrain==1])
trainingPG = len(yTrain[yTrain==2])
trainingR = len(yTrain[yTrain==3])

#Test values
testMA = len(yTest[yTest==0])
test14 = len(yTest[yTest==1])
testPG = len(yTest[yTest==2])
testR = len(yTest[yTest==3])

#Prints
print("Original TV-MA: {0} ({1:0.2f}%)".format(originalMA, originalMA / totalRows * 100))
print("Original TV-14: {0} ({1:0.2f}%)".format(original14, original14 / totalRows * 100))
print("Original TV-PG: {0} ({1:0.2f}%)".format(originalPG, originalPG / totalRows * 100))
print("Original R: {0} ({1:0.2f}%)".format(originalR, originalR / totalRows * 100))
print()
print("Training TV-MA: {0} ({1:0.2f}%)".format(trainingMA, trainingMA / totalTrain * 100))
print("Training TV-14: {0} ({1:0.2f}%)".format(training14, training14 / totalTrain * 100))
print("Training TV-PG: {0} ({1:0.2f}%)".format(trainingPG, trainingPG / totalTrain * 100))
print("Training R: {0} ({1:0.2f}%)".format(trainingR, trainingR / totalTrain * 100))
print()
print("Test TV-MA: {0} ({1:0.2f}%)".format(testMA, testMA / totalTest * 100))
print("Test TV-14: {0} ({1:0.2f}%)".format(test14, test14 / totalTest * 100))
print("Test TV-PG: {0} ({1:0.2f}%)".format(testPG, testPG / totalTest * 100))
print("Test R: {0} ({1:0.2f}%)".format(testR, testR / totalTest * 100))

Original TV-MA: 2027 (41.08%)
Original TV-14: 1698 (34.41%)
Original TV-PG: 701 (14.21%)
Original R: 508 (10.30%)

Training TV-MA: 1390 (40.25%)
Training TV-14: 1189 (34.43%)
Training TV-PG: 513 (14.86%)
Training R: 361 (10.45%)

Test TV-MA: 637 (43.01%)
Test TV-14: 509 (34.37%)
Test TV-PG: 188 (12.69%)
Test R: 147 (9.93%)


#### Now that we have seen that everything is working correctly, we can move on to importing our algorithm and training it:

In [8]:
#Logistic regression 
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression(penalty= 'l2', C=1, solver='liblinear').fit(xTrain, yTrain)
print("Model created!")

Model created!


#### Now let's put our test data in and take a sneak peak at the accuracy:

In [23]:
from sklearn.metrics import accuracy_score
result = logistic_model.predict(xTest)
predictedResults = pd.DataFrame({'yTest': yTest, 'yPrediction': result})
predictedResults.head(10)

Unnamed: 0,yTest,yPrediction
0,3,3
1,1,1
2,2,1
3,1,0
4,1,1
5,1,0
6,1,0
7,2,1
8,0,0
9,1,1


Looks okay. 
#### Let's calculate the percentage:

In [24]:
acc = accuracy_score(yTest, result)
print('accuracy score: {0:0.2f}%'.format(acc * 100, 2))

accuracy score: 60.36%


#### That was logistic regression. <br> 60% is not too impressive though. Let's try gaussian_naive_bayes. First we need to create and train the model:

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(xTrain,yTrain)
print("Model ready for use!")

Model ready for use!


#### Success. Now we can input the data and check the accuracy:

In [26]:
result = gnb.predict(xTest)

predictedResults = pd.DataFrame({'yTest': yTest, 'yPrediction': result})
predictedResults.head(10)


Unnamed: 0,yTest,yPrediction
0,3,3
1,1,1
2,2,1
3,1,2
4,1,1
5,1,2
6,1,2
7,2,1
8,0,2
9,1,1


That one looks worse already. 
#### Let's look at the percentage:

In [27]:
acc = accuracy_score(yTest, result)
print('accuracy score: {0:0.2f}%'.format(acc * 100, 2))

accuracy score: 47.27%


Seems like logistic regression is the better option here.
#### Thanks for reading :)