# Netflix Shows
Introduction thing

## Data Description

Our dataset is a collection of information about Netflix shows as of 2019. Each obsevation contains information about a Netflix show's unique ID, its type (i.e. whether it is a TV Show or a movie), title, director, cast, country (where it was produced), release date, release year, TV rating, duration, genre, and description.

The dataset 'Netflix Movies and TV Shows' was collected from Flixable, a third-party Netflix search engine. Flixable was created in 2018 by Ville Salminen, and it came with additional advanced search functionality which was missing from the implemented search engine of Netflix. The data was extracted from the Flixable database through the use of API calls.

Currently, Netlflix does not have their API publicly available, and Flixable has not openly disclosed how the web site was able to acquire the data for its database. Despite the popularity of the aforementioned site, we cannot confirm whether the dataset that was extracted from Flixable is reliable. Moreover, considering that the latest update for the dataset was on November 2019, the conclusions made in this case study may not be representative of the current Netflix shows as of September, 2020.

## Exploratory Data Analysis

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from rule_miner import RuleMiner
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Something load dataset

In [None]:
netflix_df = pd.read_csv("./netflix_titles.csv")
netflix_df.head()

In [None]:
netflix_df['director'].unique()

Remove duplicates dito ba?? ewan

#### Checking for `NAN`s

In [None]:
netflix_df.isnull().any()

We see here that the `description` and `listed_in` variables do not have `Nan`/`null` values. Hence, no need to drop rows (for this question, at least) or do imputation. 

#### Checking for duplicates

The `description` variable is a good feature to check for duplicates as the synopsis are expectedly unique for each show.

In [None]:
netflix_desc_duplicate = netflix_df['description'].value_counts().reset_index(name="description").query("description > 1")
print(netflix_desc_duplicate)
print("Repeated descriptions:", len(netflix_desc_duplicate))
print("Repeated shows:", netflix_desc_duplicate['description'].sum())

There are 7 repeated descriptions, and a total of 15 repeated shows based on our assumption that observations with the same description are the same show. To further confirm this, we can look at the other variables of the observations corresponding to these descriptions.

In [None]:
netflix_df.loc[netflix_df["description"].isin(netflix_desc_duplicate["index"])]

Closely examining the observations reveals that the only differences in the observations are the `date_added` and the `title` variables. The date when the show is added doesn't give much insight as it does not give relevant context to the similarity of the show (i.e. different versions of a show can be added at different times, and vice versa). However, the title &mdash; at least in the results above &mdash; are descriptive in the differences.

The titles show that there are multiple versions due to langauge. The show *The Incredibles 2*, for instance, has both the original, and the Spanish version. Meanwhile, the movie *Oh! Baby* has two other versions: the Malalayam and Tamil. The only observation which didn't show any difference was for the movie *Sarkar* whose titles (there are 2 observations for this movie) do not indicate versions. This may either be a data collection error, or the title simply does not show which version it is.

Nonetheless, all the duplicates will be cleaned in the same way. The "original" version (i.e. the one without a parenthesis stating the version, or if not applicable, the first entry that appears in the dataset for simplicity) will be retained. 

This step is important because duplicates will affect the weight of the words during feature extraction. Thus, there would be bias for a particular show which would produce results that are not representative of the show's listed genre.

Because there are only few duplicates, we can manually remove the offending observations via `show_id`.

In [None]:
show_id_remove_list = [81186758, 81186757, 81072516, 81046962, 81059388, 81151877, 81074135, 81091424]

netflix_df.loc[(netflix_df["description"].isin(netflix_desc_duplicate["index"])) & (~netflix_df["show_id"].isin(show_id_remove_list))]

The above shows the observations that will be retained from all the duplicates.

In [None]:
netflix_df_clean = netflix_df[~netflix_df["show_id"].isin(show_id_remove_list)]
netflix_df_clean

## Exploratory questions:

#### 1.) What genres appear most frequently among Netflix shows?

For our first exploratory question, we decided to determine what genres are the most popular among Netflix shows. First we take the genres column and isolate the genres per Netflix show. This is because it is possible for a single show to have multiple genres associated to it.

In [None]:
listed_in_series = netflix_df_clean['listed_in']
genre_matrix = []

for string in listed_in_series:
    split_str = string.split(', ')
    genre_matrix.append(split_str)

genre_df = pd.DataFrame(genre_matrix)
genre_list = pd.concat([genre_df[0], genre_df[1], genre_df[2]])
genre_list.dropna(inplace = True)
genre_df


Now that the genres have been isolated, we can now use a histogram to visualize the number of times a particular genre appeared on a Netflix show.

In [None]:
genre_count = genre_list.value_counts()

genre_count.plot.bar()
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Bar plot of genre count in Netflix shows')

Here we can see the ranking of the shows based on its frequency in Netflix shows. We see that 'International Movies' appear most frequently with almost 2000 appearances in the dataset.

#### 2.) What is the trend on the number of International Movies per year 

As we can see from the bar plot of the genre frequency, 'International Movies' was determined to be the most frequent genre in the dataset. Knowing this, the second exploratory question will focus on the trend of 'International Movies' according to its release year in the Netflix dataset. First we need to isolate the shows which are listed as 'International Movies' in the dataset. We also no longer need to perform additional data cleaning, since the dataset has no null values for `listed_in` or `release_year`.

In [None]:
netflix_df_international = netflix_df_clean
netflix_df_international.insert(12, "isInternationalMV", netflix_df_clean["listed_in"].str.find("International Movies") != -1, True)
netflix_df_international = netflix_df_international[netflix_df_international.isInternationalMV]
netflix_df_international

Now that we have a dataframe which only contains the shows which have been listed as 'International Movies', we can now use a histogram in order to visualize the trend of shows listed as 'International Movies' in the Netlflix dataset

In [None]:
netflix_df_international['release_year'].hist(bins=60)
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Histogram of International Movies per year')

Based on the histogram above, we can see that within the Netflix catalog, there is a large number of shows listed as 'International Movies' which were released towards the latter half of the 2010's. 

## Research Question

## Data Modelling

### What are the most common association rules among Netflix genres?

The `listed_in` column in the dataset refers to what genres the shows are listed in on the streaming platform. The values in this column are represented as strings where each string can contain more than one genre, and each genre is separated by commas. To parse these genres we need to create a matrix that is a list of genre lists, where each row represents the genres of one show. We can then use this matrix to create a dataframe that we can manipulate.

In [None]:
listed_in_series = netflix_df_clean['listed_in']
genre_matrix = []

for string in listed_in_series:
    split_str = string.split(', ')
    genre_matrix.append(split_str)

genre_df = pd.DataFrame(genre_matrix)
genre_df

We also need a value dictionary to assign a number to each genre.

In [None]:
values = genre_df.values.ravel()
values = [value for value in pd.unique(values) if not pd.isnull(value)]

value_dict = {}
for i, value in enumerate(values):
    value_dict[value] = i
value_dict

After generating the dictionary, we can then create a dataframe where each row is a netflix show and each column is a genre. If the show is listed in that genre then the value for that column is `1` and `0` if it is not

In [None]:
genre_df = genre_df.stack().map(value_dict).unstack()

baskets = []
for i in range(genre_df.shape[0]):
    basket = np.sort([int(x) for x in genre_df.iloc[i].values.tolist() if str(x) != 'nan'])
    baskets.append(basket)

shows_df = pd.DataFrame([[0 for _ in range(len(value_dict))] for _ in range(len(baskets))], columns=values)

for i, basket in enumerate(baskets):
    shows_df.iloc[i, basket] = 1
shows_df

After preparing the dataframe, we can finally use a Rule Miner to figure out the common genre association rules of netflix shows. We first initialize the Rule Miner to have a **support threshold** of `100` and a **confidence** of `60%`.

In [None]:
rule_miner = RuleMiner(100, 0.6)
frequent_itemsets = rule_miner.get_frequent_itemsets(shows_df)
assoc_rules = rule_miner.get_association_rules(shows_df)
assoc_rules

From the result of the get_association_rules function of the rule miner, we can see that the most common genres that appear together are:
 - 
 - 
 - 

### Are Netflix descriptions effective classifiers for the International Movies genre?

The `description` variable in the dataset refers to Netflix's synopsis of each show. For this research question, the idea is to determine whether these synopses are able to efficiently classify whether a show is classified under *International Movies* or not. 

The *International Movies* genre is chosen because 1,922 shows are classified under this genre which makes it the most prominent. Choosing a prominent genre is important because there is a need to have a good amount of samples; this is especially true for classification which requires good representatives for each categories &mdash; in this case, the categories are "International Movies" and "Not International Movies". 

- Get not LGBT shows same size as LGBT
- Get features (tfidf, or whatever)
- Get top features (chi2)
- Tas ulet ulet sampling

For this classification, two sets of data are needed: data which contains only shows with *International Movies* as its genre (regardless whether it's also classified as other genres), and data containing shows which are not classified under *International Movies*.

In the exploratory data analysis, there's already a `DataFrame` pertaining to the first. What we need is a `DataFrame` for shows that aren't classified under *International Movies*. To do this, the `isInternationalMV` variable from `netflix_df_clean`, can be used where False values indicates that show is not under said classification.

In [None]:
netflix_df_non_international = netflix_df_clean[netflix_df_clean["isInternationalMV"] == False]
netflix_df_non_international

The next step is to concatenate both `DataFrames` together to prepare it for classification.

In [None]:
netflix_df_classifier = pd.concat([netflix_df_non_international, netflix_df_international])
netflix_df_classifier

In [None]:
vectorizer=TfidfVectorizer()
vectorizer.fit(netflix_df_classifier["description"])
vector=vectorizer.transform(netflix_df_classifier["description"])
features = vectorizer.get_feature_names()
xdata=vector.todense()

In [None]:
linearSVC=svm.LinearSVC()

netflix_reset = netflix_df_classifier.reset_index()

ydata=[]
b=0
while b<len(netflix_reset):
    if netflix_reset.loc[b, "isInternationalMV"] == True:
        appenddata="International Movie"
    if netflix_reset.loc[b, "isInternationalMV"] == False:
        appenddata="Not International Movie"
    ydata.append(appenddata)
    b+=1

ydata

skf=StratifiedKFold(n_splits=5)

for train_index, test_index in skf.split(xdata, ydata):
    x_train=xdata[train_index]
    x_test=xdata[test_index]
    y_train=np.array(ydata)[train_index]
    y_test=np.array(ydata)[test_index]
    
    linearSVC = linearSVC.fit(x_train,y_train)

    y_pred = linearSVC.predict(x_test)

    confusion=confusion_matrix(y_test, y_pred, labels=['Not International Movie', 'International Movie'])
    print('CONFUSION MATRIX: \n', confusion) #order: tn, fp, fn, tp

    print("Accuracy: ", accuracy_score(y_test, y_pred))

    print(classification_report(y_test, y_pred, target_names=['Not International Movie', 'International Movie']))

In [None]:
selector = SelectKBest(chi2, k=20)
selector.fit(xdata, ydata)
# Get idxs of columns to keep
idxs_selected = selector.get_support(indices=True)

chi_x = np.asarray(xdata)
chi_x
    
scores, pval=(chi2(chi_x, ydata))
scores

for i in idxs_selected:
    pval_add=(features[i], scores[i])
    print(pval_add)
    print(pval[i])

# pvalue_continue= True
# while pvalue_continue==True:
#     try:
#         pvalue_evaluation=input("what word")
#         pvalue_index=features.index(pvalue_evaluation)
#         print("SCORE:", scores[pvalue_index])
#         print ("PVALUE:", (pval[pvalue_index]))

#     except:
#         pass

## Insights and Conclusions