# The Rise of LGBT shows
Introduction thing

## Data Description

Our dataset is a collection of information about Netflix shows as of 2019. Each obsevation contains information about a Netflix show's unique ID, its type (i.e. whether it is a TV Show or a movie), title, director, cast, country (where it was produced), release date, release year, TV rating, duration, genre, and description.

The dataset 'Netflix Movies and TV Shows' was collected from Flixable, a third-party Netflix search engine. Flixable was created in 2018 by Ville Salminen, and it came with additional advanced search functionality which was missing from the implemented search engine of Netflix. The data was extracted from the Flixable database through the use of API calls.

Currently, Netlflix does not have their API publicly available, and Flixable has not openly disclosed how the web site was able to acquire the data for its database. Despite the popularity of the aforementioned site, we cannot confirm whether the dataset that was extracted from Flixable is reliable. Moreover, considering that the latest update for the dataset was on November 2019, the conclusions made in this case study may not be representative of the current Netflix shows as of September, 2020.

## Exploratory Data Analysis

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Something load dataset

In [None]:
netflix_df = pd.read_csv("./netflix_titles.csv")
netflix_df.head()

In [None]:
netflix_df['director'].unique()

Remove duplicates dito ba?? ewan

#### Checking for `NAN`s

In [None]:
netflix_df.isnull().any()

We see here that the `description` and `listed_in` variables do not have `Nan`/`null` values. Hence, no need to drop rows (for this question, at least) or do imputation. 

#### Checking for duplicates

The `description` variable is a good feature to check for duplicates as the synopsis are expectedly unique for each show.

In [None]:
netflix_desc_duplicate = netflix_df['description'].value_counts().reset_index(name="description").query("description > 1")
print(netflix_desc_duplicate)
print("Repeated descriptions:", len(netflix_desc_duplicate))
print("Repeated shows:", netflix_desc_duplicate['description'].sum())

There are 7 repeated descriptions, and a total of 15 repeated shows based on our assumption that observations with the same description are the same show. To further confirm this, we can look at the other variables of the observations corresponding to these descriptions.

In [None]:
netflix_df.loc[netflix_df["description"].isin(netflix_desc_duplicate["index"])]

Closely examining the observations reveals that the only differences in the observations are the `date_added` and the `title` variables. The date when the show is added doesn't give much insight as it does not give relevant context to the similarity of the show (i.e. different versions of a show can be added at different times, and vice versa). However, the title &mdash; at least in the results above &mdash; are descriptive in the differences.

The titles show that there are multiple versions due to langauge. The show *The Incredibles 2*, for instance, has both the original, and the Spanish version. Meanwhile, the movie *Oh! Baby* has two other versions: the Malalayam and Tamil. The only observation which didn't show any difference was for the movie *Sarkar* whose titles (there are 2 observations for this movie) do not indicate versions. This may either be a data collection error, or the title simply does not show which version it is.

Nonetheless, all the duplicates will be cleaned in the same way. The "original" version (i.e. the one without a parenthesis stating the version, or if not applicable, the first entry that appears in the dataset for simplicity) will be retained. 

This step is important because duplicates will affect the weight of the words during feature extraction. Thus, there would be bias for a particular show which would produce results that are not representative of the show's listed genre.

Because there are only few duplicates, we can manually remove the offending observations via `show_id`.

In [None]:
show_id_remove_list = [81186758, 81186757, 81072516, 81046962, 81059388, 81151877, 81074135, 81091424]

netflix_df.loc[(netflix_df["description"].isin(netflix_desc_duplicate["index"])) & (~netflix_df["show_id"].isin(show_id_remove_list))]

The above shows the observations that will be retained from all the duplicates.

In [None]:
netflix_df_clean = netflix_df[~netflix_df["show_id"].isin(show_id_remove_list)]
netflix_df_clean

Exploratory questions:

1. Most recurring genres in the dataset

In [None]:
listed_in_series = netflix_df_clean['listed_in']
genre_matrix = []

for string in listed_in_series:
    split_str = string.split(', ')
    genre_matrix.append(split_str)

genre_df = pd.DataFrame(genre_matrix)
genre_list = pd.concat([genre_df[0], genre_df[1], genre_df[2]])
genre_list.dropna(inplace = True)
genre_list


In [None]:
genre_count = genre_list.value_counts()

genre_count.plot.bar()
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Bar plot of genre count in Netflix shows')

## Research Question

## Data Modelling

### What are the most common association rules among Netflix genres?

#### Subheader

### Are Netflix descriptions effective classifiers for the LGBTQ genre?

The `description` variable in the dataset refers to Netflix's synopsis of each show. DESCRIBE MORE

- Get not LGBT shows same size as LGBT
- Get features (tfidf, or whatever)
- Get top features (chi2)
- Tas ulet ulet sampling

In [None]:
netflix_df_international = netflix_df_international[netflix_df_international["isInternationalMV"] == True]
netflix_df_international

In [None]:
netflix_df_non_international = netflix_df_clean[netflix_df_clean["isInternationalMV"] == False]
netflix_df_non_international

In [None]:
netflix_df_classifier = pd.concat([netflix_df_non_international, netflix_df_international])
netflix_df_classifier

In [None]:
vectorizer=TfidfVectorizer()
vectorizer.fit(netflix_df_classifier["description"])
vector=vectorizer.transform(netflix_df_classifier["description"])
features = vectorizer.get_feature_names()
xdata=vector.todense()

In [None]:
linearSVC=svm.LinearSVC()

skf=StratifiedKFold(n_splits=5)

for train_index, test_index in skf.split(xdata, ydata):
    x_train=xdata[train_index]
    x_test=xdata[test_index]
    y_train=np.array(ydata)[train_index]
    y_test=np.array(ydata)[test_index]
    
    linearSVC = linearSVC.fit(x_train,y_train)

    y_pred = linearSVC.predict(x_test)

    confusion=confusion_matrix(y_test, y_pred, labels=['Not International Movie', 'International Movie'])
    print('CONFUSION MATRIX: \n', confusion) #order: tn, fp, fn, tp

    print("Accuracy: ", accuracy_score(y_test, y_pred))

    print(classification_report(y_test, y_pred, target_names=['Not International Movie', 'International Movie']))

In [None]:
netflix_reset = netflix_df_classifier.reset_index()

ydata=[]
b=0
while b<len(netflix_reset):
    if netflix_reset.loc[b, "isInternationalMV"] == True:
        appenddata="International Movie"
    if netflix_reset.loc[b, "isInternationalMV"] == False:
        appenddata="Not International Movie"
    ydata.append(appenddata)
    b+=1

ydata

selector = SelectKBest(chi2, k=20)
selector.fit(xdata, ydata)
# Get idxs of columns to keep
idxs_selected = selector.get_support(indices=True)

chi_x = np.asarray(xdata)
chi_x
    
scores, pval=(chi2(chi_x, ydata))
scores

for i in idxs_selected:
    pval_add=(features[i], scores[i])
    print(pval_add)
    print(pval[i])

# pvalue_continue= True
# while pvalue_continue==True:
#     try:
#         pvalue_evaluation=input("what word")
#         pvalue_index=features.index(pvalue_evaluation)
#         print("SCORE:", scores[pvalue_index])
#         print ("PVALUE:", (pval[pvalue_index]))

#     except:
#         pass

## Insights and Conclusions