# Netflix Shows
Introduction thing

## Data Description

Our dataset is a collection of information about Netflix shows as of 2019. Each obsevation contains information about a Netflix show's unique ID, its type (i.e. whether it is a TV Show or a movie), title, director, cast, country (where it was produced), release date, release year, TV rating, duration, genre, and description.

The dataset 'Netflix Movies and TV Shows' was collected from Flixable, a third-party Netflix search engine. Flixable was created in 2018 by Ville Salminen, and it came with additional advanced search functionality which was missing from the implemented search engine of Netflix. The data was extracted from the Flixable database through the use of API calls.

Currently, Netlflix does not have their API publicly available, and Flixable has not openly disclosed how the web site was able to acquire the data for its database. Despite the popularity of the aforementioned site, we cannot confirm whether the dataset that was extracted from Flixable is reliable. Moreover, considering that the latest update for the dataset was on November 2019, the conclusions made in this case study may not be representative of the current Netflix shows as of September, 2020.

## Exploratory Data Analysis

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from rule_miner import RuleMiner
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Something load dataset

In [None]:
netflix_df = pd.read_csv("./netflix_titles.csv")
netflix_df.head()

In [None]:
netflix_df['director'].unique()

Remove duplicates dito ba?? ewan

#### Checking for `NAN`s

In [None]:
netflix_df.isnull().any()

We see here that the `description` and `listed_in` variables do not have `Nan`/`null` values. Hence, no need to drop rows (for this question, at least) or do imputation. 

#### Checking for duplicates

The `description` variable is a good feature to check for duplicates as the synopsis are expectedly unique for each show.

In [None]:
netflix_desc_duplicate = netflix_df['description'].value_counts().reset_index(name="description").query("description > 1")
print(netflix_desc_duplicate)
print("Repeated descriptions:", len(netflix_desc_duplicate))
print("Repeated shows:", netflix_desc_duplicate['description'].sum())

There are 7 repeated descriptions, and a total of 15 repeated shows based on our assumption that observations with the same description are the same show. To further confirm this, we can look at the other variables of the observations corresponding to these descriptions.

In [None]:
netflix_df.loc[netflix_df["description"].isin(netflix_desc_duplicate["index"])]

Closely examining the observations reveals that the only differences in the observations are the `date_added` and the `title` variables. The date when the show is added doesn't give much insight as it does not give relevant context to the similarity of the show (i.e. different versions of a show can be added at different times, and vice versa). However, the title &mdash; at least in the results above &mdash; are descriptive in the differences.

The titles show that there are multiple versions due to langauge. The show *The Incredibles 2*, for instance, has both the original, and the Spanish version. Meanwhile, the movie *Oh! Baby* has two other versions: the Malalayam and Tamil. The only observation which didn't show any difference was for the movie *Sarkar* whose titles (there are 2 observations for this movie) do not indicate versions. This may either be a data collection error, or the title simply does not show which version it is.

Nonetheless, all the duplicates will be cleaned in the same way. The "original" version (i.e. the one without a parenthesis stating the version, or if not applicable, the first entry that appears in the dataset for simplicity) will be retained. 

This step is important because duplicates will affect the weight of the words during feature extraction. Thus, there would be bias for a particular show which would produce results that are not representative of the show's listed genre.

Because there are only few duplicates, we can manually remove the offending observations via `show_id`.

In [None]:
show_id_remove_list = [81186758, 81186757, 81072516, 81046962, 81059388, 81151877, 81074135, 81091424]

netflix_df.loc[(netflix_df["description"].isin(netflix_desc_duplicate["index"])) & (~netflix_df["show_id"].isin(show_id_remove_list))]

The above shows the observations that will be retained from all the duplicates.

In [None]:
netflix_df_clean = netflix_df[~netflix_df["show_id"].isin(show_id_remove_list)]
netflix_df_clean

## Exploratory questions:

#### 1.) What genres appear most frequently among Netflix shows?

For our first exploratory question, we decided to determine what genres are the most frequent among Netflix shows.

The genres of each Netflix show is shown in the `listed_in` column of the dataset in a single string format. Each string can contain more than one genre, and they are spearated by commas. To be able to count the frequency of each genre in the dataset, the `listed_in` column must be parsed into another table with each row containing all genres of a show, and each column containing only a single genre.

First we take the `listed_in` column of the dataset and isolate the genres per Netflix show by splitting the strings at commas with spaces after them (, ).

In [None]:
listed_in_series = netflix_df_clean['listed_in']
genre_matrix = []

for string in listed_in_series:
    split_str = string.split(', ')
    genre_matrix.append(split_str)

genre_df = pd.DataFrame(genre_matrix)
genre_df

This results in a table with 6226 observations and 3 columns, which means that the maximum number of genres that netflix shows are listed in is only 3. Knowing this, we can create a series by concatenating each column into one single column. Then we can drop the values which are 'None', thus cleaning the resulting series

In [None]:
genre_list = pd.concat([genre_df[0], genre_df[1], genre_df[2]])
genre_list.dropna(inplace = True)
genre_list

Now that the genres have been isolated, we can now use a bar plot to visualize the number of times a particular genre appeared on a Netflix show.

In [None]:
genre_count = genre_list.value_counts()

genre_count.plot.bar()
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Bar plot of genre count in Netflix shows')
genre_count

In the bar plot, we can see the ranking of the shows based on its frequency in Netflix shows. We see that 'International Movies' appear most frequently with almost 1922 appearances in the dataset.

#### 2.) What is the trend on the number of International Movies per year 

As we can see from the bar plot of the genre frequency, 'International Movies' was determined to be the most frequent genre in the dataset. Knowing this, the second exploratory question will focus on the trend of 'International Movies' according to its release year in the Netflix dataset. First we need to create a dataframe which only contain the shows which are listed as 'International Movies' in the dataset. We also no longer need to perform additional data cleaning, since the dataset has no null values for `listed_in` or `release_year`.

In [None]:
netflix_df_international = netflix_df_clean
netflix_df_international.insert(12, "isInternationalMV", netflix_df_clean["listed_in"].str.find("International Movies") != -1, True)
netflix_df_international = netflix_df_international[netflix_df_international.isInternationalMV]
netflix_df_international

Now that we have a dataframe which only contains the shows which have been listed as 'International Movies', we can now use a histogram in order to visualize the trend of shows listed as 'International Movies' in the Netlflix dataset

In [None]:
netflix_df_international['release_year'].hist(bins=60)
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Histogram of International Movies per year')

Based on the histogram above, we can see that within the Netflix catalog, there is a large number of shows listed as 'International Movies' which were released towards the latter half of the 2010's. 

#### 3.) What countries produce the most 'International Movies'?

Considering that our previous exploratory question answered that shows listed as 'International Movies' were the most prominent in the Netflix dataset, the third exploratory question would determine which countries these shows belong to. 

Since we already have a dataframe which contains all the shows listed as 'International Movies', we can use that dataframe for our analysis.

In [None]:
netflix_df_international['country']

As we can see, the `country` column reveals that there are some instances in which a show was produced in more than one country and these countries are separated by commas. As such, we need to parse the `country` column into another table with each row containing all countries of a show, and each column containing only a single country

In [None]:
netflix_df_international_country = netflix_df_international.dropna(subset=['country'])
international_series = netflix_df_international_country['country']
international_matrix = []

for string in international_series:
    split_str = string.split(', ')
    international_matrix.append(split_str)

international_df = pd.DataFrame(international_matrix)
international_df

This results in a table with 1854 observations and 11 columns, which means that the maximum number of countries that the shows in our dataframe were produced in is 11. As such, we can create a series by concatenating each column into one single column. Then we can drop the values which are 'None', thus cleaning the resulting series

In [None]:
country_list = pd.concat([international_df[0], international_df[1], international_df[2], international_df[3], international_df[4], international_df[5], international_df[6], international_df[7], international_df[8], international_df[9], international_df[10], international_df[11]])
country_list.dropna(inplace = True)
country_list

Now that the countries have been separated, we can now use a bar plot to visualize the distribution of shows listed as 'International Movies' based on the country where it was produced

In [None]:
country_count = country_list.value_counts()

country_count.plot.bar()
plt.xlabel('Country')
plt.ylabel('Count')
plt.title('Bar plot of distribution of International Movies based on country')
country_count

## Research Questions

### What are the most common association rules among Netflix genres?
The shows in the netflix dataset contains a `listed_in` column which refers to what genres the shows are listed in, on the streaming platform. Usually, a show is often listed in more than one genre, and there are some genre combinations that are frequently used for shows. [A chord diagram by Dr. Shahin Rostami (2020)](https://datacrayon.com/posts/statistics/data-is-beautiful/co-occurrence-of-movie-genres-with-chord-diagrams/#Conclusion) shows that some of the **most common genre combinations** of movies are as follows:
 - Romance and Comedy
 - Romance and Drama
 - Drama and Thriller
 - Drama and Action
 - Drama and Comedy
 - Crime and Thriller
 - Crime and Drama
 - Action and Adventure
 - Action and Thriller
 
There are some genres that are present in the  netflix shows dataset but are not included in the chord diagram. Furthermore, some genres like 'International Movies' have a high frequency count in the dataset, which could increase the frequency of its genre combinations. We can find out if the previously mentioned genre combinations also apply on the Netflix dataset or if there are more rules associated with the most frequent genres by using **association rules**. The data preparation needed for the following steps is to drop all duplicate shows. There are some observations in the dataset that are the same show but are in different languages. These observations were already dropped in the previously done data cleaning, and the clean dataset will be used to generate the association rules.

#### Data Modelling
The values in the `listed_in` column are represented as strings where each string can contain more than one genre, and each genre is separated by commas. To parse these genres we need to create a matrix that is a list of genre lists, where each row represents the genres of one show. We can then use this matrix to create a dataframe that we can manipulate.

In [None]:
listed_in_series = netflix_df_clean['listed_in']
genre_matrix = []

for string in listed_in_series:
    split_str = string.split(', ')
    genre_matrix.append(split_str)

genre_df = pd.DataFrame(genre_matrix)
genre_df

We also need a value dictionary to assign a number to each genre.

In [None]:
values = genre_df.values.ravel()
values = [value for value in pd.unique(values) if not pd.isnull(value)]

value_dict = {}
for i, value in enumerate(values):
    value_dict[value] = i
value_dict

After generating the dictionary, we can then create a dataframe where each row is a netflix show and each column is a genre. If the show is listed in that genre then the value for that column is `1` and `0` if it is not

In [None]:
genre_df = genre_df.stack().map(value_dict).unstack()

baskets = []
for i in range(genre_df.shape[0]):
    basket = np.sort([int(x) for x in genre_df.iloc[i].values.tolist() if str(x) != 'nan'])
    baskets.append(basket)

shows_df = pd.DataFrame([[0 for _ in range(len(value_dict))] for _ in range(len(baskets))], columns=values)

for i, basket in enumerate(baskets):
    shows_df.iloc[i, basket] = 1
shows_df

#### Rule Miner

After preparing the dataframe, we can finally use a **Rule Miner** to figure out the common genre association rules of netflix shows. There can be many configurations of this Rule Miner that will produce different results. Because of this, multiple tests will be done on the dataset which will fulfill different purposes.

The first configuration will be setting the support threshold to the **lowest occurence** of a genre. This will be done to accomodate all genres in the dataset since all genres have appeared that many times. To do this we must first get all the total counts of each genre, then get the lowest value from the results. Since the total count of each genre has already been generated in the Exploratory Data Analysis, all we need to do is get the minimum value from that series.

In [None]:
print("Last 10 results of the series\n")
print(genre_count.tail(10),"\n")
print("Minimum no. of occurrences:", genre_count.min())

The minimum value of the total genre counts is `10`, which will be set as the support threshold.

The confidence level is then lowered starting from `100%` until it produces at least 1 association rule

From this result, we initialize the first Rule Miner to have a **support threshold** of `10` and a **confidence** of `100%`.

In [None]:
rule_miner_lowest_occur = RuleMiner(10, 1)
assoc_rules_lowest_occur = rule_miner_lowest_occur.get_association_rules(shows_df)
assoc_rules_lowest_occur

From the result of the get_association_rules function of the rule miner, we can see that the genre association rules are:
 - ['Crime TV Shows', 'Korean TV Shows'] → ['International TV Shows']
 - ['International TV Shows', 'Science & Nature TV'] → ['Docuseries']
 - ['Science & Nature TV', 'International TV Shows'] → ['Docuseries']
 - ['Romantic TV Shows', 'Korean TV Shows'] → ['International TV Shows']
 - ['TV Dramas', 'Korean TV Shows'] → ['International TV Shows']
 - ['British TV Shows', 'Science & Nature TV'] → ['Docuseries']

The second configuration will be setting the support threshold to the **average occurence** of all genres. This will be done to make sure that the genres that will be included in the rules must have occured a sufficient number of times. We can reuse the total counts of each genre and get the mean value from the series.

In [None]:
print("Average no. of occurrences:", genre_count.mean())

The minimum value of the total genre counts is `324.93`, which we can round off and set as the support threshold.

The confidence level is then lowered starting from `100%` until it produces at least 1 association rule

From this result, we initialize this Rule Miner to have a **support threshold** of `325` and a **confidence** of `74%`.

In [None]:
rule_miner_mean_occur = RuleMiner(325, 0.74)
assoc_rules_mean_occur = rule_miner_mean_occur.get_association_rules(shows_df)
assoc_rules_mean_occur

From the result of the get_association_rules function of the rule miner, we can see that the only genre association rule produced is:
 - ['Independent Movies'] → ['Dramas']

The last configuration will be setting the support threshold to **1% of the total shows** on Netflix. This will be done to generate as many association rules as possible. We can simply multiply 0.01 to the number of rows in our dataframe to get our support threshold.

In [None]:
print("1% of total number of shows:", len(shows_df.index)*0.01)

1% of the total number of all shows is `62.26`, which we can round off and set as the support threshold.

Additionally, we set the confidence level to `40%` to generate more rules than the previous tests.

From this result, we initialize this Rule Miner to have a support threshold of `62` and a confidence of `40%`.

In [None]:
rule_miner_1percent = RuleMiner(62, 0.4)
assoc_rules_1percent = rule_miner_1percent.get_association_rules(shows_df)
assoc_rules_1percent

From the result of the get_association_rules function of the rule miner, we can see that genre association rules produced are:
 - ['Comedies', 'Dramas'] → ['International Movies']
 - ['Comedies', 'Romantic Movies'] → ['International Movies']
 - ['International Movies', 'Romantic Movies'] → ['Comedies']
 - ['Comedies', 'Independent Movies'] → ['Dramas']
 - ['Crime TV Shows', 'TV Dramas'] → ['International TV Shows']
 - ['Crime TV Shows', 'International TV Shows'] → ['TV Dramas']
 - ['TV Comedies', 'Romantic TV Shows'] → ['International TV Shows']
 - [['TV Dramas', 'Romantic TV Shows'] → ['International TV Shows']
 - ['International Movies', 'Thrillers'] → ['Dramas']
 - ['Dramas', 'Thrillers'] → ['International Movies']
 - ['Action & Adventure', 'Dramas'] → ['International Movies']
 - ['International Movies', 'Independent Movies'] → ['Dramas']
 - ['Independent Movies', 'Dramas'] → ['International Movies']
 - ['International Movies', 'Romantic Movies'] → ['Dramas']
 - ['Dramas', 'Romantic Movies'] → ['International Movies']
 - ['Thrillers', 'International Movies'] → ['Dramas']
 - ['Thrillers', 'Dramas'] → ['International Movies']

#### Insights and Conclusions

One limitation that was met in this research question is that there are only a maximum of 3 genres per show that is recorded in the `listed_in` column of the dataset. On the streaming platform itself, you cann actually find that there can be more than 3 genre listings per show, and this did not reflect in the dataset.

The **first** rule miner with a support threshold of `10` and a confidence of `100%` produced the following association rules:
 - ['Crime TV Shows', 'Korean TV Shows'] → ['International TV Shows']
 - ['International TV Shows', 'Science & Nature TV'] → ['Docuseries']
 - ['Science & Nature TV', 'International TV Shows'] → ['Docuseries']
 - ['Romantic TV Shows', 'Korean TV Shows'] → ['International TV Shows']
 - ['TV Dramas', 'Korean TV Shows'] → ['International TV Shows']
 - ['British TV Shows', 'Science & Nature TV'] → ['Docuseries']

What was interesting about this result is that even though the confidence level has not been lowered yet there were a lot of association rules generated. For the dataset, this means that whenever the itemsets on the left side of the rules are present in the `listed_in` column, the third item will always be what is on the right side of the rules generated. For Netflix, it means that there is only a very high chance of the shows following the rules generated, since not all netflix shows might be included in the dataset and there is a limit of 3 genres per show. For all shows, including those not in Netflix, there could be a lower probability of these rules being followed, since Netflix only adds shows to their streaming platform which caters to the preferences of their target market.

All of the genres from these results were not included in the genre chord chart. However, it is noteable how 5 out of 6 of the generated rules contained the 'International TV Shows' genre. This genre is the 4th most frequent genre with 1001 occurrences. It may be because of the frequency that the genre appeared in these rules but it does not explain why the other top 3 most frequent genres are not included in the results. It could be due to the confidence level being very high and the rules associated with the other 3 most frequent genres do not occur 100% of the time.

The **second** rule miner with a support threshold of `325` and a confidence of `74%` produced the following association rule:
 - ['Independent Movies'] → ['Dramas']
 
This result is interesting as there are only two genres in the rules. The rule states that whenever a show is listed as an 'Independent Movie' it has a 74% chance of also being listed in 'Dramas'. It could be derived that a lot of Independent or 'Indie' movies are dramas. During this time, Indie Movies are popular for winning awards or for simply being good. There could be a trend of indie drama movies that support the rule generated but that is a different question with a different solution to arrive at the answer.

Another interesting observation is that none of the genres from the first set of association rules made it to the current results. This could mean that their itemsets had low support which removed them from the set of frequent itemsets used to generate the rules.

The **third** rule miner with a support threshold of `62` and a confidence of `40%` produced the following association rules:
 - ['Comedies', 'Dramas'] → ['International Movies']
 - ['Comedies', 'Romantic Movies'] → ['International Movies']
 - ['International Movies', 'Romantic Movies'] → ['Comedies']
 - ['Comedies', 'Independent Movies'] → ['Dramas']
 - ['Crime TV Shows', 'TV Dramas'] → ['International TV Shows']
 - ['Crime TV Shows', 'International TV Shows'] → ['TV Dramas']
 - ['TV Comedies', 'Romantic TV Shows'] → ['International TV Shows']
 - [['TV Dramas', 'Romantic TV Shows'] → ['International TV Shows']
 - ['International Movies', 'Thrillers'] → ['Dramas']
 - ['Dramas', 'Thrillers'] → ['International Movies']
 - ['Action & Adventure', 'Dramas'] → ['International Movies']
 - ['International Movies', 'Independent Movies'] → ['Dramas']
 - ['Independent Movies', 'Dramas'] → ['International Movies']
 - ['International Movies', 'Romantic Movies'] → ['Dramas']
 - ['Dramas', 'Romantic Movies'] → ['International Movies']
 - ['Thrillers', 'International Movies'] → ['Dramas']
 - ['Thrillers', 'Dramas'] → ['International Movies']
 
The rules generated by this third rule miner are reflective of the common genre combinations that can be found from [Dr. Shahin Rostami's chord diagram.](https://datacrayon.com/posts/statistics/data-is-beautiful/co-occurrence-of-movie-genres-with-chord-diagrams/#Conclusion) Each combination from the combinations 'Romance and Comedy', 'Romance and Drama', 'Drama and Thriller', 'Drama and Action', 'Drama and Comedy', and 'Crime and Drama', all appear in at least one rule from the generated results. The only combinations not present are 'Action and Thriller', Crime and Thriller, and Action and Adventure, which is a single genre on its own. The only genres in the results that are not in the chord chart are 'International Movies', 'International TV Shows', and 'Independent Movies'. In addition, the top 4 most frequent genres are included in this dataset, including other frequently listed genres.

To answer the question, "What are the most common association rules among Netflix genres?", it would be the association rules generated by the **third rule miner** with a support of `62` and a confidence of `40%`. This is because the first rule miner did not have frequent enough itemsets, that the results it generated can no longer be found in the other rule miners. It is also notable that a confidence of 100% is too strict for association rules that do not occur one hundred percent of the time. The second rule miner also has a high support threshold, which eliminates some frequent enough itemsets. Additionally, the confidence level was too high especially for a rule miner with a high support threshold. The second rule miner limited and eliminated too many rules that could have been meaningful. The third rule miner had a sufficiently low support threshold of `62`, which accomodated a lot of itemsets. The confidence level of `40%` also increased the number of association rules generated, and it is not too low since the itemsets are already frequent enough and there are not too many rules generated.

If further improvements could be made, it would be finding a specific value for the support threshold to say that a set of genre combinations occurs frequently enough. Additionally, it would be better to find and use a dataset that would not limit the number of genres a show could be listed in. This could create more meaningful rules which could help in answering other questions like genre trends or help in creating netflix user profiles for recommender systems.

### Are Netflix descriptions effective classifiers for the International Movies genre?

The `description` variable in the dataset refers to Netflix's synopsis of each show. For this research question, the idea is to determine whether these synopses are able to efficiently classify whether a show is classified under *International Movies* or not. 

The *International Movies* genre is chosen because 1,922 shows are classified under this genre which makes it the most prominent. Choosing a prominent genre is important because there is a need to have a good amount of samples; this is especially true for classification which requires good representatives for each categories &mdash; in this case, the categories are "International Movies" and "Not International Movies". 

- Get not LGBT shows same size as LGBT
- Get features (tfidf, or whatever)
- Get top features (chi2)
- Tas ulet ulet sampling

### Data Preparation

For this classification, two sets of data are needed: data which contains only shows with *International Movies* as its genre (regardless whether it's also classified as other genres), and data containing shows which are not classified under *International Movies*.

In the exploratory data analysis, there's already a `DataFrame` pertaining to the first. What we need is a `DataFrame` for shows that aren't classified under *International Movies*. To do this, the `isInternationalMV` variable from `netflix_df_clean`, can be used where False values indicates that show is not under said classification.

In [None]:
netflix_df_non_international = netflix_df_clean[netflix_df_clean["isInternationalMV"] == False]
netflix_df_non_international

The next step is to concatenate both "Not International Movies" and "International Movies" data together to prepare it for feature extraction.

In [None]:
netflix_df_classifier = pd.concat([netflix_df_non_international, netflix_df_international])
netflix_df_classifier

### Feature Extraction
The `description` variable is in the form of sentences (naturally, because these are synopses). However, computers do not really understand these words. Hence, a mathematical model is important to transform this into a numerical representation. 

#### TF-IDF
Explain TF-IDF

In [None]:
vectorizer=TfidfVectorizer()
vectorizer.fit(netflix_df_classifier["description"])
vector=vectorizer.transform(netflix_df_classifier["description"])

In [None]:
features = vectorizer.get_feature_names()
print(features)

In [None]:
print(len(features))

All 16411 words above are transformed into their numerical values. EXPLAIN MORE 

### Preparing features and labels

In [None]:
xdata=vector.todense()

linearSVC=svm.LinearSVC()

netflix_reset = netflix_df_classifier.reset_index()

ydata=[]
b=0
while b<len(netflix_reset):
    if netflix_reset.loc[b, "isInternationalMV"] == True:
        appenddata="International Movie"
    if netflix_reset.loc[b, "isInternationalMV"] == False:
        appenddata="Not International Movie"
    ydata.append(appenddata)
    b+=1

ydata

skf=StratifiedKFold(n_splits=5)

for train_index, test_index in skf.split(xdata, ydata):
    x_train=xdata[train_index]
    x_test=xdata[test_index]
    y_train=np.array(ydata)[train_index]
    y_test=np.array(ydata)[test_index]
    
    linearSVC = linearSVC.fit(x_train,y_train)

    y_pred = linearSVC.predict(x_test)

    confusion=confusion_matrix(y_test, y_pred, labels=['Not International Movie', 'International Movie'])
    print('CONFUSION MATRIX: \n', confusion) #order: tn, fp, fn, tp

    print("Accuracy: ", accuracy_score(y_test, y_pred))

    print(classification_report(y_test, y_pred, target_names=['Not International Movie', 'International Movie']))

In [None]:
selector = SelectKBest(chi2, k=20)
selector.fit(xdata, ydata)
# Get idxs of columns to keep
idxs_selected = selector.get_support(indices=True)

chi_x = np.asarray(xdata)
chi_x
    
scores, pval=(chi2(chi_x, ydata))
scores

for i in idxs_selected:
    pval_add=(features[i], scores[i])
    print(pval_add)
    print(pval[i])

# pvalue_continue= True
# while pvalue_continue==True:
#     try:
#         pvalue_evaluation=input("what word")
#         pvalue_index=features.index(pvalue_evaluation)
#         print("SCORE:", scores[pvalue_index])
#         print ("PVALUE:", (pval[pvalue_index]))

#     except:
#         pass

## Insights and Conclusions