<a href="https://www.kaggle.com/code/ishmaelgarcia/assignment-2?scriptVersionId=102950880" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load csv files

In [None]:
train = pd.read_csv('/kaggle/input/assignment2/train.csv')
print(train.shape)
test = pd.read_csv('/kaggle/input/assignment2/eval.csv')
train.info()

**Lets print the head to see the columns and info we have**

In [None]:
train['esrb_rating'].value_counts()

In [None]:
sns.countplot(data = train, x = 'esrb_rating', hue = 'cartoon_violence')

In [None]:
print(train['esrb_rating'].unique())
train.head()

* So we see we have game titles and categories these games fall into. 1 - yes or 0 - no. 

* We also see there is a console column i'm assuming is just checking if this game title is available on console. There is no correlation and I believe does not affect the rating of the game itself in any way.

* id, title, and console columns will be dropped as they dont predict rating

* Since its categorical data we can assume there are no outliers

* Above we can confirm there are no null values so no filling or deleting is necessary

In [None]:
sns.countplot(data = train, x = 'esrb_rating', hue = 'console')

* When printing the all the columns we notice columns that share similar key words. Examples: "mild" , "strong", "sexual", "Blood", "Alcohol", "Drugs", "Violence"

* Lets create new columns to indicate if there is at least 1 feature with these keywords then drop other columns. This will decrease our columns making the data more concise

* Below we see mild steadily increases mostly in the middle "ET" and "T" ratings with mild not really associated with "M" rated games

In [None]:
train['is_mild'] = train.mild_cartoon_violence | train.mild_fantasy_violence | train.mild_language | train.mild_blood | train.mild_lyrics | train.mild_suggestive_themes | train.mild_violence
test['is_mild'] = test.mild_blood | test.mild_cartoon_violence | test.mild_fantasy_violence | test.mild_language | test.mild_lyrics | test.mild_suggestive_themes | test.mild_violence
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_mild')

* Here we see "strong" is heavily associated to "M" rated games

In [None]:
train['is_strong'] = train.strong_janguage | train.strong_sexual_content
test['is_strong'] = test.strong_janguage | test.strong_sexual_content
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_strong')

* Now that we have grouped violent keyword we see lots of "E" and "T" ratings but also "M" and the least is "E". We might have to make another column to be a little bit more specific such as "violent" and "bloody/gore". For now this will do

In [None]:
train['is_violent'] = train.fantasy_violence | train.violence | train.intense_violence | train.mild_cartoon_violence | train.mild_fantasy_violence | train.mild_violence | train.cartoon_violence
test['is_violent'] = test.fantasy_violence | test.violence | test.intense_violence | test.mild_cartoon_violence | test.mild_fantasy_violence | test.mild_violence | test.cartoon_violence
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_violent')

* Lets just compact any column with the mention of a drug

In [None]:
train['is_drug'] = train.use_of_drugs_and_alcohol | train.use_of_alcohol | train.alcohol_reference | train.drug_reference
test['is_drug'] = test.use_of_drugs_and_alcohol | test.use_of_alcohol | test.alcohol_reference | test.drug_reference
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_drug')

* Now we combine all sexual keywords but also including nudity

In [None]:
#train['is_nsexual_and_bloody'] = ~(train.sexual_themes & train.sexual_content & train.blood & train.blood_and_gore)
#test['is_nsexual_and_bloody'] =  ~((test.sexual_themes |  test.sexual_content) & (test.blood | test.blood_and_gore))
#sns.countplot(data = train, x = 'esrb_rating', hue = 'is_nsexual_and_bloody')

In [None]:
train['is_nsfw'] = train.strong_sexual_content | train.sexual_themes | train.sexual_content | train.partial_nudity | train.nudity
test['is_nsfw'] = test.strong_sexual_content | test.sexual_themes | test.sexual_content | test.partial_nudity | test.nudity
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_nsfw')

* And now we combine all columns with blood keyword

In [None]:
train['is_bloody'] = train.mild_blood | train.blood | train.blood_and_gore
test['is_bloody'] = test.mild_blood | test.blood | test.blood_and_gore
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_bloody')

* There is not many things that are heavily correlated with "ET" rating but animated or fictional themes seems to be one of the few

* This category will help our "ET" predictions

In [None]:
train['is_animated_graphic'] = train.animated_blood | train.fantasy_violence | train.cartoon_violence
test['is_animated_graphic'] = test.animated_blood | test.fantasy_violence | test.cartoon_violence
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_animated_graphic')

* Lets make our features smaller but grouping crude/mature humor

In [None]:
train['is_bad_humor'] = train.crude_humor | train.mature_humor
test['is_bad_humor'] = test.crude_humor | test.mature_humor
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_bad_humor')

* Lets make our features smaller by grouping language/lyrics

In [None]:
train['is_bad_speech'] = train.language | train.lyrics
test['is_bad_speech'] = test.language | test.lyrics
sns.countplot(data = train, x = 'esrb_rating', hue = 'is_bad_speech')

Now lets drop all columns we used to make our new columns

In [None]:
mg = test['id']
train.drop(['mild_blood', 'mild_cartoon_violence', 'fantasy_violence', 'mild_lyrics', 'mild_suggestive_themes', 'mild_fantasy_violence', 'mild_violence', 'strong_janguage', 'strong_sexual_content', 'use_of_drugs_and_alcohol', 'alcohol_reference', 'sexual_themes', 'sexual_content', 'partial_nudity', 'nudity', 'blood', 'animated_blood', 'id', 'console', 'cartoon_violence', 'drug_reference', 'mild_language', 'title', 'crude_humor', 'mature_humor', 'language', 'lyrics'], axis = 1, inplace = True)
test.drop(['mild_blood', 'mild_cartoon_violence', 'fantasy_violence', 'mild_lyrics', 'mild_suggestive_themes', 'mild_fantasy_violence', 'mild_violence', 'strong_janguage', 'strong_sexual_content', 'use_of_drugs_and_alcohol', 'alcohol_reference', 'sexual_themes', 'sexual_content', 'partial_nudity', 'nudity', 'blood', 'animated_blood', 'id', 'console', 'cartoon_violence', 'drug_reference', 'mild_language', 'crude_humor', 'mature_humor', 'language', 'lyrics'], axis = 1, inplace = True)
train.info()

In [None]:
print(train.shape)
print(test.shape)

* We will now split our training data and make our train/test data 80/20

* We drop our esrb rating for our X and keep only our esrb rating for our y to make sure we predict accurately

In [None]:
y = train['esrb_rating']
X = train.drop(columns = ['esrb_rating'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 123)
print(y.value_counts())

* Below we made our models and fit them with our data. We print out our score and a confusion matrix

* The multilabel confusion matrix lets us know our "True Negatives" (@00), " False Positives" (@01), "False Negatives" (@10), "True Positives" (@11)

* I will look at False Negative and False Positives of each to see which esrb rating has a higher rate of bad predictions

# LOGISTIC REGRESSION
* Since logistic regression model does not necessarily have super impactful/important parameters we will  not do any hyperparameter searches

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_predlr = lr.predict(X_test)
scorelr = lr.score(X_test, y_test)
print(scorelr)
print(multilabel_confusion_matrix(y_test, y_predlr))

In [None]:
lr_scores = cross_val_score(lr,X_train, y_train, cv = 20)
lr_scoresdf = pd.DataFrame(lr_scores, columns = ['CVS'])
print(lr_scoresdf.describe())
plt.hist(lr_scoresdf['CVS'])

# SUPPORT VECTOR MACHINE
* Hyperparameter search: Kernel & Penalty(C)

In [None]:
clf =  svm.SVC(C = 1, kernel = 'rbf')
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
grid = dict(kernel = kernel,C = C,gamma =['scale'])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = clf, param_grid = grid, n_jobs = -1, cv = cv, scoring='accuracy',error_score = 0)
grid_result = grid_search.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
clf.fit(X_train, y_train)
y_predsvm = clf.predict(X_test)
scoresvm = clf.score(X_test, y_test)
print(scoresvm)
print(multilabel_confusion_matrix(y_test, y_predsvm))

In [None]:
svm_scores = cross_val_score(clf,X_train, y_train, cv = 20)
svm_scoresdf = pd.DataFrame(svm_scores, columns = ['CVS'])
print(lr_scoresdf.describe())
plt.hist(lr_scoresdf['CVS'])

# DECISION TREE CLASSIFIER
* Hyperparameter search: max_depth

In [None]:
dtc =  tree.DecisionTreeClassifier(max_depth = 20)


params = {'max_depth': [2, 3, 5, 10, 20]}
grid_search = GridSearchCV(estimator = dtc, param_grid = params, n_jobs = -1, cv = cv, scoring='accuracy',error_score = 0)
grid_result = grid_search.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


dtc.fit(X_train, y_train)
y_preddtc = dtc.predict(X_test)
scoredtc = dtc.score(X_test, y_test)
print(scoredtc)
print(multilabel_confusion_matrix(y_test, y_preddtc))

In [None]:
dtc_scores = cross_val_score(dtc,X_train, y_train, cv = 20)
dtc_scoresdf = pd.DataFrame(dtc_scores, columns = ['CVS'])
print(dtc_scoresdf.describe())
plt.hist(dtc_scoresdf['CVS'])

# RANDOM FOREST CLASSIFIER
* Hyperparameter search: n_estimators, max_features, max_depth

In [None]:
rf =  RandomForestClassifier(max_features = 2, n_estimators = 1600, max_depth = 50)



random_grid = {'n_estimators': [20, 200, 400, 1000, 1600, 2000],
               'max_features': [2,3,10],
               'max_depth': [50, 70, 100, 120]}
grid_search = GridSearchCV(estimator = rf, param_grid = random_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_result = grid_search.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


rf.fit(X_train, y_train)
y_predrf = rf.predict(X_test)
scorerf = rf.score(X_test, y_test)
print(scorerf)
print(multilabel_confusion_matrix(y_test, y_predrf))

In [None]:
rf_scores = cross_val_score(rf,X_train, y_train, cv = 20)
rf_scoresdf = pd.DataFrame(rf_scores, columns = ['CVS'])
print(rf_scoresdf.describe())
plt.hist(rf_scoresdf['CVS'])

# KNEIGHBORS CLASSIFIER
* Hyperparameter: n_neighbors, leaf_size

In [None]:
kn =  KNeighborsClassifier(n_neighbors = 5, leaf_size = 10)


k_grid = {'n_neighbors': [1, 5, 10, 100],
            'leaf_size': [1,10,20,50]}
grid_search = GridSearchCV(estimator = kn, param_grid = k_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_result = grid_search.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


kn.fit(X_train, y_train)
y_predkn = kn.predict(X_test)
scorekn = kn.score(X_test, y_test)
print(scorekn)
print(multilabel_confusion_matrix(y_test, y_predkn))

In [None]:
kn_scores = cross_val_score(kn,X_train, y_train, cv = 20)
kn_scoresdf = pd.DataFrame(kn_scores, columns = ['CVS'])
print(kn_scoresdf.describe())
plt.hist(kn_scoresdf['CVS'])

* As I was worried before about not many features to accurately get "ET" ratings it seems to be the rating with the most False Positives/Negatives. Otherwise 85 - 87% is not bad at all

* Our Random Forest Classifier seems to do the best for now so we will use that to predict our test data

In [None]:
my_guess = rf.predict(test)

In [None]:
submission = pd.DataFrame({'id':mg, 'esrb_rating': my_guess})
submission.to_csv('csv_to_submit.csv', index = False)
print('saved file: ' + filename)

In [None]:
print(submission.to_string())