# **Competition Overview:** [Tabular_Playground_Series](https://www.kaggle.com/c/tabular-playground-series-may-2021/overview)

## **Objective:** To predict the probability, the id of test dataset belongs to each class

## **Dataset:** [Synthetic Dataset](https://www.kaggle.com/c/tabular-playground-series-may-2021/data)

In [None]:
# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
%matplotlib inline

# import the library to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import the train dataset
train_original = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')

In [None]:
# view the train dataset
train_original.head()

In [None]:
# verify the shape of the train dataset
train_original.shape

**Comments**
> There are 100000 rows and 52 columns in train dataset including 'target' variable. 

In [None]:
# Check if there are any missing values
train_original.info()

**Comments**
> There are no null values and all the independent features are of integer datatype and 'target' is of object datatype

In [None]:
# Describe the dataset
train_original.describe().T

In [None]:
#import the test dataset
test_original = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')

In [None]:
# View the test dataset
test_original.head()

In [None]:
# Check the null values
test_original.info()

**Comments**
> 1. There are 50000 rows and 51 columns in test dataset which excludes 'target' variable.
> 2. There are no null values in the test dataset and all the independent features are of integer datatype

In [None]:
# Check the distribution of the features in the train dataset using histogram
# Since there are more features, randomly chose the features out from 50 features 

# Histogram for 'feature_0'
plt.figure(figsize=(20, 10))
plt.subplot(2,5,1)
plt.title('Histogram for feature_0')
plt.xlabel('feature_0')
plt.ylabel('Frequency')
sns.distplot(train_original.feature_0, kde=False, color=['red'])

# Histogram for 'feature_10'
plt.subplot(2,5,2)
plt.title('Histogram for feature_10')
plt.xlabel('feature_10')
plt.ylabel('Frequency')
sns.distplot(train_original.feature_10, kde=False, color=['green'])

# Histogram for 'feature_20'
plt.subplot(2,5,3)
plt.title('Histogram for feature_20')
plt.xlabel('feature_20')
plt.ylabel('Frequency')
sns.distplot(train_original.feature_20, kde=False, color=['blue'])

# Histogram for 'feature_35'
plt.subplot(2,5,4)
plt.title('Histogram for feature_35')
plt.xlabel('feature_35')
plt.ylabel('Frequency')
sns.distplot(train_original.feature_35, kde=False, color=['black'])


# Histogram for 'feature_49'
plt.subplot(2,5,5)
plt.title('Histogram for feature_49')
plt.xlabel('feature_49')
plt.ylabel('Frequency')
sns.distplot(train_original.feature_49, kde=False, color=['magenta'])

**Comments**
> It is evident that most of the features distribution are not normal (Right skewed). So, normalization of the data is required before training the model. 

In [None]:
# Check the balancing of the classes in the 'target'variable
train_original['target'].value_counts()

In [None]:
# Check the balance of the data by plotting count of the target by their values
plt.figure(figsize=(5,2))
plt.title('Count plot for target')
sns.countplot(train_original['target'])

**Comments**
> 1. Count of Class2 category is more compared to other class categories. Hence, the problem is imbalanced muticlass classification problem.

> 2. Resampling technique should be used to deal with imbalanced multiclass classification problem. It consists of removing samples from the majority class 'Class2' (under-sampling) and / or adding more samples from the minority classes 'Class_1,Class_4,Class_3' (over-sampling).

> 3. Removing samples from the majority class will lead to data loss and adding more samples from the minority class will lead to overfitting.

> 4. Hence, we will use SMOTE (Synthetic Minority Oversampling Technique). Its an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem.

In [None]:
# store the 'ID' values into the new object train_ID and test_ID and verify the shape

# Train dataset
train_ID = train_original.id
print('The shape of the ID feature in train dataset is:',train_ID.shape)

# Test dataset
test_ID = test_original.id
print('\nThe shape of the ID feature in test dataset is:',test_ID.shape)

In [None]:
# store the remaining values into the new object train_X and test_X and verify the shape

# Train dataset
train_X = train_original.drop(columns = ['id','target'])
print('The shape of the final train dataset is:',train_X.shape)

# Test dataset
test_X = test_original.drop(columns = ['id'])
print('\nThe shape of the final test dataset is:',test_X.shape)

In [None]:
# store the 'target' values into the new object train_y and verify the shape
train_y=train_original.target
print('The shape of the target feature of train dataset is:',train_y.shape)

In [None]:
# save these datasets to csv for future use if required
train_ID.to_csv('train_ID.csv',index=False)
test_ID.to_csv('test_ID.csv',index=False)
train_X.to_csv('train_X.csv',index=False)
test_X.to_csv('test_X.csv',index=False)
train_y.to_csv('train_y.csv',index=False)

In [None]:
# Perform correlation analysis between the features of train dataset and visualize using a heat map.
plt.figure(figsize=(20,10))
sns.heatmap(train_X.corr(),annot=True,cmap='RdYlGn');
##
plt.xticks(rotation=90, color='indigo', size=10)
plt.yticks(rotation=0, color='indigo', size=10)

**Comments** 
> It is observed that correlation between the features are not crossing 0.25. So, it is likely the features are independent of each other

In [None]:
# import the required library
from sklearn.decomposition import PCA

In [None]:
# create a PCA object (instantiate)
pca=PCA()

In [None]:
# fit and transform the train dataset
# transform the test dataset

# fit the train dataset
pca_fit_train_X = pca.fit(train_X)

# transform the train dataset
pca_fit_transform_train_X = pca_fit_train_X.transform(train_X)

# transform the test dataset
pca_transform_test_X = pca.transform(test_X)

In [None]:
# Calculate the percentage of variation of each principal components
# Assuming we have to principal components PC1 and PC2, then  
# explained_variance_ratio for PC1 = (Variation for PC1/ (Total variation ie (PC1+PC2)))*100
##
pca_train_X_variation = np.round(pca_fit_train_X.explained_variance_ratio_.cumsum()*100,decimals=1)
##
# .cumsum() is used to display PC's variation with respect to cumulative percentage

In [None]:
# Print the cumulative percentage of expalined variance
pca_train_X_variation

In [None]:
# assign labels for each PC's as PC1,2,etc., for visulaization in scree plot 
##
labels = ['PC' + str(x) for x in range(1, len(pca_train_X_variation)+1)]

In [None]:
# generate the scree plot
## 
plt.figure(figsize=(25,5))
##
plt.bar(x=range(1, len(pca_train_X_variation)+1), height=pca_train_X_variation,tick_label=labels)
##
plt.xticks(rotation=90, color='indigo', size=15)
plt.yticks(rotation=0, color='indigo', size=15)
##
##################
plt.title('Scree Plot',color='tab:orange', fontsize=25)
###################
##
plt.xlabel('Principal Components', {'color': 'tab:orange', 'fontsize':15})
plt.ylabel('Cumulative percentage of explained variance ', {'color': 'tab:orange', 'fontsize':15})
##

**Comments**: 
> By looking into the array of elements and scree plot, It is observed that among 50 Principal components, ~90% of variation of the data in train_X dataset is explained by only first 28 principal components.

In [None]:
# create a PCA object again (instantiate) by considering first 28 principal components
pca_28=PCA(n_components=28)

In [None]:
# fit and transform the train dataset
# transform the test dataset

# fit the train dataset
pca_fit_train_X_28 = pca_28.fit(train_X)

# transform the train dataset
pca_fit_transform_train_X_28 = pca_fit_train_X_28.transform(train_X)

#transform the test dataset
pca_transform_test_X_28 = pca_28.transform(test_X)

In [None]:
# From PCA, the final train and test datasets are as follows

#train data
pca_fit_transform_train_X_28.shape

In [None]:
#test data
pca_transform_test_X_28.shape

In [None]:
#train label
train_y.shape

In [None]:
# import the library SMOTE to deal with Imbalanced Multiclass Classification
from imblearn.over_sampling import SMOTE

#instantiate SMOTE
smote = SMOTE(random_state=42)

In [None]:
# sample and fit the dataset to balance the classes,thereby increasing the observations
X_smote, y_smote = smote.fit_resample(pca_fit_transform_train_X_28,train_y)

print('Shape of independent features(X) before SMOTE :', pca_fit_transform_train_X_28.shape)
print('Shape of independent features(X) after SMOTE :' , X_smote.shape)

print('Shape of dependent feature(y) before SMOTE :', train_y.shape)
print('Shape of dependent feature(y) after SMOTE :' , y_smote.shape)

print('Count of the classes before SMOTE:\n',train_y.value_counts())
print('Count of the classes before SMOTE:\n',y_smote.value_counts())

**Comments**
> 1. After SMOTE, the number of observation got increased to 229988 rows from 100000 rows.
> 2. After SMOTE, the number of majority class_2 remains unchanged (57497), whereas number of minority classes 1,3 and 4 are increased to 57497. This ensures, balancing of the classes is done perfectly.

In [None]:
# I will use only XGBoost Classifier for the second submission. 
# Later, will check how i can improve the score by trying with different algorithms

# import the required libraries
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

In [None]:
# Parameters
params={ 'learning_rate' : [0.01, 0.05, 0.1] ,
'max_depth' : [3,4,5],
'min_child_weight': [ 0, 1, 3],
'gamma' : [ 0, 0.25,1],
'colsample_bytree': [ 0.2, 0.4, 0.6],
'reg_lambda' : [0,1,10]
}
## Explainations of the parameters
# 'max_depth' - Maximum depth of trees (default = 6, range: [0,∞])
# 'Learning rate'(eta) - scaling the tree by learning rate predicts the output in smaller steps close to the
# 'reg_lambda' - L2 regularization parameter on weights which estimate the mean of the data to avoid overfit
# 'gamma' - Minimum loss reduction required to make a further partition on a leaf node of the tree (pruning)
# 'min_child_weight' - default =1. If the weights of each leaf is less than the min_child weight, then the t
# So weights of the each leaf is > min_child_weight
# 'colsample_bytree': It is the subsample ratio of columns when constructing each tree.(default=1, range 0.5

In [None]:
# Optimize the Hyperparameter using RandomizedSearchCV

#import the required libraries
from sklearn.model_selection import RandomizedSearchCV

In [None]:
#Instantiate the classifier
XGB=xgb.XGBClassifier(missing=1)
# Objective will be automatically set to 'multi:softprob' # Alternate is 'multi:softmax'
# 'multi:softmax' returns predicted class
# 'multi:softprob' returns predicted probabilities

In [None]:
# Using Random search of parameters with 5 fold cross validation
# Improve the predictions using cross validation to optimize the parameters
Random_Search=RandomizedSearchCV (XGB,param_distributions=params,n_iter=5,n_jobs=-1,cv=5,verbose=-1)
# cv=5 - Number of folds in a `(Stratified)KFold`

In [None]:
# Fit the train dataset to the Random_Search to obtain the best estimators and parameters.
Random_Search.fit(X_smote,y_smote)

In [None]:
# Print the best estimator
Random_Search.best_estimator_

In [None]:
#Print the best parameters
Random_Search.best_params_

In [None]:
# Instantiate the XGBoost classifier with the best estimators and parameters
XGB=xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.6, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=1, monotone_constraints='()',
              n_estimators=100, n_jobs=2, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
# leave the other parameters to default values

In [None]:
# Check the accuracy of the model using Number of folds in a `(Stratified)KFold` cv=5
# import the library
from sklearn.model_selection import cross_val_score

Accuracy=cross_val_score(XGB,X_smote,y_smote,cv=5)
Accuracy

In [None]:
#Print the mean accuracy of each k-fold
print("Accuracy of XGBoost Model with Cross Validation is:",Accuracy.mean() * 100)

In [None]:
# Fit the training data
XGB.fit(X_smote,y_smote)

In [None]:
#predict the test dataset
y_predict_xgb = XGB.predict(pca_transform_test_X_28)
y_predict_xgb

In [None]:
# predict the probabality,the id of test dataset belongs to each class 
probability_xgb=XGB.predict_proba(pca_transform_test_X_28)
probability_xgb

In [None]:
# read the sample submission file
sample_submit=pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

In [None]:
# create a dataframe similar to sample submission file
pred_prob_xgb = pd.DataFrame(probability_xgb, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4'])
pred_prob_xgb['id'] = sample_submit['id']

In [None]:
#Set the index to 'id'
pred_prob_xgb=pred_prob_xgb.set_index('id')
pred_prob_xgb

In [None]:
# Export the predicted probability, the id of test dataset belongs to each class 
pred_prob_xgb.to_csv('submission_20210510_V2.csv')