<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;"> Tabular Playground Series - June 2021</h1>
<br>

<a id = '1'></a>
<h2 style = "font-family:garamond; font-size:50px; background-color: #f6f6f6; color : #fe346e; border-radius: 100px 100px; text-align:center"> Introduction </h2>

<h3 style = "font-family:garamond; font-size:30px; background-color: white; color : #fe346e; border-radius: 100px 100px; text-align:left">Problem Statement</h3>

For Tabular Playground Series - June 2021 , we have a synthetic dataset generated using CTGAN and the dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.


<h3 style = "font-family:garamond; font-size:30px; background-color: white; color : #fe346e; border-radius: 100px 100px; text-align:left">Metric</h3>

Submissions are evaluated using multi-class logarithmic loss. Each row in the dataset has been labeled with one true Class. For each row, you must submit the predicted probabilities that the product belongs to each class label. The formula is:

$$ \text{log loss} = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij}), $$

where $N$ is the number of rows in the test set, $M$ is the number of class labels, $\text{log}$ is the natural logarithm, $y_{ij}$ is 1 if observation $i$ is in class $j$ and 0 otherwise, and $p_{ij}$ is the predicted probability that observation $i$ belongs to class $j$.

<a id = '1.1'></a>

<h2 style = "background-color: #f6f5f5; color : #fe346e; font-size: 35px; font-family:garamond; font-weight: normal; border-radius: 100px 100px; text-align: center">Libraries</h2>

In [None]:
import gc
import os
import wandb
import logging
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')

<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("api_key")
os.environ["WANDB_SILENT"] = "true"

In [None]:
wandb.login(key=secret_value_0)

<a id = '1.1'></a>

<h2 style = "background-color: #f6f5f5; color : #fe346e; font-size: 35px; font-family:garamond; font-weight: normal; border-radius: 100px 100px; text-align: center">Data Description</h2>



> - ```train/``` - the training data, one product (id) per row, with the associated features (feature_*) and class label (target)
> - ```test/``` - the test data; you must predict the probability the id belongs to each class
> - ```sample_submission.csv``` - a sample submission file in the correct format 
> 

In [None]:
# Load csv data of this competition.

DATA = "../input/tabular-playground-series-jun-2021"
train_df = pd.read_csv(DATA + "/train.csv")
test_df = pd.read_csv(DATA + "/test.csv")

In [None]:
train_df.shape, test_df.shape

<div class="alert alert-block alert-warning">  
    
The train dataset contains 200000 rows and 77 features and test dataset contains 100000 rows and 76 features .Train dataset is twice that of test dataset .
    

The columns in the train data are as following:
    

```id:``` The id of the product( one per row) 
    

```features_* :``` The various attribute of the product starting from ( Feature_0 to Feature_75)
    

```target:``` Class labels 0 to 9
    
    

The test dataset has same columns except the target value

</div>

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> W & B Artifacts</h4> 

Use W&B Artifacts for dataset versioning, model versioning, and tracking dependencies and results across machine learning pipelines. Think of an artifact as a versioned folder of data. You can store entire datasets directly in artifacts, or use artifact references to point to data in other systems like S3, GCP, or your own system.

[Source](https://docs.wandb.ai/guides/artifacts)

In [None]:
run = wandb.init(job_type="dataset-creation")
artifact = wandb.Artifact('my-dataset', type='dataset')
artifact.add_file('../input/tabular-playground-series-jun-2021/train.csv')
run.log_artifact(artifact)

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> W & B Tables</h4> 

A W&B Table (wandb.Table) is a two dimensional grid of data where each column has a single type of data—think of this as a more powerful DataFrame. Tables support primitive and numeric types, as well as nested lists, dictionaries, and rich media types. Log a Table to W&B, then query, compare, and analyze results in the UI.

Tables are great for storing, understanding, and sharing any form of data critical to your ML workflow—from datasets to model predictions and everything in between.

[Source ](https://docs.wandb.ai/guides/data-vis)

In [None]:
# Initialize a new W&B run
train1_df = wandb.Table(dataframe=train_df)
run = wandb.init(project='TPSJune')
wandb.log({'train1_df': train1_df})
run.finish()    
run

Lets explore first five rows of train dataset

In [None]:
train_df.head()

Lets explore first five rows of test dataset

In [None]:
test_df.head()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Missing Data</h4> 

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

In [None]:
%%time
missing_data(train_df)

In [None]:
%%time
missing_data(test_df)

### 🎯Observations :

📌 There are no missing rows in both test and train dataset

In [None]:
%%time
train_df.describe().T.style.bar(subset=['mean'], color='#ea9999')\
                   .background_gradient(subset=['std'], cmap='YlOrBr')

In [None]:
train_df = train_df[:100000]
test_df = test_df[:100000]

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Scatter Plot of Features</h4> 

In [None]:
def plot_feature_scatter(df1, df2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(4,4,figsize=(14,14))

    for feature in features:
        i += 1
        plt.subplot(4,4,i)
        plt.scatter(df1[feature], df2[feature], marker='+' , color = "#FFB14E")
        plt.xlabel(feature, fontsize=9)
    plt.show();

In [None]:
features = [feature for feature in train_df.columns if feature not in ['id', 'target']]
features = features[:16]

In [None]:
plot_feature_scatter(train_df[::20],test_df[::20], features)

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left">Target Class</h4> 

In [None]:
run = wandb.init(project='TPSJune', job_type='image-visualization',name='Target Feature Count')
targetcount_train = pd.DataFrame(train_df['target'].value_counts())
targetcount_train = targetcount_train.reset_index(drop=False)
targetcount_train.columns = ['TargetClass', 'Count']
table = wandb.Table(data=targetcount_train, columns = ["TargetClass", "Count"])
wandb.log({"my_bar_chart_id" : wandb.plot.bar(table, "TargetClass","Count", title="Target Feature Count")})


run.finish()

run

In [None]:

#sns.countplot(train_df['target'], palette='Set3')
#plt.xticks(rotation=45)



In [None]:
run = wandb.init(project='TPSJune', job_type='image-visualization',name='Unique Values')

features = [feature for feature in train_df.columns if feature not in ['id', 'target']]
unique_values_train = np.zeros(2)
for feature in features:
    temp = train_df[feature].unique()
    unique_values_train = np.concatenate([unique_values_train, temp])
unique_values_train = np.unique(unique_values_train)

unique_value_feature_train = pd.DataFrame(train_df[features].nunique())
unique_value_feature_train = unique_value_feature_train.reset_index(drop=False)
unique_value_feature_train.columns = ['Features', 'Count']

table = wandb.Table(data=unique_value_feature_train, columns = ["Features", "Count"])
wandb.log({"Unique Train Features" : wandb.plot.histogram(table, "Features", title="Unique Train Features")})


run.finish()

run

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left">Density Plot of Features</h4> 

In [None]:
def plot_feature_distribution(df1,df2,df3,df4,df5,df6,df7,df8,df9,features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(3,3,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(3,3,i)
        sns.distplot(df1[feature], hist=False,label="Class 1")
        sns.distplot(df2[feature], hist=False,label="Class 2")
        sns.distplot(df3[feature], hist=False,label="Class 3")
        sns.distplot(df4[feature], hist=False,label="Class 4")
        sns.distplot(df5[feature], hist=False,label="Class 5")
        sns.distplot(df6[feature], hist=False,label="Class 6")
        sns.distplot(df7[feature], hist=False,label="Class 7")
        sns.distplot(df8[feature], hist=False,label="Class 8")
        sns.distplot(df9[feature], hist=False,label="Class 9")
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.legend()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    
    plt.show();

In [None]:
features = [feature for feature in train_df.columns if feature not in ['id', 'target']]
features = features[:9]

In [None]:
t1 = train_df.loc[train_df['target'] == 'Class_1']
t2 = train_df.loc[train_df['target'] == 'Class_2']
t3 = train_df.loc[train_df['target'] == 'Class_3']
t4 = train_df.loc[train_df['target'] == 'Class_4']
t5 = train_df.loc[train_df['target'] == 'Class_5']
t6 = train_df.loc[train_df['target'] == 'Class_6']
t7 = train_df.loc[train_df['target'] == 'Class_7']
t8 = train_df.loc[train_df['target'] == 'Class_8']
t9 = train_df.loc[train_df['target'] == 'Class_9']
plot_feature_distribution(t1, t2,t3,t4,t5,t6,t7,t8,t9,features)

In [None]:
features = [feature for feature in train_df.columns if feature not in ['id', 'target']]


<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Distribution of mean values</h4> 

In [None]:
run = wandb.init(project='TPSJune', job_type='image-visualization',name='Distribution of mean values')
mean_train = pd.DataFrame(train_df[features].mean())
mean_train = mean_train.reset_index(drop=True)
mean_train.columns = ['MeanDistribution']
table = wandb.Table(data=mean_train, columns = ["MeanDistribution"])
wandb.log({"Distribution of mean values" : wandb.plot.histogram(table, "MeanDistribution", title="Distribution of mean values")})


run.finish()

run

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values")
sns.distplot(train_df[features].mean(axis=0),color="magenta",kde=True,bins=120, label='train')
sns.distplot(test_df[features].mean(axis=0),color="darkblue", kde=True,bins=120, label='test')
plt.legend()
plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left">Distribution of Standard Deviation</h4> 

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of std values")
sns.distplot(train_df[features].std(),color="blue",kde=True,bins=120, label='train')
sns.distplot(test_df[features].std(),color="green", kde=True,bins=120, label='test')
plt.legend(); plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left">Distribution of min values</h4> 

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of min values ")
sns.distplot(train_df[features].min(),color="red", kde=True,bins=120, label='train')
sns.distplot(test_df[features].min(),color="orange", kde=True,bins=120, label='test')
plt.legend()
plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left">Distribution of max values</h4> 

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of max values ")
sns.distplot(train_df[features].max(),color="brown", kde=True,bins=120, label='train')
sns.distplot(test_df[features].max(),color="yellow", kde=True,bins=120, label='test')
plt.legend()
plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Distribution of Skewness</h4> 

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of skew ")
sns.distplot(train_df[features].skew(),color="red", kde=True,bins=120, label='train')
sns.distplot(test_df[features].skew(),color="orange", kde=True,bins=120, label='test')
plt.legend()
plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Distribution of Kurtosis</h4> 

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of kurtosis ")
sns.distplot(train_df[features].kurtosis(),color="darkblue", kde=True,bins=120, label='train')
sns.distplot(test_df[features].kurtosis(),color="yellow", kde=True,bins=120, label='test')
plt.legend()
plt.show()

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Correlation of Features</h4> 

In [None]:
%%time
correlations = train_df[features].corr().abs().unstack().sort_values(kind="quicksort").reset_index()
correlations = correlations[correlations['level_0'] != correlations['level_1']]
correlations.head(10)

In [None]:
correlations.tail(10)

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
corr_train = train_df.iloc[:20,1:20]
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(corr_train.corr().values,linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

<h4 style = "background-color: white; color : #fe346e; font-size: 30px; font-family:garamond; font-weight:normal; border-radius: 75px 150px; text-align: left"> Checking for Duplicate Values</h4> 

In [None]:
%%time
unique_max_train = []
unique_max_test = []
for feature in features:
    values = train_df[feature].value_counts()
    unique_max_train.append([feature, values.max(), values.idxmax()])
    values = test_df[feature].value_counts()
    unique_max_test.append([feature, values.max(), values.idxmax()])

In [None]:
np.transpose((pd.DataFrame(unique_max_train, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))

In [None]:
np.transpose((pd.DataFrame(unique_max_test, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))

Work in progress 🚧

References :

https://www.kaggle.com/dwin183287/tps-june-2021-eda

https://www.kaggle.com/bhuvanchennoju/data-storytelling-auc-focus-on-strokes

https://www.kaggle.com/c/santander-value-prediction-challenge