<a id='tab'></a>
## Table of Contents ⏩
* [Dataset 📋](#dat)
* [Basic Overview 📺](#1)
  * [Train Dataset](#1.1)
  * [Test Dataset](#1.2)

* [Exploratory Data Analysis 📊](#2)
  * [Target Distribution](#2.1)
  * [Unique value of each feature in datasets](#2.2)
  * [Percentage of zeroes in each feature of datasets](#2.3)
  * [Distribution of each feature in training data](#2.4)
  * [Feature Correlation of training dataset](#2.5)
  * [Distribution of each feature in test dataset](#2.7)
  * [Distribution of features according to targets](#2.8)
  
* [Conclusions 🤞](#3)
* [References 💫](#4)
  
   

<a id='dat'></a>
# **Dataset 📋**

* The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

#### If you like this notebook do provide your feedback and upvote if you really my work .
#### This notebook contains only EDA of dataset , will publish my nextbook regarding predictions and different methods of dealing with this problem (Hoping for getting good results😶😏)
#### But for now do enjoy this notebook and will come back soon 🔜😃

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_train=pd.read_csv("../input/tabular-playground-series-jun-2021/train.csv")
df_test=pd.read_csv("../input/tabular-playground-series-jun-2021/test.csv")

<html>
<a id="1"></a>
</html>

# 1. **Basic Overview 📺**

<a id='1.1'></a>
### 1.1 Train Dataset

*  There are 77 columns in training dataset , in which two are "id" and "target" columns , other remaining 75 columns are our independent features

* There is no null values in the dataset , but there are lot of 0 values in each feature , is there any chance it can effect our predictions will talk about this later

* "feature_19" has the highest mean and standard deviation

* We have overall 9 classes to predict

*  Data is imbalanced, class_6 (51811) and class_8 (51763) both combined covers more than 50% of data

* All the features are in "int64" type except the "target" column which is in "string" type


In [None]:
print("***** Shape of training dataset *****")
print()
print(df_train.shape)


In [None]:
print("***** First five rows of training dataset *****")
print()
df_train.head()

In [None]:
print("***** dtypes of our training dataset *****")
print()
print(df_train.dtypes)

In [None]:
print("***** Basic stats about our training dataset *****")
print()
df_train.drop(['id'],axis=1).describe().T.style.bar(subset=['mean'],color='#205ff2').background_gradient(subset=['std'],cmap='coolwarm').background_gradient(subset=['50%'],cmap='coolwarm').background_gradient(subset=['75%'],cmap='coolwarm')

In [None]:
print("***** Value counts of target columns *****")
print()
df_train['target'].value_counts()

### 1.2 Test Dataset

*  There are 76 columns in training dataset , in which one is "id" column , other remaining 75 columns are our independent features

*  There is no null values in the dataset , but there are lot of 0 values in each feature , is there any chance it can effect our predictions will talk about this later

*  "feature_19" has the highest mean and standard deviation

*  All the features are of "int64" type


In [None]:
print("***** Shape of test dataset *****")
print()
print(df_test.shape)


In [None]:
print("***** First five rows of test dataset *****")
print()
df_test.head()

In [None]:
print("***** dtypes of our test dataset *****")
print()
print(df_test.dtypes)

In [None]:
print("***** Basic stats about our test dataset *****")
print()
df_test.drop(['id'],axis=1).describe().T.style.bar(subset=['mean'],color='#205ff2').background_gradient(subset=['std'],cmap='coolwarm').background_gradient(subset=['50%'],cmap='coolwarm').background_gradient(subset=['75%'],cmap='coolwarm')

[slide to top](#tab)
<a id='2'> </a>
# 2. **Exploratory Data Analysis 📊**

In [None]:
def with_hue(data,feature,ax):
    
    #Numnber of categories
    num_of_cat=len([x for x in data[feature].unique() if x==x])
    
    bars=ax.patches
    
    for ind in range(num_of_cat):
        ##     Get every hue bar
        ##     ex. 8 X categories, 4 hues =>
        ##    [0, 8, 16, 24] are hue bars for 1st X category
        hueBars=bars[ind:][::num_of_cat] 
        # Get the total height (for percentages)
        total=sum([x.get_height() for x in hueBars])
        #Printing percentages on bar
        for bar in hueBars:
            percentage='{:.1f}%'.format(100 * bar.get_height()/total)
            ax.text(bar.get_x()+bar.get_width()/2.0,
                   bar.get_height(),
                   percentage,
                    ha="center",va="bottom",fontweight='bold',fontsize=14)
    

    
def without_hue(data,feature,ax):
    
    total=float(len(data))
    bars_plot=ax.patches
    
    for bars in bars_plot:
        percentage = '{:.1f}%'.format(100 * bars.get_height()/total)
        x = bars.get_x() + bars.get_width()/2.0
        y = bars.get_height()
        ax.text(x, y,(percentage,bars.get_height()),ha='center',fontweight='bold',fontsize=14)

[slide to top](#tab)
<a id='2.1'> </a>
 
###  2.1 Target Distribution

In [None]:
#setting theme
sns.set_theme(context='notebook',style='white',font_scale=3)

#setting the background and foreground color
fig=plt.figure(figsize=(24,12))
ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")
fig.patch.set_color("#F2EDD7FF")

#Dealing with spines
for i in ['left','top','right']:
    ax.spines[i].set_visible(False)
    
ax.grid(linestyle="--",axis='y',color='gray')

#countplot
a=sns.countplot(data=df_train,x='target',saturation=3,palette='cool')

without_hue(df_train,'target',a)

plt.title("Target Distribution",weight='bold')

[slide to top](#tab)
<a id='2.2'> </a>

### 2.2 Unique value of each feature in datasets

* Unique values of most of the features in both 'train' and 'test' dataset are equal
* feature_15 , feature_28 , feature_46 , feature_59 , feature_60 , feature_73  , these are the features which have difference in number of unique values
* feature_60 have most difference in number of unique values

In [None]:
fig,ax=plt.subplots(nrows=1,ncols=3,figsize=(18,12))

#Train Dataset
#ax=plt.axes()
for i in range(0,2):
    ax[0].set_facecolor("#F2EDD7FF")
    
#ax.set_facecolor("#F2EDD7FF")
fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#ff355d']*75)

for i in ['top','right']:
    ax[0].spines[i].set_visible(False)
    
ax[0].grid(linestyle="--",axis='x',color='gray')

y_col=df_train.columns[1:76]
y_col=list(y_col)
unique=[]
for i in y_col:
    unique.append(df_train[i].nunique())

a=sns.barplot(x=unique,y=y_col,orient='h',zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[0])
a.set_xlabel("Unique Values",fontsize=6, weight='bold')
a.set_ylabel("Features",fontsize=6, weight='bold')
a.tick_params(labelsize=6, width=1, length=1.5)
a.text(2,-1.9,"Unique values of each feature in training dataset",fontsize=10,fontweight='bold')
bars=a.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    a.text(x, y,bar.get_width(),ha='center',va='center',fontweight='bold',fontsize=6)
    


#Test Dataset
#ax=plt.axes()
for i in range(0,2):
    ax[1].set_facecolor("#F2EDD7FF")
    
#ax.set_facecolor("#F2EDD7FF")
fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#84DE02']*75)

for i in ['top','right']:
    ax[1].spines[i].set_visible(False)
    
ax[1].grid(linestyle="--",axis='x',color='gray')

y_col=df_test.columns[1:76]
y_col=list(y_col)
unique_test=[]
for i in y_col:
    unique_test.append(df_test[i].nunique())

b=sns.barplot(x=unique_test,y=y_col,orient='h',zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[1])
b.set_xlabel("Unique Values",fontsize=6, weight='bold')
b.set_ylabel("Features",fontsize=6, weight='bold')
b.tick_params(labelsize=6, width=1, length=1.5)
b.text(2,-1.9,"Unique values of each feature in test dataset",fontsize=10,fontweight='bold')
bars=b.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    b.text(x, y,bar.get_width(),ha='center',va='center',fontweight='bold',fontsize=6)
    

    
#Difference in unique values between features
for i in range(0,2):
    ax[2].set_facecolor("#F2EDD7FF")
    
#ax.set_facecolor("#F2EDD7FF")
fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#fefe22']*75)

for i in ['top','right']:
    ax[2].spines[i].set_visible(False)
    
ax[2].grid(linestyle="--",axis='x',color='gray')

y_col=df_test.columns[1:76]
y_col=list(y_col)
unique_diff=[]
for i in y_col:
    unique_diff.append(df_train[i].nunique()-df_test[i].nunique())

b=sns.barplot(x=unique_diff,y=y_col,orient='h',zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[2])
b.set_xlabel("Unique Values",fontsize=6, weight='bold')
b.set_ylabel("Features",fontsize=6, weight='bold')
b.tick_params(labelsize=6, width=1, length=1.5)
b.text(2,-1.9,"Difference of unique values in each features",fontsize=10,fontweight='bold')
bars=b.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    b.text(x, y,bar.get_width(),ha='center',va='center',fontweight='bold',fontsize=6)
    


[slide to top](#tab)
<a id='2.3'> </a>

### 2.3 Percentage of zeroes in each feature of datasets

* Majority of features in both dataset have more than 50% of zero values
* Percentages of zeroes in each feature of train and test dataset is almost same 

In [None]:
fig,ax=plt.subplots(nrows=1,ncols=3,figsize=(15,12))

#Training dataset
zeroes=(((df_train.iloc[:,1:76]==0).sum())/len(df_train))*100

zero=np.array(zeroes)
hundred=[100]*75
fig=plt.figure(figsize=(12,12))

#ax=plt.axes()
for i in range(0,2):
    ax[0].set_facecolor("#F2EDD7FF")

fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#ff355d']*75)

for i in ['top','right']:
    ax[0].spines[i].set_visible(False)
    
ax[0].grid(linestyle="--",axis='x',color='gray')

sns.barplot(y=zeroes.index,x=hundred, color='#dadada',ax=ax[0])
barh = sns.barplot(y=zeroes.index, x=zero,zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[0])
barh.tick_params(labelsize=6, width=1, length=1.5)
barh.set_xlabel("% of zeroes",fontsize=4, weight='bold')
barh.set_ylabel("Features",fontsize=4, weight='bold')

barh.text(2,-1.9,"% of zeroes in training data of  each feature",fontsize=10,fontweight='bold')

bars=barh.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    percentage=str(bar.get_width())[:5]+"%"
    barh.text(x, y,percentage,ha='center',va='center',fontweight='bold',fontsize=6)

    
    
#Test dataset
zeroes_test=(((df_test.iloc[:,1:76]==0).sum())/len(df_test))*100

zero_test=np.array(zeroes_test)
hundred_test=[100]*75
fig=plt.figure(figsize=(12,12))

#ax=plt.axes()
for i in range(0,2):
    ax[1].set_facecolor("#F2EDD7FF")

fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#84DE02']*75)

for i in ['top','right']:
    ax[1].spines[i].set_visible(False)
    
ax[1].grid(linestyle="--",axis='x',color='gray')

sns.barplot(y=zeroes_test.index,x=hundred_test, color='#dadada',ax=ax[1])
barh = sns.barplot(y=zeroes_test.index, x=zero_test,zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[1])
barh.tick_params(labelsize=6, width=1, length=1.5)
barh.set_xlabel("% of zeroes",fontsize=4, weight='bold')
barh.set_ylabel("Features",fontsize=4, weight='bold')

barh.text(2,-1.9,"% of zeroes in test data of each feature",fontsize=10,fontweight='bold')

bars=barh.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    percentage=str(bar.get_width())[:5]+"%"
    barh.text(x, y,percentage,ha='center',va='center',fontweight='bold',fontsize=6)
    
    
    
#difference in % of zeroes
zeroes_test=(((df_train.iloc[:,1:76]==0).sum())/len(df_train))*100-(((df_test.iloc[:,1:76]==0).sum())/len(df_test))*100

zero_test=np.array(zeroes_test)
hundred_test=[100]*75
fig=plt.figure(figsize=(12,12))

#ax=plt.axes()
for i in range(0,2):
    ax[2].set_facecolor("#F2EDD7FF")

fig.patch.set_facecolor("#F2EDD7FF")
sns.set_palette(['#84DE02']*75)

for i in ['top','right']:
    ax[2].spines[i].set_visible(False)
    
ax[2].grid(linestyle="--",axis='x',color='gray')

sns.barplot(y=zeroes_test.index,x=hundred_test, color='#dadada',ax=ax[2])
barh = sns.barplot(y=zeroes_test.index, x=zero_test,zorder=2,alpha=1,saturation=1,linewidth=0,ax=ax[2])
barh.tick_params(labelsize=6, width=1, length=1.5)
barh.set_xlabel("% of zeroes",fontsize=4, weight='bold')
barh.set_ylabel("Features",fontsize=4, weight='bold')

barh.text(2,-1.9,"Difference in % of zeroes in each feature",fontsize=10,fontweight='bold')

bars=barh.patches
for bar in bars:
    x = bar.get_x() + bar.get_width()+2
    y = bar.get_y() + bar.get_height() / 2 
    percentage=str(bar.get_width())[:5]+"%"
    barh.text(x, y,percentage,ha='center',va='center',fontweight='bold',fontsize=6)



[slide to top](#tab)
<a id='2.4'> </a>

### 2.4 Distribution of each feature in training data

* Most of the values in each feature is zero
* Mostly all the features are right skewed
* Some features have disturbances in their kdeplot distribution such as feature_16 , feature_18 , feature_22

In [None]:
print("***** Distribution of each feature in training dataset *****")
fig,ax=plt.subplots(nrows=19,ncols=4,figsize=(20,40))
fig.patch.set_facecolor("#F2EDD7FF")
cols=list(df_train.columns)[1:76]
for i in range(0,19):
    for j in range(0,4):
        try:
            ax[i][j].set_facecolor("#F2EDD7FF")
            a=sns.kdeplot(data=df_train,x=df_train[cols[i*4+j]],ax=ax[i][j],hue=df_train['target'],legend=False,palette="cool",fill=True)
        
            ax[i][j].spines['top'].set_visible(False)
            ax[i][j].spines['left'].set_visible(False)
            ax[i][j].spines['right'].set_visible(False)
            a.set(xticklabels=[])  
            a.set(xlabel=None)
            a.set(yticklabels=[])  
            a.set(ylabel=None)
            #a.set(title=cols[i*4+j])
            a.set_title(cols[i*4+j],size=8,fontweight='bold')
        except:
            pass



[slide to top](#tab)
<a id='2.5'> </a>

### 2.5 Feature Correlation of training dataset

* As we can see the highest correlation between any two features is 0.14 , so that means we don't have any significant correlation between all the available features

In [None]:
fig = plt.figure(figsize=(40,40))
fig.patch.set_facecolor("#F2EDD7FF")
ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")

corr = df_train.drop('id',axis=1).corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True



a=sns.heatmap(corr,
        square=True, center=0, linewidth=0.2,
        mask=mask) 

a.set_title('Feature Correlation of training dataset', loc='left', fontweight='bold')
plt.show()

[slide to top](#tab)
<a id='2.6'> </a>

### 2.6 Distribution of each feature in test dataset

* Most of the values in each feature is zero
* Mostly all the features are right skewed
* Some features have disturbances in their kdeplot distribution such as feature_16 , feature_18 , feature_22

In [None]:
print("***** Distribution of each feature in test dataset *****")
print()
fig,ax=plt.subplots(nrows=19,ncols=4,figsize=(20,40))
fig.patch.set_facecolor("#F2EDD7FF")
cols=list(df_test.columns)[1:76]
for i in range(0,19):
    for j in range(0,4):
        try:
            ax[i][j].set_facecolor("#F2EDD7FF")
            a=sns.kdeplot(data=df_test,x=df_test[cols[i*4+j]],ax=ax[i][j],legend=False,palette="cool",fill=True)
        
            ax[i][j].spines['top'].set_visible(False)
            ax[i][j].spines['left'].set_visible(False)
            ax[i][j].spines['right'].set_visible(False)
            a.set(xticklabels=[])  
            a.set(xlabel=None)
            a.set(yticklabels=[])  
            a.set(ylabel=None)
            #a.set(title=cols[i*4+j])
            a.set_title(cols[i*4+j],size=8,fontweight='bold')
        except:
            pass



[slide to top](#tab)
<a id='2.7'> </a>

### 2.7 Feature Correlation of test dataset

* As we can see the highest correlation between any two features is 0.12 , so that means we don't have any significant correlation between all the available features

In [None]:
fig = plt.figure(figsize=(40,40))
fig.patch.set_facecolor("#F2EDD7FF")
ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")

corr = df_test.drop('id',axis=1).corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True



a=sns.heatmap(corr,
        square=True, center=0, linewidth=0.2,
        mask=mask) 

a.set_title('Feature Correlation of test dataset', loc='left', fontweight='bold')
plt.show()

[slide to top](#tab)
<a id='2.8'> </a>

### 2.8 Distribution of features according to targets

* From the below stripplot , we can observe that none of features in our train dataset have some patterns or any type of correlation with any of classes available

In [None]:
fig,ax=plt.subplots(nrows=19,ncols=4,figsize=(20,40))
fig.patch.set_facecolor("#F2EDD7FF")
cols=list(df_test.columns)[1:76]
for i in range(0,19):
    for j in range(0,4):
        try:
            ax[i][j].set_facecolor("#F2EDD7FF")
            a=sns.stripplot(data=df_train,x=df_train['target'],y=df_test[cols[i*4+j]],ax=ax[i][j],palette="cool")
        
            ax[i][j].spines['top'].set_visible(False)
            ax[i][j].spines['left'].set_visible(False)
            ax[i][j].spines['right'].set_visible(False)
            a.set(xticklabels=[])  
            a.set(xlabel=None)
            a.set(yticklabels=[])  
            a.set(ylabel=None)
            #a.set(title=cols[i*4+j])
            a.set_title(cols[i*4+j],size=8,fontweight='bold')
        except:
            pass



[slide to top](#tab)
<a id='3'> </a>

# 3. **Conclusions 🤞**

* In my opinion , nature of train and test dataset is almost same . That means we can apply same preprocessing on both train and test dataset

* All features have lot of zeroes , as we don't have any description about any dataset so we can't say anything about the features available . May be we can say that not all features from 0 t0 74 is contributing towards any class , that's why dataset creater put zeroes instead of null values

* As there are a lot of features , but none of them is correlated to each other means all of them are independent of each other . 



[slide to top](#tab)
<a id='4'> </a>

# 4. **References 💫**

* https://www.kaggle.com/subinium/tps-may-categorical-eda

* https://www.kaggle.com/dwin183287/tps-june-2021-eda

## Please provide your feedback in comment section , Do give an upvote if you like my work 😀👍