# Tabular Playground Series - Dec 2021

**Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.**

For this competition, you will be predicting a categorical target based on a number of feature columns given in the data. The data is synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.

Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

<div id='content'></div>

## Index of Content

* [**1.0 Importing the modules**](#Chapter1)
* [**2.0 Data Loading and Preperation**](#Chapter2)
 * [2.1 Exploring Train Data](#train)
 * [2.2 Exploring Test Data](#test)
* [**3.0 EDA**](#Chapter3)
 * [3.1 Cover_Type](#cover)
 * [3.2 Soil_Type](#soil)
 * [3.4 Wilderness_Area](#wild)
 * [3.3 Features distribution](#features)
* [**4.0 Model Building**](#Chapter4)
 * [4.1 spliting into train, val](#split)
 * [4.2 fitting model](#fit)
 * [4.3 validating model](#val)
* [**5.0 Confusion Matrix**](#Chapter5)
* [**6.0 submitting the predictions**](#Chapter6)

<div id='Chapter1'></div>

## 1.0 Importing the modules

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns 
import matplotlib

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings('ignore')

<div id='Chapter2'></div>

## 2.0 Data Loading and Preperation

In [None]:
train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')
sample_sub = pd.read_csv('../input/tabular-playground-series-dec-2021/sample_submission.csv')

* **Files**
 * **train.csv** - the training data with the target Cover_Type column
 * **test.csv** - the test set; you will be predicting the Cover_Type for each row in this file (the target integer class)
 * **sample_submission.csv** - a sample submission file in the correct format
 
 <div id='train'></div>

#### 2.1 Exploring Train Data

In [None]:
train.head()

In [None]:
train.describe().T[1:].sort_values(by='mean',ascending=False).style.background_gradient(cmap='YlOrRd')

In [None]:
train.info()

In [None]:
print(f'Number of rows       : {train.shape[0]}\nNumber of columns    : {train.shape[1]}\nNo of missing values : {sum(train.isna().sum())}')

<div id='test'></div>

#### 2.1 Exploring Test Data

In [None]:
test.head()

In [None]:
print(f'Number of rows       : {test.shape[0]}\nNumber of columns    : {test.shape[1]}\nNo of missing values : {sum(test.isna().sum())}') 

In [None]:
test.describe().T[1:].sort_values(by='mean',ascending=False).style.background_gradient(cmap='YlOrRd')

In [None]:
features = []
for i in train.columns:
    if len(train[i].value_counts())<=2:
        features.append(i)

<div id='Chapter3'></div>

## 3.0 EDA

<div id='cover'></div>

#### 3.1 Cover_Type

In [None]:
d = dict(train['Cover_Type'].value_counts())

fig = plt.figure(figsize=(20, 5), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.3, hspace=0.05)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor('#f6f5f5')
ax1 = fig.add_subplot(gs[0, 1])

ax0.bar(d.keys(),d.values(),color='#ffd514',edgecolor='black')
ax1.set_facecolor('#f6f5f5')
height_per = [i/len(train) for i in d.values()]
ax1.bar(d.keys(),height_per,color='#ff355d',edgecolor='black')

ax0.set_xlabel('Cover_Type')
ax1.set_xlabel('Cover_Type')
ax0.set_ylabel('count')
ax1.set_ylabel('percentage')

for i in ['right','top']:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)
    
plt.show()

In [None]:
train_l_0,train_l_1 = [],[]
for i in train.columns:
    if len(train[i].value_counts())<=2:
        d = {0:0,1:0}
        temp = dict(train[i].value_counts())
        try:
            d[0] = temp[0]
        except:
            pass
        try:
            d[1] = temp[1]
        except:
            pass
        train_l_0.append(d[0])
        train_l_1.append(d[1])
        
        
test_l_0,test_l_1 = [],[]
for i in test.columns:
    if len(test[i].value_counts())<=2:
        d = {0:0,1:0}
        temp = dict(test[i].value_counts())
        try:
            d[0] = temp[0]
        except:
            pass
        try:
            d[1] = temp[1]
        except:
            pass
        test_l_0.append(d[0])
        test_l_1.append(d[1])
        
features = [i for i in train.columns  if len(train[i].value_counts())<30]

<div id='soil'></div>

#### 3.2 Soil_Type

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6,10),facecolor='#f6f5f5')
gs = fig.add_gridspec(1,2)
gs.update(wspace=.35, hspace=0.05)

ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])

background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*120)

for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=features[4:-1], x=train_l_0[4:], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
sns.barplot(ax=ax0,y=features[4:-1], x=train_l_1[4:], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1,color='#ff355d')
ax0_sns.set_xlabel("count",fontsize=4, weight='bold')
ax0_sns.set_ylabel("features",fontsize=4, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -1.8, 'Train Dataset', fontsize=5, ha='left', va='top', weight='bold')
ax0.text(0, -1.105, 'Number of records with different soil type', fontsize=3, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

for p in ax0.patches:
    value = f'{p.get_width():,.0f} | {(p.get_width()/train.shape[0]):,.1%}'
    x = p.get_x() + p.get_width() + 1000
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=3, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.2))
    
    
background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*120)

for s in ["right", "top"]:
    ax1.spines[s].set_visible(False)
ax1.set_facecolor(background_color)
ax1_sns = sns.barplot(ax=ax1, y=features[4:-1], x=test_l_0[4:], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
sns.barplot(ax=ax1,y=features[4:-1], x=test_l_1[4:], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1,color='#ff355d')
ax1_sns.set_xlabel("count",fontsize=4, weight='bold')
ax1_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax1_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1.text(0, -1.8, 'Test Dataset', fontsize=5, ha='left', va='top', weight='bold')
ax1.text(0, -1.105, 'Number of records with different soil type', fontsize=3, ha='left', va='top')
ax1.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))


for p in ax1.patches:
    value = f'{p.get_width():,.0f} | {(p.get_width()/test.shape[0]):,.1%}'
    x = p.get_x() + p.get_width()+1000
    y = p.get_y() + p.get_height() / 2 
    ax1.text(x, y, value, ha='left', va='center', fontsize=3, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.2))
    
plt.show()

In [None]:
d = dict()
d['train'],d['test'] = dict(),dict()
for i in features[:4]:
    d['train'][i] = dict(train[i].value_counts())
    d['test'][i] = dict(test[i].value_counts())
d

<div id='wild'></div>

#### 3.3 Wilderness_Area

In [None]:
fig = plt.figure(figsize=(13,3),facecolor='#f6f5f5')
gs = fig.add_gridspec(1,2)

ax4 = fig.add_subplot(gs[0,:])

x = np.arange(4)
ax4.set_facecolor(background_color)
ax4.bar(x-0.1,[i[0] for i in d['train'].values()],0.2,edgecolor='black')
ax4.bar(x+0.1,[i[1] for i in d['train'].values()],0.2,color='#ff355d',edgecolor='black')
ax4.bar(x,[i[0] for i in d['test'].values()],0.2,edgecolor='black')
ax4.bar(x+0.2,[i[1] for i in d['test'].values()],0.2,color='#ff355d',edgecolor='black')

ax4.set_xticklabels(['','Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4'])
ax4.xaxis.set_major_locator(mtick.MultipleLocator(1))

for i in ['top','right']:
    ax4.spines[i].set_visible(False)
    
for p in ax4.patches:
    value = f'{p.get_height():,.0f}'
    x = p.get_x() + p.get_width()-0.2
    y = p.get_y() + p.get_height()+140000
    ax4.text(x, y, value, ha='left', va='center', fontsize=8, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.2))
       
plt.legend([0,1])
plt.show()

<div id='features'></div>

#### 3.4 Feature Distribution

In [None]:
fig = plt.figure(figsize=(15,15),facecolor='#f6f5f5')
gs = fig.add_gridspec(5,2)
gs.update(wspace=.35, hspace=0.25)

ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

background_color = '#f6f5f5'
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)
ax9.set_facecolor(background_color)

ax0.hist(train.Slope,bins=100)
ax0.hist(test.Slope,bins=100,color='#ff355d')
ax0.set_xlabel("Elevation",fontsize=10, weight='bold')
ax0.text(80, 250000, 'Distribution of data', fontsize=20, fontweight='bold', fontfamily='serif', horizontalalignment='center')

ax1.hist(train.Aspect,bins=100)
ax1.hist(test.Aspect,bins=100,color='#ff355d')
ax1.set_xlabel("Aspect",fontsize=10, weight='bold')

ax2.hist(train.Elevation,bins=100)
ax2.hist(test.Elevation,bins=100,color='#ff355d')
ax2.set_xlabel("Slope",fontsize=10, weight='bold')

ax3.hist(train.Horizontal_Distance_To_Hydrology,bins=100)
ax3.hist(test.Horizontal_Distance_To_Hydrology,bins=100,color='#ff355d')
ax3.set_xlabel("Horizontal_Distance_To_Hydrology",fontsize=10, weight='bold')

ax4.hist(train.Vertical_Distance_To_Hydrology,bins=100)
ax4.hist(test.Vertical_Distance_To_Hydrology,bins=100,color='#ff355d')
ax4.set_xlabel("Vertical_Distance_To_Hydrology",fontsize=10, weight='bold')

ax5.hist(train.Horizontal_Distance_To_Roadways,bins=100)
ax5.hist(test.Horizontal_Distance_To_Roadways,bins=100,color='#ff355d')
ax5.set_xlabel("Horizontal_Distance_To_Roadways",fontsize=10, weight='bold')

ax6.hist(train.Hillshade_9am,bins=100)
ax6.hist(test.Hillshade_9am,bins=100,color='#ff355d')
ax6.set_xlabel("Hillshade_9am",fontsize=10, weight='bold')

ax7.hist(train.Hillshade_Noon,bins=100)
ax7.hist(test.Hillshade_Noon,bins=100,color='#ff355d')
ax7.set_xlabel("Hillshade_Noon",fontsize=10, weight='bold')

ax8.hist(train.Hillshade_3pm,bins=100)
ax8.hist(test.Hillshade_3pm,bins=100,color='#ff355d')
ax8.set_xlabel("Hillshade_3pm",fontsize=10, weight='bold')

ax9.hist(train.Horizontal_Distance_To_Fire_Points,bins=100)
ax9.hist(test.Horizontal_Distance_To_Fire_Points,bins=100,color='#ff355d')
ax9.set_xlabel("Horizontal_Distance_To_Fire_Points",fontsize=10, weight='bold')

for i in ['top','right']:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)
    ax2.spines[i].set_visible(False)
    ax3.spines[i].set_visible(False)
    ax4.spines[i].set_visible(False)
    ax5.spines[i].set_visible(False)
    ax6.spines[i].set_visible(False)
    ax7.spines[i].set_visible(False)
    ax8.spines[i].set_visible(False)
    ax9.spines[i].set_visible(False)

ax1.legend(['train','test'],loc=0)
plt.show()

<div id='Chapter4'></div>

## 4.0 Model Building

In [None]:
features = ['Wilderness_Area1','Elevation','Wilderness_Area4','Cover_Type']

In [None]:
train[features].corr()

<div id='split'></div>

#### 4.1 spliting into train, val

In [None]:
x_train,x_val,y_train,y_val = train_test_split(train[features[:-1]],train['Cover_Type'])

In [None]:
model = RandomForestClassifier()

<div id='fit'></div>

#### 4.2 fitting model

In [None]:
model.fit(x_train,y_train)

In [None]:
pred = model.predict(x_val)

<div id='val'></div>

#### 4.3 validating model

In [None]:
print(classification_report(y_val, pred))

<div id='Chapter5'></div>

## 5.0 Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_val, pred, normalize='true')
sns.heatmap(cm, annot=True, cmap="YlOrRd")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion matrix')
plt.show()

In [None]:
print('Accuracy : ', accuracy_score(y_val, pred))

In [None]:
final_pred = model.predict(test[features[:-1]])

<div id='Chapter6'></div>

## 6.0 submitting the predictions

In [None]:
submission = pd.DataFrame({'Id': test['Id'], 'Cover_Type': final_pred })
submission.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

In [None]:
submission

# If you find this notebook useful, support with an upvote 👍