---
# 1. import the libraries
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Load kaggle data
df_train=pd.read_csv("data/train.csv")
df_test=pd.read_csv("data/test.csv")

In [3]:
print("Train data shape : ",df_train.shape)
print("Test data shape  : ",df_test.shape)

Train data shape :  (188318, 132)
Test data shape  :  (125546, 131)


In [4]:
df_train.head()

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


### Check the null values

In [5]:
# check null values in train dataset
# in any column have any null value
df_train.isna().sum()

id        0
cat1      0
cat2      0
cat3      0
cat4      0
         ..
cont11    0
cont12    0
cont13    0
cont14    0
loss      0
Length: 132, dtype: int64

In [6]:
# total mo of null value in train data set
df_train.isna().sum().sum()  # no null value in train set

0

In [7]:
# Check total null value in test dataset
df_test.isna().sum().sum()  # no null value

0

In [8]:
print('First 10 columns: {0} \n Last 20 columns: {1}'.format(list(df_train.columns[:10]), list(df_train.columns[-10:])))


First 10 columns: ['id', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9'] 
 Last 20 columns: ['cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14', 'loss']


In [9]:
df_train.shape

(188318, 132)

In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 132 entries, id to loss
dtypes: float64(15), int64(1), object(116)
memory usage: 189.7+ MB


### We can see that there are 116 categorical columns (as their names suggest) and 14 continuous (numerical) columns. The names and values of the variables have been standardized to anonymize the data.  There are also id and loss columns. This sums up to 132 columns total. You might already guess we will need to drop id column, as it does not provide any information. The **loss** is a target variable. So, in total we have 132-2=130 features.

### Let’s get the column names of features, categorical features and continuous features separately. We can also see that categorical variables have text format. Most of machine learning models work only with numeric variables, therefore it is convenient to encode them.

---
### Split the feature names

In [11]:
features=[x for x in df_train.columns if x not in ['id','loss']]
# features
cat_features=[x for x in df_train.select_dtypes(include=['object']).columns if x not in ['id','loss'] ]
# cat_features
num_features=[x for x in df_train.select_dtypes(exclude=['object']).columns if x not in ['id','loss']]
# num_features

#### Encode categorical features

In [12]:
df=pd.concat((df_train[features],df_test[features])).reset_index(drop=True)
# df

for c in range(len(cat_features)):
    df[cat_features[c]]=df[cat_features[c]].astype('category').cat.codes


### We will prepare the data training

In [13]:
n_train=df_train.shape[0]
n_train

188318

In [14]:
X_train,y_train=df.iloc[:n_train,:], df_train['loss']
X_test=df.iloc[n_train:,:]

<img src="images/models.png">

### Let’s fit machine learning models. Just for the purpose of this post, we will use two basic estimators for regression problem:

---
## **1. Linear regression**:
### simple linear estimator

In [15]:
from sklearn.linear_model import LinearRegression

lr=LinearRegression()
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Submissiom

In [16]:
submission=pd.read_csv("data/sample_submission.csv")
submission.iloc[:,1]=lr.predict(X_test)

# save file
submission.to_csv('lr_starter.csv',index=None)

#### We submit the predictions lr.starter.csv to Kaggle and get MSE=1326.67 on the private score

### We can improve the result by applying more sophisticated machine learning algorithm that will catch non-linear effects in the data.
---
## **2. Random Forests:** 
###  non-linear estimator (averaged decision trees).

In [17]:
from sklearn.ensemble import RandomForestRegressor

rf=RandomForestRegressor(n_estimators=100,n_jobs=1,random_state=2019)
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=1, oob_score=False,
                      random_state=2019, verbose=0, warm_start=False)

### Submission

In [18]:
submission=pd.read_csv("data/sample_submission.csv")
submission.iloc[:,1]=rf.predict(X_test)
submission.to_csv('rf_starter.csv',index=None)

#### After sending the results to Kaggle, you will get improved MSE=1247.88.

### Although we do not have MSE of a traditional method to compare with, we can see that once we sort out how to get data in a proper form, it just minutes to create first machine learning estimators. Of course, we can spend more time on hyperparameters optimization, proper visualizations or building and deploying the model, but still, the process is greatly automated. Moreover, due to the fact that traditional methodologies use limited data and do not discover non-linear patterns in the data, we might guess that the two above machine learning models outperform them.