# <font color=blue> Forest Cover Type Analysis: </font>
<br>
<font color=teal>**Let us first import the required libraries**</font>

-  pandas to import csv files and handle DataFrames
-  numpy to handle our inputs and outputs
-  matplotlib.pyplot to visualize our data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<font color= teal>Next let us examine our dataset.</font>

Here we will check for:
-  missing values
-  datatypes of features

In [2]:
train = pd.read_csv('train.csv', header=0, index_col='Id')
print(train.head())
print(train.info())

    Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
Id                                                               
1        2596      51      3                               258   
2        2590      56      2                               212   
3        2804     139      9                               268   
4        2785     155     18                               242   
5        2595      45      2                               153   

    Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
Id                                                                    
1                                0                              510   
2                               -6                              390   
3                               65                             3180   
4                              118                             3090   
5                               -1                              391   

    Hillshade_9am  Hillshade_Noon  Hill

In [3]:
test = pd.read_csv('test.csv', header=0, index_col='Id')
print(test.head())
print(test.info())

       Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
Id                                                                  
15121       2680     354     14                                 0   
15122       2683       0     13                                 0   
15123       2713      16     15                                 0   
15124       2709      24     17                                 0   
15125       2706      29     19                                 0   

       Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
Id                                                                       
15121                               0                             2684   
15122                               0                             2654   
15123                               0                             2980   
15124                               0                             2950   
15125                               0                             2920  

<font color=teal>First we will have to convert the "one-hot" columns into linear values. Otherwise we will end up with a lot of features.</font>

In [4]:
train['Wilderness_Area'] = 0
test['Wilderness_Area'] = 0
for i in range(1,5):
    train['Wilderness_Area'] += i*train['Wilderness_Area{}'.format(i)]
    train.drop('Wilderness_Area{}'.format(i),axis=1,inplace=True)
    test['Wilderness_Area'] += i*test['Wilderness_Area{}'.format(i)]
    test.drop('Wilderness_Area{}'.format(i),axis=1,inplace=True)
train['Soil_Type'] = 0
test['Soil_Type'] = 0
for i in range(1,41):
    train['Soil_Type'] += i*train['Soil_Type{}'.format(i)]
    train.drop('Soil_Type{}'.format(i),axis=1,inplace=True)
    test['Soil_Type'] += i*test['Soil_Type{}'.format(i)]
    test.drop('Soil_Type{}'.format(i),axis=1,inplace=True)
print(train.head())
print(train.info())
print(test.head())
print(test.info())

    Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
Id                                                               
1        2596      51      3                               258   
2        2590      56      2                               212   
3        2804     139      9                               268   
4        2785     155     18                               242   
5        2595      45      2                               153   

    Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
Id                                                                    
1                                0                              510   
2                               -6                              390   
3                               65                             3180   
4                              118                             3090   
5                               -1                              391   

    Hillshade_9am  Hillshade_Noon  Hill

## <font color=blue>Statistical Analysis:</font>
<br>
Let's check the various statistical measures of our training data such as mean, standard deviation, range, etc.

In [5]:
print(train.describe())

          Elevation        Aspect         Slope  \
count  15120.000000  15120.000000  15120.000000   
mean    2749.322553    156.676653     16.501587   
std      417.678187    110.085801      8.453927   
min     1863.000000      0.000000      0.000000   
25%     2376.000000     65.000000     10.000000   
50%     2752.000000    126.000000     15.000000   
75%     3104.000000    261.000000     22.000000   
max     3849.000000    360.000000     52.000000   

       Horizontal_Distance_To_Hydrology  Vertical_Distance_To_Hydrology  \
count                      15120.000000                    15120.000000   
mean                         227.195701                       51.076521   
std                          210.075296                       61.239406   
min                            0.000000                     -146.000000   
25%                           67.000000                        5.000000   
50%                          180.000000                       32.000000   
75%            

<font color=teal>As we can see above, our data is not standardized. The mean and variance of each feature differs greatly. This will affect our final output.
<br><br>
Hence, let's use Scikit-learn's sklearn.preprocessing.StandardScaler to bring the mean of each feature to 0 and scale the variances to a standard value.</font>
<br><br>
But before that, we'll first have to extract the features and labels from the DataFrame and convert them to numpy array to be able to use them with Scikit-learn.

In [6]:
#Training data features:
X_train = train.drop('Cover_Type', axis=1).values #First we drop the label column and then get only the values.
                                                  #Specify axis=1 so that pandas will search for column instead of index.

#Training data labels:
y_train = train['Cover_Type'].values #Get the values on the label column only.

#Recall that Machine Learning is basically a method to make a program find an optimal function mapping X to y.

#Test data features:
X_test = test.values #We can directly take the values as test DataFrame does not have label column.

In [7]:
#Import StandardScaler
from sklearn.preprocessing import StandardScaler

#Instantiate the scaler
scaler = StandardScaler()

#Fit the scaler to the train features
scaler.fit(X_train)

#Transform the train features
X_train_standard = scaler.transform(X_train)

#Print the mean and standard deviation to confirm
print(np.mean(X_train_standard, axis=1)) #Specify axis=0 to compute mean along columns and not rows
print(np.std(X_train_standard, axis=1))

[ 0.00912237 -0.02251383  0.34633235 ... -0.02602759 -0.00749145
  0.06615552]
[1.51685167 1.52809436 1.36505031 ... 0.83860155 0.76042727 0.94646076]




<font color=teal>As we can see above, the data is now **standardized**.
<br><br>
Next we will use **K-Nearest Neighbors** classification technique to clusterize the training data. The reason we use KNN is:
<br></font>
-  We have labeled data.
-  We want to classify new data into known existing groups.
-  Our prediction depends on multiple factors which are not linear.

In [8]:
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

#Instantiate the classifier
knn = KNeighborsClassifier()  #We are not defining the number of neighbors (n_neighbors) as we will use GridSearchCV which is explained later on.

<font color=teal>Now that we have our classifier instantiated, we need to pick a good number of neighbors for our classifier.
<br><br>
We could do the straightforward brute force approach by changing the number of neighbors (which is called a hyperparameter) and running the code again and again. But that is very time consuming.
<br><br>
Instead, we will use something known as a Grid Search.
<br><br>
A grid search is a way of finding the optimal value for a hyperparameter without having to compromise on time or training data.
<br><br>
The grid search uses Cross Validation approach which is a way of preventing overfitting and underfitting without compromising on the amount of training data used by the model. This is done by splitting the data into n parts and keeping one part for testing and training with the other parts. This happens n times holding a different part for testing each time.Grid search does this for every value of each hyperparameter specified and finds the set of hyperparameters with the lowest error.
<br><br>
But then if the model is trained on the whole training set, we will not be able to measure the ability of the model to generalize for data it's never seen before. Hence we usually keep a holdout set from the training set. This can be done using the train_test_split module from sklearn.cross_validation</font>

In [9]:
#Import train_test_split and GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import GridSearchCV

#Split the training data
X, X_holdout, y, y_holdout = train_test_split(X_train_standard,y_train,test_size=0.2,random_state=123) #specify random_state so that data is split the same way every time the program is run.
#Define the hyperparameters and values as a dictionary
params = {'n_neighbors':range(1,15)}

#Instantiate GridSearch
gs = GridSearchCV(knn,params,cv=5) #cv defines the number of parts the data is split into

#Fit the model to the training data
gs.fit(X,y)

#Print the accuracy on training and holdout data
print(gs.score(X,y))
print(gs.score(X_holdout,y_holdout))

#Print the best hyperparameter value
print("The optimal number of neighbors is {}.".format(gs.best_params_['n_neighbors']))



1.0
0.7979497354497355
The optimal number of neighbors is 1.


<font color=teal>As we can see above, we already get pretty good accuracy on both the training set and test set. But can we improve the accuracy? Of course! Also, we need to look into the other issues such as the time consumed for training and the inability to visualize the data because we have features of very high dimensions.
<br><br>
**There is a single solution to all these problems.** We will perform Dimensionality reduction on our X dataset using PCA (Principal Component Analysis).
<br><br>
To know more about dimensionality reduction and PCA, watch [this](https://www.youtube.com/watch?v=jPmV3j1dAv4) video by Siraj Raval.</font>