# <font color=blue> Forest Cover Type Analysis: </font>
<br>
<font color=teal>**Let us first import the required libraries**</font>

-  pandas to import csv files and handle DataFrames
-  numpy to handle our inputs and outputs
-  matplotlib.pyplot to visualize our data

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<font color= teal>Next let us examine our dataset.</font>

Here we will check for:
-  missing values
-  datatypes of features

In [4]:
train = pd.read_csv('train.csv', header=0, index_col='Id')
print(train.head())
print(train.info())

    Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
Id                                                               
1        2596      51      3                               258   
2        2590      56      2                               212   
3        2804     139      9                               268   
4        2785     155     18                               242   
5        2595      45      2                               153   

    Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
Id                                                                    
1                                0                              510   
2                               -6                              390   
3                               65                             3180   
4                              118                             3090   
5                               -1                              391   

    Hillshade_9am  Hillshade_Noon  Hill

In [5]:
test = pd.read_csv('test.csv', header=0, index_col='Id')
print(test.head())
print(test.info())

       Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
Id                                                                  
15121       2680     354     14                                 0   
15122       2683       0     13                                 0   
15123       2713      16     15                                 0   
15124       2709      24     17                                 0   
15125       2706      29     19                                 0   

       Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
Id                                                                       
15121                               0                             2684   
15122                               0                             2654   
15123                               0                             2980   
15124                               0                             2950   
15125                               0                             2920  

## <font color=blue>Statistical Analysis:</font>
<br>
Let's check the various statistical measures of our training data such as mean, standard deviation, range, etc.

In [6]:
print(train.describe())

          Elevation        Aspect         Slope  \
count  15120.000000  15120.000000  15120.000000   
mean    2749.322553    156.676653     16.501587   
std      417.678187    110.085801      8.453927   
min     1863.000000      0.000000      0.000000   
25%     2376.000000     65.000000     10.000000   
50%     2752.000000    126.000000     15.000000   
75%     3104.000000    261.000000     22.000000   
max     3849.000000    360.000000     52.000000   

       Horizontal_Distance_To_Hydrology  Vertical_Distance_To_Hydrology  \
count                      15120.000000                    15120.000000   
mean                         227.195701                       51.076521   
std                          210.075296                       61.239406   
min                            0.000000                     -146.000000   
25%                           67.000000                        5.000000   
50%                          180.000000                       32.000000   
75%            

<font color=teal>As we can see above, our data is not standardized. The mean and variance of each feature differs greatly. This will affect our final output.
<br><br>
Hence, let's use Scikit-learn's sklearn.preprocessing.StandardScaler to bring the mean of each feature to 0 and scale the variances to a standard value.</font>
<br><br>
But before that, we'll first have to extract the features and labels from the DataFrame and convert them to numpy array to be able to use them with Scikit-learn.

In [8]:
#Training data features:
X_train = train.drop('Cover_Type', axis=1).values #First we drop the label column and then get only the values.
                                                  #Specify axis=1 so that pandas will search for column instead of index.

#Training data labels:
y_train = train['Cover_Type'].values #Get the values on the label column only.

#Recall that Machine Learning is basically a method to make a program find an optimal function mapping X to y.

#Test data features:
X_test = test.values #We can directly take the values as test DataFrame does not have label column.

In [13]:
#Import StandardScaler
from sklearn.preprocessing import StandardScaler

#Instantiate the scaler
scaler = StandardScaler()

#Fit the scaler to the train features
scaler.fit(X_train)

#Transform the train features
X_train_standard = scaler.transform(X_train)

#Print the mean and standard deviation to confirm
print(np.mean(X_train_standard, axis=1)) #Specify axis=0 to compute mean along columns and not rows
print(np.std(X_train_standard, axis=1))



[-0.0127341  -0.01976436  0.1731508  ... -0.00848181 -0.00436267
  0.02739664]
[0.86604714 0.87035093 1.2934162  ... 0.70819409 0.68824072 0.80903839]


<font color=teal>As we can see above, the data is now **standardized**.
<br><br>
Next we will use **K-Nearest Neighbors** classification technique to clusterize the training data. The reason we use KNN is:
<br></font>
-  We have labeled data.
-  We want to classify new data into known existing groups.
-  Our prediction depends on multiple factors which are not linear.

In [None]:
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

