## 1.1 Imports

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np

## 1.2  Load the Forest Cover Data

In [16]:
covtype = pd.read_csv('../data/covtype.csv')

In [17]:
covtype.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    581012 non-null  int64
 11  Wilderness_Area2                    581012 non-null 

There are no missing values in the dataset. The dataset consists of 581012 rows and 54 columns.

In [18]:
covtype.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


Our goal is to predict the forest cover type, so target value is the last column **Cover_Type**. 


## 1.3 Explore the Data

### 1.3.1 Numeric Features

Let's expore the spread of the potential features. We exclude the categorial features **Wilderness_Area** and **Soil types** here.

In [19]:
none_catcol = ~covtype.columns.str.match(pat = '(Soil_Type.*)|(Wilderness_Area.*)')


In [20]:
covtype.iloc[:,none_catcol].describe()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Cover_Type
count,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0
mean,2959.365301,155.656807,14.103704,269.428217,46.418855,2350.146611,212.146049,223.318716,142.528263,1980.291226,2.051471
std,279.984734,111.913721,7.488242,212.549356,58.295232,1559.25487,26.769889,19.768697,38.274529,1324.19521,1.396504
min,1859.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2809.0,58.0,9.0,108.0,7.0,1106.0,198.0,213.0,119.0,1024.0,1.0
50%,2996.0,127.0,13.0,218.0,30.0,1997.0,218.0,226.0,143.0,1710.0,2.0
75%,3163.0,260.0,18.0,384.0,69.0,3328.0,231.0,237.0,168.0,2550.0,2.0
max,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,7.0


The features have varying ranges and scales,where the **slope** has the mean value 14.1 degrees with std 7.49 while **Elevation** has the mean value 2350.15 with std 1559.25 approximately.

#### 1.3.1.1 Hillshade Index and Cover Type

In [21]:
#mean hillshade index by cover type 
covtype.groupby('Cover_Type').mean()[['Hillshade_9am','Hillshade_Noon','Hillshade_3pm']]

Unnamed: 0_level_0,Hillshade_9am,Hillshade_Noon,Hillshade_3pm
Cover_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,211.998782,223.430211,143.875038
2,213.844423,225.326596,142.983466
3,201.918415,215.826537,140.367176
4,228.345832,216.997088,111.392792
5,223.474876,219.035816,121.920889
6,192.844302,209.827662,148.284044
7,216.967723,221.746026,134.932033


In [22]:
#sum hillshde index by cover_type
covtype.groupby('Cover_Type').mean()[['Hillshade_9am','Hillshade_Noon','Hillshade_3pm'
                              ]].sum(axis = 1).sort_values()

Cover_Type
6    550.956009
4    556.735712
3    558.112127
5    564.431581
7    573.645783
1    579.304031
2    582.154486
dtype: float64

#### 1.3.1.2 Explore Suspicious Values


As we can see from the summary table, the values of each feature appear to be positive except the ones of **Vertical_Distance_To_Hydrology**. The table shows that the min value of **Vertical_Distance_To_Hydrology** is -173. **Vertical_Distance_To_Hydrology** indicates the vertical distance from the sample forest to its nearest surface water,so it is sensible that the value could be negtive when the elevation of the forest is higher than the surface water. Let' check that if the min value is the only negative value or is caused by input error.



In [23]:
covtype.sort_values('Vertical_Distance_To_Hydrology').Vertical_Distance_To_Hydrology

560962   -173
554446   -166
560804   -166
554253   -164
554637   -163
         ... 
224576    597
225047    598
223210    598
224112    599
223655    601
Name: Vertical_Distance_To_Hydrology, Length: 581012, dtype: int64

### 1.3.2 Categorical Features

#### 1.3.2.1 Counts of Categorical Features

The categorical feature Wilderness_Area consists of 4 different wilderness areas, where:

    .Wilderness_Area1 is Rawah Wilderness Area
    .Wilderness_Area2 is Neota Wilderness Area
    .Wilderness_Area3 is Comanche Peak Wilderness Area
    .Wilderness_Area3 is Cache la Poudre Wilderness Area
    
Let's find out the count for each wilderness area in our dataset.

In [24]:
covtype.iloc[:,covtype.columns.str.match(pat = 'Wilderness_Area.*')].sum()

Wilderness_Area1    260796
Wilderness_Area2     29884
Wilderness_Area3    253364
Wilderness_Area4     36968
dtype: int64

Let's also check out the count for each soil type. 

In [25]:
covtype.iloc[:,covtype.columns.str.match(pat = 'Soil_Type.*')].sum()

Soil_Type1       3031
Soil_Type2       7525
Soil_Type3       4823
Soil_Type4      12396
Soil_Type5       1597
Soil_Type6       6575
Soil_Type7        105
Soil_Type8        179
Soil_Type9       1147
Soil_Type10     32634
Soil_Type11     12410
Soil_Type12     29971
Soil_Type13     17431
Soil_Type14       599
Soil_Type15         3
Soil_Type16      2845
Soil_Type17      3422
Soil_Type18      1899
Soil_Type19      4021
Soil_Type20      9259
Soil_Type21       838
Soil_Type22     33373
Soil_Type23     57752
Soil_Type24     21278
Soil_Type25       474
Soil_Type26      2589
Soil_Type27      1086
Soil_Type28       946
Soil_Type29    115247
Soil_Type30     30170
Soil_Type31     25666
Soil_Type32     52519
Soil_Type33     45154
Soil_Type34      1611
Soil_Type35      1891
Soil_Type36       119
Soil_Type37       298
Soil_Type38     15573
Soil_Type39     13806
Soil_Type40      8750
dtype: int64

#### 1.3.2.2 Counts of Forest Type by wilderness areas

In [26]:
def create_label(df):   
    '''create a label columns for categorical columns'''
    df['Label'] = df.loc[:,df.columns != 'Cover_type'
                        ].apply(lambda row : row.index[row.argmax()], axis = 1)
    return df

In [27]:
#Extract all columns contains the string ''Wilderness_Area'
wild_cover = covtype.loc[:,covtype.columns[covtype.columns.str.contains(pat = 'Wilderness_Area.*')]]
wild_cover['Cover_type'] = covtype.Cover_Type

In [28]:
wild_cover_label = create_label(wild_cover)

In [29]:
#counts of forest cover type by wilderness_area
wild_cover_label[['Label','Cover_type']]\
                                        .groupby(['Label','Cover_type'])\
                                        .size()\
                                        .to_frame('size')

Unnamed: 0_level_0,Unnamed: 1_level_0,size
Label,Cover_type,Unnamed: 2_level_1
Wilderness_Area1,1,105717
Wilderness_Area1,2,146197
Wilderness_Area1,5,3781
Wilderness_Area1,7,5101
Wilderness_Area2,1,18595
Wilderness_Area2,2,8985
Wilderness_Area2,7,2304
Wilderness_Area3,1,87528
Wilderness_Area3,2,125093
Wilderness_Area3,3,14300


#### 1.3.2.3 Counts of Forest Type by soil type

In [30]:
#counts of cover_type by soil type
soil_cover = covtype.loc[:,covtype.columns[covtype.columns.str.contains(pat = 'Soil_Type.*')]]
soil_cover['Cover_type'] = covtype.Cover_Type
soil_cover_label = create_label(soil_cover)

soil_cover_label[['Label','Cover_type']]\
                                        .groupby(['Label','Cover_type'])\
                                        .size()\
                                        .to_frame('size')

Unnamed: 0_level_0,Unnamed: 1_level_0,size
Label,Cover_type,Unnamed: 2_level_1
Soil_Type1,3,2101
Soil_Type1,4,178
Soil_Type1,6,752
Soil_Type10,1,956
Soil_Type10,2,10803
...,...,...
Soil_Type7,2,105
Soil_Type8,1,43
Soil_Type8,2,136
Soil_Type9,1,161


## 1.3 Save the Data

In [31]:
covtype.shape


(581012, 55)

In [32]:
#save the cleaned covtypeing data

datapath = '../data/'
covtype.to_csv(datapath + 'covtype_step1.csv',index=False)