## 1.1 Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np

## 1.2  Load the Data

In [2]:
train = pd.read_csv('../dataset/train.csv')
test = pd.read_csv('../dataset/test.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15120 entries, 0 to 15119
Data columns (total 56 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   Id                                  15120 non-null  int64
 1   Elevation                           15120 non-null  int64
 2   Aspect                              15120 non-null  int64
 3   Slope                               15120 non-null  int64
 4   Horizontal_Distance_To_Hydrology    15120 non-null  int64
 5   Vertical_Distance_To_Hydrology      15120 non-null  int64
 6   Horizontal_Distance_To_Roadways     15120 non-null  int64
 7   Hillshade_9am                       15120 non-null  int64
 8   Hillshade_Noon                      15120 non-null  int64
 9   Hillshade_3pm                       15120 non-null  int64
 10  Horizontal_Distance_To_Fire_Points  15120 non-null  int64
 11  Wilderness_Area1                    15120 non-null  int64
 12  Wild

In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565892 entries, 0 to 565891
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Id                                  565892 non-null  int64
 1   Elevation                           565892 non-null  int64
 2   Aspect                              565892 non-null  int64
 3   Slope                               565892 non-null  int64
 4   Horizontal_Distance_To_Hydrology    565892 non-null  int64
 5   Vertical_Distance_To_Hydrology      565892 non-null  int64
 6   Horizontal_Distance_To_Roadways     565892 non-null  int64
 7   Hillshade_9am                       565892 non-null  int64
 8   Hillshade_Noon                      565892 non-null  int64
 9   Hillshade_3pm                       565892 non-null  int64
 10  Horizontal_Distance_To_Fire_Points  565892 non-null  int64
 11  Wilderness_Area1                    565892 non-null 

Both of the trainning set and test set have the same data structure. There are no missing values in the trainning set and test set.

In [5]:
train.head()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,1,2596,51,3,258,0,510,221,232,148,...,0,0,0,0,0,0,0,0,0,5
1,2,2590,56,2,212,-6,390,220,235,151,...,0,0,0,0,0,0,0,0,0,5
2,3,2804,139,9,268,65,3180,234,238,135,...,0,0,0,0,0,0,0,0,0,2
3,4,2785,155,18,242,118,3090,238,238,122,...,0,0,0,0,0,0,0,0,0,2
4,5,2595,45,2,153,-1,391,220,234,150,...,0,0,0,0,0,0,0,0,0,5


Our target value is the last column **Cover_Type**. 


## 1.3 Explore the Data

### 1.3.1 Numeric Features

Let's expore the spread of the potential features. We exclude the categorial features **Wilderness_Area** and **Soil types** here.

In [6]:
none_catcol = ~train.columns.str.match(pat = '(Soil_Type.*)|(Wilderness_Area.*)')


In [7]:
train.iloc[:,none_catcol].describe()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Cover_Type
count,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0
mean,7560.5,2749.322553,156.676653,16.501587,227.195701,51.076521,1714.023214,212.704299,218.965608,135.091997,1511.147288,4.0
std,4364.91237,417.678187,110.085801,8.453927,210.075296,61.239406,1325.066358,30.561287,22.801966,45.895189,1099.936493,2.000066
min,1.0,1863.0,0.0,0.0,0.0,-146.0,0.0,0.0,99.0,0.0,0.0,1.0
25%,3780.75,2376.0,65.0,10.0,67.0,5.0,764.0,196.0,207.0,106.0,730.0,2.0
50%,7560.5,2752.0,126.0,15.0,180.0,32.0,1316.0,220.0,223.0,138.0,1256.0,4.0
75%,11340.25,3104.0,261.0,22.0,330.0,79.0,2270.0,235.0,235.0,167.0,1988.25,6.0
max,15120.0,3849.0,360.0,52.0,1343.0,554.0,6890.0,254.0,254.0,248.0,6993.0,7.0


Id is are just unique id for each observation, so it's not what we are insterested.The features have varying ranges and scales,where the **slope** has the mean value 16.5 degrees with std 8.45 while **Horizontal_Distance_To_Fire_Points** has the mean value 1511.15 with std 1099.94 approximately. We may consider to normalize the data later.

#### 1.3.1.1 Drop Unnecessary Columns

In [8]:
train.Id.is_unique

True

In [9]:
train = train.drop('Id',axis = 1)
test = test.drop('Id',axis = 1)

#### 1.3.1.2 Explore Suspicious Values


As we can see from the summary table, the values of each feature appear to be positive except the ones of **Vertical_Distance_To_Hydrology**. The table shows that the min value of **Vertical_Distance_To_Hydrology** is -146. **Vertical_Distance_To_Hydrology** indicates the vertical distance from the sample forest to its nearest surface water,so it is sensible that the value could be negtive when the elevation of the forest is higher than the surface water. Let' check that if the min value is the only negative value or is caused by input error.



In [10]:
train.sort_values('Vertical_Distance_To_Hydrology').Vertical_Distance_To_Hydrology

10626   -146
1528    -134
7119    -123
12484   -115
11856   -114
        ... 
9697     403
13705    411
1892     547
11938    547
1803     554
Name: Vertical_Distance_To_Hydrology, Length: 15120, dtype: int64

### 1.3.2 Categorical Features

#### 1.3.2.1 Counts of Categorical Features

The categorical feature Wilderness_Area consists of 4 different wilderness areas, where:

    .Wilderness_Area1 is Rawah Wilderness Area
    .Wilderness_Area2 is Neota Wilderness Area
    .Wilderness_Area3 is Comanche Peak Wilderness Area
    .Wilderness_Area3 is Cache la Poudre Wilderness Area
    
Let's find out the count for each wilderness area in our dataset.

In [11]:
train.iloc[:,train.columns.str.match(pat = 'Wilderness_Area.*')].sum()

Wilderness_Area1    3597
Wilderness_Area2     499
Wilderness_Area3    6349
Wilderness_Area4    4675
dtype: int64

Let's also check out the count for each soil type. 

In [12]:
train.iloc[:,train.columns.str.match(pat = 'Soil_Type.*')].sum()

Soil_Type1      355
Soil_Type2      623
Soil_Type3      962
Soil_Type4      843
Soil_Type5      165
Soil_Type6      650
Soil_Type7        0
Soil_Type8        1
Soil_Type9       10
Soil_Type10    2142
Soil_Type11     406
Soil_Type12     227
Soil_Type13     476
Soil_Type14     169
Soil_Type15       0
Soil_Type16     114
Soil_Type17     612
Soil_Type18      60
Soil_Type19      46
Soil_Type20     139
Soil_Type21      16
Soil_Type22     345
Soil_Type23     757
Soil_Type24     257
Soil_Type25       1
Soil_Type26      54
Soil_Type27      15
Soil_Type28       9
Soil_Type29    1291
Soil_Type30     725
Soil_Type31     332
Soil_Type32     690
Soil_Type33     616
Soil_Type34      22
Soil_Type35     102
Soil_Type36      10
Soil_Type37      34
Soil_Type38     728
Soil_Type39     657
Soil_Type40     459
dtype: int64

Soil_Type7 and Soil_Type15 have 0 counts. That means no observed forest beglongs to Soil_Type7 or Soil_Type15.


#### 1.3.2.2 Counts of Forest Type by wilderness areas

In [13]:
def create_label(df):   
    '''create a label columns for categorical columns'''
    df['Label'] = df.loc[:,df.columns != 'Cover_type'
                        ].apply(lambda row : row.index[row.argmax()], axis = 1)
    return df

In [14]:
#Extract all columns contains the string ''Wilderness_Area'
wild_cover = train.loc[:,train.columns[train.columns.str.contains(pat = 'Wilderness_Area.*')]]
wild_cover['Cover_type'] = train.Cover_Type

In [15]:
wild_cover_label = create_label(wild_cover)

In [16]:
#counts of forest cover type by wilderness_area
wild_cover_label[['Label','Cover_type']]\
                                        .groupby(['Label','Cover_type'])\
                                        .size()\
                                        .to_frame('size')

Unnamed: 0_level_0,Unnamed: 1_level_0,size
Label,Cover_type,Unnamed: 2_level_1
Wilderness_Area1,1,1062
Wilderness_Area1,2,1134
Wilderness_Area1,5,856
Wilderness_Area1,7,545
Wilderness_Area2,1,181
Wilderness_Area2,2,66
Wilderness_Area2,7,252
Wilderness_Area3,1,917
Wilderness_Area3,2,940
Wilderness_Area3,3,863


#### 1.3.2.3 Counts of Forest Type by soil type

In [19]:
#counts of cover_type by soil type
soil_cover = train.loc[:,train.columns[train.columns.str.contains(pat = 'Soil_Type.*')]]
soil_cover['Cover_type'] = train.Cover_Type
soil_cover_label = create_label(soil_cover)

soil_cover_label[['Label','Cover_type']]\
                                        .groupby(['Label','Cover_type'])\
                                        .size()\
                                        .to_frame('size')

Unnamed: 0_level_0,Unnamed: 1_level_0,size
Label,Cover_type,Unnamed: 2_level_1
Soil_Type1,3,121
Soil_Type1,4,139
Soil_Type1,6,95
Soil_Type10,1,9
Soil_Type10,2,81
...,...,...
Soil_Type6,4,244
Soil_Type6,6,151
Soil_Type8,2,1
Soil_Type9,1,1
