## The challenge:

In this competition you’ll predict what types of trees there are in an area based on various geographic features.

The competition datasets comes from a study conducted in four wilderness areas within the beautiful Roosevelt National Forest of northern Colorado. These areas represent forests with very little human disturbances – the existing forest cover types there are more a result of ecological processes rather than forest management practices.

The data is in raw form and contains categorical data such as wilderness areas and soil type.

## Import Packages

In [1]:
DATA_DIR = '/kaggle/input/learn-together'
DATA_DIR = 'data'

In [10]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk(DATA_DIR):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data/train.csv
data/test.csv
data/sample_submission.csv
data/sample_submission.csv.zip
data/input
data/test.csv.zip
data/train.csv.zip


## Load Dataset

In [11]:
train_df=pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
test_df=pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))

In [12]:
train_df.head()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,1,2596,51,3,258,0,510,221,232,148,...,0,0,0,0,0,0,0,0,0,5
1,2,2590,56,2,212,-6,390,220,235,151,...,0,0,0,0,0,0,0,0,0,5
2,3,2804,139,9,268,65,3180,234,238,135,...,0,0,0,0,0,0,0,0,0,2
3,4,2785,155,18,242,118,3090,238,238,122,...,0,0,0,0,0,0,0,0,0,2
4,5,2595,45,2,153,-1,391,220,234,150,...,0,0,0,0,0,0,0,0,0,5


In [13]:
test_df.head()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
0,15121,2680,354,14,0,0,2684,196,214,156,...,0,0,0,0,0,0,0,0,0,0
1,15122,2683,0,13,0,0,2654,201,216,152,...,0,0,0,0,0,0,0,0,0,0
2,15123,2713,16,15,0,0,2980,206,208,137,...,0,0,0,0,0,0,0,0,0,0
3,15124,2709,24,17,0,0,2950,208,201,125,...,0,0,0,0,0,0,0,0,0,0
4,15125,2706,29,19,0,0,2920,210,195,115,...,0,0,0,0,0,0,0,0,0,0


In [14]:
print("shape training csv: %s" % str(train_df.shape)) 
print("shape test csv: %s" % str(test_df.shape)) 

shape training csv: (15120, 56)
shape test csv: (565892, 55)


## Delete Ids
**Let's delete the Id column in the training set but store it for the test set before deleting**

In [15]:
train_df = train_df.drop(["Id"], axis = 1)

test_ids = test_df["Id"]
test_df = test_df.drop(["Id"], axis = 1)

### Linear relations are aparently very weak and some features have strong correlation, other are detached. Besides, the numerical features alone seem to be strong as predictors. Needs investigation.

# Brute Force

In [16]:
from time import time
from sklearn.model_selection import RandomizedSearchCV

X_e, y_e = train_df.drop(['Cover_Type'], axis=1), train_df['Cover_Type']

In [76]:
param_dist = {"max_depth": [5, 10, 15, 25, 40, 80],
              'max_features': [0.2, 0.4, 0.6, 0.8],
              'n_estimators': [20, 50, 100, 200, 600, 1200, 2000]
             }


def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            
n_iter_search = 20
random_search = RandomizedSearchCV(ExtraTreesClassifier(bootstrap=False), param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5, iid=False)

start = time()
random_search.fit(X_train_reduced, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

RandomizedSearchCV took 443.10 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.865 (std: 0.005)
Parameters: {'max_depth': 25, 'n_estimators': 2000, 'max_features': 0.6}

Model with rank: 2
Mean validation score: 0.864 (std: 0.006)
Parameters: {'max_depth': 25, 'n_estimators': 600, 'max_features': 0.8}

Model with rank: 3
Mean validation score: 0.863 (std: 0.006)
Parameters: {'max_depth': 30, 'n_estimators': 100, 'max_features': 0.8}



In [44]:
X_train, X_val, y_train, y_val = train_test_split(X_e, y_e, test_size=0.1)

In [46]:
etc = ExtraTreesClassifier(
    bootstrap=True, oob_score=True,
    class_weight= {1:1000, 2: 1000, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1},
    **{'max_depth': 32, 'n_estimators': 2200, 'max_features': 0.7}
    )
etc.fit(X_train, y_train)
etc.oob_score_

0.8716931216931217

In [47]:
y_pred = etc.predict(X_val)
report(y_val, y_pred)

Accuracy: 0.8796296296296297
              precision    recall  f1-score   support

           1       0.84      0.75      0.79       218
           2       0.78      0.70      0.74       193
           3       0.87      0.83      0.85       194
           4       0.95      0.98      0.96       236
           5       0.90      0.97      0.93       235
           6       0.88      0.91      0.89       220
           7       0.91      0.98      0.95       216

    accuracy                           0.88      1512
   macro avg       0.87      0.87      0.87      1512
weighted avg       0.88      0.88      0.88      1512

[[163  32   0   0   5   0  18]
 [ 28 135   2   0  19   7   2]
 [  0   3 161  10   1  19   0]
 [  0   0   4 231   0   1   0]
 [  0   3   3   0 228   1   0]
 [  0   1  15   3   1 200   0]
 [  4   0   0   0   0   0 212]]


In [48]:
# Train with all data, maximizing trees variety (bootstrap False)
etc = ExtraTreesClassifier(
    bootstrap=False,
    class_weight= {1:10000, 2: 10000, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1},
    **{'max_depth': 32, 'n_estimators': 2200, 'max_features': 0.7}
    )
etc.fit(X_e, y_e)
etc.score(X_e, y_e)

0.9986772486772487

In [24]:
from sklearn.metrics import classification_report, confusion_matrix
def report(y_true, y_pred):
    print('Accuracy: %s' % accuracy_score(y_true, y_pred))
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

In [39]:
X_1_2 = X_e[y_e < 3]
y_1_2 = y_e[y_e < 3]
X_train, X_val, y_train, y_val = train_test_split(X_1_2, y_1_2, test_size=0.2)
etc = ExtraTreesClassifier(
    bootstrap=False,
    **{'max_depth': 20, 'n_estimators': 3000, 'max_features': 0.6}
    )

etc.fit(X_train, y_train)
y_pred = etc.predict(X_val)

In [40]:
print(etc.score(X_train, y_train))
report(y_val, y_pred)

0.9994212962962963
Accuracy: 0.8171296296296297
              precision    recall  f1-score   support

           1       0.81      0.82      0.81       421
           2       0.83      0.81      0.82       443

    accuracy                           0.82       864
   macro avg       0.82      0.82      0.82       864
weighted avg       0.82      0.82      0.82       864

[[346  75]
 [ 83 360]]


## Predictions

In [49]:
test_pred = etc.predict(test_df)

In [51]:
# Save test predictions to file
output = pd.DataFrame({'ID': test_ids,
                       'Cover_Type': test_pred})
output.to_csv('submission.csv', index=False)