## Kaggle Forest Cover Type Prediction
### Logistic regression, Random Forest, and LightGBM

[Competition](https://www.kaggle.com/c/forest-cover-type-prediction). 
In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

features (more info on [this](https://www.kaggle.com/c/forest-cover-type-prediction/data) competition page):

* Elevation - Elevation in meters
* Aspect - Aspect in degrees azimuth
* Slope - Slope in degrees
* Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
* Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
* Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
* Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
* Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
* Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
* Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
* Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
* Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
* Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation (target)

**Import libs and load data**

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [2]:
train = pd.read_csv('../../../../../../Documents/TMP/Natural Computing/forest-cover-type-prediction/train.csv',
                   index_col='Id')
test = pd.read_csv('../../../../../../Documents/TMP/Natural Computing/forest-cover-type-prediction/test.csv',
                  index_col='Id')

In [3]:
train.head(1).T

Id,1
Elevation,2596
Aspect,51
Slope,3
Horizontal_Distance_To_Hydrology,258
Vertical_Distance_To_Hydrology,0
Horizontal_Distance_To_Roadways,510
Hillshade_9am,221
Hillshade_Noon,232
Hillshade_3pm,148
Horizontal_Distance_To_Fire_Points,6279


In [4]:
train['Cover_Type'].value_counts()

7    2160
6    2160
5    2160
4    2160
3    2160
2    2160
1    2160
Name: Cover_Type, dtype: int64

In [5]:
def write_to_submission_file(predicted_labels, out_file,
                             target='Cover_Type', index_label="Id", init_index=15121):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(init_index, 
                                                  predicted_labels.shape[0] + init_index),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [6]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

**Perform train-test split**

In [7]:
Xtrain, Xvalid, ytrain, yvalid = train_test_split(
    train.drop('Cover_Type', axis=1), train['Cover_Type'],
    test_size=0.3, random_state=17)

**Test logistic regression**

In [8]:
logit = LogisticRegression(C=1, solver='lbfgs', max_iter=500,
                           random_state=17, n_jobs=4,
                          multi_class='multinomial')
logit_pipe = Pipeline(steps=[('scaler', StandardScaler()), 
                       ('logit', logit)])

In [9]:
%%time
logit_pipe.fit(Xtrain, ytrain)

Wall time: 5.42 s


Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logit',
                 LogisticRegression(C=1, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=500,
                                    multi_class='multinomial', n_jobs=4,
                                    penalty='l2', random_state=17,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [10]:
logit_val_pred = logit_pipe.predict(Xvalid)
logit_pred = logit_pipe.predict(Xvalid)
logit_pred = np.asarray(logit_pred)
write_to_submission_file(logit_pred, 'LoRe_forest_cover_type.csv')

In [11]:
Original = accuracy_score(yvalid, logit_val_pred)

In [12]:
lore = LogisticRegression(C=1, solver='lbfgs', max_iter=500, n_jobs=4, multi_class='multinomial', verbose=0)
lore_pipe = Pipeline(steps=[('scaler', StandardScaler()), 
                       ('logit', lore)])

In [13]:
EstimatorDF = pd.DataFrame(columns=['n_estimators', 'accuracy_score', 'F1 micro', 'F1 macro', 'precision macro', 'precision micro', 'recall macro','recall micro'])

In [14]:
from tqdm.notebook import trange
import gc
from sklearn.ensemble import BaggingClassifier
from sklearn import metrics
N_E = [1,10,50,100]
for est in trange(len(N_E), desc="Training"):
    estimators = N_E[est]
    clf = BaggingClassifier(base_estimator=lore_pipe, n_estimators=estimators, random_state=estimators, bootstrap = True).fit(Xtrain, ytrain)
    pred = clf.predict(Xvalid)
    acc = metrics.accuracy_score(yvalid,pred)
    f1_micro = metrics.f1_score(yvalid,pred, average='micro')
    f1_macro = metrics.f1_score(yvalid,pred,  average='macro')
    prec_macro = metrics.precision_score(yvalid,pred, average='macro')
    prec_micro = metrics.precision_score(yvalid,pred, average='micro')
    rec_macro = metrics.recall_score(yvalid,pred, average='macro')
    rec_micro = metrics.recall_score(yvalid,pred, average='micro')
    new_row = {'n_estimators' : estimators, 'accuracy_score': acc, 'F1 micro' : f1_micro, 'F1 macro' : f1_macro, 'precision macro':prec_macro, 'precision micro':prec_micro,'recall macro':rec_macro,'recall micro':rec_micro}
    #append row to the dataframe
    EstimatorDF = EstimatorDF.append(new_row, ignore_index=True)
    logit_pred = clf.predict(test)
    logit_pred = np.asarray(logit_pred)
    write_to_submission_file(logit_pred, str(estimators)+'_LoRe_forest_cover_type.csv')
    gc.collect()
EstimatorDF["n_estimators"] = EstimatorDF["n_estimators"].astype('int32')
print(EstimatorDF)

HBox(children=(IntProgress(value=0, description='Training', max=4, style=ProgressStyle(description_width='init…


   n_estimators  accuracy_score  F1 micro  F1 macro  precision macro  \
0             1        0.700397  0.700397  0.696990         0.697131   
1            10        0.705908  0.705908  0.702511         0.702331   
2            50        0.706129  0.706129  0.702627         0.703370   
3           100        0.706570  0.706570  0.703170         0.703846   

   precision micro  recall macro  recall micro  
0         0.700397      0.700678      0.700397  
1         0.705908      0.706090      0.705908  
2         0.706129      0.706479      0.706129  
3         0.706570      0.706917      0.706570  


In [15]:
EstimatorDF.to_csv('forest.csv', index=False)