# Aprendizagem - 2nd Assignment
## Canopy Height predicition

For this assignment we were assigned the task to, given a dataset, predict the canopy height from trees.  
The canopy height is a continous value, therefore, this is not a classification problem but a regression one.

In order to make this prediction pandas as scikit-learn will be used.

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Lasso

In [3]:
train_dataset = pd.read_csv('../datasets/train.csv')
test_dataset = pd.read_csv('../datasets/test.csv')

## Data Analysis

In [20]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29098 entries, 0 to 29097
Columns: 154 entries, id to rh98
dtypes: float64(152), int64(2)
memory usage: 34.2 MB


Given the values are only numeric, there is no need to encode the values.

In [4]:
train_stats = train_dataset.describe()
train_stats.drop(columns=['id'])

Unnamed: 0,aspect,elevation,slope,B2_ago,B2_jul,B2_jun,B2_may,B2_sep,B3_ago,B3_jul,...,NDWI_jul,NDWI_jun,NDWI_may,NDWI_sep,PSRI_ago,PSRI_jul,PSRI_jun,PSRI_may,PSRI_sep,rh98
count,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,...,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0
mean,-0.012836,200.365627,6.758245,0.058219,0.063284,0.061435,0.052699,0.054484,0.084856,0.090407,...,-0.482917,-0.499658,-0.52798,-0.483092,0.123394,0.123459,0.110181,0.062967,0.123459,7.049438
std,0.710396,112.360472,5.4284,0.022375,0.022177,0.022332,0.019686,0.021926,0.028212,0.027918,...,0.090114,0.08922,0.086544,0.096009,0.075911,0.075137,0.076391,0.069724,0.075049,2.695928
min,-1.0,4.0,0.0,0.01505,0.0176,0.01745,0.0152,0.0108,0.0265,0.0321,...,-0.80706,-0.798343,-0.794246,-0.792771,-0.113063,-0.096803,-0.145982,-0.118608,-0.09746,2.28
25%,-0.782304,119.0,2.780288,0.04225,0.0474,0.04655,0.0392,0.038763,0.0647,0.0705,...,-0.544885,-0.560014,-0.588149,-0.54986,0.072792,0.07329,0.057788,0.013658,0.072766,4.86
50%,0.0,175.0,5.039364,0.055,0.0601,0.0581,0.0489,0.051,0.08035,0.0863,...,-0.480677,-0.498645,-0.533046,-0.480849,0.127052,0.127164,0.112578,0.060357,0.126701,7.04
75%,0.723124,257.0,9.290806,0.0701,0.075,0.0719,0.0616,0.06625,0.1002,0.1056,...,-0.419831,-0.438565,-0.471949,-0.414111,0.178111,0.178028,0.165189,0.106852,0.177557,8.98
max,1.0,780.0,36.833881,0.26665,0.2684,0.30805,0.27855,0.2596,0.3121,0.31455,...,-0.137326,-0.128294,0.092509,-0.117401,0.370806,0.371316,0.365913,0.365601,0.384102,15.98


By observing the data I realized that the ranges of these values are very different, therefore, it would be a good idea to normalize them.

In [5]:
def normalize_dataset(ds: pd.DataFrame) -> pd.DataFrame:
    ids = pd.DataFrame(ds['id'])
    tmp_ds = ds.copy().drop(columns='id')
    normalized_fields = (tmp_ds-train_stats.loc['mean'] / train_stats.loc['std'])

    return ids.append(normalized_fields)

In [14]:
normalized_train_dataset = normalize_dataset(train_dataset)
normalized_test_dataset = normalize_dataset(test_dataset)

normalized_train_dataset.dropna()
normalized_test_dataset.dropna()

normalized_train_dataset.describe()

Unnamed: 0,id,B11_ago,B11_jul,B11_jun,B11_may,B11_sep,B12_ago,B12_jul,B12_jun,B12_may,...,S1_VV_may_mean,S1_VV_may_var,S1_VV_sep,S1_VV_sep_hom,S1_VV_sep_mean,S1_VV_sep_var,aspect,elevation,rh98,slope
count,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,...,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0,29098.0
mean,18207.199189,-4.189156,-4.339779,-4.377359,-4.377599,-3.964908,-3.193188,-3.270862,-3.229714,-3.078223,...,-12.590252,433.580155,-5.200836,-4.872793,-11.047957,417.556258,0.005233,198.582387,4.434591,5.513266
std,10478.535217,0.068267,0.068273,0.066092,0.058567,0.068785,0.060133,0.06093,0.060729,0.054462,...,0.025651,69.782099,1.849759,0.101497,0.028546,76.245264,0.710396,112.360472,2.695928,5.4284
min,2.0,-4.407091,-4.55728,-4.583392,-4.569434,-4.171577,-3.35539,-3.433186,-3.389532,-3.207626,...,-12.69232,199.182437,-12.674138,-5.265396,-11.164483,162.592362,-0.981932,2.216761,-0.334847,-1.244979
25%,9190.25,-4.237791,-4.38828,-4.424392,-4.420534,-4.014477,-3.236028,-3.313986,-3.272932,-3.118026,...,-12.607685,386.200618,-6.370302,-4.939903,-11.067044,365.646951,-0.764236,117.216761,2.245153,1.535309
50%,18201.5,-4.187791,-4.33748,-4.375342,-4.380534,-3.964777,-3.19539,-3.272586,-3.231782,-3.083676,...,-12.590541,430.545871,-5.243,-4.870458,-11.048815,412.998429,0.018068,173.216761,4.425153,3.794385
75%,27255.75,-4.141241,-4.29178,-4.331142,-4.338934,-3.917077,-3.15319,-3.230486,-3.190444,-3.045126,...,-12.573831,474.051921,-4.119455,-4.803791,-11.029283,463.944749,0.741192,255.216761,6.365153,8.045827
max,36371.0,-3.908391,-4.06308,-4.090243,-4.059134,-3.684877,-2.85649,-2.943136,-2.871882,-2.709026,...,-12.439282,932.578403,7.542948,-4.464903,-10.892999,943.172471,1.018068,778.216761,13.365153,35.588902


In [17]:
#y = normalized_train_dataset['rh98']
#X = normalized_train_dataset.drop(columns='rh98')
y = train_dataset['rh98']
X = train_dataset.drop(columns='rh98')
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [18]:
clf = Knn(n_estimators=1000, max_depth=100, n_jobs=-1)
clf.fit(X_train, y_train)

ValueError: continuous is not supported

In [19]:
clf.score(X_test, y_test)

0.5435824843146537

In [26]:
clf = GradientBoostingRegressor()
clf.fit(X_train, y_train)

GradientBoostingRegressor()

In [30]:
clf.score(X_test, y_test)

0.5254928842735072

In [39]:
clf = RandomForestRegressor(n_estimators=1000, max_depth=100, n_jobs=-1)
clf.fit(X_train, y_train)

RandomForestRegressor(max_depth=100, n_estimators=1000, n_jobs=-1)

In [61]:
feature_importances = pd.DataFrame(clf.feature_importances_, index=train_dataset.columns[:-1], columns=['importance']).sort_values('importance', ascending=False)
pd.set_option('display.max_rows', None)
feature_importances


Unnamed: 0,importance
NDVI_sep_var,0.292321
HV_mean,0.057491
B8a_ago,0.055806
B7_ago,0.024838
slope,0.020972
NDVI_sep_hom,0.011835
elevation,0.010903
HH_mean,0.008666
NDWI_sep,0.008274
B8a_jul,0.007311


In [63]:
clf_2 = RandomForestRegressor(n_estimators=500, max_depth=50, n_jobs=-1)
clf_2.fit(X_train[['NDVI_sep_var', 'HV_mean', 'B8a_ago', 'B7_ago', 'slope', 'NDVI_sep_hom', 'elevation', 'HH_mean', 'NDWI_sep', 'B8a_jul', 'HV_var']], y_train)
clf_2.score(X_test[['NDVI_sep_var', 'HV_mean', 'B8a_ago', 'B7_ago', 'slope', 'NDVI_sep_hom', 'elevation', 'HH_mean', 'NDWI_sep', 'B8a_jul', 'HV_var']], y_test)

0.5115910857294377

In [68]:
prediction = clf_2.predict(test_dataset[['NDVI_sep_var', 'HV_mean', 'B8a_ago', 'B7_ago', 'slope', 'NDVI_sep_hom', 'elevation', 'HH_mean', 'NDWI_sep', 'B8a_jul', 'HV_var']])
rh98_series = pd.Series(prediction)
submission = pd.concat([test_dataset['id'], rh98_series], axis=1)
submission.set_axis(['id', 'rh98'], axis='columns')
submission.to_csv('pred.csv', index=False)

## Strategies
After loading the data a set of alghoritms will be chosen.

## Random Forest
#### Tuned Parameters
- Number of Estimators = [100, 500, 1000]
- Max depth = [30, 50, 100]
- Max features = ['auto', 'sqrt']

Therefore, we'll be calculating 3 * 3 * 2 = 18 combinations.

In [28]:
params = {"n_estimators": [100, 500, 1000],
        "max_depth": [30, 50, 100],
        "max_features": ["auto", "sqrt"]}

clf = RandomForestRegressor(n_jobs=-1)
grid = GridSearchCV(estimator=clf, param_grid=params, cv=5)
grid.fit(X_train, y_train)
grid.best_params_


{'max_depth': 100, 'max_features': 'auto', 'n_estimators': 1000}

The best parameters for this experiment were {'max_depth': 100, 'max_features': 'auto', 'n_estimators': 1000}.

In [38]:
prediction = grid.predict(test_dataset)
rh98_series = pd.Series(prediction)

In [39]:
submission = pd.concat([test_dataset['id'], rh98_series], axis=1)
submission.set_axis(['id', 'rh98'], axis='columns')

Unnamed: 0,id,rh98
0,1,8.91603
1,11,10.19810
2,12,7.44876
3,13,7.16645
4,17,7.99893
...,...,...
7263,36336,9.79463
7264,36354,10.31577
7265,36360,9.05185
7266,36366,5.56809


In [40]:
submission.to_csv('pred.csv', index=False)