# Forest Cover Prediction

## Hello there :)

### In this notebook I will be going through the process of creating a Machine learning model to predict the forest cover type.

##### but in reallity this almost the same steps required for most of the problems that require ML solutions

## So Let's start 🔥

#### First Let's import some useful libraries to process the data quickly

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score


#### now lets read the dataset found on kaggle

In [None]:
import zipfile
train_zip = zipfile.ZipFile('../input/forest-cover-type-kernels-only/train.csv.zip')
test_zip = zipfile.ZipFile('../input/forest-cover-type-kernels-only/test.csv.zip')

df = pd.read_csv(train_zip.open('train.csv'))
df_test = pd.read_csv(test_zip.open('test.csv'))
df.head()

#### Great ! we have read our dataset

#### now we need to go through the dataset and spend some time understanding what every column represents

#### So head back to this link : https://www.kaggle.com/c/forest-cover-type-kernels-only/data

#### once you are done we can proceed :)





#### alright now let's get some info about our dataset

In [None]:
df.info()

#### now lets start the cleaning process 

## Data Cleaning



#### 1- check for nulls (as nulls mean lack of value so they do not provide any information that can help our model)

In [None]:
df.isnull().sum()

## Image of Thanos saying "Impossible" in avengers endgame 🙂

#### this is rarely the case to have a dataset that is null free

#### but since kaggle provided the dataset with no nulls so we do not need to worry about how to handle those nulls 

#### so what to do in the future if you met a dataset with nulls ?

#### actually there are some techniques that can be used to impute the null values , 

#### to save time you can refer to this awesome kaggle tutorial :https://www.kaggle.com/alexisbcook/handling-missing-values

#### now lets get a summary about our dataset and proceed to the next cleaning step

#### 2- outlier detection

In [None]:
df.describe()

#### all features look credebile or normal except "Horizontal_Distance_To_Roadways" 
#### the max is too far away from the mean and the 75%

#### so lets draw a boxplot to check for outliers

In [None]:
sns.boxplot(x=df["Horizontal_Distance_To_Roadways"])

#### so it seems we have some abnormal instatnces let's see how frequent they are

In [None]:
outliers = np.where(df['Horizontal_Distance_To_Roadways']>4500)

len(outliers[0])

In [None]:
outliers_df = df.loc[outliers[0], :]

In [None]:
outliers_df.head(10)

In [None]:
outliers_df["Cover_Type"].unique()

#### hmmm... I feel that this feature might be used later to classify the cover type so the outlier values might be useful so i will not modify those values

#### since the dataset provider has converted the categorical data to numeric we do not need to do that step but I mentioned this step as it might be useful in other projects

#### now I think that the dataset is clean and ready for the next step in the process

## Data visualization

#### this section is very important for building the intutuion about the relation between variables and how they affect our prediction

#### so lets see the relation between the Elevation and the Cover  type

In [None]:
df.groupby("Cover_Type").Elevation.mean().sort_values(ascending=False)[:5].plot.bar()

#### Well , since no 2 classes have the same ( or very close ) mean so this feature might be good for classification

#### and it also makes sense due to the fact that as we move up air goes thinner (O2 decreases so this affects the type of the forest)

#### now lets repeat the same steps but for other variables 

#### the numeric_features are the columns that are numeric by nature (not encoded from categorical variables)

#### those numeric features are treated in a slightly different way

In [None]:
numeric_features = ['Aspect', 'Slope','Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
       'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon',
        'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']

In [None]:
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(hspace=0.5, wspace=0.5)
i=1
n= len(df.columns)
rows = 3
cols = 3

for col in numeric_features:
    ax = fig.add_subplot(rows, cols, i)
    df.groupby("Cover_Type")[col].mean().sort_values(ascending=False)[:5].plot.bar()
    plt.ylabel(col+" mean")
    i+=1
    

### Hmmm it seems we have some interesting results

### all features seem to have significant effect on the class except the "hillshade noon" ,"hillshade 3pm" 
### and "hillshade 9am" so we may neglect them

In [None]:
categorical_features = ['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3',
       'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3',
       'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8',
       'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
       'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16',
       'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
       'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
       'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28',
       'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32',
       'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
       'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']

print(len(categorical_features))

In [None]:
fig = plt.figure(figsize=(20,20))
fig.subplots_adjust(hspace=0.5, wspace=0.5)
i=1
n= len(df.columns)
rows = 11
cols = 4

for col in categorical_features:
    ax = fig.add_subplot(rows, cols, i)
    sns.barplot(x="Cover_Type", y=col, data=df)
    plt.ylabel(col)
    i+=1
    


### looks like we have some other interesting reults here 
### some features should be removed as they are not adding any info as soil_type37 , 7 and 15 


### now I think our dataset is clean and we have some intutuion to build the model and what features to use

In [None]:
useful_features = ['Elevation','Aspect', 'Slope','Horizontal_Distance_To_Hydrology', 
                   'Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways',
                   'Horizontal_Distance_To_Fire_Points',
       'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3','Wilderness_Area4',
       'Soil_Type1', 'Soil_Type2', 'Soil_Type3',
       'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type8',
       'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
       'Soil_Type13', 'Soil_Type14', 'Soil_Type16',
       'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
       'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
       'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28',
       'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32',
       'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
       'Soil_Type38', 'Soil_Type39', 'Soil_Type40']

label = ["Cover_Type"]

### now lets create our training data

In [None]:
from sklearn.model_selection import train_test_split

X = df.loc[:,useful_features]



y = df[label[0]]


# for both training and validation

x_train, x_val, y_train, y_val = train_test_split( X.values, y.values, test_size=0.1, random_state=5 )

### now comes the interesting part (choosing the model)

### Lets try KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=2)

model.fit(x_train, y_train)


val_pred = model.predict(x_val)


accuracy_score(y_val, val_pred)

#### I have tried other hyperparameters and found that k = 2 give the highest of them all 

### Lets try RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier


model = RandomForestClassifier(n_estimators=1000, random_state=5)

model.fit(x_train, y_train)


val_pred = model.predict(x_val)


accuracy_score(y_val, val_pred)

### Slightly better 

In [None]:
X_test = df_test.loc[:,useful_features]

In [None]:
predictions = model.predict(X_test)



### Lets write the submission file

In [None]:
submission = pd.DataFrame()

submission['Id'] = df_test["Id"]
submission['Cover_Type'] = predictions

submission.to_csv('submission.csv', index=False)

submission.head(5)

## Conclusion

### So according to the data provided and explored above we may conclude that the Forest Cover Type is affected mainly by many variables such as the :-

### 1- Elevation above water surface
### 2- Horizontal and vertical distance to water source
### 3- distance to forest fire location
### 4- others

## Recommendations

### The features I used were the most effective from my point of view however I suggest spending some more time extracting some more useful features that might improve the model performance

### that's it and THANK YOU :)