Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

Cracking The Machine Learning Process


**Name:** AbdElRhman ElMoghazy.

**Email address associated with your DataCamp account:**  almoghazy1@gmail.com

**Project description**: This will be read by the students on the DataCamp platform **before** deciding to start the project. The description should be three paragraphs, written in Markdown.

- Any Machine Learning project must consist of some essential steps. Every step in the project will help you develop the following step confidently and finally will help you design and optimize your Machine Learning model.
In this project you will be able to perform the following:

    - Importing and handling the dataset
    - Data Exploration and Analysis (EDA)
    - Data Cleaning
    - Feature Engineering
    - Data Normalization
    - Model Creation
    - Optimization and Error Analysis


- In this Notebook, we will use Scikit-Learn, Pandas, Seaborn and matplotlib libraries besides some Classification and optimization techniques. It is recommended to take the following courses as prerequisites to this project:
    - [Supervised Learning With Scikit-Learn](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn) 
    - [Preprocessing for Machine Learning in Python](https://www.datacamp.com/courses/preprocessing-for-machine-learning-in-python)


- The dataset for this project is collected by [Center for Machine Learning and Intelligent](https://cml.ics.uci.edu/) Systems. You can read about dataset [here](https://archive.ics.uci.edu/ml/datasets/covertype)

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. The Forest Covertype dataset

In this notebook, we are going to classify the cover_type dataset using Ensemble learning. We are going to perform some essential steps in the Machine Learning process. In many cases machine learning algorithms don't perform well without feature engineering which is the process of filling NaNs and missing values, creating new features and etc. We will also be performing some exploratory data analysis to be able to perform feature engineering before implementing the model itself.

Each sample in the cover_type dataset represents a 30*30 meter cell in a forest (in one of four wilderness areas in Roosevelt National Forest of northern Colorado) in the US. In this notebook, we will perform multi-class classification to classify the samples to one of seven cover types (classes).

To import the dataset use read_csv() function from pandas library to read the data.csv file from the datasets directory "./datasets/". After loading the dataset, we will split the dataset into 80% for the training data and 20% for the testing data. We will use the testing data later to see how well the model would perform in the future on new data.

In [12]:
import numpy as np # For linear algebra
import pandas as pd # For data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # To split the dataset into training and testing data

# Loading the dataset into a Pandas dataframe
data = pd.read_csv("./datasets/data.csv")

# Splitting the dataset into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(data.drop(["target"], axis = 1), data["target"], random_state = 0)

## 2. Data Exploration

Next, We will need to know the number of training examples in the training set. Knowing the number of training examples is very important to determine which Machine Learning to use later in the notebook.
It is important to know if the number of points in the classes is balanced. If the data is skewed then we will not be able to use accuracy as a performance metric since it will be misleading but if it is skewed we may use F-beta score or precision and recall. Precision or recall or F1 score. the choice depends on the problem itself. Where high recall means a low number of false negatives, High precision means a low number of false positives and F1 score is a trade-off between them. [You can refer to this article for more about precision and recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)  

In [29]:
# Checking the number of examples in the training set
print("The number of traning examples(data points) = %i " % train.shape[0])
# Checking the number of training examples per class
print("The number of occurances of each class in the dataset is { %s " % train["target"].value_counts(), "}")
# Show the first 5 rows in the training set
train.head()

The number of traning examples(data points) = 15560 
The number of occurances of each class in the dataset is { 2.0    2273
5.0    2273
1.0    2247
7.0    2247
6.0    2218
3.0    2216
4.0    2086
Name: target, dtype: int64  }


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,target
0,2112.0,56.0,31.0,124.0,48.0,180.0,,159.0,48.0,485.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
1,2611.0,79.0,25.0,192.0,64.0,2078.0,243.0,186.0,57.0,685.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
2,2791.0,97.0,10.0,67.0,10.0,2285.0,236.0,226.0,120.0,1381.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,2832.0,335.0,28.0,285.0,126.0,2964.0,,191.0,176.0,1775.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
4,3330.0,287.0,10.0,60.0,3.0,4920.0,193.0,240.0,187.0,3428.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,7.0


## 3. Check for NaNs and Nulls

Next, We will check if any of the columns contains NaNs or Nulls so that we can fill those values if they are significant or just drop them. We may drop a whole column if most of its values are NaNs or fill its value according to its relation with other columns in the dataframe.

if you are dropping rows with NaNs and you notice that you need to drop a large portion of your dataset then you should think about filling the NaN values or drop a column that has most of its values missing.
When exploring the dataset, If a column of strings contains some "0" in it then probably it is considered as a Null value, that is why we are going to use the df.describe() to see if the range of numbers is reasonable or not.

In [30]:
# Using describe to see if the range of values in the dataset columns is reasonable
# For the data exploration we need to explore the whole training set including the target
train = pd.DataFrame(np.c_[x_train, y_train], columns = data.columns.values )
# Showing some statistics for each feature
print(train.describe() )
# Printing the number of nulls in each feature
train.isna().sum()

          Elevation        Aspect         Slope  \
count  15520.000000  15560.000000  15560.000000   
mean    2755.413918    155.528213     16.548329   
std      417.695114    109.705257      8.463855   
min     1872.000000      0.000000      0.000000   
25%     2387.000000     64.000000     10.000000   
50%     2757.000000    124.000000     15.000000   
75%     3113.000000    257.000000     22.000000   
max     3856.000000    360.000000     55.000000   

       Horizontal_Distance_To_Hydrology  Vertical_Distance_To_Hydrology  \
count                      15560.000000                    15560.000000   
mean                         229.597622                       50.827635   
std                          209.535727                       60.656991   
min                            0.000000                     -134.000000   
25%                           67.000000                        5.000000   
50%                          180.000000                       33.000000   
75%            

Elevation                               40
Aspect                                   0
Slope                                    0
Horizontal_Distance_To_Hydrology         0
Vertical_Distance_To_Hydrology           0
Horizontal_Distance_To_Roadways          0
Hillshade_9am                         7577
Hillshade_Noon                           0
Hillshade_3pm                            0
Horizontal_Distance_To_Fire_Points       0
Wilderness_Area1                         0
Wilderness_Area2                         0
Wilderness_Area3                         0
Wilderness_Area4                         0
Soil_Type1                               0
Soil_Type2                               0
Soil_Type3                               0
Soil_Type4                               0
Soil_Type5                               0
Soil_Type6                               0
Soil_Type7                               0
Soil_Type8                               0
Soil_Type9                               0
Soil_Type10

*Stop here! Only the three first tasks. :)*