Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

Cracking The Machine Learning Process


**Name:** AbdElRhman ElMoghazy.

**Email address associated with your DataCamp account:**  almoghazy1@gmail.com

**Project description**: This will be read by the students on the DataCamp platform **before** deciding to start the project. The description should be three paragraphs, written in Markdown.

- Any Machine Learning project must consist of some essential steps. Every step in the project will help you develop the following step confidently and finally will help you design and optimize your Machine Learning model.
In this project you will be able to perform the following:

    - Importing and handling the dataset
    - Data Exploration and Analysis (EDA)
    - Data Cleaning
    - Feature Engineering
    - Data Normalization
    - Model Creation
    - Optimization and Error Analysis


- In this Notebook, we will use Scikit-Learn, Pandas, Seaborn and matplotlib libraries besides some Classification and optimization techniques. It is recommended to take the following courses as prerequisites to this project:
    - [Supervised Learning With Scikit-Learn](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn) 
    - [Preprocessing for Machine Learning in Python](https://www.datacamp.com/courses/preprocessing-for-machine-learning-in-python)


- The dataset for this project is collected by [Center for Machine Learning and Intelligent](https://cml.ics.uci.edu/) Systems. You can read about dataset [here](https://archive.ics.uci.edu/ml/datasets/covertype)

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Which type dominates the forest?

![alt text](./img/Rawah.png)


Ever wondered why the type of trees changes from one forest to another? Evey tree type has different characteristics than the other to be able to survive in different environmental conditions, the amount of sun the trees get during the day, the type of soil and etc. So can we save the forests by identifying its type?

In this notebook, we will be using the [forest cover type dataset](https://archive.ics.uci.edu/ml/datasets/covertype). Each sample in the cover type dataset represents a 30*30 meter cell in a forest (in one of four wilderness areas in Roosevelt National Forest of northern Colorado) in the US. we need to clean up and tidy the dataset before performing multi-class classification and to be able to clean the data we will start by exploring how it looks like.

Let's start by importing some main packages that we will need later and loading the dataset from data.csv.

In [6]:
import numpy as np # For linear algebra
import pandas as pd # For data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # To split the dataset into training and testing data

# Loading the dataset into a Pandas dataframe
data = pd.read_csv("./datasets/data.csv")

# Display the first 5 rows of the dataset
data.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,cover_type
0,,105.0,36.0,201.0,141.0,1211.0,,170.0,12.0,1584.0,...,0,1,0,0,0,0,0,0,0,Spruce/Fir
1,,32.0,14.0,379.0,43.0,5028.0,,208.0,125.0,2845.0,...,0,0,0,0,0,0,0,0,0,Spruce/Fir
2,,273.0,10.0,391.0,24.0,2797.0,,243.0,190.0,234.0,...,0,0,0,0,0,0,0,0,0,Spruce/Fir
3,,318.0,13.0,300.0,94.0,1482.0,,228.0,182.0,2930.0,...,0,0,0,0,0,0,0,1,0,Spruce/Fir
4,,101.0,12.0,90.0,-5.0,4168.0,,223.0,110.0,2026.0,...,0,0,0,0,0,0,0,0,0,Spruce/Fir


## 2. So, Should we count the trees types?

In the first five rows in the dataset we can see that the cover_type, which is the type of trees, for the first five training examples have the same type (Spruce/Fir), but how many types do we really have in the dataset, what are their names and how many data examples do we have for each type in the dataset.

In [7]:
# Checking the number of training examples per class
print("The number of tree types in the dataset is { %s" % data["cover_type"].value_counts().shape, "}\n")

# Checking the number of training examples per class
print("The names of tree types in the dataset are %s" % data["cover_type"].unique(), "\n")

# Checking the number of training examples per class
print("The number of data examples for each type in the dataset are { \n %s" % data["cover_type"].value_counts(), "}")

The number of tree types in the dataset is { 7 }

The names of tree types in the dataset are ['Spruce/Fir' 'Lodgepole Pine' 'Ponderosa Pine' 'Cottonwood/Willow'
 'Aspen' 'Douglas-fir' 'Krummholz'] 

The number of data examples for each type in the dataset are { 
 Lodgepole Pine       3000
Krummholz            3000
Spruce/Fir           3000
Ponderosa Pine       3000
Aspen                3000
Douglas-fir          3000
Cottonwood/Willow    2747
Name: cover_type, dtype: int64 }


## 3. Strings won't work!

After exploring the cover_type (which is the target) column, we can see that all the values are strings. String values won't be helpful for our analysis and also sklearn models wouldn't accept strings as well.

We have two ways to get useful data from those string values. The first way is to one-hot encode the column so instead of having one target column we will have 7 columns. The second way is to map the string names to integer values as an example "Ponderosa Pine" would be converted to 1, "Lodgepole Pine" would be converted to 2 and etc. In this notebook, we will use the second way which is to map the string values to integers.

In [8]:
# Construct the mapping dictionary
mapping_dict = {"Spruce/Fir": 1, "Lodgepole Pine": 2,  "Ponderosa Pine": 3, "Cottonwood/Willow": 4, "Aspen": 5, "Douglas-fir": 6, "Krummholz": 7}

# Use the map(dict) function to map the values in the dataset to the values of the dictionary
data['cover_type'] = data['cover_type'].map(mapping_dict)

# Print the first 5 rows to see the difference in the cover_type column
data.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,cover_type
0,,105.0,36.0,201.0,141.0,1211.0,,170.0,12.0,1584.0,...,0,1,0,0,0,0,0,0,0,1
1,,32.0,14.0,379.0,43.0,5028.0,,208.0,125.0,2845.0,...,0,0,0,0,0,0,0,0,0,1
2,,273.0,10.0,391.0,24.0,2797.0,,243.0,190.0,234.0,...,0,0,0,0,0,0,0,0,0,1
3,,318.0,13.0,300.0,94.0,1482.0,,228.0,182.0,2930.0,...,0,0,0,0,0,0,0,1,0,1
4,,101.0,12.0,90.0,-5.0,4168.0,,223.0,110.0,2026.0,...,0,0,0,0,0,0,0,0,0,1


*Stop here! Only the three first tasks. :)*