 # <div style="text-align: center">XGBoost Tutorial for Beginners 
<div style="text-align: center">One of the most common questions we get on <b>Data science</b> is:
<br>
How can we provide better solutions than  <b>other machine learning algorithms?</b>
<br>
If you get confused and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with ensemble learning? 
<br>
In this simple tutorials you can learn all of the thing you need for  <b>using XGBoost as a method</b></div>
<img src='http://s8.picofile.com/file/8341480568/XGBoost.png'>
<div style="text-align:center">last update: <b>11/01/2018</b></div>


you can follow me on:
> ###### [ GitHub](https://github.com/mjbahmani)
> ###### [Kaggle](https://www.kaggle.com/mjbahmani/)
-------------------------------------------------------------------------------------------------------------
 **I hope you find this kernel helpful and some <font color='red'> UPVOTES</font> would be very much appreciated**
 
 -----------


## Notebook  Content
   [Introduction](#0)
1. [ Why XGBoost?](#1)
1. [Installing XGBoost ](#2)
1. [Matrix Multiplication](#3)
    1. [Vector-Vector Products](#4)
    1. [Outer Product of Two Vectors](#5)
    1. [Matrix-Vector Products](#6)
    1. [Matrix-Matrix Products](#7)
1. [Conclusion](#30)
1. [References](#31)

<a id="1"></a> <br>
#  1- Introduction
* **XGBoost** is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.
* **XGBoost** is an implementation of gradient boosted decision trees designed for speed and performance.
* **XGBoost** is short for e**X**treme **G**radient **Boost**ing package.


<a id="2"></a> <br>
## 2- Why XGBoost?

* Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

* Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

* Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.

* Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.
* Win competition On Kaggle : there are a lot of winners on Kaggle that use XGBoost


<a id="3"></a> <br>
## 3- Installing XGBoost

There is a comprehensive installation guide on the [XGBoost documentation website](http://xgboost.readthedocs.io/en/latest/build.html).

### 3-1 XGBoost in R
If you are an R user, the best place to get started is the [CRAN page for the xgboost package](https://cran.r-project.org/web/packages/xgboost/index.html).

### 3-2 XGBoost in Python
Installation instructions are available on the Python section of the XGBoost installation guide.

The official Python Package Introduction is the best place to start when working with XGBoost in Python.

To get started quickly, you can type:

>sudo pip install xgboost


<a id="3"></a> <br>
## 4- Problem Definition
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)

Problem Definition has four steps that have illustrated in the picture below:
<img src="http://s8.picofile.com/file/8338227734/ProblemDefination.png">
<a id="4"></a> <br>
### 4-1 Problem Feature
we will use the classic Iris data set. This dataset contains information about three different types of Iris flowers:

* Iris Versicolor
* Iris Virginica
* Iris Setosa

The data set contains measurements of four variables :

* sepal length 
* sepal width
* petal length 
* petal width
 
The Iris data set has a number of interesting features:

1. One of the classes (Iris Setosa) is linearly separable from the other two. However, the other two classes are not linearly separable.

2. There is some overlap between the Versicolor and Virginica classes, so it is unlikely to achieve a perfect classification rate.

3. There is some redundancy in the four input variables, so it is possible to achieve a good solution with only three of them, or even (with difficulty) from two, but the precise choice of best variables is not obvious.

**Why am I  using iris dataset:**

1- This is a good project because it is so well understood.

2- Attributes are numeric so you have to figure out how to load and handle data.

3- It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.

4- It is a multi-class classification problem (multi-nominal) that may require some specialized handling.

5- It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).

6- All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.[5]

7- we can define problem as clustering(unsupervised algorithm) project too.
<a id="5"></a> <br>
### 4-2 Aim
The aim is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals
<a id="6"></a> <br>
### 4-3 Variables
The variables are :
**sepal_length**: Sepal length, in centimeters, used as input.
**sepal_width**: Sepal width, in centimeters, used as input.
**petal_length**: Petal length, in centimeters, used as input.
**petal_width**: Petal width, in centimeters, used as input.
**setosa**: Iris setosa, true or false, used as target.
**versicolour**: Iris versicolour, true or false, used as target.
**virginica**: Iris virginica, true or false, used as target.

**<< Note >>**
> You must answer the following question:
How does your company expact to use and benfit from your model.

<a id="7"></a> <br>
## 5- Inputs & Outputs
<a id="8"></a> <br>
### 5-1 Inputs
**Iris** is a very popular **classification** and **clustering** problem in machine learning and it is such as "Hello world" program when you start learning a new programming language. then I decided to apply Iris on  20 machine learning method on it.
The Iris flower data set or Fisher's Iris data set is a **multivariate data set** introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers in three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

As a result, **iris dataset is used as the input of all algorithms**.
<a id="9"></a> <br>
### 5-2 Outputs
the outputs for our algorithms totally depend on the type of classification or clustering algorithms.
the outputs can be the number of clusters or predict for new input.

**setosa**: Iris setosa, true or false, used as target.
**versicolour**: Iris versicolour, true or false, used as target.
**virginica**: Iris virginica, true or false, used as a target.

###### <a id="4"></a> <br>
## 6- Import packages

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from pandas import get_dummies
import plotly.graph_objs as go
from sklearn import datasets
import plotly.plotly as py
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import numpy
import json
import sys
import csv
import os

In [2]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 2.2.3
sklearn: 0.20.0
scipy: 1.1.0
seaborn: 0.8.1
pandas: 0.23.4
numpy: 1.15.3
Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]


<a id="17"></a> <br>
## 6-1 Data Collection
**Data collection** is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

**Iris dataset**  consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.[6]


In [3]:
# import Dataset to play with it
dataset = pd.read_csv('../input/Iris.csv')

**<< Note 1 >>**

* Each row is an observation (also known as : sample, example, instance, record)
* Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

<a id="32"></a> <br>
## 7- Model Deployment
In this section have been applied more than **20 learning algorithms** that play an important rule in your experiences and improve your knowledge in case of ML technique.

> **<< Note 3 >>** : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

## 7-1 Families of ML algorithms
There are several categories for machine learning algorithms, below are some of these categories:
* Linear
    * Linear Regression
    * Logistic Regression
    * Support Vector Machines
* Tree-Based
    * Decision Tree
    * Random Forest
    * GBDT
* KNN
* Neural Networks

-----------------------------
And if we  want to categorize ML algorithms with the type of learning, there are below type:
* Classification

    * k-Nearest 	Neighbors
    * LinearRegression
    * SVM
    * DT 
    * NN
    
* clustering

    * K-means
    * HCA
    * Expectation Maximization
    
* Visualization 	and	dimensionality 	reduction:

    * Principal 	Component 	Analysis(PCA)
    * Kernel PCA
    * Locally -Linear	Embedding 	(LLE)
    * t-distributed	Stochastic	Neighbor	Embedding 	(t-SNE)
    
* Association 	rule	learning

    * Apriori
    * Eclat
* Semisupervised learning
* Reinforcement Learning
    * Q-learning
* Batch learning & Online learning
* Ensemble  Learning

**<< Note >>**
> Here is no method which outperforms all others for all tasks



<a id="33"></a> <br>
## 7-2 Prepare Features & Targets
First of all seperating the data into dependent(Feature) and independent(Target) variables.

**<< Note 4 >>**
* X==>>Feature
* y==>>Target

In [None]:

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

After loading the data via **pandas**, we should checkout what the content is, description and via the following:

<a id="46"></a> <br>
## 7-3 RandomForest
A random forest is a meta estimator that **fits a number of decision tree classifiers** on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [None]:
from sklearn.ensemble import RandomForestClassifier
Model=RandomForestClassifier(max_depth=2)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

<a id="47"></a> <br>
## 7-4 Bagging classifier 
A Bagging classifier is an ensemble **meta-estimator** that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting . If samples are drawn with replacement, then the method is known as Bagging . When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces . Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches .[http://scikit-learn.org]

In [None]:
from sklearn.ensemble import BaggingClassifier
Model=BaggingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

<a id="48"></a> <br>
##  7-5 AdaBoost classifier

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
This class implements the algorithm known as **AdaBoost-SAMME** .

In [None]:
from sklearn.ensemble import AdaBoostClassifier
Model=AdaBoostClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

<a id="49"></a> <br>
## 7-6 Gradient Boosting Classifier
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
Model=GradientBoostingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

<a id="50"></a> <br>
## 7-7 Linear Discriminant Analysis
Linear Discriminant Analysis (discriminant_analysis.LinearDiscriminantAnalysis) and Quadratic Discriminant Analysis (discriminant_analysis.QuadraticDiscriminantAnalysis) are two classic classifiers, with, as their names suggest, a **linear and a quadratic decision surface**, respectively.

These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no **hyperparameters** to tune.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Model=LinearDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

<a id="51"></a> <br>
## 7-8 Quadratic Discriminant Analysis
A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

The model fits a **Gaussian** density to each class.

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Model=QuadraticDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

In [4]:
type(dataset)

pandas.core.frame.DataFrame

<a id="0"></a> <br>
# 9-Conclusion
* That XGBoost is a library for developing fast and high performance gradient boosting tree models.
* That XGBoost is achieving the best performance on a range of difficult machine learning tasks.
* That you can use this library from the command line, Python and R and how to get started.



you can follow me on:
> ###### [ GitHub](https://github.com/mjbahmani)
> ###### [Kaggle](https://www.kaggle.com/mjbahmani/)

 **I hope you find this kernel helpful and some upvotes would be very much appreciated**
 

<a id="31"></a> <br>
# 10-References

* [1] [datacamp](https://www.datacamp.com/community/tutorials/xgboost-in-python)
* [2] [Xgboost presentation](https://www.oreilly.com/library/view/data-science-from/9781491901410/ch04.html)
* [3] [machinelearningmastery](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)
* [4] [analyticsvidhya](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
* [5] [Github](https://github.com/mjbahmani)

