### 2. What is XGBoost?
* XGBoost is an incredibly popular machine learning library for good reason. It was developed originally as a C++ command-line application. After winning a popular machine learning competition, the package started being adopted within the ML community. As a result, bindings, or functions that tapped into the core C++ code, started appearing in a variety of other languages, including Python, R, Scala, and Julia. We will cover the Python API in this course.

### 3. What makes XGBoost so popular?
* What makes XGBoost so popular? Its speed and performance. Because the core XGBoost algorithm is parallelizable, it can harness all of the processing power of modern multi-core computers. Furthermore, it is parallelizable onto GPU's and across networks of computers, making it feasible to train models on very large datasets on the order of hundreds of millions of training examples. However, XGBoost's speed isn't the package's real draw. Ultimately, a fast but poorly performing machine learning algorithm is not going to have wide adoption within the community. What makes XGBoost so popular is that it consistently outperforms almost all other single-algorithm methods in machine learning competitions and has been shown to achieve state-of-the-art performance on a variety of benchmark machine learning datasets. Here's an example of how we can use XGBoost using a classification problem. 

In [1]:
import pandas as pd

In [2]:
churn_data = pd.read_csv("../chronic_kidney_disease.csv")
churn_data.head()

Unnamed: 0,48,80,1.020,1,0,?,normal,notpresent,notpresent.1,121,...,44,7800,5.2,yes,yes.1,no,good,no.1,no.2,ckd
0,7,50,1.02,4,0,?,normal,notpresent,notpresent,?,...,38,6000,?,no,no,no,good,no,no,ckd
1,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423,...,31,7500,?,no,yes,no,poor,no,yes,ckd
2,48,70,1.005,4,0,normal,abnormal,present,notpresent,117,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
3,51,80,1.01,2,0,normal,normal,notpresent,notpresent,106,...,35,7300,4.6,no,no,no,good,no,no,ckd
4,60,90,1.015,3,0,?,?,notpresent,notpresent,74,...,39,7800,4.4,yes,yes,no,good,yes,no,ckd


In [5]:
churn_data.columns

Index(['48', '80', '1.020', '1', '0', '?', 'normal', 'notpresent',
       'notpresent.1', '121', '36', '1.2', '?.1', '?.2', '15.4', '44', '7800',
       '5.2', 'yes', 'yes.1', 'no', 'good', 'no.1', 'no.2', 'ckd'],
      dtype='object')

### XGBoost: Fit/Predict

It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data - explore it in the Shell!

Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

pandas and numpy have been imported as pd and np, and train_test_split has been imported from sklearn.model_selection. Additionally, the arrays for the features and the target have been created as X and y.

**Instructions**

* Import xgboost as xgb.
* Create training and test sets such that 20% of the data is used for testing. Use a random_state of 123.
* Instantiate an XGBoostClassifier as xg_cl using xgb.XGBClassifier(). Specify n_estimators to be 10 estimators and an objective of 'binary:logistic'. Do not worry about what this means just yet, you will learn about these parameters later in this course.
* Fit xg_cl to the training set (X_train, y_train) using the .fit() method.
* Predict the labels of the test set (X_test) using the .predict() method and hit 'Submit Answer' to print the accuracy.


In [3]:
%pip install xgboost

Collecting xgboost
  Downloading xgboost-1.5.1-py3-none-manylinux2014_x86_64.whl (173.5 MB)
[K     |████████████████████████████████| 173.5 MB 14 kB/s  eta 0:00:01
Installing collected packages: xgboost
Successfully installed xgboost-1.5.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import xgboost
import xgboost as xgb
from sklearn.model_selection import train_test_split

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))




ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:48, 80, 1.020, 1, 0, ?, normal, notpresent, notpresent.1, 121, 36, 1.2, ?.1, ?.2, 15.4, 44, 7800, 5.2, yes, yes.1, no, good, no.1, no.2

In [4]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier()

# Fit the classifier to the training set
____

# Predict the labels of the test set: y_pred_4
y_pred_4 = ____

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)


NameError: name 'X' is not defined

### 1. What is Boosting?
* Now that we've reviewed both supervised learning and the basics of decision trees, lets talk about the core concept that gives XGBoost its state-of-the-art performance, boosting.

### 2. Boosting overview
* At bottom, boosting isn't really a specific machine learning algorithm, but a concept that can be applied to a set of machine learning models. So, its really a meta-algorithm. Specifically, it is an ensemble meta-algorithm primarily used to reduce any given single learner's variance and to convert many weak learners into an arbitrarily strong learner.

### 3. Weak learners and strong learners
* A weak learner is any machine learning algorithm that is just slightly better than chance. So, a decision tree that can predict some outcome slightly more frequently than pure randomness would be considered a weak learner. The principal insight that allows XGBoost to work is the fact that you can use boosting to convert a collection of weak learners into a strong learner. Where a strong learner is any algorithm that can be tuned to achieve arbitrarily good performance for some supervised learning problem.

### 4. How boosting is accomplished

* How is this accomplished? By iteratively learning a set of weak models on subsets of the data you have at hand, and weighting each of their predictions according to each weak learner's performance. You then combine all of the weak learners' predictions multiplied by their weights to obtain a single final weighted prediction that is much better than any of the individual predictions themselves. It's kind of incredible that this works as well as it does.

### 6. Model evaluation through cross-validation

* Since we will be working with XGBoost's learning API for model evaluation next, it's a good idea to briefly provide you with an example that shows how model evaluation using cross-validation works with XGBoost's learning API (which is different from the scikit-learn compatible API) as it has cross-validation capabilities baked in. As a refresher, cross-validation is a robust method for estimating the expected performance of a machine learning model on unseen data by generating many non-overlapping train/test splits on your training data and reporting the average test set performance across all data splits. 

## Measuring accuracy

You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data.

**Instructions**

* Create a `DMatrix` called `churn_dmatrix` from churn_data using `xgb.DMatrix()`. The features are available in X and the labels in y.
* Perform 3-fold cross-validation by calling `xgb.cv()`. dtrain is your `churn_dmatrix`, `params` is your parameter dictionary, nfold is the number of cross-validation folds (`3`), num_boost_round is the number of trees we want to build (`5`), metrics is the metric you want to compute (this will be `"error"`, which we will convert to an accuracy).



In [None]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

In [None]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

### 2. When to use XGBoost
* Given that I've already talked a bit about when and where XGBoost shines, some of this shouldn't come as a surprise to you. You should consider using XGBoost for any supervised machine learning task that fits the following criteria: You have a large number of training examples. Although your definition of large can vary, I intend it to mean a dataset that has few features and at least 1000 examples. However, in general, as long as the number of features in your training set is smaller than the number of examples you have, you should be fine. Finally, XGBoost tends to do well when you have a mixture of categorical and numeric features, or when you have just numeric features.

### 3. When to NOT use XGBoost
* When should you not use XGBoost? The most important kinds of problems where XGBoost is a suboptimal choice involve either those that have found success using other state-of-the-art algorithms or those that suffer from dataset size issues. Specifically, XGBoost is not ideally suited for image recognition, computer vision, or natural language processing and understanding problems, as those kinds of problems can be much better tackled using deep learning approaches. In terms of dataset size problems, XGBoost is not suitable when you have very small training sets ( less than 100 training examples) or when the number of training examples is significantly smaller than the number of features being used for training. 