# what is XGBOOST:

XGBoost is an optimized distributed gradient boosting library designed to work well with large-scale datasets. It is a popular machine learning algorithm for tabular data that has proven to be effective in many competitions and real-world applications. Here's 

a step-by-step explanation of how XGBoost works, using a real-world example:

1.Data Preparation: The first step is to prepare the data. Let's say we have a dataset of customer information that includes demographic features like age, gender, income, etc., as well as past purchasing behavior, such as purchase history, amount spent, and time of purchase.

2.Splitting Data: The next step is to split the data into training and testing sets. We will use the training set to train our XGBoost model and the testing set to evaluate its performance.

3.Model Initialization: We initialize an XGBoost model by specifying the hyperparameters such as learning rate, number of trees, maximum depth, etc. We can start with some default values and tune them later.

4.Training Model: We then train the model on the training data. XGBoost uses an ensemble of decision trees, and each tree is built sequentially to correct the errors of the previous tree. At each step, the model tries to find the best split that minimizes the loss function (such as mean squared error, log loss, etc.) on the training data.

5.Evaluating Model: After training, we evaluate the performance of the model on the testing data. We can use various evaluation metrics such as accuracy, precision, recall, F1-score, ROC-AUC, etc. to measure the performance of the model.

6.Tuning Hyperparameters: If the performance of the model is not satisfactory, we can tune the hyperparameters to improve the performance. We can use techniques like grid search or random search to find the best hyperparameters.

7.Making Predictions: Once we have a tuned model, we can use it to make predictions on new data. For example, we can predict which customers are likely to make a purchase in the future based on their demographic and past purchasing behavior.

In summary, XGBoost is a powerful machine learning algorithm that uses an ensemble of decision trees to make accurate predictions on large-scale datasets. It can be used in various applications such as fraud detection, customer churn prediction, and recommendation systems.

EXAMPLE:
    
    Suppose you are a data scientist at a large e-commerce company, and your task is to build a machine learning model that predicts which products a customer is likely to purchase based on their past purchase history and demographic information. You have a dataset of millions of customers, each with hundreds of features, and you need a model that can handle this large-scale data efficiently.

You decide to use XGBoost, as it is well-suited for handling large-scale datasets and has a proven track record in competitions and industry applications.

First, you split your dataset into training and testing sets. You use the training set to train an XGBoost model, tuning hyperparameters like learning rate, number of trees, and maximum depth to improve performance. You use evaluation metrics like accuracy, precision, recall, and ROC-AUC to measure the performance of the model on the testing set.

Once you have a tuned model, you deploy it in production to make predictions on new data. For example, when a customer visits your website, your XGBoost model can analyze their past purchase history and demographic information to recommend products that they are likely to be interested in. This can improve the customer experience and drive sales for your e-commerce company.

Overall, XGBoost is a powerful machine learning algorithm that can handle large-scale data and make accurate predictions in real-world applications like e-commerce, finance, and healthcare.

# Import Required Libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read dataset:

In [2]:
dataset = pd.read_csv("Data.csv")

In [3]:
dataset['Class'] = dataset['Class'].replace({2:0, 4:1})

In [4]:
dataset

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,0
679,841769,2,1,1,1,2,1,1,1,1,0
680,888820,5,10,10,3,7,3,8,10,2,1
681,897471,4,8,6,4,3,4,10,6,1,1


In [5]:
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into training and testing:

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 0)

In [None]:
pip install xgboost

# Training the XGBoost on the training set:

In [7]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train,y_train)

In [8]:
y_pred = classifier.predict(X_test)

# Making the confusion matrix:

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
print(cm)
accuracy_score(y_test,y_pred)

[[85  2]
 [ 1 49]]


0.9781021897810219

# Applying the K-fold cross validation:

In [11]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print(" Accuracy: {: .2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {: .2f} %".format(accuracies.std()*100))

 Accuracy:  96.53 %
Standard Deviation:  2.63 %
