In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv("./concrete_data.csv")
data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
X = data.iloc[:, :-1].values
Y = data.iloc[:, -1].values.reshape(-1, 1)


In [4]:
print(np.shape(X))
print(np.shape(Y))

(1030, 8)
(1030, 1)


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2021)

# XGBoost (Extreme Gradient Boosting)

XGBoost is a powerful machine learning algorithm that excels in predictive modeling.

---

## 🌱 What It Does

**Gradient Boosting**:  
XGBoost builds multiple decision trees sequentially, with each tree learning from the errors of its predecessors. It uses gradient descent to minimize prediction errors.

---

## 🔑 Key Features

- **High Performance**: Optimized for speed and efficiency  
- **Regularization**: Built-in L1 and L2 regularization to prevent overfitting  
- **Parallel Processing**: Can utilize multiple CPU cores  
- **Handling Missing Values**: Automatically handles missing data  
- **Flexibility**: Supports various objective functions and evaluation metrics  

---

## 🔍 Use Cases

- Classification (binary and multi-class)  
- Regression  
- Ranking  
- Time-series forecasting  

---

## ✅ Advantages

- Excellent accuracy  
- Fast computation  
- Robust to overfitting  
- Handles large datasets  
- Provides feature importance ranking  

---


> XGBoost is widely used in machine learning competitions and real-world applications due to its superior performance and efficiency.


In [7]:
#Defining the base model using XGBoost

from xgboost import XGBRegressor
xgb_model = XGBRegressor(random_state=2021)

## 🌟 Common XGBoost Hyperparameters (for `XGBClassifier`)

---

### ✅ 1. `n_estimators`

> **Number of boosting rounds / trees**

* Default: 100
* Higher → more trees, better fit, but slower
* Works with `learning_rate`: small LR → need more trees

---

### ✅ 2. `learning_rate` (aka `eta`)

> **How much each tree contributes to the final prediction**

* Default: 0.3
* Smaller = slower learning, better generalization
* Typical: `0.01` to `0.1`

✅ Rule of thumb:

> Use `learning_rate = 0.01` with `n_estimators = 500+` for strong models

---

### ✅ 3. `max_depth`

> **Maximum depth of each tree**

* Controls how complex a tree is
* Default: 6
* Lower → less overfitting, Higher → more expressive

📌 Try: `[3, 6, 9]`

---

### ✅ 4. `subsample`

> **Fraction of training rows to use for each tree**

* Range: `0.5` to `1.0`
* Default: 1.0
* Lower values help regularize (avoid overfitting)

📌 Try: `0.8` or `0.9`

---

### ✅ 5. `colsample_bytree`

> **Fraction of features (columns) to use per tree**

* Default: 1.0
* Like `subsample`, but for columns
* Helps diversify trees

📌 Try: `0.6`, `0.8`, `1.0`

---

### ✅ 6. `gamma`

> **Minimum loss reduction to split a node**

* Default: 0
* Higher → model is more conservative
* Helps prune branches in trees

📌 Try: `0`, `0.1`, `0.5`

---

### ✅ 7. `min_child_weight`

> **Minimum sum of weights required in a child node**

* Default: 1
* Higher = more conservative splits
* A form of regularization (reduces overfitting)

📌 Try: `1`, `5`, `10`

---

### ✅ 8. `objective`

> **Defines the type of prediction task**

* For binary classification → `"binary:logistic"` ✅
* For multiclass → `"multi:softprob"`
* For regression → `"reg:squarederror"`

---

### ✅ 9. `eval_metric`

> **What to optimize during training**

* Binary: `"logloss"`, `"error"`, `"auc"`
* Multi-class: `"mlogloss"`, `"merror"`
* Regression: `"rmse"`, `"mae"`

📌 Example:

```python
XGBClassifier(eval_metric='logloss')
```

---

### ✅ 10. `scale_pos_weight`

> **Helps with imbalanced datasets** (e.g. 90% class 0, 10% class 1)

* \= (# of class 0 examples) / (# of class 1 examples)

📌 Try this if you have class imbalance!

---

## 🔁 Regularization Parameters

| Parameter | Purpose                   | Typical Values |
| --------- | ------------------------- | -------------- |
| `gamma`   | Prune trees               | 0 to 5         |
| `lambda`  | L2 regularization (Ridge) | 0 to 1         |
| `alpha`   | L1 regularization (Lasso) | 0 to 1         |

---

### 🛠 Sample Model Setup

```python
from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0.1,
    min_child_weight=1,
    objective='binary:logistic',
    eval_metric='auc',
    use_label_encoder=False
)
```

In [9]:
search_space = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3,6,9],
    'gamma' : [0.01,0.1],
    "learning_rate" : [0.001,0.01,0.1,1]
}

In [10]:
from sklearn.model_selection import GridSearchCV

GS = GridSearchCV(estimator = xgb_model,
                    param_grid = search_space,
                    scoring = ['r2','neg_mean_absolute_error'],
                    cv = 5, #cross validation
                    refit = 'r2', #refit the model with the best parameters
                    verbose = 4)  #to tell how much info should be printed

In [11]:
GS.fit(X_train, Y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=100; neg_mean_absolute_error: (test=-12.564) r2: (test=0.112) total time=   0.1s
[CV 2/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=100; neg_mean_absolute_error: (test=-12.236) r2: (test=0.103) total time=   0.0s
[CV 3/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=100; neg_mean_absolute_error: (test=-13.121) r2: (test=0.119) total time=   0.0s
[CV 4/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=100; neg_mean_absolute_error: (test=-12.548) r2: (test=0.106) total time=   0.0s
[CV 5/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=100; neg_mean_absolute_error: (test=-13.274) r2: (test=0.108) total time=   0.0s
[CV 1/5] END gamma=0.01, learning_rate=0.001, max_depth=3, n_estimators=200; neg_mean_absolute_error: (test=-11.879) r2: (test=0.211) total time=   0.1s
[CV 2/5] END gamma=0

0,1,2
,estimator,"XGBRegressor(...ree=None, ...)"
,param_grid,"{'gamma': [0.01, 0.1], 'learning_rate': [0.001, 0.01, ...], 'max_depth': [3, 6, ...], 'n_estimators': [100, 200, ...]}"
,scoring,"['r2', 'neg_mean_absolute_error']"
,n_jobs,
,refit,'r2'
,cv,5
,verbose,4
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [12]:
print(GS.best_params_)

{'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}


In [13]:
print(GS.best_score_)

0.921563136246513


In [14]:
df = pd.DataFrame(GS.cv_results_)
df = df.sort_values("rank_test_r2")
df.to_csv("cv_results.csv")