<a href="https://colab.research.google.com/github/Jacobgokul/ML-Playground/blob/main/Gradient_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Boosting is an ensemble learning technique where multiple weak models (usually decision trees) are combined to form a strong model. It sequentially improves performance by focusing on the errors made by previous models.

#How Boosting Works
- Train a weak model (like a small decision tree).

- Identify the errors (misclassified samples or large residuals).

- Train a new model that focuses on correcting those errors.

- Repeat this process multiple times to reduce mistakes progressively.

Unlike Bagging (Random Forest), where all trees work independently, Boosting trains trees sequentially.

# Gradient Boosting
Gradient Boosting is a type of Boosting where models are trained sequentially, but instead of focusing on misclassified samples (like AdaBoost), it focuses on reducing the residual errors using gradient descent.

📌 Gradient Boosting in Classification (GBC)

## For classification tasks:

The model starts with a simple classifier (e.g., a Decision Tree).

It trains subsequent classifiers to correct the errors made by the previous one.

The final model is a weighted combination of all weak classifiers.

## For regression tasks:
Instead of classification errors, we use the residual errors (difference between predicted and actual values).

The model keeps minimizing the loss function (like Mean Squared Error).

The final prediction is a sum of all weak learners.


In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In [7]:
dataset_url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

df= pd.read_csv(dataset_url, sep='\t', header=None, names=['label', 'message'])

df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [8]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [9]:
df

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [13]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [15]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [16]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

# Apply preprocessing
df['cleaned_message'] = df['message'].apply(preprocess_text)

In [17]:
df.head()

Unnamed: 0,label,message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned_message'])
y = df['label']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbc.fit(X_train, y_train)

In [29]:
X_test[0]

<1x9430 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [22]:
y_pred = gbc.predict(X_test)

In [23]:
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [25]:
from sklearn.metrics import accuracy_score, classification_report

In [26]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 0.97

Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       966
           1       0.97      0.79      0.87       149

    accuracy                           0.97      1115
   macro avg       0.97      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115



In [32]:
feature_names = vectorizer.get_feature_names_out()

X_test_dense = X_test.toarray()

In [33]:
for i, row in enumerate(X_test_dense):
    words = [feature_names[j] for j in range(len(row)) if row[j] > 0]
    print(f"Message {i + 1}: {', '.join(words)}")

Message 1: back, christmas, cute, den, frndshp, get, hate, hug, lik, lucky, luvd, none, people, squeeeeeze
Message 2: also, blown, blue, couple, id, ive, looking, rather, recently, sorta, text, times, weed
Message 3: better, drinks, good, got, indian, mmm, roast, thats
Message 4: anything, dont, eat, heavy, kanji, mm, ok
Message 5: comes, costumes, future, gift, guys, hint, ring, theres, yowifes
Message 6: bollox, hurt, lot, need, sary, tim, tol
Message 7: could, decide, decision, feeling, isnt, less, life, love, magical, much, simpler, would
Message 8: aft, ask, find, havent, lor, one, students, supervisor, tell, thk, yet
Message 9: dear, good, morning
Message 10: chennai, im, velachery
Message 11: away, forever, grr, like, lol, minutes, mom, pharmacy, prescription, taking, ugh
Message 12: didnt, fb, glad, huh, im, page, proof, really, rupaul, show, tool, ugh, valentines, watch
Message 13: buying, im, lar, tix, wif
Message 14: aa, er, exhaust, followed, go, hanging, hello, home, limpi

In [40]:
#prediction

custom_message = "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
cleaned_message = preprocess_text(custom_message)
custom_message_vector = vectorizer.transform([cleaned_message])
custom_prediction = gbc.predict(custom_message_vector)
prediction_label = "spam" if custom_prediction[0] == 1 else "ham"

print(f"The message is predicted to be: {prediction_label}")

The message is predicted to be: spam
