# NLP Project

## Importing libraries

In [14]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
import joblib
import os

## Load the dataset

> In this step, I’m loading the dataset containing labeled URLs to determine whether they are spam or not. The dataset is publicly available via a GitHub link. I’m using `pandas` to read the CSV directly from the URL.

- This will help me inspect the data and prepare it for the next preprocessing step.


In [5]:
# Load dataset from the provided GitHub URL
url = "https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv"
df = pd.read_csv(url)

# Display the first few rows
df.head()

Unnamed: 0,url,is_spam
0,https://briefingday.us8.list-manage.com/unsubs...,True
1,https://www.hvper.com/,True
2,https://briefingday.com/m/v4n3i4f3,True
3,https://briefingday.com/n/20200618/m#commentform,False
4,https://briefingday.com/fan,True


> Everything loaded correctly, moving to the next step. 

## Preprocess the links

> In this step, I’m preparing the URLs for training a machine learning model. Since the URLs are just strings, I need to convert them into a format that the model can understand.

> Here's what I’ll do:
- **Tokenize** the URLs by breaking them into meaningful parts using common punctuation marks.
- **Remove stopwords** (even though they’re rare in URLs, this is good hygiene).
- **Lemmatize** the tokens to reduce them to their base forms.
- **Vectorize** the result using `TfidfVectorizer` so we can train an SVM later.

> Finally, I’ll split the dataset into training and testing sets.


In [7]:
# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Custom preprocessing function for URLs
def preprocess_url(url):
    # Split on punctuation and non-alphanumeric characters
    tokens = re.split(r'\W+', url.lower())
    # Remove stopwords and lemmatize
    clean_tokens = [lemmatizer.lemmatize(token) for token in tokens if token and token not in stop_words]
    return ' '.join(clean_tokens)

# Apply preprocessing to the URL column
df['clean_url'] = df['url'].apply(preprocess_url)

# Vectorize the cleaned URLs
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_url'])

# Target variable
y = df['is_spam']  # 1 = spam, 0 = not spam

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Confirm dimensions
X_train.shape, X_test.shape


[nltk_data] Downloading package stopwords to /home/vscode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vscode/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


((2399, 6335), (600, 6335))

> In this step, I prepared the dataset for machine learning by cleaning and transforming the URLs into numerical vectors. Here’s a breakdown of what I did:

- **Tokenization:** I split each URL into smaller parts using punctuation marks like `.`, `/`, `-`, and `_`.
- **Stopword Removal:** I removed common English stopwords that don’t help the model (e.g., “the”, “and”).
- **Lemmatization:** I reduced each word to its root form to normalize the vocabulary (e.g., “running” → “run”).
- **TF-IDF Vectorization:** I converted the processed URLs into a matrix of numerical features based on how important each token is across the dataset.

> Finally, I split the dataset into:
- **Training set:** 2,399 URLs
- **Test set:** 600 URLs
- **Feature space:** 6,335 unique tokens

> This step ensures that my SVM model will receive clean, consistent, and meaningful inputs.



## Train a Support Vector Machine (SVM)

> In this step, I trained a Support Vector Machine (SVM) classifier using the default parameters.

- The goal here is to establish a strong baseline model that can classify whether a given URL is spam or not based on its tokenized and vectorized form.

> SVMs are effective for binary classification problems like this one and tend to perform well with high-dimensional data, which suits the TF-IDF representation we built in the previous step.

- Let's check the model's accuracy and classification report to evaluate its performance on the test set.


In [9]:
# Initialize the SVM with default parameters
svm_model = SVC()

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Detailed classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.9583
              precision    recall  f1-score   support

       False       0.95      1.00      0.97       455
        True       0.99      0.83      0.91       145

    accuracy                           0.96       600
   macro avg       0.97      0.92      0.94       600
weighted avg       0.96      0.96      0.96       600



> In this step, I trained a Support Vector Machine (SVM) using default hyperparameters on the preprocessed URL data. The model was then evaluated on the test set, and the performance metrics are very promising.

- **Accuracy**: `95.83%` — This means the model correctly predicted whether a URL was spam or not in nearly 96% of the cases.
- **Precision**:
  - **False (not spam)**: `95%`
  - **True (spam)**: `99%` — This shows that the model is excellent at detecting spam when it makes a positive prediction.
- **Recall**:
  - **False**: `100%` — The model perfectly recognized all non-spam links.
  - **True**: `83%` — It missed some spam URLs, which I’ll try to improve in the next step through optimization.
- **F1-Score**:
  - **False**: `0.97`
  - **True**: `0.91` — A solid balance between precision and recall for both classes.

> Overall, the SVM model performs very well out of the box, but there's room to improve its ability to catch all spam cases. In Step 4, I’ll optimize the model’s hyperparameters to boost its performance further.


## Hyperparameter Optimization

> In this step, I optimized the Support Vector Machine by performing a grid search to find the best combination of hyperparameters. The goal is to improve the model's ability to detect spam links, especially those that were misclassified in the initial version.

- I focused on tuning the `C` (regularization strength), `kernel` type, and `gamma` (kernel coefficient). These parameters directly impact the decision boundary and model flexibility. 
- After finding the best combination, I retrained the model and evaluated its performance.


In [11]:
# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Initialize SVM
svc = SVC()

# Run Grid Search
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', verbose=2)
grid_search.fit(X_train, y_train)

# Get best model
best_svm = grid_search.best_estimator_

# Evaluate best model
y_pred_optimized = best_svm.predict(X_test)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred_optimized))
print(classification_report(y_test, y_pred_optimized))

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.2s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.2s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.2s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.2s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.2s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.4s
[CV] END ....................C=0.1, gamma=scale,

> In this step, I used GridSearchCV to find the best combination of hyperparameters for the SVM model. I tested different values of `C`, `kernel`, and `gamma` using 5-fold cross-validation. The best parameters found were:

- **C**: 10  
- **Kernel**: 'rbf'  
- **Gamma**: 'scale'

> After training the model with these optimized parameters, I achieved the following results:

- **Accuracy**: 0.9683
- **Precision (spam)**: 0.96
- **Recall (spam)**: 0.90
- **F1-Score (spam)**: 0.93

> This shows a clear improvement over the baseline SVM model from Step 3, especially in terms of recall for the spam class, which means the optimized model is better at identifying actual spam links. The overall performance is okay. 


> I'll move to the final step

## Save the Optimized SVM Model

In [16]:
# Ensure the models folder exists
os.makedirs("models", exist_ok=True)

# Save the best model from GridSearchCV
joblib.dump(grid_search.best_estimator_, "models/best_svm_model.pkl")

# Also save the vectorizer so it can be reused for prediction
joblib.dump(vectorizer, "models/url_vectorizer.pkl")

['models/url_vectorizer.pkl']

> To finalize the project, I saved the optimized SVM model using the `joblib` library. This will allow me to reuse the trained model later for inference without needing to retrain it. I also saved the fitted `TfidfVectorizer`, which is essential to transform new input data in the same way as during training.

> Files saved:
- `models/best_svm_model.pkl`: the optimized SVM classifier.
- `models/url_vectorizer.pkl`: the fitted Tfidf vectorizer for preprocessing URLs.


> Aaaaaand its finished.