# Support Vector Machines - Exercise 1

In this exercise, we'll be using support vector machines (SVMs) to build a spam classifier.  We'll start with SVMs on some simple 2D data sets to see how they work.  Then we'll do some pre-processing work on a set of raw emails and build a classifier on the processed emails using a SVM to determine if they are spam or not.

The first thing we're going to do is look at a simple 2-dimensional data set and see how a linear SVM works on the data set for varying values of C (similar to the regularization term in linear/logistic regression).  Let's load the data.
## Exercise 1
#### 1. Load libraries

In [1]:
import numpy as np
import pandas as pd
import scipy.io
import matplotlib.pyplot as plt

#Modelos
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

#Metricas
from sklearn.metrics import accuracy_score, classification_report


#### 2. Load data
Load the file *ejer_1_data1.mat*. Find the way for loading this kind of file. **scipy.io.loadmat**

In [2]:
loadmat = scipy.io.loadmat('ejer_1_data1.mat')

FileNotFoundError: [Errno 2] No such file or directory: 'ejer_1_data1.mat'

In [3]:
print(loadmat.keys())

NameError: name 'loadmat' is not defined

#### 3. Create a DataFrame with the features and target

In [4]:
X = loadmat['X']  
y = loadmat['y']

NameError: name 'loadmat' is not defined

In [5]:
data = pd.DataFrame(X, columns=['feature1', 'feature2'])

NameError: name 'X' is not defined

In [6]:
data['target'] = y.ravel()

NameError: name 'y' is not defined

In [7]:
data.head()

NameError: name 'data' is not defined

In [8]:
data.info()

NameError: name 'data' is not defined

#### 4. Plot a scatterplot with the data

In [9]:
plt.figure(figsize=(8,6))

clase1 = data[data['target'] == 1]
plt.scatter(clase1['feature1'], clase1['feature2'], 
            c='red', marker='^', s=60, label='Clase 1')

clase0 = data[data['target'] == 0]
plt.scatter(clase0['feature1'], clase0['feature2'], 
            c='blue', marker='o', s=60, label='Clase 0')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatterplot 2D - Datos para SVM Lineal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

NameError: name 'data' is not defined

<Figure size 800x600 with 0 Axes>

Notice that there is one outlier positive example that sits apart from the others.  The classes are still linearly separable but it's a very tight fit.  We're going to train a linear support vector machine to learn the class boundary.

#### 5. LinearSVC
Declare a Linear SVC with the hyperparamenters:

```Python
LinearSVC(C=1, loss='hinge', max_iter=10000)
```

In [10]:
model = LinearSVC(C=1, loss='hinge', max_iter=10000)
model.fit(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

In [11]:
print("Coeficientes (w):", model.coef_)
print("Intercepto (b):", model.intercept_)

AttributeError: 'LinearSVC' object has no attribute 'coef_'

#### 6. Try the performance (score)
For the first experiment we'll use C=1 and see how it performs.

In [12]:
train_score = model.score(data[['feature1', 'feature2']], data['target'])
print(f"Accuracy con C=1: {train_score:.4f}")

NameError: name 'data' is not defined

In [13]:
y_pred = model.predict(data[['feature1', 'feature2']])

NameError: name 'data' is not defined

In [14]:
print("\nReporte de clasificación:")
print(classification_report(data['target'], y_pred))


Reporte de clasificación:


NameError: name 'data' is not defined

It appears that it mis-classified the outlier.

#### 7. Increase the value of C until you get a perfect classifier

In [15]:
C_values = [1, 10, 100, 1000, 10000]
results = []

In [16]:
for C in C_values:
    model = LinearSVC(C=C, loss='hinge', max_iter=10000, random_state=42)
    model.fit(data[['feature1', 'feature2']], data['target'])
    
    score = model.score(data[['feature1', 'feature2']], data['target'])
    results.append(score)
    
    print(f"C={C}: Accuracy = {score:.4f}")

NameError: name 'data' is not defined

This time we got a perfect classification of the training data, however by increasing the value of C we've created a decision boundary that is no longer a natural fit for the data.  We can visualize this by looking at the confidence level for each class prediction, which is a function of the point's distance from the hyperplane.

#### 8. Plot Decission Function
Get the `decision_function()` output for the first model. Plot a scatterplot with X1, X2 and a range of colors based on `decision_function()`

In [17]:
model_C1 = LinearSVC(C=1, loss='hinge', max_iter=10000, random_state=42)
model_C1.fit(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

In [18]:
decision_values = model_C1.decision_function(data[['feature1', 'feature2']])

NameError: name 'data' is not defined

In [20]:
plt.figure(figsize=(10,6))

scatter = plt.scatter(data['feature1'], data['feature2'], 
                     c=decision_values, 
                     cmap='RdBu_r', 
                     s=60, 
                     edgecolors='black', 
                     linewidth=0.5)

plt.colorbar(scatter, label='Decision Function (distancia al hiperplano)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Function - Modelo C=1 (muestra el outlier problemático)')

# - marcar hiperplano (decision=0)
plt.contour(data[['feature1', 'feature2']], model_C1.decision_function, 
            levels=[0], colors='black', linestyles='-', linewidth=2)

plt.show()

NameError: name 'data' is not defined

<Figure size 1000x600 with 0 Axes>

#### 9. Do the same with the second model

https://www.svm-tutorial.com/2015/06/svm-understanding-math-part-3/

In [21]:
model_highC = LinearSVC(C=1000, loss='hinge', max_iter=10000, random_state=42)
model_highC.fit(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

In [22]:
decision_values_highC = model_highC.decision_function(data[['feature1', 'feature2']])

NameError: name 'data' is not defined

In [23]:
plt.figure(figsize=(10,6))

scatter = plt.scatter(data['feature1'], data['feature2'], 
                      c=decision_values_highC, 
                      cmap='RdBu_r', 
                      s=60, 
                      edgecolors='black', 
                      linewidth=0.5)

plt.colorbar(scatter, label='Decision Function (distance from hyperplane)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Function - Modelo con C=1000 (perfect classification)')

NameError: name 'data' is not defined

<Figure size 1000x600 with 0 Axes>

In [24]:
x_min, x_max = data['feature1'].min() - 1, data['feature1'].max() + 1
y_min, y_max = data['feature2'].min() - 1, data['feature2'].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                     np.linspace(y_min, y_max, 500))

Z = model_highC.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contour(xx, yy, Z, levels=[0], colors='black', linestyles='-', linewidths=2)

plt.show()

NameError: name 'data' is not defined

Now we're going to move from a linear SVM to one that's capable of non-linear classification using kernels.  We're first tasked with implementing a gaussian kernel function.  Although scikit-learn has a gaussian kernel built in, for transparency we'll implement one from scratch.

## Exercise 2

That result matches the expected value from the exercise.  Next we're going to examine another data set, this time with a non-linear decision boundary.

#### 1. Load the data `ejer_1_data2.mat`

In [25]:
loadmat = scipy.io.loadmat('data/ejer_1_data2.mat')
print(loadmat.keys())


FileNotFoundError: [Errno 2] No such file or directory: 'data/ejer_1_data2.mat'

#### 2. Create a DataFrame with the features and target

In [26]:
X = loadmat['X']  
y = loadmat['y']

NameError: name 'loadmat' is not defined

In [27]:
data = pd.DataFrame(X, columns=['feature1', 'feature2'])
data['target'] = y.ravel()

NameError: name 'X' is not defined

#### 3. Plot a scatterplot with the data

For this data set we'll build a support vector machine classifier using the built-in RBF kernel and examine its accuracy on the training data.  To visualize the decision boundary, this time we'll shade the points based on the predicted probability that the instance has a negative class label.  We'll see from the result that it gets most of them right.

In [28]:
plt.figure(figsize=(8,6))

clase1 = data[data['target'] == 1]
plt.scatter(clase1['feature1'], clase1['feature2'], 
            c='red', marker='^', s=60, label='Clase 1')

clase0 = data[data['target'] == 0]
plt.scatter(clase0['feature1'], clase0['feature2'], 
            c='blue', marker='o', s=60, label='Clase 0')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatterplot 2D - Datos para SVM Lineal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

NameError: name 'data' is not defined

<Figure size 800x600 with 0 Axes>

#### 4. Declare a SVC with this hyperparameters
```Python
SVC(C=100, gamma=10, probability=True)
```

In [29]:
model = SVC(C=100, gamma=10, probability=True, random_state=42)

#### 5. Fit the classifier and get the score

In [30]:
model.fit(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

In [31]:
train_score = model.score(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

#### 6. Plot the scatter plot and probability of predicting 0 with a [sequential color](https://matplotlib.org/3.1.1/tutorials/colors/colormaps.html)

In [32]:
prob_class0 = model.predict_proba(data[['feature1', 'feature2']])[:, 0]


NameError: name 'data' is not defined

In [33]:
plt.figure(figsize=(10,6))

scatter = plt.scatter(data['feature1'], data['feature2'], 
                     c=prob_class0, 
                     cmap='YlOrRd',  # - colormap secuencial
                     s=60, 
                     edgecolors='black', 
                     linewidth=0.5)

plt.colorbar(scatter, label='P(Clase 0)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Probabilidad de predecir Clase 0 (SVC C=100, gamma=10)')

plt.show()

NameError: name 'data' is not defined

<Figure size 1000x600 with 0 Axes>

#### Grid search

In [34]:
param_grid = {'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10], 'kernel': ['rbf']}

In [35]:
svc = SVC(probability=True)

In [36]:
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)

In [37]:
grid_search.fit(data[['feature1', 'feature2']], data['target'])

NameError: name 'data' is not defined

In [38]:
print("Mejores hiperparámetros:", grid_search.best_params_)
print("Mejor score:", grid_search.best_score_)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'