### **Logistics**

1. **Team**: 

    - **Sepehr Akbari**

    - **Carson Pagel**

2. **Assignment**:

    - CSCI 250: Programming for Data Applications

        - Project 1

3. **Data**:

    - Name: *Ecommerce Customer Churn Analysis and Prediction*
    
    - Source: [Kaggle](https://www.kaggle.com/datasets/ankitverma2010/ecommerce-customer-churn-analysis-and-prediction)


---

# **Customer Churn Prediction**

**Objective:**

Customer churn prediction is a common problem in the telecommunication industry. It is important for companies to identify customers who are likely to churn in order to take preventive actions to retain them. Furthermore, it is beneficiary for companies to understand the factors that lead to customer churn. In this project, we will use machine learning models to predict customer churn based on a dataset from a telecommunication company.

**Table of Content:**

1. [Setup](#setup)

2. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)

3. [Data Encoding & Imputation](#data-encoding--imputation)

4. [Correlation Analysis](#correlation-analysis)

5. [Handling Imbalanced Data & Scaling](#handling-imbalanced-data--scaling)

6. [Regression Models](#regression-models)

    - [Linear Regression](#linear-regression)

    - [Polynomial Regression](#polynomial-regression)

    - [Logistic Regression](#logistic-regression)

7. [K-Nearest Neighbors (kNN) Models](#k-nearest-neighbors-knn-models)

    - [Euclidean Distance](#euclidean-distance)

    - [Manhattan Distance](#manhattan-distance)

8. [Support Vector Machine (SVM) Models](#support-vector-machine-svm-model)

    - [Linear Kernel](#linear-svm)

    - [Non-Linear Kernel](#non-linear-svm)

## Setup

1. **importing libraries**

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from catboost import CatBoostClassifier

from imblearn.over_sampling import SMOTE

import pickle

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, mean_squared_error, precision_score, r2_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import SVC

from xgboost import XGBClassifier

In [None]:
warnings.filterwarnings('ignore')

This is the random state the models will use, for reproducibility.

In [None]:
rs = 250

2. **Importing the data**

In [None]:
df = pd.read_excel('data.xlsx', sheet_name='E Comm')
desc = pd.read_excel('data.xlsx', sheet_name='Data Dict', header=1, usecols=[1,2,3]).drop(columns="Data")

3. **Description of the data columns**

In [None]:
desc

4. **Peaking the data**

In [None]:
df.head()

## Exploratory Data Analysis (EDA)

In this section we want to explore and understand the data. We will look at the distribution of the target variable, the distribution of the features, and the relationship between the target variable and the features.

1. **Analyzing the type of data**

In [None]:
df.info()

Seems like all the data are in appropriate format, no need to change the data type.

2. **Dropping `customerID` column**

We'll drop the `customerID` column as it is not useful for our analysis, nor is a feature for churn prediction.

In [None]:
df.drop(columns="CustomerID", inplace=True)

3. **Lets get the count of some interesting variables, to get an idea of what the data represents.**

In [None]:
plt.figure(figsize=(20,40))
n = 1
colors = sns.color_palette("plasma", len(df.columns))
for i, col in enumerate(df.columns):
    plt.subplot(10,2,n)
    sns.histplot(df, x=col, bins=25, color=colors[i])
    plt.title(f"Distribution of {col}")
    plt.xlabel("")
    plt.ylabel("")
    plt.grid(True)
    n += 1
    plt.tight_layout()

plt.show()

Here are a couple of important things these visualizations tell us:

- The data shows that the majority of customers do not churn, with only a smaller fraction leaving.

- There are fewer senior citizens than non-seniors, suggesting age might be a smaller but potentially significant factor.

- Tenure varies widely, with many customers having short tenure, which could indicate higher churn risk.

- Monthly charges are skewed, suggesting that a smaller subset of customers pay much higher fees.

- Total charges are also right-skewed, typically increasing with customer tenure.

- Certain internet service types and add-ons dominate, possibly influencing churn based on specific service preferences.

- Payment methods vary, with certain methods (like month-to-month) often linked to higher churn.

- Contract type, additional services, and senior citizen status stand out as potential predictors of churn.

- Outliers and missing values in total charges and other features may need special handling before modeling.

## Data Encoding & Imputation

1. **Checking for missing values**

In [None]:
print(f'Count of missing values in each column:\n\n{df.isnull().sum()}\n---')
print(f'Total missing: {df.isnull().sum().sum()}')
print(f'Total rows missing (at least one): {df[df.isnull().any(axis=1)].shape[0]}')
print(f'Percentage of missing values in db: {df.isnull().sum().sum() / df.shape[0] * 100:.2f}%')

This shows that no one row has more than one missing value, which is actually a bad thing, as it implies if we drop all rows with missing values, we will lose aout 33% of our data.

So we need to impute the missing values. I'll use `simpleImputer` for numeric columns and `iterativeImputer` with a RandomForestRegressor for categorical columns.

2. **Imputing missing values for numeric columns**

In [None]:
numerics = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
df[numerics] = SimpleImputer(strategy='mean').fit_transform(df[numerics])

3. **Encoding categorical columns (One-Hot Encoding)**

This allows us to convert categorical columns into numerical columns, which is necessary for most machine learning models, especially regressions which we'll be using. We can use `drop_first=True` to avoid multicollinearity which will help our model to generalize better.

In [None]:
categoricals = df.select_dtypes(include=['object']).columns.tolist()
df = pd.get_dummies(df, columns=categoricals, drop_first=True)

4. **Imputing missing values for categorical columns**

In [None]:
rf_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=rs), max_iter=10)
df = pd.DataFrame(rf_imputer.fit_transform(df), columns=df.columns)

5. **Displaying the imputed and encoded data**

In [None]:
df.head()

## Correlation Analysis

Now that we have imputed and encoded the data, we can look at the correlation between the features and the target variable.

In [None]:
churn_corr = df.corr()["Churn"].sort_values(ascending=False)
churn_corr = churn_corr.drop("Churn")

plt.figure(figsize=(10, 8))
sns.barplot(x=churn_corr.values, y=churn_corr.index, palette="RdBu")

plt.xlabel("Correlation with Churn")
plt.ylabel("")
plt.title("Feature Correlations with Churn")
plt.tight_layout()
plt.show()


This corrolation chart indicates:

- **Complain** has the highest positive correlation with churn, indicating that customers who lodge complaints are more likely to leave.

- **MaritalStatus_Single** also correlates positively with churn, suggesting single customers may have a higher tendency to churn.

- **PreferredOrderCat_Mobile** Phone shows a moderate positive correlation, implying those who primarily order mobile phones could be more prone to churn.

- **NumberOfDeviceRegistered** and **SatisfactionScore** both have moderate positive correlations, hinting that device usage and satisfaction levels relate to churn risk.

- **MaritalStatus_Married**, **CashbackAmount**, **DaySinceLastOrder**, and **Tenure** all show negative correlations, meaning married customers, those receiving higher cashback, and longer-term or more recently active customers are less likely to churn.

## Splitting Data

1. **Specifying the target variable (`y`) and the features (`X`)**

In [None]:
X = df.drop(columns=["Churn"])
y = df["Churn"]

2. **Splitting the data into training and testing sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs, stratify=y)

print(f"Training set's shape: {X_train.shape}")
print(f"Testing set's shape: {X_test.shape}")

## Handling Imbalanced Data & Scaling

1. **Evaluating the imbalance of `Churn`**

In [None]:
plt.figure(figsize=(5,4))

churn_percentages = df["Churn"].value_counts(normalize=True) * 100
ax = sns.barplot(x=churn_percentages.index, y=churn_percentages.values, palette="plasma")

plt.xlabel("Churn (0 = No, 1 = Yes)")
plt.ylabel("Percentage (%)")
plt.title("Churn Distribution (Scaled)")
plt.ylim(0, 100)
plt.show()

This bar chart shows that our data is split 80/20 which is very imbalanced. This is very reasonable as it simply implies that most customers do not leave. But for our model to be able to predict the minority class, we need to balance the data. I will use `SMOTE` to oversample the minority class (1s).

2. **Handling the imbalance**

In [None]:
smote = SMOTE(random_state=rs)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("Before upsampling:")
print('count of Churn = 0: {}'.format(sum(y_train==0)))
print('count of Churn = 1: {}'.format(sum(y_train==1)))
print(f"Training set's shape: {X_train.shape}\n")

print('After upsampling')
print('count of Churn = 0: {}'.format(sum(y_train_resampled==0)))
print('count of Churn = 1: {}'.format(sum(y_train_resampled==1)))
print(f"Training set's shape: {X_train_resampled.shape}")

3. **Scaling the data**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

## Regression Models

This section will first focus on dismissing the linear approach. The main reason is:

- A linear model is based on the assumption that the relationship between the input features and the "Churn" is linear. This is not the case.

- The nature of a linear model's output is continuous, while the target variable here is binary (not churn / churn).

Also note, that quadratic, cubic, etc. regressions, are still linear models, as the coefficientsassociated with the features are still linear, with just different powers. In other words, a linear regression is a polynomial regression of degree 1.

### **Linear Regression**

$$
y = \beta X + \epsilon
$$

1. **Performing Linear Regression**

In [None]:
model_linear = LinearRegression()
model_linear.fit(X_train_scaled, y_train_resampled)
y_pred_linear = model_linear.predict(X_test_scaled)

2. **Visualizing the model**

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(range(len(y_test)), y_test, color='tab:blue', label='True', alpha=0.5)
plt.scatter(range(len(y_pred_linear)), y_pred_linear, color='tab:red', label='Predicted', alpha=0.5)
plt.title("Linear Regression: True vs Predicted Churn")
plt.xlabel("Sample Index")
plt.ylabel("Churn (0 or 1)")
plt.legend()
plt.tight_layout()
plt.show()

3. **Evaluating Error**

In [None]:
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print(f"Error Metrics: \n\nMSE: {mse_linear:.4f} \nR^2: {r2_linear:.4f}")

As shown in the visualization of the predicted points, the predictions are continuous, which is not suitable for a binary classification problem. Moreover, a $R^2$ score of -0.02 indicates that the model does not only poorly fit the data, but also performs worse than a horizontal line.

### **Polynomial Regression**

$$
y = \beta_0 + \beta_1 X + \beta_2 X^2 + \ldots + \beta_n X^n+ \epsilon
$$

As mentioned earlier, polynomial regression is still a linear model, as the coefficients associated with the features are still linear, with just different powers. The value of $n$ is the `degree` of the polynomial in our `PolynomialFeatures` function.

1. **Calculating $X^2$**

In [None]:
poly2 = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly2 = poly2.fit_transform(X_train_scaled)
X_test_poly2 = poly2.transform(X_test_scaled)

2. **Calculating $X^3$**

In [None]:
poly3 = PolynomialFeatures(degree=3, include_bias=False)
X_train_poly3 = poly3.fit_transform(X_train_scaled)
X_test_poly3 = poly3.transform(X_test_scaled)

3. **Performing Quadratic Regression**

In [None]:
model_quadratic = LinearRegression()
model_quadratic.fit(X_train_poly2, y_train_resampled)
y_pred_quadratic = model_quadratic.predict(X_test_poly2)

4. **Performing Cubic Regression**

In [None]:
model_cubic = LinearRegression()
model_cubic.fit(X_train_poly3, y_train_resampled)
y_pred_cubic = model_cubic.predict(X_test_poly3)

5. **Visualizing the model**

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

ax1.scatter(range(len(y_test)), y_test, color='tab:blue', label='True', alpha=0.5)
ax1.scatter(range(len(y_pred_quadratic)), y_pred_quadratic, color='tab:red', label='Predicted', alpha=0.5)
ax1.set_title("Quadratic Regression: True vs Predicted Churn")
ax1.set_xlabel("Sample Index")
ax1.set_ylabel("Churn (0 or 1)")
ax1.legend()

ax2.scatter(range(len(y_test)), y_test, color='tab:blue', label='True', alpha=0.5)
ax2.scatter(range(len(y_pred_cubic)), y_pred_cubic, color='tab:red', label='Predicted', alpha=0.5)
ax2.set_title("Cubic Regression: True vs Predicted Churn")
ax2.set_xlabel("Sample Index")
ax2.set_ylabel("Churn (0 or 1)")
ax2.set_ylim(-0.75, 1.45)
ax2.legend()


plt.tight_layout()
plt.show()

In [None]:
mse_quadratic = mean_squared_error(y_test, y_pred_quadratic)
r2_quadratic = r2_score(y_test, y_pred_quadratic)

mse_cubic = mean_squared_error(y_test, y_pred_cubic)
r2_cubic = r2_score(y_test, y_pred_cubic)

print(f"Error Metrics of Quadratic Model: \n\nMSE: {mse_quadratic:.4f} \nR^2: {r2_quadratic:.4f}\n\n")
print(f"Error Metrics of Cubic Model: \n\nMSE: {mse_cubic:.4f} \nR^2: {r2_cubic:.4f}")

Lets evaluate the visualizations of the quadratic and cubic regressions. At first glance it may seem that as we increase the degree of the polynomial, the model fits the data better. However, this is not the case. The model is overfitting the data, which is evident by the fact that the model is trying to fit every single point in the dataset. Moreover, the $R^2$ score of $-1.15 \times 10^21$ indicates that the model is performing extremely poorly, learning from all the noise in the data, without having enough complexity to generalize.

### **Logistic Regression**

The first complex model we'll analyze is logistic regression, which tries to calculate the probability of some binary outcome:

$$
P(y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n)}}
$$

This model estimates $\beta$ through maximum likelihood estimation by fitting a sigmoid function to the data. As shown in the equation, we can have $\beta_n$ added to the model for regularization, which will help prevent overfitting.

1. **Performing Logistic Regression**

In [None]:
model_logistic = LogisticRegression(random_state=rs, max_iter=1000)
model_logistic.fit(X_train_scaled, y_train_resampled)

y_pred_logistic = model_logistic.predict(X_test_scaled)
y_pred_proba = model_logistic.predict_proba(X_test_scaled)[:, 1]

2. **Visualizing the model**

Lets see how a real binary classification model looks.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(range(len(y_test)), y_test, color='tab:blue', label='True', alpha=0.5)
plt.scatter(range(len(y_pred_logistic)), y_pred_logistic, color='tab:red', label='Predicted', alpha=0.5)
plt.title("Logistic Regression: True vs Predicted Churn")
plt.xlabel("Sample Index")
plt.ylabel("Churn (0 or 1)")
plt.legend()
plt.tight_layout()
plt.show()

3. **Performance**

In [None]:
cm_logistic = confusion_matrix(y_test, y_pred_logistic)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_logistic, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

4. **Evaluating Error**

In [None]:
mse_logistic = mean_squared_error(y_test, y_pred_logistic)
r2_logistic = r2_score(y_test, y_pred_logistic)
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
precision_logistic = precision_score(y_test, y_pred_logistic)
recall_logistic = recall_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)

print(f"Error Metrics: \n\nMSE: {mse_logistic:.4}\nR^2: {r2_logistic:.4f}\n\nAccuracy: {accuracy_logistic:.4f}\nPrecision: {precision_logistic:.4f}\nRecall: {recall_logistic:.4f}\nF1: {f1_logistic:.4f}")

As shown in the confusion matrix, the model is performing decently on predicting the majority class, however, as the $R^2$ score suggests, the model is not performing well on the majority class, with an overall classification score of about 60%, indicated by the F1 score.

**We can do better!**

## K-Nearest Neighbors (kNN) Models

The next model we'll analyze is the k-Nearest Neighbors (kNN) model. This model is based on the assumption that similar data points are close to each other in the feature space. The model calculates the distance between the data points and classifies the data point based on the majority class of the k-nearest neighbors.

I'll use three different distance metrics: Euclidean, Manhattan, and Hamming. My hypothesis is that each will perform better than the other.

### **Euclidean Distance**

$$
d(x,y) = \sqrt{\sum_{i=1}^{n} (y_i - x_i)^2}
$$

1. **Performing Euclidean KNN**

In [None]:
model_euc_knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
model_euc_knn.fit(X_train_scaled, y_train_resampled)
y_pred_euc_knn = model_euc_knn.predict(X_test_scaled)
y_pred_euc_knn_proba = model_euc_knn.predict_proba(X_test_scaled)[:, 1]

2. **Performance**

In [None]:
cm_euc_knn = confusion_matrix(y_test, y_pred_euc_knn)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_euc_knn, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('kNN (Euclidean)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

3. **Evaluating Error**

In [None]:
mse_euc_knn = mean_squared_error(y_test, y_pred_euc_knn)
r2_euc_knn = r2_score(y_test, y_pred_euc_knn)
accuracy_euc_knn = accuracy_score(y_test, y_pred_euc_knn)
precision_euc_knn = precision_score(y_test, y_pred_euc_knn)
recall_euc_knn = recall_score(y_test, y_pred_euc_knn)
f1_euc_knn = f1_score(y_test, y_pred_euc_knn)

print(f"Error Metrics: \n\nMSE: {mse_euc_knn:.4}\nR^2: {r2_euc_knn:.4f}\n\nAccuracy: {accuracy_euc_knn:.4f}\nPrecision: {precision_euc_knn:.4f}\nRecall: {recall_euc_knn:.4f}\nF1: {f1_euc_knn:.4f}")

Looking at the confusion matrix, the model is performing better than the logistic regression model, with a higher F1 score. However, the model is still not performing well on the minority class, with an overall classification score of about 74%.

One reason for this underperformance is that euclidean distance is assumes that all features are continuous, which is not the case in our dataset.

Next, we'll try the Manhattan distance, which also assumes that all features are continuous, but for ones that don't fit on a straight line. It is also more robust to outliers.

### **Manhattan Distance**

$$
d(x,y) = \sum_{i=1}^{m} |x_i - y_i|
$$

1. **Performing Manhattan KNN**

In [None]:
model_man_knn = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
model_man_knn.fit(X_train_scaled, y_train_resampled)
y_pred_man_knn = model_man_knn.predict(X_test_scaled)
y_pred_man_knn_proba = model_man_knn.predict_proba(X_test_scaled)[:, 1]

2. **Performance**

In [None]:
cm_man_knn = confusion_matrix(y_test, y_pred_man_knn)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_man_knn, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('kNN (Manhattan)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

3. **Evaluating Error**

In [None]:
mse_man_knn = mean_squared_error(y_test, y_pred_man_knn)
r2_man_knn = r2_score(y_test, y_pred_man_knn)
accuracy_man_knn = accuracy_score(y_test, y_pred_man_knn)
precision_man_knn = precision_score(y_test, y_pred_man_knn)
recall_man_knn = recall_score(y_test, y_pred_man_knn)
f1_man_knn = f1_score(y_test, y_pred_man_knn)

print(f"Error Metrics: \n\nMSE: {mse_man_knn:.4}\nR^2: {r2_man_knn:.4f}\n\nAccuracy: {accuracy_man_knn:.4f}\nPrecision: {precision_man_knn:.4f}\nRecall: {recall_man_knn:.4f}\nF1: {f1_man_knn:.4f}")

**Much better!**

The model is performing better than the Euclidean KNN model, with a higher F1 score of about 82%. The model is still not performing perfectly on the minority class but the $R^2$ is much better and MSE is much lower.

But... we can make it a little better! (or can we?)

### **Hamming Distance**

Hamming distance does not have a equation formula. But I'll briefly explain it here.

Let "ABCDEFG" and "YBCNEXM" be two strings of equal length 7. The Hamming distance between these two strings is the number of positions at which the corresponding characters are different. So: **A**BC**D**E**FG** and **Y**BC**N**E**XM** have 3 different characters, so the Hamming distance is 3.

1. **Performing Hamming KNN**

In [None]:
model_ham_knn = KNeighborsClassifier(n_neighbors=5, metric='hamming')
model_ham_knn.fit(X_train_scaled, y_train_resampled)
y_pred_ham_knn = model_ham_knn.predict(X_test_scaled)
y_pred_ham_knn_proba = model_ham_knn.predict_proba(X_test_scaled)[:, 1]

2. **Performance**

In [None]:
cm_ham_knn = confusion_matrix(y_test, y_pred_ham_knn)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_ham_knn, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('kNN (Hamming)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

3. **Evaluating Error**

In [None]:
mse_ham_knn = mean_squared_error(y_test, y_pred_ham_knn)
r2_ham_knn = r2_score(y_test, y_pred_ham_knn)
accuracy_ham_knn = accuracy_score(y_test, y_pred_ham_knn)
precision_ham_knn = precision_score(y_test, y_pred_ham_knn)
recall_ham_knn = recall_score(y_test, y_pred_ham_knn)
f1_ham_knn = f1_score(y_test, y_pred_ham_knn)

print(f"Error Metrics: \n\nMSE: {mse_ham_knn:.4}\nR^2: {r2_ham_knn:.4f}\n\nAccuracy: {accuracy_ham_knn:.4f}\nPrecision: {precision_ham_knn:.4f}\nRecall: {recall_ham_knn:.4f}\nF1: {f1_ham_knn:.4f}")

Interesting...

So of course this is not a good model for this dataset. The Hamming distance is not suitable for this dataset, as it assumes that all features are categorical, which is not the case in our dataset. It views each feature individually, and does not take into account the relationship between them. This also means if one of them have a missmatch, it will infleunce the model greatly.

So the winner here is the Manhattan distance kNN.

## Support Vector Machine (SVM) Model

### **Linear SVM**

The linear SVM tries to find a linear hyperplane that separate the data points into two classes, by checking for this condition to hold at all times:

$$
y_i(w^T X_i + b) \geq 1
$$


1. **Performing SVM**

In [None]:
model_svc = SVC(kernel='linear', random_state=rs, probability=True)
model_svc.fit(X_train_scaled, y_train_resampled)
y_pred_svc = model_svc.predict(X_test_scaled)
y_pred_svc_proba = model_svc.predict_proba(X_test_scaled)[:, 1]

2. **Performance**

In [None]:
cm_svc = confusion_matrix(y_test, y_pred_svc)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_svc, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('SVM (Linear)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

3. **Evaluating Error**

In [None]:
mse_svc = mean_squared_error(y_test, y_pred_svc)
r2_svc = r2_score(y_test, y_pred_svc)
accuracy_svc = accuracy_score(y_test, y_pred_svc)
precision_svc = precision_score(y_test, y_pred_svc)
recall_svc = recall_score(y_test, y_pred_svc)
f1_svc = f1_score(y_test, y_pred_svc)

print(f"Error Metrics: \n\nMSE: {mse_svc:.4}\nR^2: {r2_svc:.4f}\n\nAccuracy: {accuracy_svc:.4f}\nPrecision: {precision_svc:.4f}\nRecall: {recall_svc:.4f}\nF1: {f1_svc:.4f}")

The negative $R^2$ alone indicates that the model is not performing well. Although the MSE is not terrible, probably due to the fact that its prediciting the majority class well, the F1 score is still not good. 

This means our data is too complex for a linear hyperplane, and a more complex algorithm is needed.

### **Non-Linear SVM**

$$
K(x,x′)=exp(−γ∥x−x′∥ 2)
$$

The non linear SVM uses the kernel trick to transform the data into a higher dimensional space, where it can find a hyperplane that separates the data points into two classes. The higher the value of $\gamma$, the more complex the model will be and the less smooth.

In [None]:
# non-linear kernel
model_rbf_svc = SVC(kernel='rbf', random_state=rs, probability=True)
model_rbf_svc.fit(X_train_scaled, y_train_resampled)
y_pred_rbf_svc = model_rbf_svc.predict(X_test_scaled)
y_pred_rbf_svc_proba = model_rbf_svc.predict_proba(X_test_scaled)[:, 1]

In [None]:
cm_rbf_svc = confusion_matrix(y_test, y_pred_rbf_svc)

plt.figure(figsize=(5, 5))
sns.heatmap(cm_rbf_svc, annot=True, fmt='d', cmap='Greys', cbar=False)
plt.title('SVM (RBF)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()

In [None]:
mse_rbf_svc = mean_squared_error(y_test, y_pred_rbf_svc)
r2_rbf_svc = r2_score(y_test, y_pred_rbf_svc)
accuracy_rbf_svc = accuracy_score(y_test, y_pred_rbf_svc)
precision_rbf_svc = precision_score(y_test, y_pred_rbf_svc)
recall_rbf_svc = recall_score(y_test, y_pred_rbf_svc)
f1_rbf_svc = f1_score(y_test, y_pred_rbf_svc)

print(f"Error Metrics: \n\nMSE: {mse_rbf_svc:.4}\nR^2: {r2_rbf_svc:.4f}\n\nAccuracy: {accuracy_rbf_svc:.4f}\nPrecision: {precision_rbf_svc:.4f}\nRecall: {recall_rbf_svc:.4f}\nF1: {f1_rbf_svc:.4f}")

**Good!**

This model is fitting the data pretty decently with $R^2 = 0.5$ and an F1 score of 81%. This is a good model for this dataset overall.

It is important to mention that a non-linear SVM means a lot of computational power, as it needs to calculate the kernel for every data point, which can be very time consuming, and not the best model either for our data.