# Portfolio assignment week 5
1. SVC
The Scikit-learn library provides different kernels for the Support Vector Classifier, e.g. RBF or polynomial.

Based on the examples in the accompanying notebook, create your own SVC class and configure it with different kernels to see if you are able to have it correctly separate the moon-dataset. You can also use a precomputed kernel. In addition, there are several parameters you can tune to for better results. Make sure to go through the documentation.

Hint:

Plot the support vectors for understanding how it works.
Give arguments why a certain kernel behaves a certain way.
2. Model Evaluation
Classification metrics are important for measuring the performance of your model. Scikit-learn provides several options such as the classification_report and confusion_matrix functions. Another helpful option is the AUC ROC and precision-recall curve. Try to understand what these metrics mean and give arguments why one metric would be more important then others.

For instance, if you have to predict whether a patient has cancer or not, the number of false negatives is probably more important than the number of false positives. This would be different if we were predicting whether a picture contains a cat or a dog – or not: it all depends on the context. Thus, it is important to understand when to use which metric.

For this exercise, you can use your own dataset if that is eligable for supervised classification. Otherwise, you can use the breast cancer dataset which you can find on assemblix2019 (/data/datasets/DS3/). Go through the data science pipeline as you've done before:

Try to understand the dataset globally.
Load the data.
Exploratory analysis
Preprocess data (skewness, normality, etc.)
Modeling (cross-validation and training)
Evaluation
Create and train several LogisticRegression and SVM models with different values for their hyperparameters. Make use of the model evaluation techniques that have been described during the plenary part to determine the best model for this dataset. Accompany you elaborations with a conclusion, in which you explicitely interpret these evaluation and describe why the different metrics you are using are important or not. Make sure you take the context of this dataset into account.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [None]:
# Read the data file
data_brc = pd.read_csv('datasets_DS3\\breast-cancer.csv')

data_brc.head()

In [None]:
data_brc.columns

In [None]:
data_brc.info()

In [None]:
data_brc.shape

In [None]:
data_brc.dtypes == 'object'

In [None]:
data_brc.isna().sum().sum()

In [None]:
data_brc.describe()

In [None]:
data_brc['diagnosis'].value_counts()

In [None]:
# Cleaning and modifying the data
data_brc_clean = data_brc.drop('id',axis=1)

In [None]:
# Mapping Benign to 0 and Malignant to 1 
data_brc_clean['diagnosis'] = data_brc_clean['diagnosis'].map({'M':1,'B':0})

In [None]:
data_brc_clean

In [None]:
corr_mat = data_brc_clean.corr()
# Strip out the diagonal values for the next step
for x in range(len(data_brc_clean.columns)):
    corr_mat.iloc[x,x] = 0.0
    
corr_mat

In [None]:
# see which features are highly correlated
# Pairwise maximal correlations
corr_mat.abs().idxmax()

In [None]:
# how much are they correlated? Can we eliminate certain features based on high correlations
corr_mat.abs().max()

In [None]:
# .skew 0: no skew, + right skew, - left skew, look for above .75 
skew_columns = (data_brc_clean
                .skew()
                .sort_values(ascending=False))

skew_columns = skew_columns.loc[skew_columns > 0.75]
skew_columns

In [None]:
# Perform log transform on skewed columns
for col in skew_columns.index.tolist():
    data_brc_clean[col] = np.log1p(data_brc_clean[col])

In [None]:
import seaborn as sns
sns.set_context('notebook')
sns.pairplot(data_brc_clean, 
             hue='diagnosis');

### the code above is taken from: https://github.com/fenna/BFVM23DATASCNC5/blob/main/Exercises/E_Clustering_breast_cancer_solution.ipynb
### the above code is also used for the subsequent assignments of week 6 and 7.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(data_brc_clean)

In [None]:
scaled_features = scaler.transform(data_brc_clean)

In [None]:
scaled_features

In [None]:
X = data_brc_clean.drop('diagnosis', axis=1)
y= data_brc_clean['diagnosis']

In [None]:
data_brc_clean

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

In [None]:
# To store results of models
result_dict_train = {}
result_dict_test = {}

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model_1 = LogisticRegression(max_iter=5000)
accuracies = cross_val_score(lr_model_1, X_train, y_train, cv=5)
lr_model_1.fit(X_train, y_train)

print("Train Score:",np.mean(accuracies))
print("Test Score:",lr_model_1.score(X_test,y_test))

In [None]:
result_dict_train["Logistic regression Default Train Score"] = np.mean(accuracies)
result_dict_test["Logistic regression Default Test Score"] = lr_model_1.score(X_test,y_test)

In [None]:
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
  
# Instantiating logistic regression classifier
logreg = LogisticRegression(max_iter=4000)
  
# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X_train, y_train)

print("Best Parameters:",logreg_cv.best_params_)
print("Train Score:",logreg_cv.best_score_)
print("Test Score:",logreg_cv.score(X_test,y_test))

### Support Vector Machine (supervised learning)

In [None]:
from sklearn.svm import SVC
svc_model_2 = SVC(random_state = 101)
accuracies = cross_val_score(svc_model_2, X_train, y_train, cv=5)
svc_model_2.fit(X_train,y_train)

print("Train Score:",np.mean(accuracies))
print("Test Score:",svc_model_2.score(X_test,y_test))

In [None]:
result_dict_train["SVM Default Train Score"] = np.mean(accuracies)
result_dict_test["SVM Default Test Score"] = svc_model_2.score(X_test,y_test)

In [None]:
grid = {
    'C':[0.01,0.1,1,10],
    'kernel' : ["linear","poly","rbf","sigmoid"],
    'degree' : [1,3,5,7],
    'gamma' : [0.01,1]
}

svm  = SVC ()
svm_cv = GridSearchCV(svm, grid, cv = 5)
svm_cv.fit(X_train,y_train)
print("Best Parameters:",svm_cv.best_params_)
print("Train Score:",svm_cv.best_score_)
print("Test Score:",svm_cv.score(X_test,y_test))

### AUC ROC

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
# logistic regression
model1 = LogisticRegression()
# knn
model2 = SVC(kernel='poly')

# fit model
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)

# predict probabilities
pred_prob1 = model1.predict(X_test)
pred_prob2 = model2.predict(X_test)

In [None]:
# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1, pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2, pos_label=1)

# roc curve for tpr = fpr 
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)

In [None]:
# auc scores
auc_score1 = roc_auc_score(y_test, pred_prob1)
auc_score2 = roc_auc_score(y_test, pred_prob2)

print('Logistic regression AUC Score: ' ,auc_score1,'SVC AUC Score: ' ,auc_score2)

### the code used above is taken from: https://github.com/AnshulSaini17/Income_evaluation/blob/main/Income_Evalutation.ipynb
### and https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

The AUC score from the Logistic regression is excellent when applied on this dataset. it has a good classification performance on the data, also for support vector machine (SVC), the model scored very high on the dataset for classifiaction based on Malignant and benign tumors. However, logistic regression performed a bit better compred to the SVC model. 