## Binary Classification

In this notebook, we will work on a binary classification using the following algorithms levereging cross-validation:
- Logistic Regression
- Naive Bayes
- SVM
- KNN

In [1]:
## the general packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## ML packages
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

### 1. Loading and Exploring the data

In [2]:
## we will be using the breast cancer data from UCI archive
data = pd.read_csv('https://archive.ics.uci.edu/static/public/15/data.csv')
data.head()

Unnamed: 0,Sample_code_number,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [3]:
## checking out the types and the existence of nulls in the colums
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Sample_code_number           699 non-null    int64  
 1   Clump_thickness              699 non-null    int64  
 2   Uniformity_of_cell_size      699 non-null    int64  
 3   Uniformity_of_cell_shape     699 non-null    int64  
 4   Marginal_adhesion            699 non-null    int64  
 5   Single_epithelial_cell_size  699 non-null    int64  
 6   Bare_nuclei                  683 non-null    float64
 7   Bland_chromatin              699 non-null    int64  
 8   Normal_nucleoli              699 non-null    int64  
 9   Mitoses                      699 non-null    int64  
 10  Class                        699 non-null    int64  
dtypes: float64(1), int64(10)
memory usage: 60.2 KB


the only column with null values is the `Bare_nuclei`. The types are all `int64`, which could certainly be converted to `int8` or `int16`.

In [5]:
## checking to see if the missing values
## belong to a specific class
data[data['Bare_nuclei'].isnull()]['Class'].value_counts()

Class
2    14
4     2
Name: count, dtype: int64

as it turned out, the majority of the missing values belong to the 

In [None]:
## some data cleaning 
def data_cleaner(df):
    ## lowercasing the column names
    ## and removing the extra spaces
    df.columns = [a.strip().repace(r'\s+', '_').lower() for x in df.columns]
    for col in df.columns:
        if df[col].dtype == ''