# Haberman-KNN-Classifier-Analysis

This code implements a K-Nearest Neighbors (KNN) classifier on the Haberman dataset, a collection of data regarding the survival rates of patients who underwent surgery for breast cancer. It begins by loading the dataset and assigning column names. Subsequently, it splits the data into features (X) and the target variable (y), followed by training the KNN classifier on both the original and normalized data. The performance of the classifier is then evaluated using confusion matrices and accuracy scores, with different values of K and test sizes tested to assess their impact on the classifier's performance. Overall, the code offers insights into how varying parameters influence the effectiveness of the KNN classifier on the Haberman dataset.

In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

In [34]:
#Read data
df = pd.read_csv("haberman.data")

In [35]:
#Giving our dataset column names
col_names=['age','year','node','status']
df.columns=col_names

In [36]:
#showing 4 rows in the data  
df.head()

Unnamed: 0,age,year,node,status
0,30,62,3,1
1,30,65,0,1
2,31,59,2,1
3,31,65,4,1
4,33,58,10,1


In [37]:
#Group by Status 
df.groupby('status').count()

Unnamed: 0_level_0,age,year,node
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,224,224,224
2,81,81,81


In [38]:
#drops Column status 
x=df.drop('status',axis=1)


In [39]:
#Assign new variable 
y=df['status']

In [40]:
#Training our data 
X_train, X_test, y_train, y_test = train_test_split(x, y,random_state=123, test_size=0.20)

In [41]:
#Redogör för vilken skillnad blir det blir med olika k

In [42]:
#Here we test diffrent K numbers by help of a loop. 
#Task A and B I
Nr = 1
print("Three diffrent K numbers with non normalized data\n")
for K in [3, 5, 9]:
    print("Attempt Nr:", Nr)
    print("K number:",K)
    knn=KNeighborsClassifier(n_neighbors=K)
    knn.fit(X_train,y_train)
    pred= knn.predict(X_test)
    confusion_matrix = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
    print("Confusion Matrix")
    print (confusion_matrix)
    print("Accurancy")
    print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))
    print("\n---------------------\n")
    Nr += 1

Three diffrent K numbers with non normalized data

Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          38  7
2          14  2
Accurancy
Test set accuracy: 0.66

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          39  6
2          12  4
Accurancy
Test set accuracy: 0.70

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predicted   1  2
Actual          
1          38  7
2          13  3
Accurancy
Test set accuracy: 0.67

---------------------



In [43]:
#Three diffrent K numbers with normalized data

In [44]:
#Making a copy of our dataset for normalization 
normDf = df.copy()

In [45]:
print(df)
# define min max scaler
#here we normalize the dataset and print it out to see 
scaler = MinMaxScaler()
print("transforming data")
normdf = scaler.fit_transform(df)
print(normdf)

     age  year  node  status
0     30    62     3       1
1     30    65     0       1
2     31    59     2       1
3     31    65     4       1
4     33    58    10       1
..   ...   ...   ...     ...
300   75    62     1       1
301   76    67     0       1
302   77    65     3       1
303   78    65     1       2
304   83    58     2       2

[305 rows x 4 columns]
transforming data
[[0.         0.36363636 0.05769231 0.        ]
 [0.         0.63636364 0.         0.        ]
 [0.01886792 0.09090909 0.03846154 0.        ]
 ...
 [0.88679245 0.63636364 0.05769231 0.        ]
 [0.90566038 0.63636364 0.01923077 1.        ]
 [1.         0.         0.03846154 1.        ]]


In [46]:
#Training our normalized data here with 0.25 test size
normX_train, normX_test, normY_train, normY_test = x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(df[['age','year','node','status']],
                                                                             df['status'], random_state=123, test_size=0.25)

In [47]:
#Task B II
#Training Normalized data with 3 diffrent k numbers
print("Test size is 0.25")
Nr = 1
for K in [3, 5, 9]:
    print("Attempt Nr:", Nr)
    print("K number:",K)
    
    knn=KNeighborsClassifier(n_neighbors=K)
    knn.fit(normX_train,normY_train)
    pred= knn.predict(normX_test)
    
    print("Confusion Matrix")
    confusion_matrix = pd.crosstab(normY_test, pred, rownames=['Actual'], colnames=['Predicted'])
    print (confusion_matrix)
    
    print("Accurancy")
    print("Test set accuracy: {:.2f}".format(knn.score(normX_test, normY_test)))
    print("\n---------------------\n")
    Nr += 1

    

Test size is 0.25
Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          19  5
Accurancy
Test set accuracy: 0.68

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          46  7
2          18  6
Accurancy
Test set accuracy: 0.68

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          19  5
Accurancy
Test set accuracy: 0.68

---------------------



In [48]:
#Task C, Diffrent Test size 

In [49]:
#nestled loop one  for diffrent test size each time run, and other loop inside for diffrent k number each time run
#Original data with diffrent test sizes and 3 diffrent K numbers 
print("Original data.")
for TestSize in [0.25, 0.33, 0.5]:
    X_train, X_test, y_train, y_test = train_test_split(x, y,random_state=123, test_size=TestSize)
    
    print("Test size is", TestSize)
    Nr = 1
    for K in [3, 5, 9]:
        print("Attempt Nr:", Nr)
        print("K number:",K)
        
        knn=KNeighborsClassifier(n_neighbors=K)
        knn.fit(X_train,y_train)
        pred= knn.predict(X_test)
        
        print("Confusion Matrix")
        confusion_matrix = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
        print (confusion_matrix)
        
        print("Accurancy")
        print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))
        print("\n---------------------\n")
        Nr += 1
    print("*"*50,"\n")

Original data.
Test size is 0.25
Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          20  4
Accurancy
Test set accuracy: 0.66

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          46  7
2          20  4
Accurancy
Test set accuracy: 0.65

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          19  5
Accurancy
Test set accuracy: 0.68

---------------------

************************************************** 

Test size is 0.33
Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          63  8
2          22  8
Accurancy
Test set accuracy: 0.70

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          63  8
2          22  8
Accurancy
Test set accuracy: 0.70

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predicte

In [50]:
#Normalized data with diffrent test sizes and 3 diffrent K numbers 
print("Normalized data")
for TestSize in [0.25, 0.33, 0.5]:
    normX_train, normX_test, normY_train, normY_test = x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(df[['age','year','node','status']],
                                                                             df['status'], random_state=123, test_size=TestSize)
    print("Test size is", TestSize)
    Nr = 1
    for K in [3, 5, 9]:
        print("Attempt Nr:", Nr)
        print("K number:",K)
        knn=KNeighborsClassifier(n_neighbors=K)
        knn.fit(normX_train,normY_train)
        pred= knn.predict(normX_test)
        print("Confusion Matrix")
        confusion_matrix = pd.crosstab(normY_test, pred, rownames=['Actual'], colnames=['Predicted'])
        print (confusion_matrix)
        print("Accurancy")
        print("Test set accuracy: {:.2f}".format(knn.score(normX_test, normY_test)))
        print("\n---------------------\n")
        Nr += 1
    print("*"*50,"\n")

Normalized data
Test size is 0.25
Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          19  5
Accurancy
Test set accuracy: 0.68

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          46  7
2          18  6
Accurancy
Test set accuracy: 0.68

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predicted   1  2
Actual          
1          47  6
2          19  5
Accurancy
Test set accuracy: 0.68

---------------------

************************************************** 

Test size is 0.33
Attempt Nr: 1
K number: 3
Confusion Matrix
Predicted   1  2
Actual          
1          65  6
2          21  9
Accurancy
Test set accuracy: 0.73

---------------------

Attempt Nr: 2
K number: 5
Confusion Matrix
Predicted   1  2
Actual          
1          63  8
2          21  9
Accurancy
Test set accuracy: 0.71

---------------------

Attempt Nr: 3
K number: 9
Confusion Matrix
Predict