<a href="https://colab.research.google.com/github/EloMleko/Machine-Learning-Algorithms/blob/main/haberman_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

1. Title: Haberman's Survival Data

2. Sources:
   (a) Donor:   Tjen-Sien Lim (limt@stat.wisc.edu)
   (b) Date:    March 4, 1999

3. Past Usage:
   1. Haberman, S. J. (1976). Generalized Residuals for Log-Linear
      Models, Proceedings of the 9th International Biometrics
      Conference, Boston, pp. 104-122.
   2. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984),
      Graphical Models for Assessing Logistic Regression Models (with
      discussion), Journal of the American Statistical Association 79:
      61-83.
   3. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis,
      Department of Statistics, University of Wisconsin, Madison, WI.

4. Relevant Information:
   The dataset contains cases from a study that was conducted between
   1958 and 1970 at the University of Chicago's Billings Hospital on
   the survival of patients who had undergone surgery for breast
   cancer.

5. Number of Instances: 306

6. Number of Attributes: 4 (including the class attribute)

7. Attribute Information:
   1. Age of patient at time of operation (numerical)
   2. Patient's year of operation (year - 1900, numerical)
   3. Number of positive axillary nodes detected (numerical)
   4. Survival status (class attribute)
         1 = the patient survived 5 years or longer
         2 = the patient died within 5 year

8. Missing Attribute Values: None

In [1]:
columns = ['age', 'year_of_op', 'pos_axillary_nodes', 'survival_status']

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/haberman.csv', names=columns)

In [4]:
df

Unnamed: 0,age,year_of_op,pos_axillary_nodes,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
...,...,...,...,...
301,75,62,1,1
302,76,67,0,1
303,77,65,3,1
304,78,65,1,2


In [5]:
df2 = df.copy()

In [6]:
import random

In [27]:
train, test = np.split(df2.sample(frac=1), [int(round(len(df2) * 0.7))])

In [28]:
train_labels, test_label = train['survival_status'], test['survival_status']

In [29]:
train.drop(['survival_status'], axis=1, inplace=True)
test.drop(['survival_status'], axis=1, inplace=True)

In [30]:
print(train)

     age  year_of_op  pos_axillary_nodes
186   55          66                   0
197   57          61                   5
53    42          69                   1
69    43          60                   0
141   51          66                   1
..   ...         ...                 ...
46    41          58                   0
284   69          66                   0
88    45          67                   1
116   49          61                   1
195   56          67                   0

[214 rows x 3 columns]


In [43]:
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(train, train_labels)

In [44]:
prediction = knn_model.predict(test)

In [45]:
print(classification_report(test_label, prediction))

              precision    recall  f1-score   support

           1       0.77      0.97      0.86        67
           2       0.75      0.24      0.36        25

    accuracy                           0.77        92
   macro avg       0.76      0.61      0.61        92
weighted avg       0.77      0.77      0.73        92

