###  Following cell is for data preparation, you should put 'adult.data' file in the same folder as current notebook and ran the cell in the beginning.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain',
         'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv('adult.data', names=names, index_col=False)
df = df[['age', 'workclass', 'sex', 'hours-per-week',
         'education', 'capital-gain', 'capital-loss', 'income']]
df.replace(' ?', np.nan, inplace=True)

### The dataset consists of 7 features of a person and an income class that they belong to: '>50K' or '<=50K'.

In [2]:
df.head()

Unnamed: 0,age,workclass,sex,hours-per-week,education,capital-gain,capital-loss,income
0,39,State-gov,Male,40,Bachelors,2174,0,<=50K
1,50,Self-emp-not-inc,Male,13,Bachelors,0,0,<=50K
2,38,Private,Male,40,HS-grad,0,0,<=50K
3,53,Private,Male,40,11th,0,0,<=50K
4,28,Private,Female,40,Bachelors,0,0,<=50K


## 1) Perform k-nearest neighbors algorithm with two k parameters of your choice on a given dataset.

### 1.1) Preprocessing: dataset contains missing values and categorical variables, you need to handle them, before applying an algorithm on the data.

In [2]:
cat_columns = ['workclass', 'sex', 'education']
for cat_col in cat_columns:
    df = df.join(pd.get_dummies(df[cat_col]))
df = df.drop(labels=cat_columns, axis=1)

In [3]:
df.head()

Unnamed: 0,age,hours-per-week,capital-gain,capital-loss,income,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,...,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
0,39,40,2174,0,<=50K,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,50,13,0,0,<=50K,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,38,40,0,0,<=50K,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,53,40,0,0,<=50K,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,28,40,0,0,<=50K,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0


### 1.2) You need to divide your dataset into two parts: training and test. Training subset should contain 80% of the whole dataset and target classes should be balanced in both subsets.

In [4]:
#feat_columns = ['age','workclass','sex','hours-per-week','education','capital-gain','capital-loss']
X_train, X_test, y_train, y_test = train_test_split(df.drop("income", axis=1),
                                                    df['income'], test_size=0.2, random_state = 1, stratify = df['income'])

### 1.3) Apply k-nearest neighbors algorithm with two different k parameters of your choice:

In [5]:
# Importing the classifier from scikit learn
from sklearn.neighbors import KNeighborsClassifier
# Initializing first classifier with number of neighbours = 10
knn10 = KNeighborsClassifier(n_neighbors=10)
# Initializing second classifier with number of neighbours = 3
knn3 = KNeighborsClassifier(n_neighbors=3)

knn10.fit(X_train, y_train)
knn3.fit(X_train, y_train)

# Predicting class of new customers with k nearest neighbors algoritm with k = 10
result10 = knn10.predict(X_test)
# Predicting class of new customers with k nearest neighbors algoritm with k = 3
result3 = knn3.predict(X_test)

## 2) Evaluate and compare performance of two models:

### 2.1) Print performance metrics of your models:

## Results for k=3

In [6]:
from sklearn.metrics import classification_report  # Importing reporting module
# Module takes in list of correct classes and list of predicted classes
rep10 = classification_report(y_test, result10)
rep3 = classification_report(y_test, result3)

print(rep3)

             precision    recall  f1-score   support

      <=50K       0.86      0.91      0.89      4945
       >50K       0.66      0.53      0.59      1568

avg / total       0.81      0.82      0.81      6513



## Results for k=10

In [7]:
print(rep10)

             precision    recall  f1-score   support

      <=50K       0.85      0.95      0.90      4945
       >50K       0.75      0.46      0.57      1568

avg / total       0.83      0.83      0.82      6513



### 2.2) In a few sentences argue which k performed better, based on performance metrics from the previous task. 

Based on perfomance metrics from classification report we can argue that both classifiers performed mor or less equally with average f1 score 0.82. Although average precision and recall are higher in k=10 model.

### 2.3) Classification for which class were performed better? Why do you think this is the case?

Classification for the '<=50K' class has higher f1-score in both classifiers. You can also notice that increase in parameter k leads to improvement of f1 score for '<=50k' class and decline of f1 score for '>50k' class. This result can be explained by unbalanced classes of our dataset. Overrepresentation of datapoints with '<=50k' class leads to lower recall rates of '>50k' class.

In [8]:
resdf = pd.DataFrame()
resdf['correct_labels'] = y_test
resdf['K=3'] = result3
resdf['K=10'] = result10
print("How classes are distributed in test data:")
resdf['correct_labels'].value_counts(normalize=True)

How classes are distributed in test data:


 <=50K    0.759251
 >50K     0.240749
Name: correct_labels, dtype: float64

In [9]:
print("How classes are distributed in results of knn with k=3:")
resdf['K=3'].value_counts(normalize=True)

How classes are distributed in results of knn with k=3:


 <=50K    0.807616
 >50K     0.192384
Name: K=3, dtype: float64

In [10]:
print("How classes are distributed in results of knn with k=10:")
resdf['K=10'].value_counts(normalize = True)

How classes are distributed in results of knn with k=10:


 <=50K    0.851835
 >50K     0.148165
Name: K=10, dtype: float64