# Classification with K-Nearest Neighbors

### 1. Explain the Dataset and Type of Information to Gain:

The "Adult" dataset consists of 48,842 instances and 14 attributes, including age, workclass, education, marital status, occupation, race, sex, and native country. The target variable is income, which is divided into two classes: <=50K and >50K. The goal is to classify individuals into one of these two income categories based on the other features.

##### Objective: 
The main objective is to build a k-nearest neighbors classifier to predict whether an individual earns more than $50,000 based on their demographic and employment information.

### 2. Explain the k-Nearest Neighbors Algorithm:

kNN is a simple, instance-based learning algorithm. For a given data point, kNN looks at the k closest training examples in the feature space and assigns the most common class among those neighbors to the data point. The Euclidean distance metric will be used to determine the "closeness" between data points.

##### Value of k: 
Here, k = 5, meaning the algorithm will consider the 5 nearest neighbors when making a classification. Choosing a smaller k makes the model sensitive to noise, while a larger k smooths out the decision boundary but may include points from other classes.

### 3. Import Necessary Libraries and Perform Initial Exploration:

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load and explore the dataset
data = pd.read_csv('adult.data', header=None, names=["age", "workclass", "fnlwgt", "education", 
                                                     "education-num", "marital-status", "occupation", 
                                                     "relationship", "race", "sex", "capital-gain", 
                                                     "capital-loss", "hours-per-week", "native-country", 
                                                     "income"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [17]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


### 4. Clean the Data and Address Unusual Phenomena:

#### - Encoding Categorical Data: 
Convert categorical variables into numerical format using one-hot encoding.

In [18]:
data.dropna(inplace=True)

In [19]:
data = pd.get_dummies(data, drop_first=True)

### 5. Formulate Two Questions for Classification:

- Can we predict whether an individual earns more than $50K based on their demographic and employment information?

- Which features (e.g., education, occupation) have the most significant impact on predicting whether an individual earns more than $50K?

### 6. Split Data into Training and Testing Sets:

In [20]:
X = data.drop('income_ >50K', axis=1)
y = data['income_ >50K']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

This splits the dataset into 80% for training and 20% for testing.

### 7. Train the kNN Classifier:

In [22]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2))
])
pipeline.fit(X_train, y_train)

This pipeline scales the data and trains the kNN classifier on the training data.

### 8. Make Classification Predictions:

In [23]:
y_pred = pipeline.predict(X_test)

The model predicts the income category for the test data.

### 9. Interpret the Results:

If the model predicts that a person is likely to earn more than $50K, this insight can help in demographic studies, targeted marketing, or other strategic business decisions.

### 10. Model Validation

#### Confusion Matrix:

This will show the number of true positives, true negatives, false positives, and false negatives.

In [24]:
from sklearn.metrics import confusion_matrix, accuracy_score

conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

#### ROC-AUC Curve:

Indicates how well the model distinguishes between the two classes.

In [25]:
from sklearn.metrics import roc_auc_score, roc_curve

roc_auc = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])

#### k-Fold Cross-Validation:

In [26]:
from sklearn.model_selection import cross_val_score

cross_val_accuracy = cross_val_score(pipeline, X, y, cv=10, scoring='accuracy').mean()

In [27]:
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy Score: {accuracy}')
print(f'ROC-AUC Score: {roc_auc}')
print(f'k-Fold Cross-Validation Score: {cross_val_accuracy}')

Confusion Matrix:
[[4456  489]
 [ 647  921]]
Accuracy Score: 0.8255796100107478
ROC-AUC Score: 0.8453981681145663
k-Fold Cross-Validation Score: 0.8216576525027554
