# K-Nearest Neighbors Classifier
---

### Objectives:

- Use K-Nearest neighbors to classify data

- Apply KNN classifier on a real world data set

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip install seaborn==0.13.0

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data Modeling / Model Linear / Metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay

# Warnings
import warnings
warnings.filterwarnings('ignore')

### Load the data

In [0]:
df = pd.read_csv('./data/teleCust1000t.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)


### Understand the data

---

#### Scenario

A telecommunications provider has segmented its customer base by service usage patterns into four distinct groups. The main business goal is to leverage demographic data to predict group membership for prospective customers. By accurately classifying new customers, the company can customize offers and marketing strategies for each individual, thereby maximizing conversion rates and customer value.

#### Loading Telecommunications Data

The dataset contains demographic and service usage information for a set of customers. Each row represents a unique customer profile, linking their personal attributes to their subscribed service category.

#### About the Dataset

This is a classification problem where the objective is to build a model (specifically using **K-Nearest Neighbors**) to predict the class of a new or unknown case based on predefined labels.

The dataset structure includes:

* **Demographic Features**: The independent variables used to predict the customer profile. These include `region`, `tenure`, `age`, `marital` status, `address`, `income`, `ed` (education), `employ` (employment years), `retire` (retirement status), `gender`, and `reside` (number of people in household).
* **Target Variable (custcat)**: The dependent variable indicating the customer's service group. It has four possible values:

1. **Basic Service**
2. **E-Service**
3. **Plus Service**
4. **Total Service**

### Explore the data
First, consider a statistical summary of the data.

In [0]:
df.describe()

In [0]:
df.info()

### Check distribution of target variable

In [0]:
level_order = [
    1, 
    2, 
    3, 
    4,     
]

# Figure
plt.figure(figsize = (8, 5))

# Count Plot
ax = sns.countplot(
    y = 'custcat',
    data = df,
    order = level_order,
    edgecolor = 'black',
    palette = 'RdYlBu_r',
)

plt.title('Distribution of custcat ', fontsize = 14)
plt.xlabel('Number of Customers', fontsize = 12)
plt.ylabel('Custcat', fontsize = 12)

plt.tight_layout()
plt.show()

In [0]:
df['custcat'].value_counts()

We can say that we have records of 281 customers who opt for Plus Services, 266 for Basic-services, 236 for Total Services, and 217 for E-Services. It can thus be seen that the data set is mostly balanced between the different classes and requires no special means of accounting for class bias.

### Checking the correlations between the variables

In [0]:
df.corr()['custcat'].abs().sort_values(ascending = False)

In [0]:
plt.rc('font', size = 10)
fig, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', ax = ax)
ax.set_title('Correlation Features Matrix')
plt.tight_layout()
plt.show()

In [0]:
# Collecting data
correlation_values = df.corr()['custcat']

if 'custcat' in  correlation_values.index:
    plot_data = correlation_values.drop('custcat').sort_values()

else:
    plot_data = correlation_values.sort_values()


# Colors 
colors = ['#2196f3' if x > 0 else '#f44336' for x in plot_data]

# Figure
plt.figure(figsize = (10, 6))

# Plot
plot_data.plot(
    kind = 'barh', 
    color = colors, 
    edgecolor = 'black'
)

plt.title('Correlation of Features with CustCat ', fontsize = 10)
plt.xlabel('Correlation Coefficient (Pearson)', fontsize = 10)
plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.grid(axis = 'x', linestyle = '--', alpha = 0.7)

sns.despine(left = True, bottom = True) 

plt.tight_layout()
plt.show()

The correlation analysis identified ed (Education Level), tenure (Months with Company), income, and employ (Employment Years) as the most relevant predictors. Among these, Education Level exhibits the strongest association with the service category (custcat), followed by customer loyalty. This indicates that the socioeconomic profile is the primary driver for segmentation.

### Extract the input features and labels from the data set
Extract the required columns and convert the resulting dataframes to NumPy arrays.


In [0]:
# Shape Dataset
df.shape

In [0]:
X = df.drop(columns = ['custcat']).copy()
y = df['custcat'].astype('int').copy()

print(f'The shape X train {X.shape}')
print(f'The shape y train {y.shape}')


### Preprocess selected features an train test split

Create train and test datasets

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 33)

In [0]:
df.head(10)

#### Defining the appropriate category for each feature

In [0]:
categorical_ordinals = ['ed']
categorical_nominals = ['region', 'marital', 'gender', 'reside']
numerical_features = ['tenure', 'age', 'address', 'income', 'employ', 'retire']

#### Creating a pipeline for processing the features

In [0]:
# Defining Pipeline for Nominals features
categorical_nominals_pipeline = Pipeline(
    steps = [
        ('one_hot_encoder', OneHotEncoder(sparse_output = False)),
        ('standard_scaler', StandardScaler())
    ]
)

preprocessor = ColumnTransformer(
    transformers = [
        ('categorical_nominals', categorical_nominals_pipeline, categorical_nominals),
        ('categorical_ordinals', StandardScaler(), categorical_ordinals),
        ('numerical_features', StandardScaler(), numerical_features)
    ],
    remainder = 'passthrough'
)

#### Train data Preprocessed

In [0]:
X_train_preprocessed = preprocessor.fit_transform(X_train)
pd.DataFrame(X_train_preprocessed, columns = preprocessor.get_feature_names_out(X_train.columns)).head()

#### Test data Preprocessed

In [0]:
X_test_preprocessed = preprocessor.transform(X_test)
pd.DataFrame(X_test_preprocessed, columns = preprocessor.get_feature_names_out(X_test.columns)).head()

A standardized variable has zero mean and a standard deviation of one.

### KNN Classification Model

In [0]:
# Create a model object
KNN = KNeighborsClassifier(n_neighbors = 20, weights = 'distance', metric = 'cosine', algorithm = 'brute')

# Train the model in the training data
KNN.fit(X_train, y_train)

In [0]:
# Predict the target variable in the test data
y_pred = KNN.predict(X_test)
y_pred[: 10]

In [0]:
# Collecting the probabilities of the model classifications
y_prob = KNN.predict_proba(X_test)
y_prob[ : 10]

In [0]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob, multi_class = 'ovr'):.2f}')

### Model Evaluation

In [0]:
coefficients = pd.Series(SVM.coef_[0], index = df.columns[: - 1])
coefficients

In [0]:
# Data collect
data_ax = coefficients.sort_values().reset_index()
data_ax.columns = ['Feature', 'Coefficient']

# # Creating a category to automatically "paint" the graph
# If > 0 it helps Churn (Bad/Red), if < 0 it retains the customer (Good/Blue)
data_ax['Impact'] = data_ax['Coefficient'].apply(lambda x: 'High Risk' if x > 0 else 'Low Risk')

# Figure
plt.figure(figsize = (12, 8))
sns.set_style('whitegrid')

# Barplot
sns.barplot(
    data = data_ax,
    y = 'Feature',
    x = 'Coefficient',
    edgecolor = 'black',
    hue = 'Impact',
    dodge = False,
    palette = {'High Risk': 'firebrick', 'Low Risk': 'teal'}
)

plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.title('Features Importance', fontsize = 15)
plt.xlabel('Coefficients', fontsize = 12)
plt.ylabel('Variables', fontsize = 12)
plt.legend(title = 'Type of Impact', loc = 'upper right', fontsize = 12)

plt.tight_layout()
plt.show()

In [0]:
print(classification_report(y_test, y_pred))

In [0]:
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, 
    display_labels = ['No-Fraud', 'Fraud'], 
    cmap = 'Blues', 
    values_format = 'd'
)

plt.grid(False) 
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

In [0]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob):.2f}')


### Conclusion

---

- The developed **Linear SVM** model demonstrates high statistical robustness and predictive capacity for fraud detection. With an overall **Accuracy of ~100%** (due to class imbalance) and, more importantly, an **Area Under the Curve (AUC-ROC) of 0.94**, the classifier exhibits a discriminatory competence far superior to chance. The model successfully achieved a **Recall of 0.80** for the minority class (Fraud), validating its effectiveness in flagging 80% of fraudulent transactions while maintaining a reasonable precision.

- The analysis of the coefficients reveals clear vectors of influence, divided into risk factors (pushing towards Fraud) and protective factors (indicating Normal behavior):

  - The variables `V11` (+6.00), `V4` (+4.73), and `V2` (+4.20) show the highest positive coefficients. This indicates that high values in these specific Principal Components are the main drivers for classifying a transaction as **Fraudulent**.
  
  - The variables `V17` (-15.38), `V12` (-10.60), and `V14` (-10.57) act as the strongest protective factors. They possess the largest absolute weights in the model, meaning they are the most critical features for confirming the legitimacy of a transaction. A high value in `V17` is the strongest mathematical indicator of a **Normal** transaction.
  
  - Original features such as `Time` (-0.04) and `Amount` (+0.15) showed coefficients relatively close to zero compared to the PCA components. This indicates that the raw transaction value and timestamp have significantly less explanatory power than the structural patterns hidden within the PCA-transformed features (V1-V28).

- The **F1-Score of 0.77** for the Fraud class suggests a balanced trade-off between precision and recall, meaning the model is aggressive enough to catch fraud without generating an unmanageable number of false positives for the bank to investigate.