# Goals and Overview

The Sure Tomorrow insurance company wants to solve several tasks with the help of Machine Learning and you are asked to evaluate that possibility.

Task 1: Find customers who are similar to a given customer. This will help the company's agents with marketing.
Task 2: Predict whether a new customer is likely to receive an insurance benefit. Can a prediction model do better than a dummy model?
Task 3: Predict the number of insurance benefits a new customer is likely to receive using a linear regression model.
Task 4: Protect clients' personal data without breaking the model from the previous task. It's necessary to develop a data transformation algorithm that would make it hard to recover personal information if the data fell into the wrong hands. This is called data masking, or data obfuscation. But the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model, just prove that the algorithm works correctly.

# Project

## Initialization

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns

import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing

from sklearn.model_selection import train_test_split

from IPython.display import display

import math

## Reading Data

In [None]:
df = pd.read_csv('./datasets/insurance_us.csv')

In [None]:
df = df.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})

In [None]:
df.sample(10)

In [None]:
df.info()

In [None]:
df['age'] = df['age'].astype('int64')

In [None]:
df.info()

In [None]:
df.describe()

Everything seems fine with the data.

## Data Preparation

## Data Analysis

In [None]:
g = sns.pairplot(df, kind='hist')
g.fig.set_size_inches(12, 12)

Ok, it is a bit difficult to spot obvious groups (clusters) as it is difficult to combine several variables simultaneously (to analyze multivariate distributions). That's where LA and ML can be quite handy.

## Testing Statistical Hypothesis

In [None]:
feature_names = ['gender', 'age', 'income', 'family_members']

In [None]:
def get_knn(df, n, k, metric):
    
    """
    Returns k nearest neighbors

    :param df: pandas DataFrame used to find similar objects within
    :param n: object no for which the nearest neighbours are looked for
    :param k: the number of the nearest neighbours to return
    :param metric: name of distance metric
    """

    nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=metric)
    nbrs.fit(df[feature_names])
    nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)
    
    df_res = pd.concat([
        df.iloc[nbrs_indices[0]], 
        pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
        ], axis=1)
    
    return df_res

In [None]:
feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())

In [None]:
df_scaled.sample(5)

In [None]:
result1 = get_knn(df, 0, 10, 'euclidean')
print(result1)

In [None]:
result2 = get_knn(df, 0, 10, 'manhattan')
print(result2)

In [None]:
result3 = get_knn(df_scaled, 0, 10, 'euclidean')
print(result3)

In [None]:
result4 = get_knn(df_scaled, 0, 10, 'manhattan')
print(result4)

In [None]:
# calculate the target

df['insurance_benefits_received'] = (df['insurance_benefits'] > 0).astype(int)

In [None]:
df_scaled['insurance_benefits_received'] = (df_scaled['insurance_benefits'] > 0).astype(int)

In [None]:
# check for the class imbalance with value_counts()

class_imbalance = df['insurance_benefits_received'].value_counts()
print(class_imbalance)

In [None]:
def eval_classifier(y_true, y_pred):
    
    f1_score = sklearn.metrics.f1_score(y_true, y_pred)
    print(f'F1: {f1_score:.2f}')
    
# if you have an issue with the following line, restart the kernel and run the notebook again
    cm = sklearn.metrics.confusion_matrix(y_true, y_pred, normalize='all')
    print('Confusion Matrix')
    print(cm)

In [None]:
# generating output of a random model

def rnd_model_predict(P, size, seed=42):

    rng = np.random.default_rng(seed=seed)
    return rng.binomial(n=1, p=P, size=size)

In [None]:
for P in [0, df['insurance_benefits_received'].sum() / len(df), 0.5, 1]:
    print(f'The probability: {P:.2f}')
    y_pred_rnd = rnd_model_predict(P, size=len(df))
    eval_classifier(df['insurance_benefits_received'], y_pred_rnd)
    print()

In [None]:
for P in [0, df_scaled['insurance_benefits_received'].sum() / len(df_scaled), 0.5, 1]:
    print(f'The probability: {P:.2f}')
    y_pred_rnd = rnd_model_predict(P, size=len(df_scaled))
    eval_classifier(df_scaled['insurance_benefits_received'], y_pred_rnd)
    print()

In [None]:
features = df.drop(columns=['insurance_benefits_received', 'insurance_benefits'])
target = df['insurance_benefits_received']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [None]:
f1_scores_original = []
f1_scores_scaled = []

# Loop through different values of k for kNN
for k in range(1, 11):
    # kNN classifier for original data
    knn_original = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
    knn_original.fit(features_train, target_train)
    
    target_pred_original = knn_original.predict(features_test)
    
    f1_original = sklearn.metrics.f1_score(target_test, target_pred_original)
    f1_scores_original.append(f1_original)

    # kNN classifier for scaled data
    scaler = sklearn.preprocessing.StandardScaler()
    
    features_train_scaled = scaler.fit_transform(features_train)
    features_test_scaled = scaler.transform(features_test)
    
    knn_scaled = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
    knn_scaled.fit(features_train_scaled, target_train)
    
    target_pred_scaled = knn_scaled.predict(features_test_scaled)
    
    f1_scaled = sklearn.metrics.f1_score(target_test, target_pred_scaled)
    f1_scores_scaled.append(f1_scaled)
    
print('F1 scores for original DF:')
print(f1_scores_original)

print()

print('F1 scores for scaled DF:')
print(f1_scores_scaled)

Scaling the dataset has led to a substantial improvement in the F1 scores across all metrics. In the original dataset, the F1 scores ranged from 0.0245 to 0.6523, indicating relatively poor performance. After scaling, the F1 scores improved dramatically, ranging from 0.9128 to 0.9393, indicating that the model's performance is significantly better and more consistent.

In [None]:
class MyLinearRegression:
    
    def __init__(self):

        self.weights = None
    
    def fit(self, X, y):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        self.weights = np.linalg.inv(X2.T.dot(X2)).dot(X2.T).dot(y)
    def predict(self, X):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        y_pred = X2.dot(self.weights)
        
        return y_pred

In [None]:
def eval_regressor(y_true, y_pred):
    
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'RMSE: {rmse:.2f}')
    
    r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
    print(f'R2: {r2_score:.2f}')   

In [None]:
X = df[['age', 'gender', 'income', 'family_members']]
y = df['insurance_benefits']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)
print()

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr.fit(X_train_scaled, y_train)
print(lr.weights)
print()

y_testscaled_pred = lr.predict(X_test_scaled)
eval_regressor(y_test, y_testscaled_pred)

There is no difference in the RMSE and R2 scores between scaled and unscaled data

In [None]:
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = df[personal_info_column_list]

In [None]:
X = df_pn.to_numpy()

In [None]:
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

In [None]:
# Attempt to compute the inverse
try:
    P_inv = np.linalg.inv(P)
    print("Matrix P is invertible.")
except np.linalg.LinAlgError:
    print("Matrix P is not invertible.")

In [None]:
X_obfuscated = np.dot(X, P)
X_obfuscated

In [None]:
P_inv = np.linalg.inv(P)

X_recovered = np.dot(X_obfuscated, P_inv)

print("Recovered Data (X_recovered):")
print(X_recovered[:5])

In [None]:
df_recovered = pd.DataFrame(X_recovered, columns=personal_info_column_list)

In [None]:
X_obfuscated = pd.DataFrame(X_obfuscated, columns=personal_info_column_list)

In [None]:
df_pn.head(5)

In [None]:
X_obfuscated[:5]

In [None]:
df_recovered.head(5)

In [None]:
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

In [None]:
try:
    P_inv = np.linalg.inv(P)
    print("Matrix P is invertible.")
except np.linalg.LinAlgError:
    print("Matrix P is not invertible.")

In [None]:
class LinearRegression:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0
    
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print(r2_score(target, predictions))

In [None]:
class ObfuscatingLinearRegression:
    def __init__(self, obfuscate=False, noise_level=0.1):
        self.obfuscate = obfuscate
        self.noise_level = noise_level
        self.model = sklearn.linear_model.LinearRegression()
        self.P = None

    def generate_invertible_matrix(self, size):
        while True:
            P = np.random.rand(size, size)
            try:
                _ = np.linalg.inv(P)
                return P
            except numpy.linalg.LinAlgError:
                continue

    def fit(self, X, y):
        if self.obfuscate:
            if self.P is None:
                self.P = self.generate_invertible_matrix(X.shape[1])
            X = X @ self.P
        self.model.fit(X, y)

    def predict(self, features):
        if self.obfuscate and self.P is not None:
            features = features @ self.P
        return self.model.predict(features)

    def score(self, X, y):
        y_pred = self.predict(X)
        mse = sklearn.metrics.mean_squared_error(y, y_pred)
        r2 = sklearn.metrics.r2_score(y, y_pred)
        return mse, r2

In [None]:
model_original = ObfuscatingLinearRegression(obfuscate=False)
model_obfuscated = ObfuscatingLinearRegression(obfuscate=True)

model_original.fit(X_train, y_train)
model_obfuscated.fit(X_train, y_train)

predictions_original = model_original.predict(X_test)
predictions_obfuscated = model_obfuscated.predict(X_test)

mse_original, r2_original = model_original.score(X_test, y_test)
mse_obfuscated, r2_obfuscated = model_obfuscated.score(X_test, y_test)

print(f"Original Data - MSE: {mse_original}, R2: {r2_original}")
print(f"Obfuscated Data - MSE: {mse_obfuscated}, R2: {r2_obfuscated}")
print(f"Difference in Predictions: {np.mean(predictions_original - predictions_obfuscated)}")

In [None]:
pd.DataFrame(predictions_original).describe()

Based on these results, it seems like obfuscating the feature matrix 𝑋 with matrix 𝑃 didn't noticeably affect how well the linear regression model performed. The original and obfuscated models had almost identical MSE and R² scores, and their predictions differed by an insignificant amount. This suggests that Linear Regression handles this type of obfuscation well from a computational perspective.

## Conclusion

A model has been made that retains accuracy after scaling and obscufation. The MSE suggests low prediction error, while the 𝑅2 score indicates a moderate level of goodness-of-fit