# Introduction

Sure Tomorrow is an insurance company looking to make predictions based on data they already have. This analysis will find customers similar to given customers, as well as make predictions about the benefits that a new customer is likely to receive. t also demonstrates that it is possible to protect their customers personal data (PII) without breaking any machine learning models that they are using.

Import statements necessary to complete project

In [119]:
import pandas as pd
import numpy as np
from IPython.display import display
import math
from scipy import linalg
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
plt.style.use('ggplot')

## Load Data

This code loads the data and checks its properties and content, as well as updating the column names to be in the correct format.

In [120]:
insurance_data = pd.read_csv("/Users/leahdeyoung/Desktop/GitHub/sure_tomorrow_insurance/insurance_us.csv", encoding = "utf-8")


In [121]:
insurance_data = insurance_data.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})

In [122]:
display(insurance_data.sample(10))
insurance_data.info()

Unnamed: 0,gender,age,income,family_members,insurance_benefits
1235,0,47.0,42500.0,2,1
2210,0,38.0,39300.0,0,0
4628,1,41.0,32500.0,4,0
832,0,25.0,37800.0,3,0
1698,0,33.0,41200.0,1,0
4623,0,41.0,7400.0,0,0
3678,0,21.0,44300.0,2,0
4113,1,33.0,41800.0,0,0
4669,0,46.0,26900.0,1,1
2088,1,21.0,47900.0,2,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              5000 non-null   int64  
 1   age                 5000 non-null   float64
 2   income              5000 non-null   float64
 3   family_members      5000 non-null   int64  
 4   insurance_benefits  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


## Data Preprocessing

This code converts the age column to integer.

In [123]:
insurance_data['age'] = insurance_data['age'].astype('int')


In [124]:
insurance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              5000 non-null   int64  
 1   age                 5000 non-null   int64  
 2   income              5000 non-null   float64
 3   family_members      5000 non-null   int64  
 4   insurance_benefits  5000 non-null   int64  
dtypes: float64(1), int64(4)
memory usage: 195.4 KB


This code checks the value counts of the different columns to make sure there are no strange values. Everything looks like it should, assuming that 0 family members means only the insured person is part of the family.

In [125]:
print(insurance_data['gender'].value_counts())
print(insurance_data['age'].value_counts())
print(insurance_data['income'].value_counts())
print(insurance_data['family_members'].value_counts())
print(insurance_data['insurance_benefits'].value_counts())

0    2505
1    2495
Name: gender, dtype: int64
19    223
25    214
31    212
26    211
22    209
27    209
32    206
28    204
29    203
30    202
23    202
21    200
20    195
36    193
33    191
24    182
35    179
34    177
37    147
39    141
38    139
41    129
18    117
40    114
42     93
43     77
44     74
45     73
46     60
47     47
49     37
50     27
48     26
52     22
51     21
53     11
55      9
54      7
56      5
59      3
57      2
58      2
60      2
61      1
65      1
62      1
Name: age, dtype: int64
45800.0    29
37100.0    28
41500.0    27
43200.0    27
46800.0    26
           ..
17700.0     1
70600.0     1
18100.0     1
13000.0     1
56800.0     1
Name: income, Length: 524, dtype: int64
1    1814
0    1513
2    1071
3     439
4     124
5      32
6       7
Name: family_members, dtype: int64
0    4436
1     423
2     115
3      18
4       7
5       1
Name: insurance_benefits, dtype: int64


## EDA

This code checks for trends in customer groups by using a pairplot. It does not seem taht there are any obvious trends, as there are too many variables to evaluate efficiently.

In [126]:
customer_groups = sns.pairplot(insurance_data, kind='hist')
customer_groups.fig.set_size_inches(12, 12)


# Task 1. Similar Customers

This code scales the data, then returns the nearest neighbors, aka the most similar customers, (k) for both the manhattan and euclidean distances.

In [127]:
feature_names = ['gender', 'age', 'income', 'family_members']

In [128]:
def get_knn(df, n, k, metric_name):
    
    """
    Returns k nearest neighbors

    :param df: pandas DataFrame used to find similar objects within
    :param n: object no for which the nearest neighbours are looked for
    :param k: the number of the nearest neighbours to return
    :param metric: name of distance metric
    """

    nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=metric_name)
    nbrs.fit(df[feature_names])
    nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)
    
    df_res = pd.concat([
        df.iloc[nbrs_indices[0]], 
        pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
        ], axis=1)
    
    return df_res

Scaling the data.

In [129]:
feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(insurance_data[feature_names].to_numpy())

insurance_data_scaled = insurance_data.copy()
insurance_data_scaled.loc[:, feature_names] = transformer_mas.transform(insurance_data[feature_names].to_numpy())

In [130]:
insurance_data_scaled.sample(5)

Unnamed: 0,gender,age,income,family_members,insurance_benefits
931,0.0,0.615385,0.653165,0.0,0
1694,1.0,0.507692,0.688608,0.0,0
2293,0.0,0.461538,0.536709,0.0,0
2304,0.0,0.6,0.651899,0.166667,0
1232,1.0,0.292308,0.436709,0.333333,0


Now, let's get similar records for a given one for every combination

In [131]:
get_knn(insurance_data, 10, 5, 'euclidean')



Unnamed: 0,gender,age,income,family_members,insurance_benefits,distance
10,1,25,36600.0,1,0,0.0
4039,1,25,36600.0,2,0,1.0
3247,1,26,36600.0,2,0,1.414214
2037,1,26,36600.0,0,0,1.414214
1508,0,26,36600.0,0,0,1.732051


In [132]:
get_knn(insurance_data, 10, 5, 'cityblock')



Unnamed: 0,gender,age,income,family_members,insurance_benefits,distance
10,1,25,36600.0,1,0,0.0
4039,1,25,36600.0,2,0,1.0
3247,1,26,36600.0,2,0,2.0
2037,1,26,36600.0,0,0,2.0
1508,0,26,36600.0,0,0,3.0


In [133]:
get_knn(insurance_data_scaled, 10, 5, 'euclidean')



Unnamed: 0,gender,age,income,family_members,insurance_benefits,distance
10,1.0,0.384615,0.463291,0.166667,0,0.0
4377,1.0,0.384615,0.473418,0.166667,0,0.010127
1389,1.0,0.369231,0.464557,0.166667,0,0.015437
760,1.0,0.369231,0.462025,0.166667,0,0.015437
2254,1.0,0.4,0.455696,0.166667,0,0.017157


In [134]:
get_knn(insurance_data_scaled, 10, 5, 'cityblock')



Unnamed: 0,gender,age,income,family_members,insurance_benefits,distance
10,1.0,0.384615,0.463291,0.166667,0,0.0
4377,1.0,0.384615,0.473418,0.166667,0,0.010127
1389,1.0,0.369231,0.464557,0.166667,0,0.01665
760,1.0,0.369231,0.462025,0.166667,0,0.01665
2254,1.0,0.4,0.455696,0.166667,0,0.02298


The data not being scaled causes the distances to appear larger than it does when the data is scaled, because the financial information has significatly larger values than the other numerical values. The results using the the Manhattan distance metric are not particularly similar to each other in scaling vs. not scaling; however, they are somewhat similar to the respective Euclidean distances (more similar in the scaled dataset), and they are relatively close to the other Manhattan distances in their own datasets. 

# Task 2. Is Customer Likely to Receive Insurance Benefit?

This code builds a kNN classification model and compares it to a dummy model to see if it produces better results in regards to an F1 score.

This code creates the target column by assigning "1" to any value in insurance_benefits that is greater than 0 and "0" to any value that is equal to 0.

In [135]:
print(insurance_data['insurance_benefits'].value_counts())
insurance_data['insurance_benefits_received'] = insurance_data['insurance_benefits'].where((insurance_data['insurance_benefits'] == 0), 1)
insurance_data['insurance_benefits_received'].fillna(0) 
print(insurance_data['insurance_benefits_received'].sample(10))
print(insurance_data['insurance_benefits_received'].value_counts())



0    4436
1     423
2     115
3      18
4       7
5       1
Name: insurance_benefits, dtype: int64
3970    0
1145    0
2931    0
3111    0
2980    0
4455    0
1250    0
483     0
1286    0
2678    0
Name: insurance_benefits_received, dtype: int64
0    4436
1     564
Name: insurance_benefits_received, dtype: int64


This code splits the data into training (70%) and test (30%) and assigns the targets and features. 

In [136]:
insurance_data_train, insurance_data_test = train_test_split(insurance_data, test_size=0.30, train_size = 0.70, random_state=54321, shuffle=True)

unscaled_features_train = insurance_data_train.drop(['insurance_benefits_received', 'insurance_benefits'], axis=1)
unscaled_scaled_target_train = insurance_data_train['insurance_benefits_received']
unscaled_features_test = insurance_data_test.drop(['insurance_benefits_received', 'insurance_benefits'], axis=1)
unscaled_scaled_target_test = insurance_data_test['insurance_benefits_received']

scale_features_transformer = sklearn.preprocessing.MaxAbsScaler().fit(unscaled_features_train[feature_names].to_numpy())

scaled_features_train = unscaled_features_train.copy()
scaled_features_train.loc[:, feature_names] = scale_features_transformer.transform(unscaled_features_train[feature_names].to_numpy())

scaled_features_test = unscaled_features_test.copy()
scaled_features_test.loc[:, feature_names] = scale_features_transformer.transform(unscaled_features_test[feature_names].to_numpy())



This code looks for class imbalance in the target sets using the value_counts() method. There does seem to be a slight imbalance,  between the two sets in terms of numbers only, but it does not change between the scaled and unscaled sets.

In [137]:
#print(unscaled_features_train.value_counts())
print(unscaled_scaled_target_train.value_counts())
#print(unscaled_features_test.value_counts())
print(unscaled_scaled_target_test.value_counts())

0    3098
1     402
Name: insurance_benefits_received, dtype: int64
0    1338
1     162
Name: insurance_benefits_received, dtype: int64


This code creates the kNN classifier and evaluates the F1 score for kNN predictions

In [138]:
def eval_classifier(features_train, target_train, features_test, target_test, k):
    
    kNN = sklearn.neighbors.KNeighborsClassifier(n_neighbors=(k), weights='uniform', algorithm='auto', leaf_size=30, metric_params=None, n_jobs=None)
    kNN.fit(features_train, target_train)

    predictions = kNN.predict(features_test)
    f1_score_predict = sklearn.metrics.f1_score(target_test, predictions)
    print(f'K: {k}')
    print(f'F1 for KNN Predictions: {f1_score_predict:.2f}')
    cm_predict = sklearn.metrics.confusion_matrix(target_test, predictions, normalize='all')
    print('Confusion Matrix for KNN Predictions')
    print(cm_predict)

    return 

This code creates a random dummy model and evaluates the F1 and confusion matrix

In [139]:
def rnd_model_predict(y_true, P, size, seed=42):

    rng = np.random.default_rng(seed=seed)
    rnd_pred = rng.binomial(n=1, p=P, size=size)
    f1_score = sklearn.metrics.f1_score(y_true, rnd_pred)
    print(f'F1: {f1_score:.2f}')

    # if you have an issue with the following line, restart the kernel and run the notebook again
    cm = sklearn.metrics.confusion_matrix(y_true, rnd_pred, normalize='all')
    print('Confusion Matrix')
    print(cm)
    return rnd_pred



In [140]:
rng = np.random.default_rng(seed=42)
rng.binomial(n=1, p=0.5, size=len(insurance_data))

array([1, 0, 1, ..., 0, 0, 1])

In [141]:
for P in [0, insurance_data['insurance_benefits_received'].sum() / len(insurance_data), 0.5, 1]:

    print(f'The probability: {P:.2f}')
    y_pred_rnd = rnd_model_predict(insurance_data['insurance_benefits_received'], P, len(insurance_data), seed=42)


The probability: 0.00
F1: 0.00
Confusion Matrix
[[0.8872 0.    ]
 [0.1128 0.    ]]
The probability: 0.11
F1: 0.12
Confusion Matrix
[[0.7914 0.0958]
 [0.0994 0.0134]]
The probability: 0.50
F1: 0.20
Confusion Matrix
[[0.456  0.4312]
 [0.053  0.0598]]
The probability: 1.00
F1: 0.20
Confusion Matrix
[[0.     0.8872]
 [0.     0.1128]]


Unscaled:

In [142]:
for n in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
    eval_classifier(unscaled_features_train, unscaled_scaled_target_train, unscaled_features_test, unscaled_scaled_target_test, n)

K: 1
F1 for KNN Predictions: 0.59
Confusion Matrix for KNN Predictions
[[0.86933333 0.02266667]
 [0.05266667 0.05533333]]
K: 2
F1 for KNN Predictions: 0.41
Confusion Matrix for KNN Predictions
[[0.88733333 0.00466667]
 [0.07933333 0.02866667]]
K: 3
F1 for KNN Predictions: 0.43
Confusion Matrix for KNN Predictions
[[0.87933333 0.01266667]
 [0.07533333 0.03266667]]
K: 4
F1 for KNN Predictions: 0.17
Confusion Matrix for KNN Predictions
[[0.89  0.002]
 [0.098 0.01 ]]
K: 5
F1 for KNN Predictions: 0.18
Confusion Matrix for KNN Predictions
[[0.88866667 0.00333333]
 [0.09666667 0.01133333]]
K: 6
F1 for KNN Predictions: 0.05
Confusion Matrix for KNN Predictions
[[0.892      0.        ]
 [0.10533333 0.00266667]]
K: 7
F1 for KNN Predictions: 0.05
Confusion Matrix for KNN Predictions
[[0.89066667 0.00133333]
 [0.10533333 0.00266667]]
K: 8
F1 for KNN Predictions: 0.01
Confusion Matrix for KNN Predictions
[[8.92000000e-01 0.00000000e+00]
 [1.07333333e-01 6.66666667e-04]]
K: 9
F1 for KNN Predictions:

Scaled:

In [143]:
for n in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
    eval_classifier(scaled_features_train, unscaled_scaled_target_train, scaled_features_test, unscaled_scaled_target_test, n)
    

K: 1
F1 for KNN Predictions: 0.95
Confusion Matrix for KNN Predictions
[[0.88933333 0.00266667]
 [0.00866667 0.09933333]]
K: 2
F1 for KNN Predictions: 0.90
Confusion Matrix for KNN Predictions
[[8.91333333e-01 6.66666667e-04]
 [1.86666667e-02 8.93333333e-02]]


K: 3
F1 for KNN Predictions: 0.92
Confusion Matrix for KNN Predictions
[[0.888 0.004]
 [0.012 0.096]]
K: 4
F1 for KNN Predictions: 0.91
Confusion Matrix for KNN Predictions
[[0.88933333 0.00266667]
 [0.016      0.092     ]]
K: 5
F1 for KNN Predictions: 0.92
Confusion Matrix for KNN Predictions
[[0.888 0.004]
 [0.012 0.096]]
K: 6
F1 for KNN Predictions: 0.89
Confusion Matrix for KNN Predictions
[[0.88933333 0.00266667]
 [0.02       0.088     ]]
K: 7
F1 for KNN Predictions: 0.91
Confusion Matrix for KNN Predictions
[[0.888      0.004     ]
 [0.01533333 0.09266667]]
K: 8
F1 for KNN Predictions: 0.89
Confusion Matrix for KNN Predictions
[[0.89  0.002]
 [0.02  0.088]]
K: 9
F1 for KNN Predictions: 0.91
Confusion Matrix for KNN Predictions
[[0.88866667 0.00333333]
 [0.01533333 0.09266667]]
K: 10
F1 for KNN Predictions: 0.90
Confusion Matrix for KNN Predictions
[[0.89066667 0.00133333]
 [0.01866667 0.08933333]]


The scaled F1 scores are higher than the unscaled F1 scores by a significant amount, but as K increases, the F1 score decreases in both cases.

# Task 3. Regression (with Linear Regression)

This code builds a Linear Regression model with RMSE as an evaluation metric and insurance_benefits as the target for both the scaled and unscaled data.

In [144]:
class MyLinearRegression:
    
    def __init__(self):
        
        self.weights = None
    
    def fit(self, X, y):
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        weights = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.weights = weights[1:]
        self.weights0 = weights[0]
        
    def predict(self, X):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        y_pred = X.dot(np.append([self.weights0], self.weights))
        
        return y_pred

In [145]:
def eval_regressor(y_true, y_pred):
    
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'RMSE: {rmse:.2f}')
    
    r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
    print(f'R2: {r2_score:.2f}')    

Evaluating RMSE with the unscaled data:

In [146]:
X = insurance_data[['age', 'gender', 'income', 'family_members']].to_numpy()
y = insurance_data['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)

[-4.11580814e-02 -1.19007179e-05 -4.53191847e-02]
RMSE: 0.38
R2: 0.56


Evaluating RMSE with the scaled data:

In [147]:
X = insurance_data_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
y = insurance_data_scaled['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)

[-0.04115808 -0.94015671 -0.27191511]
RMSE: 0.38
R2: 0.56


The data has identical RMSE and R2 values, though the weights are different.

# Task 4. Obfuscating Data

The code below creates a random number matrix to obfuscate the data, then reverses the process to see if the data can be displayed accurately again.

In [148]:
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = insurance_data[personal_info_column_list]

In [149]:
X = df_pn.to_numpy()

Generating a random matrix $P$.

In [150]:
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

Checking the matrix $P$ is invertible

In [151]:
try:
    i = linalg.inv(P)
    print('The matrix is invertible')
except:
    print('The matrix is not invertible')

The matrix is invertible


Can you guess the customers' ages or income after the transformation?

In [152]:
ob_data = X @ P
ob_data_frame = pd.DataFrame(ob_data, columns=personal_info_column_list)
display (ob_data_frame.sample(10))

Unnamed: 0,gender,age,income,family_members
1589,2951.32758,10394.284994,8556.296459,21344.031601
382,5872.412582,20664.06,17011.252551,42475.059411
3759,5640.296045,19851.12805,16341.752956,40805.17176
1349,5116.903483,18004.678861,14821.704203,37005.015023
2580,3718.749157,13085.277036,10771.535744,26894.718633
821,5319.619878,18715.254121,15406.828773,38479.839819
1319,5704.561753,20070.906163,16523.443728,41264.535237
1967,6868.868436,24161.174341,19890.758033,49691.110753
2233,1451.000794,5116.14178,4211.427044,10494.365478
491,4436.409768,15608.030028,12849.181003,32085.858764


In [153]:
ob_data_revert = ob_data @ i

matrix_check = P @ i

matrix_check_2 = i @ P

print(np.round(matrix_check))

print(np.round(matrix_check_2))

ob_data_revert_frame = pd.DataFrame(ob_data_revert, columns=personal_info_column_list)
ob_data_revert_frame['gender'] = ob_data_revert_frame['gender'].astype('int')
ob_data_revert_frame['age'] = ob_data_revert_frame['age'].astype('int')
ob_data_revert_frame['family_members'] = ob_data_revert_frame['family_members'].astype('int')
display (ob_data_revert_frame.sample(10))


[[ 1.  0. -0. -0.]
 [-0.  1. -0.  0.]
 [-0. -0.  1.  0.]
 [ 0. -0. -0.  1.]]
[[ 1. -0.  0.  0.]
 [ 0.  1. -0.  0.]
 [-0.  0.  1. -0.]
 [ 0. -0. -0.  1.]]


Unnamed: 0,gender,age,income,family_members
4934,1,25,27100.0,3
4199,0,54,23500.0,1
4022,1,22,42700.0,1
2373,1,22,41500.0,0
99,0,36,42100.0,2
2914,0,40,52700.0,2
3526,0,18,38600.0,1
80,0,40,32700.0,0
3348,0,25,46800.0,1
557,0,32,23100.0,2


In [154]:
sample = insurance_data.sample(10)
display(sample)
sample_pn = sample[personal_info_column_list]
Y = sample_pn.to_numpy()

ob_sample = Y @ P
ob_sample_frame = pd.DataFrame(ob_sample, columns=personal_info_column_list)
display (ob_sample_frame)

ob_sample_revert = ob_sample @ i

ob_sample_revert_frame = pd.DataFrame(ob_sample_revert, columns=personal_info_column_list)
ob_sample_revert_frame['gender'] = ob_sample_revert_frame['gender'].astype('int')
ob_sample_revert_frame['age'] = ob_sample_revert_frame['age'].astype('int')
ob_sample_revert_frame['family_members'] = ob_sample_revert_frame['family_members'].astype('int')
display (ob_sample_revert_frame)


Unnamed: 0,gender,age,income,family_members,insurance_benefits,insurance_benefits_received
752,0,41,32100.0,0,0,0
2010,1,38,35900.0,1,0,0
4100,0,33,32400.0,0,0,0
1373,0,40,52000.0,0,0,0
16,1,26,48900.0,2,0,0
2635,0,23,32900.0,0,0,0
2584,1,20,33600.0,0,0,0
2770,0,38,34100.0,1,0,0
1274,0,36,52900.0,3,0,0
3804,0,34,43000.0,0,0,0


Unnamed: 0,gender,age,income,family_members
0,4116.30888,14497.389123,11933.823306,29781.384779
1,4604.275973,16207.19046,13341.874391,33301.65815
2,4153.989551,14624.699925,12038.973595,30053.125761
3,6665.675993,23459.093665,19311.942848,48223.221992
4,6269.266935,22051.322946,18153.558444,45340.397472
5,4217.104594,14840.136671,12216.76121,30508.647612
6,4307.275561,15152.918839,12474.895006,31155.722279
7,4372.897479,15396.056893,12673.579349,31632.783802
8,6782.533148,23863.006804,19643.946754,49054.847941
9,5512.088235,19399.766489,15970.193792,39877.620707


Unnamed: 0,gender,age,income,family_members
0,0,40,32100.0,0
1,1,37,35900.0,0
2,0,32,32400.0,0
3,0,40,52000.0,0
4,0,25,48900.0,2
5,0,22,32900.0,0
6,1,20,33600.0,0
7,0,37,34100.0,1
8,0,35,52900.0,3
9,0,33,43000.0,0


When checking the matrix values by multiplying the original by inverted, the results that I expected to be zero approached zero, but were not actually zero without rounding to an integer. When using the inverted matrix, there are limitations to floating point arithmetic that cause the numbers that were zero to approach zero instead.

## Test Linear Regression With Data Obfuscation

The following code follows the same procedures as above for data obfuscation and linear regression to test if the obfuscated data produces the same RMSE as the unobfuscated data.

In [155]:
rng2 = np.random.default_rng(seed=55)
P2 = rng2.random(size=(X.shape[1], X.shape[1]))

try:
    i2 = linalg.inv(P2)
    print('The matrix is invertible')
except:
    print('The matrix is not invertible')

The matrix is invertible


In [156]:
ob_data_new = X @ P2
ob_data_frame_new = pd.DataFrame(ob_data_new, columns=personal_info_column_list)
ob_data_frame_new['insurance_benefits'] = insurance_data['insurance_benefits']
display (ob_data_frame_new.head(10))


Unnamed: 0,gender,age,income,family_members,insurance_benefits
0,3383.226743,24390.415516,16813.448193,29895.109743,0
1,2598.548994,18696.606475,12891.396449,22917.37912,1
2,1437.780366,10334.843009,7126.668035,12667.720242,0
3,2837.078227,20495.168201,14125.925081,25121.765713,0
4,1783.824788,12839.546035,8851.984594,15736.779406,0
5,2801.301133,20168.843099,13904.761713,24721.489876,1
6,2711.209301,19527.406426,13462.03436,23935.209572,0
7,2629.831664,18977.336563,13080.10661,23261.784264,0
8,3387.541555,24435.758924,16843.766294,29950.566003,0
9,3521.141137,25414.936465,17517.73463,31150.611444,0


In [157]:
X = ob_data_frame_new[['age', 'gender', 'income', 'family_members']].to_numpy()
y = ob_data_frame_new['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)

[-0.01464017  0.16214101 -0.03722125]
RMSE: 0.38
R2: 0.56


The RMSE is identical

# Conclusions

Data obfuscation is a helpful way to hide PII from those who do not need to see it, especially in cases like insurance, where sensitive information can be involved. While we only used it to evaluate Linear Regression, we proved that it can be useful in other linear cases.

Scaling data demonstrates an impact on KNN evaluation. This is something to keep in mind when working with numerical data in any capacity going forward. The KNN model also did better than the random model!

Just because EDA is not terribly effective does not mean we cannot draw conclusions and make decisions based on the data.