<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Customer-personal-data-protection" data-toc-modified-id="Customer-personal-data-protection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Description</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Transformation-algorithm-proposal" data-toc-modified-id="Transformation-algorithm-proposal-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Transformation algorithm proposal</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Customer personal data protection

Goal: ensure customer personal data protection. 

Develop a data transformation method providing complicated personal information recovery. Verify correct functioning of the method.

Encode customers' personal information in a way that machine learning models performance does not degrade.

## Data Preparation

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('/datasets/insurance.csv')

In [3]:
data.head()

Unnamed: 0,Gender,Age,Income,Family members,Insurance claim
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Gender           5000 non-null   int64  
 1   Age              5000 non-null   float64
 2   Income           5000 non-null   float64
 3   Family members   5000 non-null   int64  
 4   Insurance claim  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
data.describe()

Unnamed: 0,Gender,Age,Income,Family members,Insurance claim
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [6]:
data.duplicated().sum()

153

In [7]:
duplicate = data[data.duplicated()]

In [8]:
data = data.drop_duplicates()

In [9]:
data.shape

(4847, 5)

###### Summary 

Dataset was uploaded and reviewed. No missing or negative values found. Deleted 153 duplicates.

Remove target feature. Check linear regression performance.

In [15]:
features = data.drop('Insurance claim', axis=1)

In [16]:
target = data['Insurance claim'] 

In [17]:
features_train, features_valid, target_train, target_valid = \
train_test_split(features, target, test_size=0.25, random_state=12345)

In [18]:
model = LinearRegression()

In [19]:
model.fit(features_train, target_train)

LinearRegression()

In [20]:
predictions = model.predict(features_valid)

In [21]:
r2_score(target_valid, predictions)

0.42307727492147573

In [22]:
coef_original = model.coef_

In [23]:
features_train.shape

(3635, 4)

In [24]:
matrixSize = 4
B = np.random.rand(matrixSize,matrixSize)

Matrix invertibility check: create an inverse matrix C. If such matrix doesn't exist, it means that an noninvertible matrix B was created 

In [25]:
C = np.linalg.inv(B)
C

array([[ -0.57418168,  13.84993229, -18.56260765,   0.42623925],
       [  0.66022883,  -6.46483446,  10.5950053 ,  -0.74067942],
       [  8.94742362, -56.29310327,  73.94468647,  -4.70724359],
       [ -7.51666877,  47.15193914, -62.31256761,   5.29091294]])

Multiply features by invertible matrix B 

In [26]:
features_multip_by_inv_matrix = features_train.values @ B

In [27]:
features_multip_by_inv_matrix.shape

(3635, 4)

In [28]:
model.fit(features_multip_by_inv_matrix, target_train)

LinearRegression()

In [29]:
predictions_multip_by_inv_matrix = model.predict(features_valid.values @ B)

In [30]:
type(predictions_multip_by_inv_matrix)

numpy.ndarray

In [31]:
r2_score(target_valid, predictions_multip_by_inv_matrix)

0.42307727492155545

Prove that predictions before the multiplication are equal to predictions after the multiplication

In [32]:
difference = predictions / predictions_multip_by_inv_matrix

In [33]:
difference = np.round(difference)

In [34]:
difference

array([1., 1., 1., ..., 1., 1., 1.])

In [35]:
set(difference)

{1.0}

In [36]:
coef__multip_by_inv_matrix = model.coef_

Features multiplication by invertible matrix did not cause any significant difference for r2_score.

In [37]:
coef_original

array([ 1.45766002e-02,  3.64782926e-02,  1.79477716e-07, -1.23345013e-02])

In [38]:
coef__multip_by_inv_matrix

array([ 0.49159149, -0.21706442, -1.8649785 ,  1.5451828 ])

Features multiplication by invertible matrix did not caused proportional change of model parameters. Linear regression model performance did not degrade.  

## Transformation algorithm proposal

**Algorithm**

Use feature scaling or vector linear transformation

Method#1: Scaling

In [39]:
scaler = StandardScaler()

In [40]:
features_scaled = scaler.fit_transform(features)

In [41]:
features_scaled

array([[ 1.0030995 ,  1.1754362 ,  0.97315092, -0.18517565],
       [-0.99691008,  1.76456423, -0.19011493, -0.18517565],
       [-0.99691008, -0.23847105, -1.89490109, -1.09546611],
       ...,
       [-0.99691008, -1.29890149, -0.60126924,  0.7251148 ],
       [ 1.0030995 , -1.06325028, -0.72160708,  1.63540526],
       [ 1.0030995 , -0.35629665,  0.07061707, -0.18517565]])

In [42]:
features_train_scaled, features_valid_scaled, target_train, \
target_valid = train_test_split(
    features_scaled, target, test_size=0.25, random_state=12345)

In [43]:
model.fit(features_train_scaled, target_train)

LinearRegression()

In [44]:
predictions = model.predict(features_valid_scaled)

In [45]:
r2_score(target_valid, predictions)

0.4230772749214825

Method#1
Feature standardization is conducted as per the formula: 

$$
z = (x - u) / sigma 
$$

This formula includes linear operations, so the model quality doesn't degrade

Method#2: Vector multiplication

In [46]:
features.shape[1]

4

In [47]:
vector_random = np.array(np.random.randint(1000, size = features.shape[1]))

In [48]:
vector_random

array([182,   4, 225, 912])

Multiply feature matrix by the vector of random numbers. A random number 12 is added in order to improve encoding 

In [49]:
features_transformed = features * vector_random + 12 

In [50]:
features_transformed

Unnamed: 0,Gender,Age,Income,Family members
0,194,176.0,11160012.0,924
1,12,196.0,8550012.0,924
2,12,128.0,4725012.0,12
3,12,96.0,9382512.0,1836
4,194,124.0,5872512.0,12
...,...,...,...,...
4995,12,124.0,8032512.0,1836
4996,12,148.0,11790012.0,924
4997,12,92.0,7627512.0,1836
4998,194,100.0,7357512.0,2748


In [51]:
features_train_transformed, features_valid_transformed, target_train, target_valid = \
train_test_split(
    features_transformed, target, test_size=0.25, random_state=12345)

In [52]:
model.fit(features_train_transformed, target_train)

LinearRegression()

In [53]:
predictions = model.predict(features_valid_transformed)

In [54]:
r2_score(target_valid, predictions)

0.42307727492111646

###### Conclusion

r2 metric comparison reflects that the model quality doesn't degrade after the transformation.