# Protection of Customer Personal Data

You need to protect the data of the customers of the insurance company. Develop a data transformation method that makes it difficult to recover personal information from them. Justify the correctness of its operation.

It is necessary to protect the data in such a way that the quality of machine learning models does not deteriorate after the transformation. It is not necessary to select the best model.

## Exploratory Data Analysis

In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [4]:
data=pd.read_csv('/datasets/insurance.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [6]:
display (data.head())

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


Let's convert the age and salary to the int data type.

In [8]:
data['Возраст'] = data['Возраст'].astype(int)
data['Зарплата'] = data['Зарплата'].astype(int)
data.dtypes

Пол                  int64
Возраст              int64
Зарплата             int64
Члены семьи          int64
Страховые выплаты    int64
dtype: object

Let's check for the presence of missing data.

In [11]:

data.isna().mean()

Пол                  0.0
Возраст              0.0
Зарплата             0.0
Члены семьи          0.0
Страховые выплаты    0.0
dtype: float64

### Conclusion

The customer data consists of 4 distinct features: Gender, Age, Salary, Family Members, and 1 target feature - Insurance Claims.


To assess the degree of correlation between a pair of quantitative features, we will compute the Pearson correlation coefficient, which measures the linear relationship between them.

In [21]:
cor_matrix=data.corr()
print(cor_matrix)

                        Пол   Возраст  Зарплата  Члены семьи  \
Пол                1.000000  0.002074  0.014910    -0.008991   
Возраст            0.002074  1.000000 -0.019093    -0.006692   
Зарплата           0.014910 -0.019093  1.000000    -0.030296   
Члены семьи       -0.008991 -0.006692 -0.030296     1.000000   
Страховые выплаты  0.010140  0.651030 -0.014963    -0.036290   

                   Страховые выплаты  
Пол                         0.010140  
Возраст                     0.651030  
Зарплата                   -0.014963  
Члены семьи                -0.036290  
Страховые выплаты           1.000000  


If the absolute value of the coefficient exceeds a certain threshold, it can indicate a strong correlation between features. In practice, the threshold value depends on the task and typically ranges from 0.6 to 1.0.
We can observe that insurance claims are strongly correlated with age.

## Matrix Multiplication

Notations:

$X$ — feature matrix (the zero column consists of ones)

$y$ — target feature vector

$P$ — matrix by which features are multiplied

$w$ — linear regression weights vector (the zero element corresponds to the intercept)

The formula for finding the parameters looks as follows:
$$
w=(X^TX)^{-1}X^Ty
$$


Let's add multiplication by the invertible matrix $P$ to the formula for finding the parameters:

$$
w1 = ((XP)^T) XP)^{-1} (XP)^T y
$$

Let's expand the brackets:
$$
w1=(P^TX^TXP)^{-1}P^TX^Ty
$$

Let's combine $X^T$ and $X$ into a single set of brackets:
$$
w1= (P^T(X^TX)P)^{-1}P^TX^Ty
$$

Let's expand the brackets:
$$
w1= P^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty
$$


since according to the conditions, $P$ is an invertible matrix, then
$$
P^T^{-1}*P^T=E
$$
$$
w1= P^{-1}(X^TX)^{-1}EX^Ty
$$

We remove E:


$$
w1= P^{-1}(X^TX)^{-1}X^Ty
$$

On the right side of the equation, there is a formula w:
    $$
    w1= P^{-1}w
    $$     
    

Let's substitute the new formula for weights (w1=P^{-1}w) and X (X1=XP) into the prediction calculation

Prediction:

$$
a = Xw
$$
New prediction:

$$
a1 = XPP^{-1}w
$$

Since  P - is invertible, then PP^-1=E, i.e.
$$
a1=XEw=Xw=a
$$

### Conclusion

We have proven that the predictions а1 for features multiplied by an invertible matrix are equal to the predictions for the original features. Therefore, the quality of the regression will not change.

## Transformation Algorithm

To protect customer data, you need to multiply all features by a randomly generated invertible matrix. Despite this transformation, the data can be easily recovered by multiplying the obtained matrix by the inverse of the generated invertible matrix. Importantly, the quality of the linear regression model won't change. We've shown that the predictions from linear regression using features multiplied by an invertible matrix are equal to predictions from the original features.


<div class="alert alert-success">
<font size="5"><b>Комментарий ревьюера</b></font>

Успех 👍:


Все верно.




</div>


In [7]:
features=data.drop('Страховые выплаты', axis=1)
target=data['Страховые выплаты']

Let's transform the features into a matrix and take a look at how our features appear.

In [8]:

X=np.array(features)
print (X)

[[    1    41 49600     1]
 [    0    46 38000     1]
 [    0    29 21000     0]
 ...
 [    0    20 33900     2]
 [    1    22 32700     3]
 [    1    28 40600     1]]


Let's generate a random square matrix with a size equal to the number of features (excluding the target feature).

In [9]:

P=np.random.normal(size = (features.shape[1],features.shape[1]))
print(P)

[[ 0.85611076 -2.89730574  1.38309474  2.74027142]
 [-0.30072361 -0.09563498 -1.38042554 -2.44621158]
 [ 0.54975212  0.63538208 -0.87752311  0.05166445]
 [ 0.65307087  1.37059703  1.23594953  0.22903134]]


Let's check its invertibility by attempting to compute its inverse matrix:

In [10]:

P_inv=np.linalg.inv(P)
print (P_inv)

[[ 0.4341737   0.56044884  0.6959967   0.63425545]
 [-0.26200062 -0.28226582  0.15179086  0.08570311]
 [ 0.07719282  0.11542316 -0.5797607   0.43999749]
 [-0.08669275 -0.53179324  0.23566944 -0.32961808]]


Since the inverse matrix was successfully computed without errors, it indicates that the randomly generated matrix is invertible.

Let's multiply our features by the obtained invertible matrix:

In [11]:

features_new=X@P
print(features_new)



[[ 27256.88474525  31509.50340853 -43579.12450003   2465.23157015]
 [ 20877.40040587  24141.49041529 -33408.14168367   1850.95256884]
 [ 11536.07356907  13340.25025859 -18468.01758345   1014.01340822]
 ...
 [ 18631.88859193  21540.28099511 -29773.16993216   1702.95883794]
 [ 17973.09378042  20776.10452085 -28720.28401081   1639.03837217]
 [ 22313.02505771  25792.30794627 -35663.47100682   2032.05223039]]


To return to the original features, we can multiply the new features by the inverse of the invertible matrix raised to the power of -1:

In [12]:

features_restored=features_new@(P**(-1))
print(features_restored)




[[-148436.58173855 -405673.05403146   48537.3538939  -835673.4681786 ]
 [-113826.90620532 -310868.62801641   37174.80871782 -640805.41974108]
 [ -62926.1934034  -171799.15623913   20542.95323149 -354277.01056248]
 ...
 [-101414.58632813 -277281.2963477    33173.57065174 -570850.37400037]
 [ -97825.65676448 -267452.906184     31999.27554985 -550678.17514842]
 [-121464.61887797 -332043.21360034   39733.54780727 -683819.05200058]]


## Algorithm Validation

### Quality of linear regression before feature multiplication by invertible matrix

In [13]:
features_train, features_test, target_train, target_test=train_test_split(features, target, test_size=0.25, random_state=12345)

In [14]:
model=LinearRegression()
model.fit(features_train, target_train)

LinearRegression()

In [15]:
predictions=model.predict(features_test)
print(r2_score(target_test, predictions))

0.4352275684083322


### Quality of linear regression after feature multiplication by invertible matrix

In [16]:
features_train_new, features_test_new, target_train_new, target_test_new=train_test_split(features_new, target, test_size=0.25, random_state=12345)

In [17]:
model=LinearRegression()
model.fit(features_train_new, target_train_new)

LinearRegression()

In [18]:
predictions_new=model.predict(features_test_new)
print(r2_score(target_test_new, predictions_new))

0.4352275684082225


R2_score After multiplying the features by the invertible random matrix, the quality of the linear regression has not changed. The quality of the linear regression remains the same.

### Conclusion

We addressed the task of protecting the personal data of customers from an insurance company. The dataset contained information about 5000 clients, including age, gender, family size, and the number of insurance claims over the last 5 years. Among the mentioned features, the number of insurance claims is the target variable. Our goal was to propose a data protection algorithm that effectively encrypts the data while maintaining the quality of machine learning models.

We proposed a data protection algorithm that involves multiplying the features by an invertible matrix. Initially, we verified our assumption using formulas, particularly the formula for linear regression parameter estimation. $$
w=(X^TX)^{-1}X^Ty
$$

By substituting the multiplication by the invertible matrix into the formula, we verified that the predictions using the original features are equal to the predictions using the features multiplied by the invertible matrix.

Next, we tested our hypothesis in practice. Firstly, we generated a random square matrix with a size equal to the number of features (excluding the target feature) using the numpy.random.normal() method. After that, we checked its invertibility by attempting to compute its inverse using the np.linalg.inv method. Since the code didn't generate an error, the matrix was indeed invertible. Then, we multiplied the features by the obtained invertible matrix. Subsequently, we trained a linear regression model on both the original and encrypted data. The quality of regression (R2 score) was 0.4352 in both cases, indicating no change.

In summary, we devised a data encryption algorithm for customer data without altering the quality of the machine learning model. Customer data can still be easily recovered by multiplying it with the inverse matrix.





