# Protection of personal data of clients

It is necessary to protect the customer data of the insurance company "Though the flood". Develop a method of data transformation so that it is difficult to recover personal information from them. Justify the correctness of his work.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during the conversion. There is no need to select the best model.

## Data loading

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('insurance.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


**Conclusion:**

- The data consists of 5 columns and 5 thousand rows.

- There are no missing values in the data

In [5]:
# divide data into features and target
X = data.drop("Страховые выплаты", axis=1)
y = data["Страховые выплаты"]
X.shape, y.shape

((5000, 4), (5000,))

In [6]:
# function to calculate r2
def get_predict_r2(x, y):
    model = LinearRegression()
    model.fit(x, y)
    r2 = model.score(x, y)
    return r2

In [7]:
# r2 for an ordinary matrix
get_predict_r2(X,y)

0.4249455028666801

In [8]:
# r2 after multiplying the features by the matrix inverse to the one
# which is obtained by multiplying the transposed feature matrix by the original feature matrix
X_inv = X @ np.linalg.inv(np.array(X).T @ np.array(X))
print(get_predict_r2(X_inv,y))

0.42494550286668


**Conclusion:**
    
- Data is loaded and has no anomalies;
- There are duplicates but we do not delete them;
- The quality of the regression does not change when multiplying a matrix by an invertible matrix.

## Умножение матриц

Designations:

- $X$ — feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ — the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Target function:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** The quality of the linear regression will not change

**Rationale:**

Let's imagine a new feature matrix as the result of multiplying the original feature matrix by some reversible matrix:

$$
D=XA
$$

the learning formula becomes: $w_1 = (D^T D)^{-1} D^T y$

Substituting the values of $D$ into it, we get: $w_1 = ((XA)^T XA)^{-1} (XA)^T y$


Then our formula for calculating the weight vector can be rewritten:
$$
w_1 = ((XA)^T XA)^{-1} (XA)^T y
$$
$$
w_1 = (A^T (X^T X) A)^{-1} (XA)^T y
$$
$$
w_1 = (A^T (X^T X) A)^{-1} A^T X^T y
$$
$$
w_1 = A^{-1} (X^T X)^{-1} (A^T)^{-1} A^T X^T y
$$
$$
w_1 = A^{-1} (X^T X)^{-1} E X^T y
$$
$$
w_1 = A^{-1}w
$$

for the prediction formula: $a_1 = Dw_1$, substituting the resulting values $D = XA$ and $w_1 = A^{-1}w$:
$$a_1 = XAA^{-1}w $$
$$a_1 = Xw$$

predictions won't change

## Conversion algorithm

**Algorithm**

- Compilation of matrix $D$;
- Checking the matrix for reversibility. Calculation of the determinant of the matrix $D$;
- Multiplication by a random invertible matrix A;
- Addition to the matrix 499;
- Multiplication by 25;
- Addition 10;
- learning linear regression on the original and encrypted data;
- comparison of the resulting metrics R2.

**Rationale**

The matrix $D$ must have the required dimension $(NxN)$, where N is the number of features for regression.
Thus the matrix $D$ will have the same dimension as the matrix $X$. The inverse matrix $D$ exists only for square matrices whose determinant is not equal to zero.

## Algorithm verification

In [9]:
# feature matrix transformations
def get_cipher_matrix(features):
    n = features.shape[1]
    alter_matrix = np.random.randint(1, 55, (n,n))
    det = np.linalg.det(alter_matrix)
    while det == 0:
        alter_matrix = np.random.randint(1, 10, (n,n))
        det = np.linalg.det(alter_matrix)
    crypted_features = features @ alter_matrix
    return crypted_features, alter_matrix

In [10]:
crypted_features,alter_matrix = get_cipher_matrix(X)

In [11]:
# to decode the data, it is necessary to multiply
# the encoded "crypted_features" matrix by the inverse matrix with which we encoded the original "alter_matrix"
X_2 = crypted_features @ np.linalg.inv(alter_matrix)

In [12]:
X_2

Unnamed: 0,0,1,2,3
0,1.000000e+00,41.0,49600.0,1.000000e+00
1,-2.088241e-10,46.0,38000.0,1.000000e+00
2,4.193090e-12,29.0,21000.0,-1.862484e-10
3,-1.756648e-10,21.0,41700.0,2.000000e+00
4,1.000000e+00,28.0,26100.0,-9.507861e-11
...,...,...,...,...
4995,1.036735e-10,28.0,35700.0,2.000000e+00
4996,-9.874324e-11,34.0,52400.0,1.000000e+00
4997,-3.357492e-11,20.0,33900.0,2.000000e+00
4998,1.000000e+00,22.0,32700.0,3.000000e+00


In [13]:
# encryption algorithm
X_1 = (((crypted_features) + 499) * 25) + 10
# checking r2 after applying the encryption algorithm
print(get_predict_r2(X_1,y))

0.42494550286615207


In [14]:
# check r2 WITHOUT algorithm
get_predict_r2(X,y)

0.4249455028666801

## Conclusion

- Downloaded and studied data;
- The quality of linear regression has not changed from using the original matrix multiplied by the reversible one;
- Created a data conversion algorithm;
- Investigated the data transformation algorithm and tested the R2 metric for data without transformation and with it;

Based on the results of using matrix operations it can be seen that the data can be encrypted from recognition having the correct matrix.