# The task
We need to protect the data of clients of the insurance company. Develop such a method of data transformation so that it would be difficult to recover personal information from it. 
We need to protect data so that the quality of machine learning models does not degrade as you transform.

## 1. Start

In [1]:
import pandas as pd
import numpy as np
from numpy.linalg import inv
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score

In [2]:
data=pd.read_csv('datasets/insurance.csv')
data

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
# all data will be converted into whole chiles
data = data.astype(int)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Пол                5000 non-null   int64
 1   Возраст            5000 non-null   int64
 2   Зарплата           5000 non-null   int64
 3   Члены семьи        5000 non-null   int64
 4   Страховые выплаты  5000 non-null   int64
dtypes: int64(5)
memory usage: 195.4 KB


In [6]:
#select the features and the target feature
features = data.drop('Insurance payments', axis=1)
target = data['Insurance payments']
print('Features', features.shape)
print('Target', target.shape)

Признаки (5000, 4)
Целевой признак (5000,)


<div class="alert alert-info" style="border:solid blue 2px; padding: 20px"> <b>Conclusion:</b> Good quality data. The dataset has 5 columns and 5000 rows. All data resulted in int. Highlighted the signs and the target sign.</div>

## 2. Matrix multiplication

Legend:

- $ X $ - matrix of features (zero column consists of ones)

- $ y $ - target feature vector

- $ P $ - matrix by which the features are multiplied

- $ w $ - vector of linear regression weights (zero element is equal to shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \ arg \ min_w MSE (Xw, y)
$$

Learning formula:

$$
w = (X ^ T X) ^ {- 1} X ^ T y
$$

**Answer:** will not change

**Rationale:** Linear regression is a regression model with a linear function of the dependence of one variable y on another or several other variables x. Linear regression assumes that there are no random model errors. Those. the same variance is present. Because a matrix multiplied by an invertible matrix results in the identity matrix. Accordingly, we will receive the same predictions and, accordingly, the same quality.

## 3. Conversion Algorithm

**Algorithm**

if X is the matrix of features, and P is the matrix by which the features are multiplied, then multiplying these matrices we get the matrix ХР. Accordingly, the learning formula will look like this: $$
w = (XP ^ T XP) ^ {- 1} XP ^ T y
$$

**Rationale**

given the formula above, calculate the predictions using the math rules: $$
a = XPw
$$

$
a = XP((XP^T XP)^{-1} XP^T y)=XP(P^TX^TXP)^{-1}P^TX^Ty=XP(X^TXP)^{-1}(P^T)^{-1}P^TX^Ty=XPP^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty=X(X^TX)^{-1}X^Ty=Xw=>a
$

$ PP ^ {- 1}, P ^ T (P ^ T) ^ {- 1} $ are matrices and its invertible matrices. We exclude, since as a result of multiplication, they give the E-unit matrix

<div class="alert alert-info" style="border:solid blue 2px; padding: 20px"> <b>Conclusion:</b> Concluding from our solution: predictions will not change. Let's calculate the quality of the model with the original data and with new ones (matrix multiplication). To do this, we turn the data into a matrix, multiply the matrices and return the data back to the DataFrame format. Let's train the model and compare the quality with the r2_score metric</div>

## 4. Algorithm check

In [7]:
#turn features into a matrix
matrix_features = features.values
print(matrix_features)

[[    1    41 49600     1]
 [    0    46 38000     1]
 [    0    29 21000     0]
 ...
 [    0    20 33900     2]
 [    1    22 32700     3]
 [    1    28 40600     1]]


In [8]:
# only a square matrix can become invertible (and we have it not square). and the determinant must not be == 0
# so transpose the original matrix (get a matrix with 4 rows and 5000 columns), multiply by the original matrix
# and get a square matrix. Since in the condition of the problem the original matrix must be multiplied by an invertible one, then
# raise the product to the power (-1)
# suppose the determinant! = 0
matrix_inv = np.linalg.inv(matrix_features.T.dot(matrix_features))
print(matrix_inv)

[[ 7.79281208e-04 -5.06698571e-06 -5.24393614e-09 -9.67517664e-06]
 [-5.06698571e-06  1.63619393e-06 -1.05347762e-09 -2.87048239e-06]
 [-5.24393614e-09 -1.05347762e-09  1.01259380e-12 -2.27826976e-09]
 [-9.67517664e-06 -2.87048239e-06 -2.27826976e-09  1.60298184e-04]]


In [9]:
#multiply the original matrix by the invertible
matrix=matrix_features.dot(matrix_inv)
print(matrix)

[[ 3.01760385e-04  6.89399278e-06 -4.90135848e-10 -8.00689513e-05]
 [-4.42026093e-04  3.23622886e-05 -1.22596759e-08 -5.83182574e-05]
 [-2.57065245e-04  2.53265938e-05 -9.28638122e-09 -1.31087654e-04]
 ...
 [-2.98459503e-04 -8.72997767e-06  8.70083791e-09  1.85953375e-04]
 [ 4.67305281e-04 -1.21308848e-05 -2.14343580e-09  3.33569341e-04]
 [ 4.14826624e-04 -4.89522965e-06  4.09172903e-09 -2.22482523e-05]]


In [10]:
# get the table format
features_2=pd.DataFrame(matrix, columns = features.columns)
features_2

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,0.000302,0.000007,-4.901358e-10,-0.000080
1,-0.000442,0.000032,-1.225968e-08,-0.000058
2,-0.000257,0.000025,-9.286381e-09,-0.000131
3,-0.000344,-0.000015,1.554559e-08,0.000165
4,0.000501,0.000013,-8.312611e-09,-0.000150
...,...,...,...,...
4995,-0.000348,0.000002,2.095686e-09,0.000159
4996,-0.000457,-0.000002,1.496341e-08,-0.000057
4997,-0.000298,-0.000009,8.700838e-09,0.000186
4998,0.000467,-0.000012,-2.143436e-09,0.000334


In [11]:
#let's check that we got a dataframe with the same structure as the original features
features_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Пол          5000 non-null   float64
 1   Возраст      5000 non-null   float64
 2   Зарплата     5000 non-null   float64
 3   Члены семьи  5000 non-null   float64
dtypes: float64(4)
memory usage: 156.4 KB


In [12]:
# check the quality of the model on the initial data
model= LinearRegression(normalize=True)
model.fit(features,target)
predictions = model.predict(features)
print('R2_score initial data',r2_score(target, predictions))


R2_score исходных данных 0.42494550308169177


In [13]:
## check the quality of the model on the encoded data
model= LinearRegression(normalize=True)
model.fit(features_2,target)
predictions_2 = model.predict(features_2)
print('R2_score encoded data', r2_score(target, predictions_2))

R2_score закодированных данных 0.42494550308169177


<div class="alert alert-info" style="border:solid blue 2px; padding: 20px"> <b>Conclusion:</b> Using matrix operations, we got a coded table with features. To return to the correct data, it is enough to multiply by an invertible matrix, since there is no division in matrices. Having trained the model on the initial data and on the encoded ones, we got the same value of the r2_score metric. Thus, we believe that our answer - 'the quality of the model will not change' - is proven. </div>