# Protection of customers' personal data

## Content
1. Loading data
2. Matrix multiplication
3. Conversion algorithm
4. Verification of the algorithm
5. Checklist

## Description
You need to protect the insurance company's customer data. Develop a data conversion method that makes it difficult to recover personal information. Justify its correctness.

Protect the data so that machine learning models do not degrade during conversion. There is no need to select the best model.

## Loading data

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from numpy.linalg import inv
from sklearn.linear_model import LinearRegression

In [2]:
try:
    data = pd.read_csv('/datasets/insurance.csv')
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/insurance.csv')

In [3]:
data.shape

(5000, 5)

In [4]:
data.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [6]:
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [7]:
data.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


**Conclusion:**

No gaps in the data

## Matrix multiplication

Notations:

- $X$ - feature matrix (zero column consists of units)

- $y$ - vector of target attributes

- $P$ - matrix, on which attributes are multiplied

- $w$ - vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$



**Answer:** ...If the features are multiplied by a reversible matrix, the quality of the linear regression will not change.

**Rationale:** ...If the features are multiplied by a reversible matrix, all linear combinations of the features will also be transformed using this matrix. However, this transformation will be linear and will not change the linear relationship between the features and the target variable that the linear regression model is seeking.

Linear regression seeks to find the optimal weights for the features that minimise prediction error. Scaling or multiplying features by a reversible matrix does not change the relationship between the features and the target variable, as it preserves the proportionality and linear combinations of the features. Thus, the weights of the features and the quality of the model will remain unchanged.

It is important to note that by multiplying the features by an irreversible matrix, the quality of the linear regression may change, as it may distort linear relationships and generate new dependencies between features, which may negatively affect the ability of the model to reconstruct the target variable.

**Proof:**
 
1. Basic properties of an inverse matrix 
   $$ det(A)=1/det(A)$$
2. Basic properties of a unit matrix 
   $$AE=EA=A $$
   $$AA^{-1}=E$$
3. Combinatorial property of matrices, associativity: 
   $$A(BC)=(AB)C$$

A formula for training linear regression weights:

$$
w = (X^T X)^{-1} X^T y
$$

If we multiply the feature matrix $X$ by the reversible matrix $P$, we get a new feature matrix $X'$:

$$
X' = XP
$$

We can now express the linear regression weights for the new feature matrix $X'$:

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$

Open the brackets:

$$
w' = (P^T X^T XP)^{-1} P^T X^T y
$$

Since multiplication by a reversible matrix does not change the reversibility of the original matrix, the matrix $X^T X$ remains reversible.

Use the property $(AB)^T = B^T A^T$:

$$
w' = (P^T(X^T X)P)^{-1} P^T X^T y
$$

Now replace $A = X^T X$:

$$
w' = (P^T AP)^{-1} P^T X^T y
$$

Since the matrix $A$ is reversible, we can use the following property for the inverse matrix $(AB)^{-1} = B^{-1} A^{-1}$:

$$
w' = P^{-1}(A^{-1}(P^T)^{-1})P^T X^T y
$$

The inverse of the inverse matrix gives us $A^{-1} (P^T)^{-1} = (P^T A)^{-1}$:

$$
w' = P^{-1}(P^T A)^{-1} P^T X^T y
$$


$$
w' = (P^{-1} P^T AP^{-1})P^T X^T y
$$

or:

$$
w' = (P^{-1} P^T) (AP^{-1})P^T X^T y
$$

The matrix $P^{-1} P^T$ is reversible because it is the product of reversible matrices.

We can now express $w'$ as follows

$$
w' = Q(AP^{-1})P^T X^T y
$$

Where $Q = P^{-1} P^T$ is a reversible matrix.

Thus we see that the linear regression weights w' for the new feature matrix X' and the old weights w for the original feature matrix X are equivalent.

This means that the quality of the linear regression, as measured through the root mean square error (MSE), will not change when the features are multiplied by a reversible matrix.

## Conversion algorithm

**Algorithm**.

1. Write down the features and the target feature
2. Create an arbitrary reversible matrix
    - The reversible matrix must be square
    - the reversible matrix must have a size equal to the length of the feature vector, in this case 4
3. teach the model
4. Let's get the R2 metric
5. Let's multiply the inverse matrix by the features
6. Train the model on the new data
7. Get R2 metric on the new model
8. Compare the quality of the models before and after

**Rationale**

To test the hypothesis, let us compare the quality of the model before and after multiplying the feature matrix by the reversible matrix.

## Algorithm check

**Write down the signs**

In [8]:
features = data.drop('Страховые выплаты', axis=1).values
target = data['Страховые выплаты'].values

**Find the length of the target feature vector**

In [9]:
n = len(features[0])
print(n)

4


**Create an arbitrary matrix**

In [10]:
m_random = np.random.normal(size=(n,n))
display(m_random)

array([[ 0.12959788, -0.73613628,  1.07381142, -0.7538745 ],
       [ 0.7866127 ,  0.04204902, -2.24416515,  0.64525221],
       [-2.28954895,  1.00514094, -0.21122626,  0.71199664],
       [ 0.54888564, -1.48611529,  2.75022679, -1.2715076 ]])

**Check it for reversibility**

In [11]:
m_random = inv(m_random)
display(m_random)

array([[-0.90999305, -0.32889887, -0.59132454,  0.04150722],
       [ 0.3033701 , -1.55242537, -0.8641868 , -1.45158943],
       [-1.39694462, -0.28854281, -0.01703761,  0.67227858],
       [-3.76894294,  1.04835947,  0.71793073,  2.38215684]])

In [12]:
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print('R2 of the original model:',r2_score(target, predictions))

new_matrix = features @ m_random
model.fit(new_matrix, target)
predictions = model.predict(new_matrix)
print('R2 of the transformed model:',r2_score(target, predictions))

R2 of the original model: 0.42494550286668
R2 of the transformed model: 0.42494550286668276


**Conclusions**

1. Input data has been uploaded and verified
2. Theoretical justification of the problem has been carried out, the possibility of multiplication of features by a reversible matrix has been tested, and independence of linear regression metrics from the use of data encoded in this way has been proved. The relation between initial and transformed vectors of coefficients is proved.
3. The algorithm of data transformation and model training on initial and encrypted data is developed. An arbitrary matrix for data encryption is created.
4. Models were trained and R2 metrics were compared
5. R2 metrics do not differ within the margin of error, the results of the experiment are in line with the theoretical justification.

## Inspection checklist

Put an 'x' in the completed items. Then press Shift+Enter.

- [x] Jupyter Notebook open
- [x] All code runs without errors
- [x] Cells with code are in the order of execution
- [x] Executed step 1: Data loaded
- [x] Executed step 2: Matrix multiplication question is answered
    - [x] Correct answer is given
    - [x] Option is justified
- [x] Step 3 is completed: transformation algorithm is proposed
    - [x] Algorithm is described
    - [x] Algorithm is justified
- [x] Step 4 is completed: Algorithm is verified
    - [x] Algorithm is implemented
    - [x] Comparison of model quality before and after transformation performed