# Data encryptation for ML models.

## Description 

It's required to protect the data of the clients of the insurance company "Any Flood". We will develop a method of data transformation so that it will be difficult to recover personal information from it. We will justify the correctness of its work.

We need to protect the data so that the quality of machine learning models does not deteriorate during the transformation. It is not necessary to select the best model.

## Data description


 - **Attributes:** gender, age and salary of the insured, number of family members.
 - **Target Aributes:** number of insurance payments to the client for the last 5 years.

## 1. Data loading

In [1]:
import pandas as pd

df= pd.read_csv('/datasets/insurance.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Пол                  5000 non-null int64
Возраст              5000 non-null float64
Зарплата             5000 non-null float64
Члены семьи          5000 non-null int64
Страховые выплаты    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [2]:
df.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [3]:
df.duplicated().sum()

153

In [4]:
duplicates = df[df.duplicated()]

duplicates

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
281,1,39.0,48100.0,1,0
488,1,24.0,32900.0,1,0
513,0,31.0,37400.0,2,0
718,1,22.0,32600.0,1,0
785,0,20.0,35800.0,0,0
...,...,...,...,...,...
4793,1,24.0,37800.0,0,0
4902,1,35.0,38700.0,1,0
4935,1,19.0,32700.0,0,0
4945,1,21.0,45800.0,0,0


## 2. Matrix multiplication

Notation:

- $X$ - feature matrix (zero column consists of units)

- $y$ - vector of target attribute

- $P$ - matrix by which the signs are multiplied

- $w$ - vector of linear regression weights (zero element is equal to the shift)

Prediction:

$$
a = Xw
$$

Learning target:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** 
When the signs are multiplied by a reversible matrix, the quality of the linear regression will not change.

**Justification: 

Formula  Squared Error:

$$
SE = \sum_{i=1}^{n} (y_i - a_i)^2 
$$

Prediction formula in matrix form:
$$
a = Xw = X(X^T X)^{-1} X^T y
$$
Let the matrix Z be reversible:

Multiply the matrix X by the reversible Z in the prediction formula:
$$
a^{,} = XZ((XZ)^T XZ)^{-1} (XZ)^T
$$
Let's transform the formula with the help of the reversible matrix property:
$$
(AB)^{-1}= В^{-1}A^{-1}
$$

$$
a^{,} = XZ(XZ)^{-1}((XZ)^T)^{-1}(XZ)^T
$$

Let's transform the formula using the matrix transportation property:
$$
(AB)^T= В^TA^T
$$

$$
a^{,} = XZZ^{-1}X^{-1}(Z^T X^T)^{-1}(XZ)^T
$$

Applying the equality:
$$
AA^{-1} = A^{-1}A = E
$$
Где:
- $E$ — unit matrix 

$$
a^{,} = XEX^{-1}(X^T)^{-1} (Z^T)^{-1}Z^TX^T
$$

$$
a^{,} = XEX^{-1}(X^T)^{-1} E X^T
$$

Applying equality:

$$
AE = EA = A
$$
$$
a^{,} = XX^{-1}(X^T)^{-1}  X^T
$$
$$
a^{,} = X(X^TX)^{-1}  X^T
$$

Getting:

$$
a = a^{,} 
$$

Thus :

$$
\sum_{i=1}^{n} (y_i - a_i)^2  = \sum_{i=1}^{n} (y_i - a^{,}_i)^2
$$

Which is a proof of the assumption that when the signs are multiplied by a reversible matrix, the quality of the linear regression will not change.

## 3. Conversion algorithm

**Algorithm**

1. Defining the feature and clonal feature
2. Training model without transformation
3. Obtaining the R2 metric on the model without transformation
4. Creating a random reversible matrix
5. Multiplying the reversible matrix by the feature matrix
6. Training the model on the reversible features
7. Obtaining the R2 therics of the transformed features
8. Comparison of metrics, conclusion

**Rationale**

The algorithm is justified on analytical proof is given in paragraph 2

## 4. Algorithm validation

In [5]:
import numpy as np

# Определение признаков для обучения  модели 

features = df.drop(['Страховые выплаты'], axis = 1)

target = df['Страховые выплаты']

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
    
# Defining features for model training

model=  LinearRegression()
model.fit(features,target)
prediction = model.predict(features)

r2 = r2_score(target,prediction).round(6)

r2

0.424946

In [7]:
from random import randint

# Function for creating a reversible matrix
def invertible_matrix(i,j):
    matrix = [[randint(0,9) for x in range(i)] for x in range(j)]
    try:
        inverse = np.linalg.inv(matrix)
    except np.linalg.LinAlgError:
    # Let's skip the irreversible matrix.
        pass
    else:
        return matrix 
    
Z = invertible_matrix(4,4)
Z

[[4, 4, 5, 3], [8, 2, 3, 0], [7, 1, 4, 7], [9, 0, 7, 3]]

In [8]:
# Transformation: multiplication of the feature matrix by a reversible matrix

features_transformed = np.dot(features, Z)

In [9]:
# Training of the model after transformationия

model_transformed =  LinearRegression()
model_transformed.fit(features_transformed, target)
prediction_transformed = model_transformed.predict(features_transformed)

r2_transformed = r2_score(target, prediction_transformed).round(6)

r2_transformed

0.424946

In [10]:
if r2 == r2_transformed:
    print("The hypothesis is correct: when the signs are multiplied by a reversible matrix, the quality of the linear regression will not change.")
    
else:
    print("There seems to be a mistake somewhere)

Гипотеза верная: при умножения признаки на обратимую матрицу, качество линейной регресии не изменится.
