<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Loading" data-toc-modified-id="Data-Loading-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Loading</a></span></li><li><span><a href="#Matrix-Multiplication" data-toc-modified-id="Matrix-Multiplication-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Matrix Multiplication</a></span></li><li><span><a href="#Transformation-Algorithm" data-toc-modified-id="Transformation-Algorithm-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Transformation Algorithm</a></span></li><li><span><a href="#Algorithm-Testing" data-toc-modified-id="Algorithm-Testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Algorithm Testing</a></span></li><li><span></span></li></ul></div>

# Protection of personal data of clients

 we need to protect the data of clients for the insurance company "Hot Pot." Develop a method for data transformation that makes it difficult to recover personal information. Additionally, explain the correctness of the method's operation. It is necessary to ensure that the quality of machine learning models does not decrease during data transformation, and finding the best model is not required.

 - Пол - Gender

 - Возраст- Age

 - Зарплата - Salary

 - Члены семьи - Family members

 - Страховые выплаты - Insurance payments

##  Data Loading

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
data = pd.read_csv("/datasets/insurance.csv")
data.head(15)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
5,1,43.0,41000.0,2,1
6,1,39.0,39700.0,2,0
7,1,25.0,38600.0,4,0
8,1,36.0,49700.0,1,0
9,1,32.0,51700.0,1,0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [6]:
data.duplicated().sum()

153

In [7]:
data.drop_duplicates(inplace=True)
data.duplicated().sum()

0

<b>Conclusion:</b>
There are no missing values. I want to draw attention to one nuance - we don't have a unique customer identifier such as an ID or full name, so we cannot say with 100% certainty that all matches are duplicates.

## Matrix Multiplication

Notation:

- $X$ — feature matrix (the first column consists of ones)
- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — vector of weights of linear regression (the first element is the bias)


Predictions:

$$
a = Xw
$$


Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$
Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$


<b>Answer:</b>The determinant of the matrix $P$ being non-zero implies that it is invertible.

Let's consider the predictions before and after multiplying the feature matrix $X$ by $P$:

Before:

$$
a = Xw
$$

After:

$$
a 
′
 =XPw 
′
 =XPP 
−1
 w 
′
 =Xw=a
 $$

Therefore, the predictions are the same, which means that the quality of the linear regression is not affected by multiplying the feature matrix by an invertible matrix.

<b>Learning formula:</b>

$$
w = (X^T X)^{-1} X^T y
$$   
$$
a = X(X^T X)^{-1} X^T y
$$
$$
a' = XP((XP)^T XP)^{-1} (XP)^T y
$$
$$
a' = XP(P^T (X^T X) P)^{-1} (XP)^T y
$$
$$
a' = XP(P^T (X^T X) P)^{-1} P^T X^T y
$$
$$
a' = XP(P^T X^T XP)^{-1} P^T X^T y
$$
$$
a' = XPP^{-1}(X^TX)^{-1}(P^T)^{-1} X^Ty
$$
$$
(AB)^{-1} = B^{-1}A^{-1}, 
where 
$$
$$
A=X^TX, B=P
$$
$$
PP^{-1} = E
$$
$$
(P^T)^{-1}P^T =E
$$
$$
a' = XE(X^TX)^{-1}EX^Ty = X(X^TX)^{-1}X^Ty = Xw = a
$$

## Transformation Algorithm

**Algorithm**

Create a random matrix.

Check its invertibility.

Multiply it by the features matrix.

Calculate the metrics of the trained model on the original features and on the features multiplied by our matrix.

Compare the metrics. If they are equal, the task is completed.

**Justification**

Let's multiply our feature matrix by an arbitrary invertible matrix, train the model on the transformed data. The quality of the model should not change as a result of this transformation, and the personal data of clients will be encrypted in this way.

## Algorithm Testing

In [10]:
features = data.drop('Страховые выплаты', axis=1)
target = data['Страховые выплаты']

In [11]:
features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                                            test_size=0.4, random_state=12345)
model = LinearRegression().fit(features_train, target_train)
predictions = model.predict(features_test)

In [12]:
model = LinearRegression().fit(features_train, target_train)
predictions = model.predict(features_test)

In [13]:
rscore = r2_score(target_test, predictions)
print("The coefficient of determination is equal to ", rscore)

The coefficient of determination is equal to  0.4272661343811538


In [14]:
#Multiply the matrix
matrix = features.values @ features.values.T

In [15]:
matrix_train, matrix_test, target_m_train, target_m_test = train_test_split(matrix, target, 
                                                                            test_size=0.4, random_state=12345)


model = LinearRegression().fit(matrix_train, target_m_train)
predictions = model.predict(matrix_test)

In [16]:
rscore_2 = r2_score(target_m_test, predictions)
print("The coefficient of determination is equal to ", rscore_2)

The coefficient of determination is equal to  0.4237704125997034


<b>Conclusion:</b>

In the case of multiplying the feature matrix by a random invertible matrix, it is possible to protect the data while not losing much in the quality of the model. (The slight difference is caused by the peculiarity of the matrix transformations and floating-point arithmetic, which is normal.)