# Personal data protection

We need to protect the data of customers of the insurance company "Though the Flood". We need to develop a data transformation method that makes it difficult to recover personal information from it. Also we will justify the correctness of his work.

We need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation. It is not necessary to select the best model.

We will follow the following steps:

1) upload the data and check its correctness. If any errors or anomalies are found, we will correct them.

2) we will carry out a theoretical proof of the answers to the questions posed below

3) we will develop a transformation algorithm

4) implement this algorithm and check it on a test sample

5) give a general conclusion.

## Loading data

In [1]:
# load libraries
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
# read data
data = pd.read_csv('....csv')

In [3]:
# display first 10 rows
data.head(10)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
5,1,43.0,41000.0,2,1
6,1,39.0,39700.0,2,0
7,1,25.0,38600.0,4,0
8,1,36.0,49700.0,1,0
9,1,32.0,51700.0,1,0


In [4]:
# display data shape
data.shape

(5000, 5)

In [5]:
# display general information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


During the analysis of the data, the following was obtained:

1) the table has 5 columns and 5000 rows

2) the table has the following columns:

    - gender: data type int
    - age: float data type
    - salary: float data type
    - family members: data type int
    - insurance payments (their number): data type int

In [6]:
# display column names
data.columns

Index(['Пол', 'Возраст', 'Зарплата', 'Члены семьи', 'Страховые выплаты'], dtype='object')

In [7]:
# rename the columns in accordance with generally accepted requirements
data = data.rename(columns={'Пол': 'gender', 'Возраст': 'age', 'Зарплата': 'salary', 'Члены семьи': 'family_members', 'Страховые выплаты': 'insurance_payments'})

In [8]:
# missing values
data.isna().sum()

gender                0
age                   0
salary                0
family_members        0
insurance_payments    0
dtype: int64

In [9]:
#  look at the age column
data['age'].value_counts()

19.0    223
25.0    214
31.0    212
26.0    211
27.0    209
22.0    209
32.0    206
28.0    204
29.0    203
30.0    202
23.0    202
21.0    200
20.0    195
36.0    193
33.0    191
24.0    182
35.0    179
34.0    177
37.0    147
39.0    141
38.0    139
41.0    129
18.0    117
40.0    114
42.0     93
43.0     77
44.0     74
45.0     73
46.0     60
47.0     47
49.0     37
50.0     27
48.0     26
52.0     22
51.0     21
53.0     11
55.0      9
54.0      7
56.0      5
59.0      3
60.0      2
58.0      2
57.0      2
65.0      1
61.0      1
62.0      1
Name: age, dtype: int64

In [10]:
#  look at the gender column
data['gender'].value_counts()

0    2505
1    2495
Name: gender, dtype: int64

In [11]:
#  look at the salary column
data['salary'].describe()

count     5000.000000
mean     39916.360000
std       9900.083569
min       5300.000000
25%      33300.000000
50%      40200.000000
75%      46600.000000
max      79000.000000
Name: salary, dtype: float64

In [12]:
#  look at the family_members column
data['family_members'].value_counts()

1    1814
0    1513
2    1071
3     439
4     124
5      32
6       7
Name: family_members, dtype: int64

In [13]:
#  look at the insurance_payments column
data['insurance_payments'].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: insurance_payments, dtype: int64

In [14]:
# change the data type in the Age columnn to int64
data['age'] = data['age'].astype('int64')
data['salary'] = data['salary'].astype('int64')

In [15]:
# display duplicates
data.duplicated().sum()

153

There are 153 duplicates in our data. Let's delete them.

In [16]:
# removal of duplicates
data.drop_duplicates(inplace=True)

In [17]:
# check removal of duplicates
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4847 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   gender              4847 non-null   int64
 1   age                 4847 non-null   int64
 2   salary              4847 non-null   int64
 3   family_members      4847 non-null   int64
 4   insurance_payments  4847 non-null   int64
dtypes: int64(5)
memory usage: 227.2 KB


### Conclusion

In the course of this section, we read the data from the file, checked it, and got the following:

1) renamed the names of the columns for the convenience of work in the future

2) determined the size of the table (5 columns and 4847 rows, after removing duplicates), looked at the composition of the columns (gender, age, salary, family_members, insurance_payments), no errors were found in the data

3) no missing values

4) changed the data type in the age column, since in this column the age is stored as an integer.

## Matrix multiplication

In this task, we can write formulas in *Jupyter Notebook.*

To write a formula inside text, surround it with dollar signs \\$; if outside - double symbols \\$\\$. These formulas are written in the layout language *LaTeX.*

For example, we wrote down the linear regression formulas. You can copy and edit them to solve the problem.

Working in *LaTeX* is optional.

Designations:

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** when multiplying features by an invertible matrix, the quality of linear regression does not change

**Rationale:** Let's justify this statement by the formula for calculating the weights for linear regression.
In order to correctly solve this formula, we clarify the main properties of matrices:

1) $(AB)^{-1} = (B)^{-1}(A)^{-1}$

2) $(AB)^T = B^T A^T$

3) $AA^{-1} = E$

Predictions:

$$
a = Xw
$$

Linear regression weight vector:

$$
w = (X^T X)^{-1}X^T y
$$

Let us represent the new feature matrix X1 as the product of the original feature matrix X by the invertible matrix P:

$$
X1=XP
$$

Next, we substitute the obtained value of X1 into the vector of weights of the linear regression w1:

$$
w1 = ((XP)^T(XP))^{-1}(XP)^Ty
$$

We will expand the brackets gradually in accordance with the above properties of matrices. First, let's expand the first bracket $(XP)^Y$:

$$
w1 = (P^T X^T XP)^{-1}(XP)^Ty
$$

For clarity, we regroup the factors in brackets:

$$
w1 = (P^T(X^T X)P)^{-1}(XP)^Ty
$$

In accordance with the properties of matrices, expand the brackets $(P^T(X^TX)P)^{-1}$:

$$
w1 = P^{-1}(X^TX)^{-1}(P^T)^{-1}(XP)^Ty
$$

Expand the last bracket $(XP)^T$:

$$
w1 = P^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty
$$


Since, according to the condition of the problem, the matrix P is invertible, then, based on the property of matrices number 3, indicated above. That is, the product of a matrix and an invertible matrix must be equal to the identity matrix E. We get that $(P^T)^{-1}P^T = E$. Let's transform our expression:

$$
w1 = P^{-1}(X^TX)^{-1}EX^Ty = P^{-1}(X^TX)^{-1}X^Ty
$$

As we can see, in this formula we have obtained the formula for the linear regression weight vector $w = (X^T X)^{-1}X^T y$, so we get:

$$
P^{-1}(X^TX)^{-1}X^Ty = P^{-1}w
$$

If we substitute the values X1 and w1 into the prediction formula a, then we get the following prediction a1 for the new model:

$$
a1 = X1w1 = XPP^{-1}w
$$

Since P is invertible by the condition of the problem, then, based on the properties of matrices, we can replace the expression $PP^{-1}$ with the identity matrix E, then:

$$
a1 = XPP^{-1} = XEw = Xw = a
$$

Thus, we have proved that the predictions a1 for the feature matrix, multiplied by the invertible matrix P, are equal to the predictions a. Based on this, we can conclude that when the features are multiplied by an invertible matrix, the quality of the linear regression model will not change. Thus, we can develop an algorithm for encrypting customers' personal data.

## Conversion algorithm

**Algorithm**

The following algorithm can be proposed as a transformation algorithm:

1) split the data for training

2) create a random matrix, check it for invertibility

3) train the model on the original matrix and calculate the value of r2_score

4) multiply the random matrix by the features of the original matrix

5) train the model on the obtained features and calculate the value of r2_score

6) compare the obtained metrics with each other.

If the characteristics of r2_score are the same, then this algorithm can be used to encrypt client data.

**Rationale**

Based on the theoretical proof of the formula above, we can assume that due to the proposed transformation algorithm, we should get similar metric values. To confirm our hypothesis, we will compare the r2_score of the original matrix and the r2_score of the random matrix (encrypted features) with each other. If the values do not differ or differ slightly, then we will confirm the proof of the formula above. This will mean that the algorithm we have developed can be used to encrypt data.

## Algorithm verification

In [18]:
# split the data
features = data.drop('insurance_payments', axis=1)
target = data['insurance_payments']

In [19]:
# split the resulting sample into training and test
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=12345)

In [20]:
# check that the sizes of features and target features in the samples are the same
print(f'Обучающая выборка: {features_train.shape}, {target_train.shape}')
print(f'Тестовая выборка: {features_test.shape}, {target_test.shape}')

Обучающая выборка: (3635, 4), (3635,)
Тестовая выборка: (1212, 4), (1212,)


In [21]:
# create a random matrix
matrix = np.random.normal(size = (features.shape[1], features.shape[1]))

In [22]:
# check it for reversibility
matrix = np.linalg.inv(matrix)

In [23]:
# train the model on the original, unencrypted features
model = LinearRegression()
model.fit(features_train, target_train)
predictions_train = model.predict(features_train)
r2_score_train = r2_score(target_train, predictions_train)
print(f'r2_score на исходных, незашифрованных признаках: {r2_score_train}')

r2_score на исходных, незашифрованных признаках: 0.43215820265809746


In [24]:
# multiply the original matrix by a random one 
transform_features_train = features_train @ matrix

In [25]:
# train the model on new, encrypted features
model.fit(transform_features_train, target_train)
predictions_train_transform = model.predict(transform_features_train)
r2_score_train_transform = r2_score(target_train, predictions_train_transform)
print(f'r2_score на новых, зашифрованных признаках: {r2_score_train_transform}')

r2_score на новых, зашифрованных признаках: 0.43215820265810523


In [26]:
# determine if there is a difference between the received data
print('Разность между признаками', r2_score_train - r2_score_train_transform)

Разность между признаками -7.771561172376096e-15


**Conclusion:** as we can see, there is practically no difference between the features of the original and the matrix obtained by multiplying by a random matrix (invertible matrix). This suggests that in practice we have confirmed the theoretical proof that when multiplying features by an invertible matrix, the quality of linear regression does not change.

###  Checking on a test set

Let's test our models on a test set.

In [27]:
model = LinearRegression()
model.fit(features_test, target_test)
predictions_test = model.predict(features_test)
r2_score_test = r2_score(target_test, predictions_test)
print(f'r2_score на исходных, незашифрованных признаках: {r2_score_test}')

r2_score на исходных, незашифрованных признаках: 0.425552801076332


In [28]:
# multiply the original matrix by a random one 
transform_features_test = features_test @ matrix

In [29]:
# train the model on new, encrypted features
model.fit(transform_features_test, target_test)
predictions_test_transform = model.predict(transform_features_test)
r2_score_test_transform = r2_score(target_test, predictions_test_transform)
print(f'r2_score на новых, зашифрованных признаках: {r2_score_test_transform}')

r2_score на новых, зашифрованных признаках: 0.4255528010763243


In [30]:
# determine if there is a difference between the received data
print('Разность между признаками', r2_score_test - r2_score_test_transform)

Разность между признаками 7.66053886991358e-15


As we can see, on the test sample, we got similar r2_score values and a similar difference between the indicators. Thus, we once again confirmed that when the features are multiplied by an invertible matrix, the quality of the linear regression does not change. Our algorithm can be used to encrypt customer data

## General conclusion

During the course of our work, we:

1) analyzed the data, saw that the data was in order, with the exception of duplicates, which were eliminated

2) put forward the assumption that when multiplying features by an invertible matrix, the quality of linear regression does not change

3) theoretically proved the correctness of this statement

4) developed an algorithm on the basis of which it is possible to practically prove the correctness of the statement

5) prepared data for training models

6) trained the models and checked the behavior of the model, when multiplying the original features by a random invertible matrix

7) received the same data and thus proved in practice that when the features are multiplied by an invertible matrix, the quality of the linear regression does not change

8) on the test sample, we checked the correctness of our calculations and evidence

Thus, when multiplying the features by an invertible matrix, the quality of the linear regression does not change and the algorithm developed by us can be used to encrypt customer data