# Protection of clients' personal data

It is necessary to protect the data of clients of the insurance company "Though the Flood". 

<u>*Target*</u>

Develop a method for converting data so that it is difficult to recover personal information from it. Justify the correctness of its operation.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during conversion. There is no need to select the best model.

<u>*Tasks*</u>
1. Download and explore data.

2. Answer the question and justify the decision. 

 The features are multiplied by an invertible matrix. Will the quality of linear regression change? (She can be retrained.)
 
 a. Will change. Give examples of matrices.
 
 b. Will not change. Indicate how the linear regression parameters in the original problem and in the transformed one are related.
 
 3. Propose a data transformation algorithm to solve the problem. Justify why the quality of linear regression will not change.
 
 4. Program this algorithm using matrix operations. Check that the quality of the linear regression from sklearn does not differ before and after the transformation. Apply Metric R2.

<u>Features</u>: Пол - gender, Возраст - age and Зарплата - salary of the insured, Члены семьи -  number of family members.

<u>Target</u>: Страховые выплаты - number of insurance payments to the client over the last 5 years.

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-Data" data-toc-modified-id="Loading-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading Data</a></span></li><li><span><a href="#Matrix-multiplication" data-toc-modified-id="Matrix-multiplication-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Matrix multiplication</a></span></li><li><span><a href="#Conversion-algorithm" data-toc-modified-id="Conversion-algorithm-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conversion algorithm</a></span><ul class="toc-item"><li><span><a href="#Algorithm" data-toc-modified-id="Algorithm-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Algorithm</a></span></li><li><span><a href="#Rationale" data-toc-modified-id="Rationale-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Rationale</a></span></li></ul></li><li><span><a href="#Algorithm-verification" data-toc-modified-id="Algorithm-verification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Algorithm verification</a></span></li></ul></div>

## Loading Data

In [1]:
# download the libraries
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
# upload the dataset
df = pd.read_csv('C:/Users/hp/Documents/data_science/GitHub/Yandex_project/Датасеты/insurance.csv')
display(df.head(20))

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
5,1,43.0,41000.0,2,1
6,1,39.0,39700.0,2,0
7,1,25.0,38600.0,4,0
8,1,36.0,49700.0,1,0
9,1,32.0,51700.0,1,0


In [3]:
# dataset statistics 
df.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [4]:
# presence of missing values
df.isnull().mean()

Пол                  0.0
Возраст              0.0
Зарплата             0.0
Члены семьи          0.0
Страховые выплаты    0.0
dtype: float64

In [5]:
# information on the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


We downloaded the data, looked for missing values ​​- there were none. Displayed the stat. data, as well as information by data type. We have a dataset of 5 columns and 5000 rows. Data type - numerical values. In the gender column - designations 1 and 0; age min 18 years, max 65 years, average age 30 years; salary min 5300, max 79000, average 40200; family members min 0, max 6, average 1; insurance payments - designations 1 and 0.

## Matrix multiplication

Designations:

- $X$ — feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ — matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** will not change

**Rationale:** To prove this, let’s multiply the signs by an invertible matrix. Let our features have the same dimension as in the original dataset, 5000 * 4, and the matrix for multiplying 4 by 4. The multiplied matrix will have the form X * P. Then the prediction formula will take the following form:
$$
𝑎'= XPw'
$$
Let's substitute the learning formula for the matrix X P into the prediction formula:
$$
𝑎' = XP((XP)^T(XP))^{-1}(XP)^T y
$$
There are properties of matrices:
$$
(AB)^{-1} = (B)^{-1})(A){-1}
$$

$$
(AB)^T = B^TA^T
$$

$$
AA^{-1} = E
$$

$$
AE = EA = A
$$

Let's open the brackets using the above properties.

$$
𝑎' = XP(P^TX^TXP)^{-1}P^TX^T y
$$

When multiplying a matrix by its inverse matrix, we get an identity matrix.
Opening the brackets:

$$
𝑎'= XPP^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty
$$

We cancel the identity matrices and then what remains is:

$$
𝑎'= X(X^TX)^{-1}X^Ty
$$

As a result of all the transformations, we returned to the original form of Predictions

$$
𝑎' = Xw
$$

The linear regression parameters in the original and transformed problems are related as follows:

$$
wp = ((XP)^TXP)^{-1}(XP)^Ty
$$
$$
wp = (P^T(X^TX)P)^{-1}P^TX^Ty
$$
$$
wp = P^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty
$$
$$
wp = P^{-1}(X^TX)^{-1}X^Ty
$$
$$
wp = P^{-1}w
$$


## Conversion algorithm

### Algorithm

Let's multiply the features of the original dataset by a random invertible matrix of size 4*4. Let's create a linear regression model on the data before transformation and calculate the R2 metric. Afterwards, we calculate the R2 mark of linear regression on the data multiplied by the invertible matrix. The final results should be the same.

### Rationale

In [6]:
# let's create features for training the model
features = df.drop('Страховые выплаты', axis=1)
target = df['Страховые выплаты']
# проверим размеры выборок
print('Size of samples with features:', features.shape)
print('Size of samples with target attribute:', target.shape)

Size of samples with features: (5000, 4)
Size of samples with target attribute: (5000,)


In [7]:
# let's carry out modeling
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
r2 = r2_score(target, predictions)
# let's derive the r2 metric for the data before multiplication
print(f'The R2 metric for features in the initial state is equal to {r2}')

The R2 metric for features in the initial state is equal to 0.4249455028666801


## Algorithm verification

In [8]:
# let's create an invertible matrix
matrix = np.random.randn(4, 4)

# display the matrix on the screen
display(matrix)

array([[-1.04972753, -0.50175608, -0.08553739,  0.21028805],
       [-0.99621138, -0.45652665,  0.38250587,  0.3052428 ],
       [ 0.01057047, -1.50215932, -0.53459244,  1.24041408],
       [ 0.42677732,  2.63490757, -0.19074775, -0.51384911]])

In [9]:
# let's check our matrix for invertibility
matrix_b = np.linalg.inv(matrix)
# identity matrix check
matrix_1 = matrix @ matrix_b
# display the results on the screen
display(matrix_b)
display((matrix_1).round())

array([[-0.91870842, -0.08425907,  0.12825249, -0.11642851],
       [-0.17483477,  0.39666603,  0.13605692,  0.49251951],
       [-1.81458896,  1.84862134, -0.20651608, -0.14298649],
       [-0.9859482 ,  1.27780331,  0.88085227,  0.53581622]])

array([[ 1., -0.,  0.,  0.],
       [-0.,  1.,  0., -0.],
       [ 0., -0.,  1., -0.],
       [-0.,  0.,  0.,  1.]])

In [10]:
# checking with an infinite loop
data = matrix.copy()
def get_rand_matrix():
    det = 0
    while det == 0:
        matrix = np.random.normal(size=(data.shape[1], data.shape[1]))
        det = np.linalg.det(matrix)
    return matrix

P = get_rand_matrix()
print('Matrix P:\n', P)

print('\nTesting for identity matrix:')
display((P @ np.linalg.inv(P)).round(5))

Matrix P:
 [[-0.86205457  0.24583378  0.31897699  0.76149539]
 [ 0.67008099 -0.73145595  1.1385825  -0.33620908]
 [-0.92195419  1.03098871 -1.25736139 -0.60976184]
 [-0.49204387  0.0024066   0.05212793  0.02571277]]

Testing for identity matrix:


array([[ 1.,  0., -0.,  0.],
       [ 0.,  1.,  0., -0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

From the obtained result we see that the matrix is ​​invertible

In [11]:
# multiply the original features by our matrix
df_pr = features @ matrix

# display the matrix on the screen
df_pr.head()

Unnamed: 0,0,1,2,3
0,482.82758,-74523.686793,-26500.378333,61536.749633
1,356.278825,-57100.419539,-20297.108018,47149.262261
2,193.089691,-31558.585026,-11215.34847,26057.547667
3,420.721618,-62644.360955,-22284.853423,51730.649429
4,246.94556,-39219.642796,-13942.237933,32383.564507


In [12]:
# we will carry out modeling on the changed data
model_pr = LinearRegression()
model_pr.fit(df_pr, target)
predictions_pr = model_pr.predict(df_pr)
r2_pr = r2_score(target, predictions_pr)
# we will derive the r2 metric for the data after multiplication by an invertible matrix
print(f'The R2 metric for features in a changed state is equal to{r2_pr}')

The R2 metric for features in a changed state is equal to0.4249455028666774


Conclusion:

During preparation, no changes were made to the data, since the dataset has no missing values, the format of the values also corresponds.

We conducted a theoretical study and answered the question whether the r2 metric will change if the original data is multiplied by an invertible matrix. The answer is no, the conclusions are in paragraph 2. Matrix multiplication.

We checked the r2 metric on the original data. The result is 0.4249455028666.

We checked the metric on data with features multiplied by an invertible matrix. The result is 0.4249455028666.

This once again proves that the quality of linear regression from sklearn does not differ before and after transformation.
