# Protection of personal data of clients

You need to protect the data of customers of the insurance company "Though the Flood". Develop a data transformation method that makes it difficult to recover personal information from it. Justify the correctness of his work.

You need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation. There is no need to select the best model.

<font size = 3><b>Data Description</b></font>
- <b>Features</b>: sex, age and salary of the insured, the number of members of his family.
- <b>Target feature</b>: the number of insurance payments to the client over the past 5 years.

## Loading data

In [1]:
#import the necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
#load our dataset
df = pd.read_csv('/datasets/insurance.csv')

In [3]:
#look at the data
df.sample(5)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
4251,0,18.0,25500.0,2,0
3011,1,36.0,68500.0,1,0
3828,0,32.0,32000.0,0,0
1357,0,29.0,38600.0,2,0
3060,0,20.0,36600.0,1,0


In [4]:
#and see general information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


We have 3 `int64` format columns and 2 `float64` format columns. As you can see, the `Age` and `Salary` columns have zeros after the comma, so we can convert these columns to integer format as well.

In [5]:
df = df.astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Пол                5000 non-null   int32
 1   Возраст            5000 non-null   int32
 2   Зарплата           5000 non-null   int32
 3   Члены семьи        5000 non-null   int32
 4   Страховые выплаты  5000 non-null   int32
dtypes: int32(5)
memory usage: 97.8 KB


In [6]:
#look at the number of duplicates
print('Number of duplicates:', df.duplicated().sum(),
      'what is', (df.duplicated().sum()/len(df))*100, 'percent of the dataset.')

Number of duplicates: 153 what is 3.06 percent of the dataset.


Although it is not entirely clear whether these are duplicates or coincidences, because the data is anonymized, but it is better to delete these duplicates, because their number is small and this deletion will not affect the results of the study.

In [7]:
df.drop_duplicates(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4847 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Пол                4847 non-null   int32
 1   Возраст            4847 non-null   int32
 2   Зарплата           4847 non-null   int32
 3   Члены семьи        4847 non-null   int32
 4   Страховые выплаты  4847 non-null   int32
dtypes: int32(5)
memory usage: 132.5 KB


In [8]:
#now check the dataset for gaps
df.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

Let's look at a table with descriptive statistics and see if there are any anomalies in the data.

In [9]:
display(df.describe())

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,4847.0,4847.0,4847.0,4847.0,4847.0
mean,0.498453,31.023932,39895.811223,1.203425,0.152259
std,0.500049,8.487995,9972.952441,1.098664,0.468934
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33200.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


Based on the table, there are no anomalies in the data.
- In the "Gender" column, there are only two values, 0 and 1, which indicates the male and female gender;
- The column "Age" indicates the age from 18 to 65 years, just the minimum and maximum age for the possibility of applying for most insurance products.
- The "Salary" column contains values ​​from 5,300 rubles to 79,000 rubles, which indicates that the data was uploaded either many years ago, or it contains information mainly about small regions with a low level of income.
- In the "Family Members" column, the values ​​are from 0 to 6, which are also absolutely normal values.
- And in the column "Insurance payments" the number of applications for payments for the last 5 years is indicated, and as we can see, even in the third quintile the value is 0, the average value is only 0.148, and the maximum value is 5 applications, which indicates that the majority people do not apply for insurance payments.

In [10]:
#let's look at the number of insurance payments in more detail
df['Страховые выплаты'].value_counts()

0    4284
1     423
2     114
3      18
4       7
5       1
Name: Страховые выплаты, dtype: int64

### Section Conclusion
1. The dataset has 5000 rows and 5 columns.
2. The entire dataset has been converted to `int32` format.
3. 153 duplicates were found in the dataset and subsequently removed.
4. There are no gaps in the dataset.

<b>Our dataset is ready for research.</b>

## Matrix multiplication

Notation:

- $X$ - feature matrix (zero column consists of ones)

- $y$ - target feature vector

- $P$ - the matrix by which features are multiplied

- $w$ - vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning Formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Question:**
features are multiplied by an invertible matrix. Will the quality of linear regression change?

**Answer:**
the quality of the linear regression will not change.

**Rationale:**

Let's create a new feature matrix from our old $X$ matrix multiplied by the invertible $P$ matrix:

$$
X_p = X * P
$$

Now let's create a new weight vector using the property of transposed and invertible matrices, as well as the identity matrix $E$:

$$
w_p = ((XP)^T XP)^{-1} (XP)^T y = (P^T X^T XP)^{-1} P^T X^T y = (P^T(X^T X)P)^{-1} P^T X^T y = P^{-1}(X^T X)^{-1}(P^T)^{-1}  P^T X^T y = = P^{-1}(X^T X)^{-1}EX^T y = P^{-1}(X^T X)^{-1}X^T y
$$

The final formula is: $w_p = P^{-1}(X^T X)^{-1}X^T y$, wherein $(X^T X)^{-1}X^T y$ equal to learning formula $w$.

It turned out that $w_p = P^{-1}w$

Now let's build a prediction according to the derived formula: $a_p = X_p w_p = X P P^{-1} w$.

$P P^{-1}$ is equal to the identity matrix $E$ and it turns out that $a_p = XEw = Xw$

And therefore $a = a_p$

### Section Conclusion:
When multiplying linear regression features by an invertible matrix, the quality of the model will not change.

## Conversion algorithm

**Algorithm**

To protect information, we multiply the feature matrix by the reversible matrix $P$, which is randomly generated.

Algorithm steps:
1. Creation of a random square invertible matrix with the number of columns that corresponds to the number of features.
2. Checking the matrix for invertibility.
3. Multiplication of the feature matrix by a random invertible matrix.
4. Application of the model on the transformed features.
5. Compare the R2 metrics on the original features and on the features multiplied by the invertible matrix.

**Rationale**

As we found out above, an invertible matrix should not degrade the quality of the model.

## Algorithm verification

In [11]:
#select the features and the target feature from the dataset
features= df.drop('Страховые выплаты', axis=1)
target = df['Страховые выплаты']

In [12]:
#divide into training and test sets
(features_train, features_test, target_train, target_test) =\
train_test_split(features, target, test_size = 0.25, random_state=12345)
print(features_train.shape)
print(features_test.shape)

(3635, 4)
(1212, 4)


In [13]:
def score(features, target):
    model = LinearRegression()
    model.fit(features, target)
    predictions = model.predict(features)
    r2 = r2_score(target, predictions)
    return r2

In [14]:
print('R2_score of the standard model:', score(features_train, target_train))

R2_score of the standard model: 0.43215820265809757


Let's create a random 4x4 matrix because we have 4 features and the invertible matrix must be square.

In [15]:
def rnd_matrix(n):
    random_matrix = np.random.normal(size=(4, 4))
    try:
        np.linalg.inv(random_matrix) #the operation of creating an inverse matrix is performed
    except LinAlgError: #if an error occurs, i.e. inverse matrix does not exist, another attempt is made
        rnd_matrix(n)
    return random_matrix

In [16]:
invert_matrix = rnd_matrix(10)

The matrix is invertible. Now let's multiply the feature matrix by our random matrix $P$.

In [17]:
features_train_matrix = features_train @ invert_matrix
features_test_matrix = features_test @ invert_matrix
print(features_train_matrix.shape)
print(features_test_matrix.shape)

(3635, 4)
(1212, 4)


In [18]:
# let's see how the transformed matrices look like
display(features_train_matrix.sample(5))
display(features_test_matrix.sample(5))

Unnamed: 0,0,1,2,3
3276,-21252.989164,7897.305473,-21162.869576,-13262.665168
1181,-16911.058479,6245.53262,-16835.978179,-10552.268596
1536,-31165.140885,11577.573899,-31034.47795,-19448.268589
4563,-36040.458276,13438.258338,-35896.378786,-22492.177977
3645,-16358.61773,6075.479714,-16287.603514,-10208.146652


Unnamed: 0,0,1,2,3
561,-25123.408109,9361.850884,-25021.531045,-15678.717988
4308,-29197.69438,10838.886012,-29074.072859,-18220.37267
4646,-30015.029767,11188.451856,-29894.615317,-18731.751846
3301,-25734.863095,9552.87882,-25626.548444,-16059.551968
2926,-25484.283457,9489.105735,-25381.532894,-15904.071324


The sample sizes are the same as the original ones. Let's check the quality of the model on new features.

In [19]:
#check r2 on the transformed training set
print('R2_score on the transformed training set:', score(features_train_matrix, target_train))

R2_score on the transformed training set: 0.432158202658095


In [20]:
#and check if r2 matches on test samples
print('R2_score on the test set:', score(features_test, target_test))

R2_score on the test set: 0.425552801076332


In [21]:
print('R2_score on the transformed test set:', score(features_test_matrix, target_test))

R2_score on the transformed test set: 0.4255528010763677


As you can see, the R2 metric has not changed and is the same on the normal and transformed samples.

### Section Conclusion
The quality of the linear regression has not changed and the R2 metric is the same on the standard and transformed data, which means that we were able to protect customer data without compromising the quality of the model.

## Conclusion:
1. Getting to know the data:
    - The dataset has 5000 rows and 5 columns.
    - The entire dataset has been converted to `int32` format.
    - 153 duplicates were found in the dataset and subsequently removed.
    - There are no gaps in the dataset.
2. During the study, it was found that in the case of multiplying the signs of linear regression by an invertible matrix, the quality of the model does not change.
3. To protect information, we multiply the feature matrix by the reversible matrix $P$, which is randomly generated. Conversion algorithm:
    1. Creation of a random square invertible matrix with the number of columns that corresponds to the number of features.
    2. Checking the matrix for invertibility.
    3. Multiplication of the feature matrix by a random invertible matrix.
    4. Application of the model on the transformed features.
    5. Compare the R2 metrics on the original features and on the features multiplied by the invertible matrix.
4. After transforming the data by multiplying the features by the matrix, the quality of the linear regression has not changed and the R2 metric is the same on the standard and transformed data, which means that we managed to protect customer data without compromising the quality of the model.