# Protection of customer personal data

We need to protect the data of our customers at Flood Insurance Company. Therefore, we should develop such a method of data transformation that it would be difficult to recover personal information from it.

We also need to protect the data so that the quality of machine learning models does not suffer during the transformation.

Work plan:
1. Explore the data from the dataframe.
2. Conduct preprocessing of the data.
3. Check by formula whether multiplication of the original feature matrix by a reversible matrix affects the quality of linear regression.
4. Create a custom linear regression class and, using the R2 metric and compare the quality before and after transformation.
5. Find the quality of the inbuilt linear regression model on the transformed feature matrix and compare it with the same obtained on the original matrix.

<a id="0"></a> <br>
# Table of Contents  
1. [Loading data](#1)     
2. [Matrix multiplication](#2)
3. [Transformation Algorithm](#3)
4. [Algorithm validation](#4)

<a id="1"></a>
## Loading data
[Back to the top](#0)

In [3]:
#pip install nb_black

In [4]:
#%load_ext nb_black

In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

Explore the data.

In [6]:
df = pd.read_csv("/datasets/insurance.csv", sep=",")
df.head(10)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
5,1,43.0,41000.0,2,1
6,1,39.0,39700.0,2,0
7,1,25.0,38600.0,4,0
8,1,36.0,49700.0,1,0
9,1,32.0,51700.0,1,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


We see that all columns in the original dataset are of numeric type, and there are no omissions.

In [8]:
df["Пол"].value_counts()

0    2505
1    2495
Name: Пол, dtype: int64

As we can see, the `Gender` column takes two values, meaning it is a categorical feature, which matches the logic.

In [9]:
df["Возраст"].sort_values()

2688    18.0
3370    18.0
1159    18.0
2549    18.0
1693    18.0
        ... 
3117    60.0
2240    60.0
3907    61.0
4019    62.0
228     65.0
Name: Возраст, Length: 5000, dtype: float64

We see that `Age` accepts only non-negative integer values, so the data in this column is correct.

In [10]:
df["Зарплата"].sort_values()

726      5300.0
4164     6000.0
4623     7400.0
437      8900.0
483      9800.0
         ...   
2193    71400.0
3328    71600.0
4360    74800.0
4512    75200.0
3255    79000.0
Name: Зарплата, Length: 5000, dtype: float64

The `Salary` column contains only non-negative values, indicating the adequacy of the data in this column.

In [11]:
df["Члены семьи"].value_counts()

1    1814
0    1513
2    1071
3     439
4     124
5      32
6       7
Name: Члены семьи, dtype: int64

The resulting column shows that the data in the `Family members` column is realistic, most people are single or with one family member.

In [12]:
df["Страховые выплаты"].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: Страховые выплаты, dtype: int64

We see that a maximum of 5 insurance payments have been made over 5 years, while the majority of the insurance company's clients have not received any insurance payments at all over the last 5 years.

<a id="2"></a>
## Matrix multiplication
[Back to the top](#0)

Let us introduce the notations:

- $X$ — feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ — matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals the shift)

Predictions:

$$
a = Xw
$$

Training goal:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Test whether the quality of the linear regression changes after multiplying the feature matrix $X$ by the reversible matrix $P$.

**Rationale:**

According to the training formula for the encrypted data matrix $Z = XP$ we obtain:
 
$$
w' = (Z^T Z)^{-1} Z^T y = ((XP)^T (XP))^{-1} (XP)^T y.
$$

Next, let's use the transpose property of matrices: $(A B)^T = B^T A^T$:

$$
w'= ((XP)^T (XP))^{-1} (XP)^T y=(P^T X^T X P)^{-1} P^T X^T y.
$$

Let's apply the property of finding the inverse matrix $(A B)^{-1} = B^{-1} A^{-1}$, где $(A B)$ is a square matrix:

$$
w'=(P^T X^T X P)^{-1} P^T X^T y=P^{-1}(X^TX)^{-1}(P^T)^{-1} P^T X^T y.
$$

Since $(P^T)^{-1} P^T = E$, then:

$$
w'=P^{-1}(X^TX)^{-1}(P^T)^{-1} P^T X^T y=P^{-1}(X^TX)^{-1} X^T y=P^{-1}w.
$$

Then for the prediction $a'=Zw'$ we have:

$$
a'=Zw'=XPP^{-1}w=Xw.
$$

**Outcome:** According to the given formula conclusions, we conclude that the quality of the model does not change with this data encryption. 

<a id="3"></a>
## Transformation Algorithm
[Back to the top](#0)

Let's check the conclusion obtained at the previous stage in practice.

**Algorithm**

In [13]:
RANDOM_STATE = 12345

Divide the initial data into two samples: training (80%) and test (20%).
Write the features of the obtained samples in features_train, features_test.
Record the target feature in target_train, target_test.

In [14]:
target = df["Страховые выплаты"]
features = df.drop("Страховые выплаты", axis=1)
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=RANDOM_STATE
)

In [15]:
features_train.shape

(4000, 4)

In [16]:
target_train.shape

(4000,)

In [17]:
features_test.shape

(1000, 4)

In [18]:
target_test.shape

(1000,)

We can see by the number of rows in each sample that the separation has been done correctly.

Create our own linear regression class.

In [19]:
class LinearRegression_custom:
    def fit(self, train_features, train_target):
        X = np.concatenate(
            (np.ones((train_features.shape[0], 1)), train_features), axis=1
        )
        y = train_target
        w = (np.linalg.inv(X.T @ X) @ X.T) @ y
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

Perform feature scaling.

In [20]:
features_train.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
317,0,20.0,39800.0,2
4143,1,40.0,34200.0,0
4252,1,45.0,50800.0,1
710,0,28.0,39100.0,2
148,0,28.0,39000.0,1


In [21]:
numeric = ["Возраст", "Зарплата", "Члены семьи"]
scaler = StandardScaler()
scaler.fit(features_train[numeric])
pd.options.mode.chained_assignment = None
features_train[numeric] = scaler.transform(features_train[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

In [22]:
features_train.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
317,0,-1.306556,-0.012614,0.74993
4143,1,1.050572,-0.57884,-1.079168
4252,1,1.639853,1.099616,-0.164619
710,0,-0.363705,-0.083392,0.74993
148,0,-0.363705,-0.093503,-0.164619


We can see that the scaling was successful.

Train this model on the source training sample, get the predictions on the test sample and calculate the value of the `R2` metric.

In [23]:
model = LinearRegression_custom()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
r2_score(target_test, predictions)

0.4117683956770476

Create a random reversible matrix $P$:

In [24]:
np.random.seed(RANDOM_STATE)
P = np.random.normal(1, 2, size=(4, 4))
P

array([[ 0.59058468,  1.95788668, -0.03887743, -0.11146061],
       [ 4.93156115,  3.78681167,  1.18581575,  1.56349231],
       [ 2.53804514,  3.49286947,  3.01437872, -1.59244222],
       [ 1.54998327,  1.45782576,  3.70583367,  2.77285868]])

In [25]:
np.linalg.det(P)

-97.05037974221308

Since the determinant of matrix $P$ is not zero, this matrix is reversible.

Построим модель линейной регресии для матрицы $Z=XP$:

In [26]:
features_Z = features @ P
features_Z

Unnamed: 0,0,1,2,3
0,126091.373282,173405.000840,149565.469674,-78918.369441
1,96674.116934,132904.691128,114604.644535,-60438.110788
2,53441.963113,73460.076467,63336.341675,-33395.945306
3,105943.144889,145735.095711,125731.906221,-66366.461444
4,66381.652326,91271.881853,78708.448430,-41519.075572
...,...,...,...,...
4995,90749.395006,124804.386556,107653.934640,-56800.863689
4996,133162.788148,183156.569796,157997.468244,-83388.040638
4997,86141.461274,118486.927012,102218.566427,-53946.975634
4998,83107.810801,114306.472980,98607.350556,-52030.256590


In [27]:
features_train_Z, features_test_Z, target_train_Z, target_test_Z = train_test_split(
    features_Z, target, test_size=0.2, random_state=RANDOM_STATE
)

In [28]:
features_train_Z.shape

(4000, 4)

In [29]:
features_test_Z.shape

(1000, 4)

In [30]:
target_train_Z.shape

(4000,)

In [31]:
target_test_Z.shape

(1000,)

We can see by the number of rows in each sample that the separation has been done correctly.

In [32]:
model.fit(features_train_Z, target_train_Z)
predictions = model.predict(features_test_Z)
r2_score(target_test_Z, predictions)

0.41176840477707777

**Rationale:**

The values of the `R2` metric for the original feature set $X$ and the encrypted one $Z=PX$ are almost the same, so the quality of the model does not change with this data transformation algorithm.

<a id="4"></a>
## Algorithm validation
[Back to the top](#0)

Verify that the quality of the linear regression does not change when the original data is encrypted.

For the original matrix we have:

In [33]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
r2_score(target_test, predictions)

0.4117683956770476

For the transformed matrix:

In [34]:
model.fit(features_train_Z, target_train_Z)
predictions = model.predict(features_test_Z)
r2_score(target_test_Z, predictions)

0.41176839567706636

From the results obtained, it is seen that indeed the quality of the model does not change after transforming the original data matrix. Thus, in the case of multiplying the feature matrix by a random reversible matrix, data protection can be provided without losing much in the quality of the model.