# Protection of personal data of customers

You need to protect the data of customers of the insurance company "Though the Flood". Develop a data transformation method that makes it difficult to recover personal information from it. Justify the correctness of his work.

You need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation. There is no need to select the best model.

## 1. Loading data

In [5]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from itertools import product

In [6]:
data = pd.read_csv('/datasets/insurance.csv')
data

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


In [7]:
data.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [9]:
features = data.drop('Страховые выплаты', axis=1) #Insurance payouts
target = data['Страховые выплаты'] #Insurance payments

## 2. Matrix multiplication

Designations:

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ - is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** Will not change.
$$
w' = P^{-1}w
$$

**Rationale:** The vector of linear regression weights is the same in both cases

## 3. Conversion algorithm

**Algorithm**

Linear regression, to find a straight line with the desired coefficients, uses the formula
$$
w = (X^T X)^{-1} X^T y
$$

where X is a matrix with features, y is a target feature vector, 𝑤 is a vector of linear regression coefficients

$$
w^{0} = ((XY)^{T}XY^{-1})(XY)^{T}y
$$


$$
a = Xw
$$

Where a - vectors of predictions trained on X

$$
a' = w^{0}XY
$$

where Y is a random invertible matrix the same size as X, $a^{0}$- vectors of predictions trained on XY

$$
a = a^{0}
$$

**Rationale**
Using the property of multiplying a matrix by its inverse - there is an identity matrix
$$
a = (X^T X)^{-1} X^T yT = y(X^T X)^{-1} X^TX = yE = y
$$

E - identity matrix

$$
MSE = \frac1n\sum _{i=1}^{n}(a_{i}-{y_{i}})^{2} = 0
$$

Let's check what happens if we do the same with $a^{0}$

$$
a^{0} = ((XY)^{T}XY^{-1}(XY)^{T}yXY = y((XY)^{T}XY^{-1}((XY)^{ T}XY) = yE = y
$$

$$
MSE = \frac1n\sum _{i=1}^{n}(a_{i}^{0}-{y_{i}})^{2} = 0
$$

$a = y$, $a^{0} = y$
Now we can say with reconciliation that our predictions are equal to the target feature

In [11]:
x = np.random.rand(5,5)
a = 0
for comb in range(10):
    try:
        print(x)
        inverse = np.linalg.inv(x)
    except np.linalg.LinAlgError:
        pass
    else:
        a = x
        break

[[0.38147368 0.90000264 0.34198635 0.67299682 0.15591154]
 [0.5622855  0.38861365 0.85381575 0.53453338 0.27683409]
 [0.70382097 0.13411873 0.69792014 0.28437123 0.89200652]
 [0.57921709 0.62566856 0.40946283 0.71672926 0.15117336]
 [0.20181158 0.53300872 0.23111341 0.34903386 0.34381551]]


In [12]:
np.linalg.inv(x)

array([[  9.04536439,  -2.56125257,   4.00758646,  -2.69349239,
        -11.2526733 ],
       [  8.12474977,  -1.00370455,   2.03216551,  -4.73172342,
         -6.06801292],
       [  0.2660681 ,   2.34535564,  -0.52392141,  -1.87646183,
          0.17525351],
       [-13.66643613,   1.64577686,  -4.61520282,   8.43361272,
         13.13786388],
       [ -4.21000426,  -0.18789514,  -0.4653426 ,   1.61623574,
          5.46560818]])

In [13]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

## 4. Checking the algorithm

In [16]:
class LinearRegression:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
         return test_features.dot(self.w) + self.w0
    
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print('R2-', r2_score(target, predictions))

R2- 0.42494550286668


In [17]:
class LinearRegression1:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        Y = a
        y = train_target
        w  = np.dot(np.linalg.inv(np.dot((np.dot(X, Y)).T, np.dot(X, Y))), np.dot((np.dot(X, Y)).T, y))
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        print(test_features)
        return test_features.dot(self.w) + self.w0
    
features_new = features.dot(a[1:])    
model1 = LinearRegression()
model1.fit(features_new, target)
predictions1 = model1.predict(features_new)
print('R2_encryption-', r2_score(target, predictions1))

R2_encryption- 0.35734804422455835


In [18]:
features_new = features.dot(a[1:,1:])

In [19]:
model1 = LinearRegression()
model1.fit(features_new, target)
predictions1 = model1.predict(features_new)
print('R2_encryption-', r2_score(target, predictions1))

R2_encryption- 0.4249455028663408


********************************

Coefficient of determination is close to zero