## Loading and analyzing initital data

In [24]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score

In [25]:
data = pd.read_csv('datasets/insurance.csv')

In [26]:
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [28]:
data.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [29]:
data.groupby(by = 'Страховые выплаты')['Члены семьи'].count()

Страховые выплаты
0    4436
1     423
2     115
3      18
4       7
5       1
Name: Члены семьи, dtype: int64

In [30]:
data.columns = ('param_1', 'param_2', 'param_3', 'param_4', 'target')

In [31]:
data.loc[data['target'] > 1, 'target'] = 1

In [32]:
data.describe()

Unnamed: 0,param_1,param_2,param_3,param_4,target
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.1128
std,0.500049,8.440807,9900.083569,1.091387,0.31638
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,1.0


**Results**

1. Data doesn't have missing variables
2. Column names were anonymized
3. Target variable was changed into dummy varable since the task is to predict if event occurs, not how many times

## Matrix multiplication

Notation:

- $X$ — feature matrix  (column zero consists of ones)

- $y$ — target feature vector

- $P$ — matrix that is used to multiply the features matrix 

- $w$ — vector of linear regression weights (element zero equals shift)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer: ** The quality of a linear regression will not change if multiplied by an invertible matrix and at the same time multiplying the feature matrix X on the right.

**Rationale:** 

If we consider this formula

$
w = (X^T X)^{-1} X^T y
$

Let's start with the matrices dimensions

X - has m * n dimension (m is the number of observations, and n is the number of indicators that we use in the analysis)

So $
X^T
$
has n*m dimension 

Their multiplication  $
X^T X
$ - n*n, just like its inverse matrix
$
(X^T X)^{-1} 
$


Finally 
$(X^T X)^{-1} X^T $
an n*m matrix that is multiplied by
y - m * 1 and the result is a matrix with dimension
n * 1

n * 1 is the dimension that is needed for the last multiplication X * w, m * n and n * 1 gives us m * 1 - exactly the matrix that is needed to compare with the target column.

When multiplying the matrix X by some invertible matrix, we must take into account that both dimensions are important.

Let Z be our invertible matrix

Thus, if the multiplication is on the right, X * Z, then it must have a dimension of n * n so that our matrix does not change size.

Assume that we multiply from the right by the matrix Z. Dimensions

$ X^T и  Z $ - n * n

$X$ - m * n

$X^T$ -  n * m

$$
a_1 = X(X^T X)^{-1} X^T y
$$

$$
a_2 = XZ((XZ)^T XZ)^{-1} (XZ)^T y
$$

    The transposed product of matrices is equal to the product of the transposed matrices taken in reverse order.

$$
a_2 = XZ(Z^T X^T X Z)^{-1} Z^T X^T y
$$

the inverse matrix of the product of **square** matrices is equal to the product of the inverse matrices of factors, taken in reverse order.

We have 3 square matrices under -1 Zt, Z and the product $X^T X$. That is, at the beginning we can assume that we have a product of 2 matrices $Z^T$ and $X^T X Z$, then

$$
a_2 = XZ(Z^T X^T X Z)^{-1} Z^T X^T y
$$
equals
$$
a_2 = XZ(X^T X Z)^{-1}(Z^T)^{-1} Z^T X^T y
$$
and then
$$
a_2 = XZZ^{-1}(X^T X)^{-1}(Z^T)^{-1} Z^T X^T y
$$

As a result $XZZ^{-1} = XE = X$ and the same with the second Z $(Z^T)^{-1} Z^T X^T = E X^T = X^T$

Thus, despite the multiplication by the matrix Z a1=a2:

$$
a_1 = X(X^T X)^{-1} X^T y  = XZZ^{-1}(X^T X)^{-1}(Z^T)^{-1} Z^T X^T y = a_2
$$
$$
a_1 = X(X^T X)^{-1} X^T y  = XE(X^T X)^{-1} E X^T y = a_2
$$
$$
a_1 = X(X^T X)^{-1} X^T y  = X(X^T X)^{-1} X^T y = a_2
$$

Can we multiply on the left? Logic in this case is fuzzy, but let's check

Then

$ X^T и  Z $ - m * m

$X$ - m * n

$X^T$ -  n * m

$$
a_1 = X(X^T X)^{-1} X^T y
$$

$$
a_3 = ZX((ZX)^T ZX)^{-1} (ZX)^T y
$$

    The transposed product of matrices is equal to the product of the transposed matrices taken in reverse order.

$$
a_3 = ZX(X^T Z^T ZX)^{-1} X^T Z^T  y
$$

In this case, the square matrix at -1 appears only when all 4 matrices are multiplied within it. So we can't do it

## Encryption algorithm

**Algorithm**

1. Take the current number of the day (d) and month (m)
2. At odd points use the following sequence
$$
a_{ij} = d-1^{i+1}*j
$$
3. For even lines use
$$
a_{ij} = m-1^{i/2+1}*j
$$
4. Check for the presence of a determinant - if it is equal to 0 - then recalculate with d + 1

**Rationale**

Based on what was shown in "Matrix multiplication paragraph", any invertible matrix of size n * n will not affect the results of the calculations.

Based on the analysis carried out in paragraph 2, a matrix n * n is taken, where n is the number of analyzed features. In our case, this is 4 * 4. A matrix that is constant can allow you to return to the original data, which limits the security. To avoid this, when forming the matrix, it is proposed to use the current day and month. Then, without knowing when the transformation was made, it will become impossible to obtain the original data based on the transformed ones.

## Algorithm verification

In [33]:
n = 4
day = int(datetime.today().strftime('%d'))
month = int(datetime.today().strftime('%m'))

In [34]:
def create_matrix(n, day, month):
    check = 0
    while check == 0:
        for_matrix = []
        for i in range(1, n+1):
            if i % 2 == 0:
                z = []
                for j in range(1, n+1):
                    z.append(month - 1^(i+1)*j)
                for_matrix.append(z)
            else:
                z = []
                for j in range(1, n+1):
                    z.append(day - 1^(i+1)*j)
                for_matrix.append(z)
        result = np.array(for_matrix)
        check = np.linalg.det(result)
        day += 1
    return result

In [35]:
encr_matrix = create_matrix(n, day, month)

Encryption matrix prepared

In [36]:
df_train, df_test = train_test_split(data, test_size=0.25, random_state=12345,
                                      stratify=data['target']) 

In [37]:
df_train_feature = df_train.drop(['target'], axis=1)
df_train_target = df_train['target']
df_test_feature = df_test.drop(['target'], axis=1)
df_test_target = df_test['target']

In [38]:
model = LinearRegression()
model.fit(df_train_feature, df_train_target)
train_predict = model.predict(df_train_feature)
print(r2_score(df_train_target, train_predict))
test_predict = model.predict(df_test_feature)
print(r2_score(df_test_target, test_predict))

0.43906860192530794
0.4207365055282475


We built model based on initital data and got resulting coefficients.

 Let's encrypt and do it all over again

In [39]:
vector = np.array(data.drop(['target'], axis=1).values)
vector = vector @ encr_matrix

I was not sure if it was possible to give a matrix to the linear model, so I decided to return the dataframe (most likely it is possible, but it’s better to do everything the same way for comparison)

In [40]:
data_enc = pd.DataFrame(vector)
data_enc['4'] = data['target']
data_enc.columns = ('param_1', 'param_2', 'param_3', 'param_4', 'target')
data_enc

Unnamed: 0,param_1,param_2,param_3,param_4,target
0,942629.0,1537631.0,1339841.0,347659.0,0
1,722233.0,1178012.0,1026699.0,266478.0,1
2,399145.0,651000.0,567435.0,147290.0,0
3,792411.0,1292724.0,1126233.0,292146.0,0
4,496061.0,809119.0,705137.0,183011.0,0
...,...,...,...,...,...
4995,678446.0,1106724.0,964338.0,250216.0,0
4996,995773.0,1624412.0,1415319.0,367158.0,0
4997,644206.0,1050924.0,915618.0,237536.0,0
4998,621440.0,1013755.0,883274.0,229205.0,0


In [41]:
df_train, df_test = train_test_split(data_enc, test_size=0.25, random_state=12345,
                                      stratify=data['target']) 

In [42]:
df_train_feature = df_train.drop(['target'], axis=1)
df_train_target = df_train['target']
df_test_feature = df_test.drop(['target'], axis=1)
df_test_target = df_test['target']

In [43]:
model_enc = LinearRegression()
model_enc.fit(df_train_feature, df_train_target)
train_predict = model_enc.predict(df_train_feature)
print(r2_score(df_train_target, train_predict))
test_predict = model_enc.predict(df_test_feature)
print(r2_score(df_test_target, test_predict))

0.43906860192529995
0.420736505528264


In [44]:
print(model_enc.coef_)
print(model.coef_)

[ 2.18909943e-05 -1.54847793e-03  1.83836534e-03 -2.92672521e-04]
[-6.78200728e-03  2.47582098e-02  2.69393022e-07 -7.23887956e-03]


**Results**

1. 2 models are built with encrypted and unencrypted data. The coefficientsfor parameters within the model are different
2. R2 is the same for when we build model with encrypted and unencrypted data, both on the test and on the training set