<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1"><span class="toc-item-num">1</span>Data loading</a></span></li><li><span><a href="#Matrix multiplication" data-toc-modified-id="Matrix multiplication-2"><span class="toc-item-num">2</span>Matrix multiplication</a></span></li><li><span><a href="#Transformation algorithm" data-toc-modified-id="Transformation algorithm-3"><span class="toc-item-num">3</span>Conversion algorithm</a></span></li><li><span><a href="#Verification of the algorithm" data-toc-modified-id="Algorithm-check-4"><span class="toc-item-num">4</span>Checking the algorithm</a></span></li><li><span><a href="#Final-output" data-toc-modified-id="Final-output-5"><span class="toc-item-num">5</span>Final output</a></span></li></div>

<div style="border:2px solid Black; padding:20px;">

<h1> Personal data protection </h1>

We need to protect the customer data of the insurance company "Though the flood". We will develop such a method of data transformation so that it would be difficult to recover personal information from them.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during the conversion.

Project execution plan:
1. Download and examine the data.
2. Answer the question and justify the decision. The signs are multiplied by a reversible matrix.
Will the quality of linear regression change:
- It will change. Give examples of matrices.
- It won't change. Specify how the linear regression parameters are related in the original
problem and in the transformed one.
3. Propose a data transformation algorithm to solve the problem. Justify why
the quality of linear regression will not change.
4. Program this algorithm using matrix operations. Check that
the quality of linear regression from sklearn does not differ before and after the transformation.
Apply the R2 metric.

## Loading data

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
insurance_data = pd.read_csv('datasets/insurance.csv')
insurance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [None]:
insurance_data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [None]:
def show_rows(data_frame):
    for column in data_frame.columns:
        print('Unique column values', column)
        print(data_frame[column].unique())
    print('Number of gaps in each column')
    print(data_frame.isna().mean())

In [None]:
show_rows(insurance_data)

Уникальные значения столбца Пол
[1 0]
Уникальные значения столбца Возраст
[41. 46. 29. 21. 28. 43. 39. 25. 36. 32. 38. 23. 40. 34. 26. 42. 27. 33.
 47. 30. 19. 31. 22. 20. 24. 18. 37. 48. 45. 44. 52. 49. 35. 56. 65. 55.
 57. 54. 50. 53. 51. 58. 59. 60. 61. 62.]
Уникальные значения столбца Зарплата
[49600. 38000. 21000. 41700. 26100. 41000. 39700. 38600. 49700. 51700.
 36600. 29300. 39500. 55000. 43700. 23300. 48900. 33200. 36900. 43500.
 36100. 26600. 48700. 40400. 38400. 34600. 34800. 36800. 42200. 46300.
 30300. 51000. 28100. 64800. 30400. 45300. 38300. 49500. 19400. 40200.
 31700. 69200. 33100. 31600. 34500. 38700. 39600. 42400. 34900. 30500.
 24200. 49900. 14300. 47000. 44800. 43800. 42700. 35400. 57200. 29600.
 37400. 48100. 33700. 61800. 39400. 15600. 52600. 37600. 52500. 32700.
 51600. 60900. 41800. 47400. 26500. 45900. 35700. 34300. 26700. 25700.
 33300. 31100. 31500. 42100. 37300. 42500. 27300. 46800. 33500. 44300.
 41600. 53900. 40100. 44600. 45000. 32000. 38200. 33000. 38500

Let's change the data type in the Age and Salary columns to integer due to the absence
of fractional parts. We will also reduce the memory access for the remaining columns.

In [None]:
insurance_data['Пол'] = pd.to_numeric(insurance_data['Пол'], downcast='integer')
insurance_data['Возраст'] = pd.to_numeric(insurance_data['Возраст'], downcast='integer')
insurance_data['Зарплата'] = pd.to_numeric(insurance_data['Зарплата'], downcast='integer')
insurance_data['Члены семьи'] = pd.to_numeric(insurance_data['Члены семьи'], downcast='integer')
insurance_data['Страховые выплаты'] = pd.to_numeric(insurance_data['Страховые выплаты'], downcast='integer')
insurance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int8   
 1   Возраст            5000 non-null   int8   
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int8   
 4   Страховые выплаты  5000 non-null   int8   
dtypes: float64(1), int8(4)
memory usage: 58.7 KB


Gain in file size: 195.4 KB - 58.7 KB = 136.7 KB

# Conclusion on step 1
The data is loaded and has no anomalies. Casting types will allow you to access data more quickly.

In [None]:
features = insurance_data.drop('Страховые выплаты',axis=1)
target = insurance_data['Страховые выплаты']

In [None]:
X = np.concatenate((np.ones((features.shape[0], 1)), features), axis=1)
y = target
w = np.linalg.inv(X.T @ X) @ X.T @ y
display(w[1:])
model = LinearRegression()
model.fit(features, target)
model.coef_

array([ 7.92580543e-03,  3.57083050e-02, -1.70080492e-07, -1.35676623e-02])

array([ 7.92580543e-03,  3.57083050e-02, -1.70080492e-07, -1.35676623e-02])

The regression coefficients match 100%. Now let's move on to the theoretical part.

Notation for working with linear regression levels:

Designations:

- $X$ — feature matrix (the zero column consists of units)

- $y$ — vector of the target feature

- $P$ is the matrix by which the signs are multiplied

- $w$ is a vector of linear regression weights (the zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

The task of training:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**question:** Features are multiplied by a reversible matrix. Will the quality of linear regression change?
(if not, how are the linear regression parameters related in the original problem and in the transformed one )

**Response:** Won't change.

**Justification:** Replace the matrix $X$ with the matrix $Z$ :

$$
Z = XP \qquad ( 2.1)
$$

where $P$ is an invertible matrix with some values, by which the matrix $X$ can be multiplied.





Replace $X$ with $Z$ and calculate what the prediction and the vector of weights will be equal to.

$$
a_1 = Zw_1 \qquad (2.2)
$$
$$
w_1 = (Z^T Z)^{-1} Z^T y \qquad (2.3)
$$

Substitute equation 2.2 for the right side of equation 2.3 and get the following:

$$
a_1 = Z (Z^T Z)^{-1} Z^T y \qquad (2.4)
$$

Replace all $Z$ with the right side of equation 2.1:

$$
a_1 = XP ((XP)^T (XP))^{-1} (XP)^T y \qquad (2.5)
$$

For the next step, you will need the following property of the inverse matrix:

$$
(AB)^{-1} = B^{-1} A^{-1}
$$

Let's open $((XP)^T(XP))^{-1}$ in two steps:

$$
a_1 = XP ((XP)^T (XP))^{-1} (XP)^T y = XP(XP)^{-1}((XP)^T)^{-1} (XP)^T y = XPP^{-1}X^{-1}((XP)^T)^{-1} (XP)^T y  \qquad (2.6)
$$

Multiplication will result in $PP^{-1} = E$. For the next step, we will use the property of the transposed matrix:

$$
(AB)^T = B^T A^T
$$

Multiplying by the unit matrix does not change anything. Let's open $((XP)^T)^{-1} (XP)^T$ in three steps:

$$
a_1 = XEX^{-1}((XP)^T)^{-1} (XP)^T y = XX^{-1}(P^T X^T)^{-1} P^T X^T y = XX^{-1} (X^T)^{-1} (P^T)^{-1}  P^T X^T y   \qquad (2.7)
$$

Multiplication will result in $(P^T)^{-1} P^T = E$. Let's see what's left of Equation 2.7:

$$
a_1 = XX^{-1} (X^T)^{-1} E X^T y = X(X^T X)^{-1} X^T y = Xw = a  \qquad (2.8)
$$

As you can see, the value of the prediction $a$ does not change if you multiply the feature matrix by an invertible matrix.

What is the relationship between $w$ and $w_p$?
    
    
Where



$$
w = (X^T X)^{-1} X^T y
$$
    
    
$$
w_P = ((XP)^T XP)^{-1} (XP)^T y
$$   



$ w_P = ((XP)^T XP)^{-1} (XP)^Ty = (P^T X^T X P)^{-1} (P^T X^T) y = (P^T (X^T X) P)^{-1} (P^T X^T) y = P^{-1} (P^T (X^T X))^{-1} P^T X^T y = P^{-1} (X^T X)^{-1} (P^T)^{-1} P^T X^T y = \{(P^T)^{-1} P^T = E\} = P^{-1} (X^T X)^{-1} E X^Ty = P^{-1} (X^T X)^{-1} X^Ty = \{(X^T X)^{-1} X^Ty = w\} $

hence we get $w_P= P^{-1} w$

Here we had to prove that the predictions would not change, that is, $a = a'$, i.e. $a = Xw = XEw = XPP^{-1}w = (XP)P^{-1}w = (XP)w' = a'$, It turns out that if we proved that $w' = P^{-1}w$, then this means that $a=a'$ and the predictions have not changed
    

## Conversion algorithm

**Algorithm**


To protect the information at the stage of searching for regression coefficients and regression predictions, we will multiply the feature matrix by the reversible matrix $Y$, which will be generated randomly.

Algorithm stages:
1. Compilation of the matrix $Y$.
2. Checking the matrix for reversibility. Calculation of the determinant of the matrix $Y$.
3. Obtaining a matrix of transformed features $Z = X Y$.
4. Application of the algorithm on the transformed features of $Z$.

**Justification**

The matrix $Y$ must have the required dimension $(nxn)$, where n is the number of features for regression.
Thus, the matrix $Z$ will have the same dimension as the matrix $X$. The inverse matrix $Y$ exists only for square non-degenerate matrices (whose determinant is not zero).

Example:

$
X = \begin{pmatrix}
1 & 2 \\
2 & 3 \\
4 & 5  
\end{pmatrix}
\qquad
Y = \begin{pmatrix}
1 & 0 \\
2 & 3 \\  
\end{pmatrix}
\qquad det  Y = 3 $

Find the value of Z:

$
Z = \begin{pmatrix}
1 & 2 \\
2 & 3 \\
4 & 5  
\end{pmatrix} \begin{pmatrix}
1 & 0 \\
2 & 3 \\  
\end{pmatrix} = \begin{pmatrix}
1*1+2*2 & 1*0+2*3  \\
2*1+3*2 & 2*0+3*3  \\
4*1+5*2 & 4*0+5*3   
\end{pmatrix} = \begin{pmatrix}
5 & 6 \\
8 & 9 \\
14 & 15  
\end{pmatrix}$

After that, we add the zero column and enter the data into the linear regression.

## Checking the algorithm

We will conduct a study of the model in two directions:
1. We investigate the quality of the model without transformation.
         1.1 With the initial signs
         1.2 With scaled features
2. We investigate the quality of the model with transformation.
         2.1 With the initial signs
         2.2 With scaled features

Let's divide the data into training and test data.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)

In [None]:
model = LinearRegression()
model.fit(features_train, target_train)
R2_LR_origin_data = r2_score(target_test, model.predict(features_test))
print("w-vector coef",model.coef_)
print("R2 =", R2_LR_origin_data)

w-vector coef [ 1.79258369e-02  3.57228278e-02 -5.46000708e-07 -1.26186590e-02]
R2 = 0.435227571270266


In [None]:
regressor = LinearRegression()
scaller = StandardScaler()
pipeline = Pipeline([("standard_scaller", scaller),("linear_regression", regressor)])
pipeline.fit(features_train, target_train)
R2_LR_origin_data_scaled = r2_score(target_test, pipeline.predict(features_test))
#print("w-vector coef",pipeline.coef_) при использовании pipeline невозможно получить коэффициенты регрессии
print("R2 =", R2_LR_origin_data_scaled)

R2 = 0.4352275712702667


The model behaves the same both on the source data and on the scaled ones. Fluctuations of 15 digits and further after the decimal point are associated with the accuracy of storing fractional numbers in python 3.

Let's create a feature matrix transformation function.

In [None]:
def cipher_features(features):
    crypted_features = features
    n = features.shape[1]
    np.random.seed(12345)
    cipher_matrix = np.random.randint(1, 10, (n,n))
    det = np.linalg.det(cipher_matrix)
    while det == 0:
        np.random.seed(12345)
        cipher_matrix = np.random.randint(1, 10, (n,n))
        det = np.linalg.det(cipher_matrix)
    crypted_features = crypted_features @ cipher_matrix
    return crypted_features, cipher_matrix

We will output the data before the conversion and after.

In [None]:
display(features.head())
features, cipher_matrix = cipher_features(features)
display(features.head())
cipher_matrix

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,1,41,49600.0,1
1,0,46,38000.0,1
2,0,29,21000.0,0
3,0,21,41700.0,2
4,1,28,26100.0,0


Unnamed: 0,0,1,2,3
0,99452.0,396931.0,347287.0,49899.0
1,76279.0,304140.0,266095.0,38329.0
2,42174.0,168087.0,147058.0,21203.0
3,83532.0,333667.0,291948.0,41861.0
4,52371.0,208890.0,182758.0,26301.0


array([[3, 6, 2, 5],
       [6, 3, 2, 7],
       [2, 8, 7, 1],
       [3, 2, 3, 7]])

As you can see, the data has lost its original values, now an incomprehensible number is stored in them. Let's split the data into training and training samples and check the R2 value.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)

In [None]:
model = LinearRegression()
model.fit(features_train, target_train)
R2_LR_cipher_data = r2_score(target_test, model.predict(features_test))
print("w-vector coef",model.coef_)
print("R2 =", R2_LR_cipher_data)

w-vector coef [ 0.01280453  0.00328449 -0.00664342 -0.00538157]
R2 = 0.4352275712703063


In [None]:
regressor = LinearRegression()
scaller = StandardScaler()
pipeline = Pipeline([("standard_scaller", scaller),("linear_regression", regressor)])
pipeline.fit(features_train, target_train)
R2_LR_cipher_data_scaled = r2_score(target_test, pipeline.predict(features_test))
#print("w-vector coef",pipeline.coef_) при использовании pipeline невозможно получить коэффициенты регрессии
print("R2 =", R2_LR_cipher_data_scaled)

R2 = 0.4352275712702772


Let's compare the quality indicators of the models.

In [None]:
result = pd.DataFrame(data= [R2_LR_origin_data_scaled,
                      R2_LR_origin_data,
                      R2_LR_cipher_data,
                      R2_LR_cipher_data_scaled],
                     columns=['R2'],
                     index=['Linear regression',
                            'Linear regression with scale',
                            'Linear regression on transformed features',
                            'Linear regression on transformed features with scale',])
result

Unnamed: 0,R2
Линейная регрессия,0.435228
Линейная регрессия c масштабом,0.435228
Линейная регрессия на преобразованных признаках,0.435228
Линейная регрессия на преобразованных признаках c масштабом,0.435228


<div style="border:2px solid Black; padding:20px;">

## Final conclusion
In the course of the work , the following was done:

1. Data has been downloaded and studied.
2. The quality of linear regression has not changed from the use of the original matrix and the matrix multiplied by the reversible one.
3. A data transformation algorithm has been created.
4. The algorithm of data transformation is investigated and the K2 metric for data without transformation and with it is checked.

Based on the results of using matrix operations, you can see that it is very easy to encrypt data from recognition, having the correct matrix.

</div>