<h1>Table of contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Загрузка-данных" data-toc-modified-id="Загрузка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Exploratory analysis</a></span></li><li><span><a href="#Умножение-матриц" data-toc-modified-id="Умножение-матриц-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Matrix multiplication</a></span></li><li><span><a href="#Алгоритм-преобразования" data-toc-modified-id="Алгоритм-преобразования-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conversion algorithm</a></span></li><li><span><a href="#Проверка-алгоритма" data-toc-modified-id="Проверка-алгоритма-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Checking the algorithm</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Final conclusions</a></span></li></ul></div>

# Protection of personal data of clients

**Project Description**


It is necessary to protect the data of the clients of the insurance company "We're not afraid of the flood" by developing such a method of data conversion so that it would be difficult to recover personal information from them and justify the correctness of its work.
It is necessary to protect the data so that the quality of machine learning models does not deteriorate during the conversion.

**Data description**


*Features:* gender, age and salary of the insured, the number of his family members.


*Target:* the number of insurance payments to the client over the past 5 years.

## 1.  Exploratory analysis

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('insurance.csv')

In [3]:
df.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [4]:
df.columns = ['gender', 'age', 'salary', 'family members number', 'insurance payments number']

In [5]:
df.tail()

Unnamed: 0,gender,age,salary,family members number,insurance payments number
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0
4999,1,28.0,40600.0,1,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   gender                     5000 non-null   int64  
 1   age                        5000 non-null   float64
 2   salary                     5000 non-null   float64
 3   family members number      5000 non-null   int64  
 4   insurance payments number  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [7]:
df['family members number'].value_counts()

1    1814
0    1513
2    1071
3     439
4     124
5      32
6       7
Name: family members number, dtype: int64

In [8]:
df['insurance payments number'].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: insurance payments number, dtype: int64

<div style="font-size: 20px">
    

**Intermediate conclusion:**

No anomalies were detected after loading the data. Additional type conversion, elimination of missing values is not required.

## 2.  Matrix multiplication

Разделим исходный датафрейм на обучающую и тестовые выборки.

In [9]:
train, test = train_test_split(df, train_size=0.75, shuffle=False, random_state=12345)

Let's write a Linear Regression class for training, predictions, as well as calculating the r2_score metric.

In [10]:
class LinearRegression1:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0],1)), train_features),axis=1)
        y = train_target
        w = np.dot(np.linalg.inv(X.T.dot(X)),X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]
        
    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0
    
    def score(self, X, y, set_name):
        print('r2_score metric on {} sample: {:.3f}'.format(set_name, r2_score(y, self.predict(X))))

Let's build a linear regression model for the original data for comparison in the future, and also compare the coefficients of the class described above and linear regression from the scikit-learn library.

In [11]:

model_1 = LinearRegression1()

train_features = train.drop('insurance payments number', axis=1)
train_target = train['insurance payments number']
test_features = test.drop('insurance payments number', axis=1)
test_target = test['insurance payments number']

model_1.fit(train_features, train_target)
model_1.score(train_features, train_target, 'train')
model_1.score(test_features, test_target, 'test')

r2_score metric on train sample: 0.426
r2_score metric on test sample: 0.423


In [12]:
# Checking the calculation of coefficients used in the Linear Regression 1 class
X = np.concatenate((np.ones((train_features.shape[0],1)), train_features),axis=1)
y = train_target
w = np.dot(np.linalg.inv(X.T.dot(X)),X.T).dot(y)
display(w[1:])

# Checking the results on a model from scikit-learn
model = LinearRegression()
model.fit(train_features, train_target)
model.coef_

array([ 1.03495034e-02,  3.61862478e-02, -9.06635933e-08, -1.35669330e-02])

array([ 1.03495034e-02,  3.61862478e-02, -9.06635933e-08, -1.35669330e-02])

The coefficients match. We introduce some notation for further theoretical justification of the algorithm.

Designations:

- $X$ — features matrix (the zero column consists of 1)

- $y$ — vector of the target feature

- $P$ - the matrix by which the features are multiplied

- $w$ - a vector of linear regression weights (the zero element is equal to the shift)

Predictions:

$$
\begin{equation}
\begin{split}
a = Xw
\end{split}
\end{equation}
$$

Task of training:

$$
w = \arg\min_w MSE(Xw, y)
$$

Formula of training:

$$
w = (X^T X)^{-1} X^T y
$$

**Question:** Features are multiplied by a reversible matrix. Will the quality of linear regression change?

**Response:** No, it won't change.

Replace the matrix $X$ with the matrix $Z$:
$$
\begin{equation}
\begin{split}
Z = XP
\end{split}
\end{equation}
$$

Where $P$ is an invertible matrix with some values, by which the matrix $X$ can be multiplied.

**Justification:**

Replace $X$ with $Z$ and calculate what the prediction and the vector of weights will be equal to.

$$
\begin{equation}
\begin{split}
a_1 = Zw_1 \\
w_1 = (Z^TZ)^{-1}Z^Ty
\end{split}
\end{equation}$$

Substitute the formula $w_1$ into the equation $a_1$
$$
\begin{equation}
\begin{split}
a_1 = Z(Z^TZ)^{-1}Z^Ty \\
\end{split}
\end{equation}$$

Replace $Z$ with $XP$ from the first equation

$$
\begin{equation}
\begin{split}
a_1 = XP((XP)^T(XP))^{-1}(XP)^Ty \\
\end{split}
\end{equation}$$

Taking into account that the statement is true for the inverse matrix:

$$(AB)^{-1} = B^{-1}A^{-1}$$

Let's open $((XP)^T(XP))^{-1}$:

$$a_1 = XP((XP)^T(XP))^{-1}(XP)^{T}y = XP(XP)^{-1}((XP)^{T})^{-1}(XP)^{T}y = XPP^{-1}X^{-1}((XP)^T)^{-1}(XP)^{T}y$$

Multiplication will result in $PP^{-1} = E$. Let's use the characteristic of matrix transposition: $(AB)^{T} = B^{T}A^{T}$

Multiplying by the unit matrix does not change anything. Let's open $((XP)^{T})^{-1}(XP)^{T}$:

$$
a_1 = XEX^{-1}((XP)^{T})^{-1}(XP)^{T}y = XX^{-1}(P^{T}X^{T})^{-1}P^{T}X^{T}y = XX^{-1}(X^{T})^{-1}(P^{T})^{-1}P^{T}X^{T}y
$$

Multiplication will result in $(P^{T})^{-1}P^{T} = E$. Then, as a result of the transformations:

$$
a_1 = XX^{-1}(X^{T})^{-1}EX^{T}y = X(X^{T}X)^{-1}X^{T}y = Xw = a
$$

<div style="font-size: 20px">
    

**Intermediate conclusion:**

The final prediction results do not change if you multiply the feature matrix by the reversible matrix.

## 3.  Conversion algorithm

**Algorithm**

To protect information at the stage of searching for linear regression coefficients and making predictions, we multiply the feature matrix by an reversible matrix of random numbers.

Algorithm stages:

1. Generating a matrix of random numbers;

2. Checking the matrix for reversibility (the presence of an inverse matrix);

3. Obtaining matrices of transformed features (for training and test samples).

4. Checking the operation of the algorithm on the transformed features.

**Justification**


The fact of no changes in the final results of the features in the case of multiplying the latter by the transformation matrix from a mathematical point of view was considered earlier. When creating a matrix of transformed features, the dimension of the transformation matrix should be taken into account (it should be $n*n$), where n is the number of features for regression, and also check for the presence of an inverse matrix as proof of its reversibility. 

## 4.  Checking the algorithm

Let's create a function that generates an reversible matrix of pseudorandom numbers. At the same time, if the matrix has an inverse, then it is reversible and vice versa. We will not check for the presence of a determinant not equal to zero, since the presence of an inverse matrix indicates the presence of a determinant not equal to zero and, moreover, only one.

In [13]:
def generate_mtrx(features_train, reversibility):
    while reversibility:
        #Let's create a matrix of random numbers
        random_mtrx = np.random.random((features_train.shape[1], features_train.shape[1]))
        # Checking the presence/absence of the inverse matrix
        if reversibility:
            try:
                np.linalg.inv(random_mtrx)
                return random_mtrx
            except:
                continue

Let's build a linear regression model by multiplying the features by an invertible matrix.

In [14]:
rev_mtrx = generate_mtrx(train_features, True)

train_features_2 = train_features@rev_mtrx
test_features_2 = test_features@rev_mtrx

model_2 = LinearRegression1()
model_2.fit(train_features_2, train_target)
model_2.score(train_features_2, train_target, 'train')
model_2.score(test_features_2, test_target, 'test')

r2_score metric on train sample: 0.426
r2_score metric on test sample: 0.423


<div style="font-size: 20px">
    

**5. Final conclusions:**

One of the possible ways to protect customer data is to multiply it by an invertible matrix. The quality of linear regression does not change when comparing the R2 metrics on the original matrix and the transformed one by multiplying the original matrix by the reversible one.

Thus, the correctness of the data transformation algorithm was formulated, investigated and verified in order to protect them.