The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [1]:
import pandas as pd
import numpy as np
from numpy.random import RandomState
state = RandomState(322)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('/datasets/insurance_us.csv')

In [3]:
features = df.drop('Insurance benefits',axis=1)
target = df['Insurance benefits']

## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Answer: Lets multiply $x$ by the matrix $P$ and follow the algebra

$$ w'=((XP)^TXP)^{-1}(XP)^Ty $$

$$ w'=(P^TX^T(XP))^{-1}(XP)^Ty $$

$$w'=(P^T(X^TX)P)^{-1}(XP)^Ty$$

$$w'=P^{-1}(X^TX)^{-1}(P^T)^{-1}(XP)^Ty $$

$$w'=P^{-1}(X^TX)^{-1}(P^T)^{-1}P^TX^Ty$$

$$w'=P^{-1}(X^TX)^{-1}X^Ty$$

$$w'=P^{-1}w$$

$$a=Xw \;\;a'=X'w'$$

$$ X'=XP$$
and now because we know that $w'=P^{-1}w$ we can conclude that:
$$a'=XPP^{-1}w=Xw=a$$
where $a$ and $a'$ are prediction matrices
$$$$
Justification:
After all the algebra we can see that $w$ is the same as $w'$

## 3. Transformation algorithm

** Algorithm**




In [4]:
def transform(matrix):
    flag = True
    while(flag):
        P = np.random.rand(matrix.shape[1],matrix.shape[1])
        try:
            np.linalg.inv(P)
            flag = False
        except: flag = True
    return P, matrix @ P

** Justification**

I multiplied the features matrix by an invertible matrix, and according to algebra in part 2 predictions using these weighs are the same as it was without P

In [5]:
P, new_features = transform(features)

## 4. Algorithm test

In [6]:
features_train, features_test, target_train, target_test = train_test_split(new_features,target,test_size=0.25,random_state=12345)
features_train_old,features_test_old = train_test_split(features,test_size=0.25,random_state=12345)

In [7]:
model = LinearRegression()
model.fit(features_train,target_train)
pred = model.predict(features_test)
r2_score(target_test,pred)

0.43522757126872025

In [8]:
model = LinearRegression()
model.fit(features_train_old,target_train)
pred = model.predict(features_test_old)
r2_score(target_test,pred)

0.435227571270266

as you can see the r2 score is the really close with old features and new features