### DATA MASKING

The goal of the task is to develop a data transforming algorithm that will protect clients' personal information and will not harm the quality of the linear regression model.

### 1. Data downloading

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
df=pd.read_csv('/datasets/insurance_us.csv')
df.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


We have dataset with 5000 entries. 5 columns: two floats and 3 integers types. No missing values. 

### 2. Multiplication of matrices

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

*Answer:* 
#### The predictions with or without matrix P are the same. 



*Justification:* 

Let's assume that P is our invertible matrix and $ X'= XP $. Now we'll find connection between weight vector $w'$ and $w$.

$$
𝑤′=((𝑋P)^𝑇(𝑋P))^{−1}(𝑋P)^𝑇𝑦
$$

Here we'll use such properties of matrices: 

$ (AB)^T = B^TA^T $ and $ (AB)^{-1} = B^{-1}A^{-1} $ and $ AA^{-1}=1 $

So,

$$ w' = (P^TX^TXP)^{-1}P^TX^Ty $$

$$ w' = (X^TXP)^{-1}(P^T)^{-1}P^TX^Ty $$

$$ w' = (X^TXP)^{-1}X^Ty $$

$$ w' = P^{-1}(X^TX)^{-1}X^Ty $$

$$ w' = P^{-1}w $$

Now let's prove that predictions are the same ($ a = a' $)

$$ a' = X'w' $$

$$ a' = XPP^{-1}w $$

$$ a' = Xw = a $$

$$ a' = a $$                                        

### 3. Transformation algorithm

We'll generate a random matrix using the numpy.random.normal() function. Our matrix should be invertible (4,4) size.

In [4]:
P = np.random.normal(size=(4,4))
P

array([[ 1.27441429,  1.31479393,  0.28503923,  0.14161571],
       [-1.83615311,  0.58860192,  0.60888692, -0.30618003],
       [-0.49312915,  0.21603914,  0.81105444,  0.72070484],
       [ 0.84774782, -1.08708203,  0.16753061, -0.84089861]])

Let's check our matrix for invertibility: we'll find the inverse matrix by calling the numpy.linalg.inv() function. If we'll get an error then matrix is non-invertible.

In [5]:
np.linalg.inv(P)

array([[ 0.27990232, -0.27416468,  0.06550233,  0.20310434],
       [ 0.45718526,  0.26895391, -0.30437796, -0.28180612],
       [ 0.2742896 ,  0.26866952,  0.80331652,  0.63686225],
       [-0.25420353, -0.57056474,  0.6195674 , -0.4932562 ]])

P is invertible matrix.

For data transformation we'll multiply feature matrix by matrix P. We saw above that the predictions by linear regression stayed the same also after transformation.    

### 4. Algorithm test


Let's check if the quality of linear regression stays the same before and after transformation using R2 metric.

* before transformation

In [6]:
features = df.drop('Insurance benefits', axis=1)
target = df['Insurance benefits']

# split the data into a training set and test set at a ratio of 75:25
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=12345)
    

In [7]:
class LinearRegression:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1) 
    
        y = train_target
        w = ((np.linalg.inv(X.T @ X)).dot(X.T)).dot(y)
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0
    
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)

print('R2 score before transformation:', r2_score(target_test, predictions))

R2 score before transformation: 0.43522757127026657


* after transformation

In [8]:
# multiplication of two matrices 
features_P = features @ P

# split the data into a training set and test set at a ratio of 75:25
features_train_P, features_test_P, target_train, target_test = train_test_split(features_P, target, test_size=0.25, random_state=12345)
    
    
model = LinearRegression()
model.fit(features_train_P, target_train)
predictions = model.predict(features_test_P)

print('R2 score after transformation:', r2_score(target_test, predictions))

R2 score after transformation: 0.4352275689346069


#### We can see that the data transformation algorithm, based on multiplication feature matrix by invertible matrix, doesn't harm the quality of linear regresion model.  

#### The quality stays the same for both sets of parameters: the original features and the features after multiplication