<div class="alert alert-success">
    Hi! Thanks for taking the time to improve the project! Now it looks great! The project is accepted and you can move on to the next sprint. Good luck!
</div>

# Review
Hi, my name is Dmitry and I will be reviewing your project.

You can find my comments in colored markdown cells:

<div class="alert alert-success">
    If everything is done succesfully.
</div>

<div class="alert alert-info">
    If I have some (optional) suggestions, or questions to think about, or general comments.
</div>

<div class="alert alert-danger">
    If a section requires some corrections. Work can't be accepted with red comments.
</div>

First of all, thank you for turning in the project! You did some good work, unfortunately you didn't do the theoretical part. Please have another look at it!

The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

Import the required libraries.

In [1]:
import pandas as pd
import numpy as np
from numpy.linalg import inv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

Open and read the data.

In [2]:
data = pd.read_csv('/datasets/insurance_us.csv')

In [3]:
data.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer:** ...

** Justification:** ...

*Split the data into train and test sets:*

In [5]:
features = data.drop('Insurance benefits', axis=1)
target = data['Insurance benefits']

In [6]:
features_train, features_test = train_test_split(features, test_size=0.25, random_state=0)
target_train, target_test = train_test_split(target, test_size=0.25, random_state=0)

*Original features:*

In [7]:
X = np.concatenate((np.ones((features_train.shape[0], 1)), features_train), axis=1)

In [8]:
original_w = inv(X.T.dot(X)).dot(X.T).dot(target_train)
print(original_w)

[-9.71916396e-01  9.29860938e-03  3.70151367e-02 -2.09123664e-07
 -1.51987617e-02]


In [9]:
ypred = features_test.dot(original_w[1:]) + original_w[0]

In [10]:
w_min = np.argmin(mean_squared_error(ypred, target_test))

In [11]:
print(w_min)

0


In [12]:
original_mse = mean_squared_error(ypred, target_test)

In [13]:
original_mse

0.09783243165662998

*The minimum MSE value for original features is obtained when the weights are equal to:*

In [14]:
original_w

array([-9.71916396e-01,  9.29860938e-03,  3.70151367e-02, -2.09123664e-07,
       -1.51987617e-02])

*Features multiplied by an invertible matrix:*

I will create an arbitrary 4x4 matrix, because the number of columns of the first matrix should be equal to the number of rows of the second matrix.

I will check that the matrix is invertible.

**An invertible matrix:**
   - is a square matrix that has an inverse
   - the determinant of the inverse is not equal to zero.

In [15]:
A = np.matrix([[2, -1, 0, 1],[1, 2, -1, 0],[-1, 0, 1, 2], [0, 1, 2, -1]])

A is an invertible matrix if it has an inverse, i.e. *the `numpy.linalg.inv()` function does not return an error*.

In [16]:
inv(A)

matrix([[ 0.375,  0.125, -0.125,  0.125],
        [-0.125,  0.375,  0.125,  0.125],
        [ 0.125, -0.125,  0.125,  0.375],
        [ 0.125,  0.125,  0.375, -0.125]])

In [17]:
#multiply the features by the invertible matrix
features = features @ A

In [18]:
#split the masked features
features_train, features_test = train_test_split(features, test_size=0.25, random_state=0)

In [19]:
X = np.concatenate((np.ones((features_train.shape[0], 1)), features_train), axis=1)

In [20]:
masked_w = inv(X.T.dot(X)).dot(X.T).dot(target_train)
print(masked_w)

[-0.97191638  0.00621405  0.01081848 -0.00916413  0.00768899]


In [21]:
ypred = features_test.dot(masked_w[1:]) + masked_w[0]

In [22]:
w_min = np.argmin(mean_squared_error(ypred, target_test))

In [23]:
print(w_min)

0


In [24]:
masked_mse = mean_squared_error(ypred, target_test).min()

In [25]:
masked_mse

0.09783243185307258

*The minimum MSE value for masked features is obtained when the weights are equal to:*

In [26]:
masked_w

array([-0.97191638,  0.00621405,  0.01081848, -0.00916413,  0.00768899])

In [27]:
original_mse, masked_mse

(0.09783243165662998, 0.09783243185307258)

In [28]:
#prediction bias values for the original and masked features.
original_w[0], masked_w[0]

(-0.9719163957736481, -0.9719163766746913)

$y$ is the dependent variable/feature

$X$ are the independent features

The terms $w$ and $w_0$ are the parameters of the model.

The parameter $w_0$ is the intercept, and the parameter $w$ is the slope parameter. 

The determination of the statistical model $a = Xw+w_0$ depends on the estimation of $w$ and $w_0$. In the X matrix, we add a column consisting only of ones and then multiply the $X$ matrix by the $w$ vector and add the $w0$ prediction bias value.

We calculate $w$ using this formula:
$$
w = (X^T X)^{-1} X^T y
$$

The objective is to find the values for the coefficient values($w$) that minimize the distance in the prediction of the target vector($y$):
$$
\min_w d_2(Xw, y)
$$

My justification is that for us to find the minimum MSE values of the model, we need to use the vector weights in $w$.

The weights in $w$, give us the model parameters for which the value of the loss function on the training set is minimal.

The quality metric MSE, has achieved its lowest value for both sets of parameters at 0.098.

It can be seen that MSE minimums from the models are related by having the same, $w_0$, prediction bias value.

<div class="alert alert-danger">
    <s>You didn't do step 2: "Provide a theoretical proof based on the equation of linear regression. The features are multiplied by an invertible matrix. Show that the quality of the model is the same for both sets of parameters: the original features and the features after multiplication. How are the weight vectors from MSE minimums for these models related?" Instead you do step 4 twice...</s>
</div>

<div class="alert alert-danger">
    <s>I'm sorry, but that is still not a proof. Let's start again:</s>
</div>

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Suppose we transform the features like this:
$$ X' = XP $$

What are the new weights:
$$ w' = ((X')^T X')^{-1} (X')^T y = \ ?$$

What are the new predictions:
$$ a' = X'w' = \ ?$$

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Suppose we transform the features like this:
$$ X' = XP $$

These are the new weights:
$$ w' = ((X')^T X')^{-1} (X')^T y $$
$$    = ((XP)^T XP)^{-1} (XP)^T y $$
$$    = (P^T X^T (XP))^{-1} P^T X^T y $$
$$    = P^{-1}(X^T X)^{-1}(P^T)^{-1} P^T X^T y $$

Matrix multiplication of a matrix and its inverse is equal to an identity matrix , which is equal to 1.

Therefore;
$$ (P^T)^{-1}(P^T) = 1 $$

Hence;
$$ w' = P^{-1}(X^T X)^{-1} X^T y $$
$$    = P^{-1}w $$ 

$w'$ is the multiplication of $w$ and the inverse of $P$.

The new predictions will be:
$$ a' = X'w' $$
$$  = XP P^{-1} w $$
$$   = Xw = a $$

The new predictions are also equal to the predictions of the original features.

Therefore, the quality of the model is the same for both sets of parameters.

<div class="alert alert-danger">
    <s>The weights can't be equal to the original weights, unless $P$ is the identity matrix. You need to plug $X' = XP$ into the training formula and see what happens.</s>
</div>

<div class="alert alert-danger">
    <s>Thanks for adding a proof, that's a good start! The idea is correct, but unfortunately some steps are wrong. Here are a couple of pointers:<br>
    1. Matrix multiplication is not commutative, so the terms can't be rearranged like that. $AB \neq BA$ in general.<br>
    2. $(AB)^T = B^T A^T$.<br>
        3. $(AB)^{-1} = B^{-1} A^{-1}$.</s>
</div>

<div class="alert alert-success">
    The proof is correct now! Good job!
</div>

## 3. Transformation algorithm

** Algorithm**

...

** Justification**

...

The transformation algorithm will generate a random matrix, then invert it and check if it passes the invertibility test using `inv` from np.linalg, by not returning an error.

It will then return the masked features multiplied by the invertible matrix, which will be used for prediction.

In [29]:
def feature_transformer(features):
    np.random.seed(0)
    P = np.random.normal(size=(features.shape[1], features.shape[1]))
    inv(P)
    X = features @ P    
    return X, P

<div class="alert alert-info">
    <s>The algorithm is fine, but you should probably save the random matrix, so that you could recover the original features in the future. Also, you shouldn't do the splitting inside this function. It should only do feature transformation.</s>
</div>

## 4. Algorithm test

*Model quality before transformation:*

In [30]:
features_train, features_test = train_test_split(features, test_size=0.25, random_state=0)

In [31]:
model_original = LinearRegression()
model_original.fit(features_train, target_train)
original_predictions = model_original.predict(features_test)
original_score = r2_score(target_test, original_predictions)

*Model quality after transformation:*

In [32]:
X, P = feature_transformer(features)

In [33]:
#the invertible matrix
print(P)

[[ 1.76405235  0.40015721  0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788  0.95008842 -0.15135721]
 [-0.10321885  0.4105985   0.14404357  1.45427351]
 [ 0.76103773  0.12167502  0.44386323  0.33367433]]


In [34]:
features_train, features_test = train_test_split(X, test_size=0.25, random_state=0)

In [35]:
model_masked = LinearRegression()
model_masked.fit(features_train, target_train)
masked_predictions = model_masked.predict(features_test)
masked_score = r2_score(target_test, masked_predictions)

In [36]:
original_score, masked_score

(0.3878739635059356, 0.38787396350593595)

**Conclusion:**

### The scores for the model before and after transformation are the same. This proves that the algorithm works correctly after transformation.

<div class="alert alert-success">
    Great! You've empirically confirmed that the algorithm works.
</div>

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  Step 1 performed: the data was downloaded
- [x]  Step 2 performed: the answer to the matrix multiplication problem was provided
    - [x]  The correct answer was chosen
    - [x]  The choice was justified
- [x]  Step 3 performed: the transform algorithm was proposed
    - [X]  The algorithm was described
    - [x]  The algorithm was justified
- [x]  Step 4 performed: the algorithm was tested
    - [X]  The algorithm was realized
    - [X]  Model quality was assessed before and after the transformation