# Protection of personal data of clients
It is necessary to protect the data of clients of the insurance company "Want a Flood". Namely, we need to develop a method of data transformation in a way that it would be difficult to recover personal information from it, and justify the correctness of its work.

At the same time, it is necessary to protect the data in such a way that the quality of machine learning models does not deteriorate during the transformation.

In [1]:
import pandas as pd
import numpy as np
from numpy.random import RandomState
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Meeting the data

In [2]:
try:
    data = pd.read_csv('/datasets/insurance.csv')
except:
    data = pd.read_csv('C:/Users/Ivan/datasetsYP/insurance.csv')
    
data.columns = ['gender', 'age', 'salary', 'relatives', 'insurance']
display(data.describe())
display(data.isnull().sum())
print(data.duplicated().sum())
data = data.drop_duplicates(keep='first')
display(data.describe())
display(data.info())

Unnamed: 0,gender,age,salary,relatives,insurance
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


gender       0
age          0
salary       0
relatives    0
insurance    0
dtype: int64

153


Unnamed: 0,gender,age,salary,relatives,insurance
count,4847.0,4847.0,4847.0,4847.0,4847.0
mean,0.498453,31.023932,39895.811842,1.203425,0.152259
std,0.500049,8.487995,9972.953985,1.098664,0.468934
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33200.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4847 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   gender     4847 non-null   int64  
 1   age        4847 non-null   float64
 2   salary     4847 non-null   float64
 3   relatives  4847 non-null   int64  
 4   insurance  4847 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 227.2 KB


None

**Conclusions**: Data downloaded and examined. No obvious outliers were found. However, there were 153 complete duplicates in the data, given that wages have many different values (as opposed to gender, age, number of family members, and insurance benefits), duplicates should not occur, so they were removed. Apparently, duplicates are missing values.

## Matrix Multiplication


Denotations:

- $X$ - feature matrix (zero column consists of units)

- $y$ - vector of target attribute

- $P$ - matrix by which the signs are multiplied

- $w$ - vector of linear regression weights (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

New parameters:
$$
X_n = X P
$$

Let's focus on the new parameters:

$$
w_n = (X_n^T X_n)^{-1} X_n^T y = ((XP)^T (XP))^{-1} (XP)^T y
$$
We can solve the transpose ( (XP)^T = P^T X^T ):

$$
w_n = (P^T X^T X P)^{-1} P^T X^T y
$$

Now let's solve the -1 degree:

$$
(A @ B @ C)^{-1} = C^{-1} (A B)^{-1} = C^{-1} B^{-1} A^{-1}
A = P^T, B = (X^T X), C = P  
$$
We can, since X^T X gives a guaranteed square matrix (reversible)

$$
w_n = P^{-1} (X^T X)^{-1} P^T^{-1} P^T X^T y 
$$

$$
w_n = P^{-1} (X^T X)^{-1} X^T y
$$
Since:
$$
w = (X^T X)^{-1} X^T y
$$
We can simply substitute w:
$$
w_n = P^{-1} w
$$
Finally, let's substitute in the prediction:
$$
a_n = X_n w_n = X P P^{-1} w = X w = a
$$
The old and new predictions are equal!
**Proved**

## Conversion Algorithm
**Algorithm**

Based on the algorithm showed above, we will multiply the feature matrix by an additional matrix (P) with constant values. The values for the matrix will be randomly generated and will be used for both training and test samples.

Since we want to keep the number of features and the number of users constant, the matrix P should be quadratic with dimension n - the number of features. It should also be linearly independent so that the features do not degenerate, and the related property is non-degeneracy so that we can perform reconstruction of the original features.

Thus, it is necessary to:
Obtain a random unexpanded quadratic matrix (P) with dimension n (number of features)  
Check it for reversibility (respectively, non-convexity)  
Multiply the original feature matrix X by the matrix P

**Rationale**.
Quadraticity and matrix size:  
There is a matrix X of size mxn, where m is the number of users, n is the number of features  
We need to obtain a new matrix X' of the same dimension as follows:  
X' = X P  
hence the dimension of the matrix P must be n by n

**Transformation of attributes**

In [3]:
N_FEATURES = data.shape[1] - 1
our_random_state = RandomState(42)

def get_rand_matrix(N_FEATURES):
    det = 0
    while det == 0:
        matrix = our_random_state.rand(N_FEATURES, N_FEATURES)
        det = np.linalg.det(matrix)
    return matrix

P_matrix = get_rand_matrix(N_FEATURES)

# Check for reversibility (determinant is not equal to 0)
if np.linalg.det(P_matrix) != 0:
    print('Matix is not degenerate - everything is fine!')
else:
    print('It was not supposed to be like this...')

#  Function for parameter conversion using a pre-generated random matrix
def get_transformed_features(old_features):
    return old_features @ P_matrix

print('\nThe Matrix:')
display(P_matrix)

Matix is not degenerate - everything is fine!

The Matrix:


array([[0.37454012, 0.95071431, 0.73199394, 0.59865848],
       [0.15601864, 0.15599452, 0.05808361, 0.86617615],
       [0.60111501, 0.70807258, 0.02058449, 0.96990985],
       [0.83244264, 0.21233911, 0.18182497, 0.18340451]])

**Conclusions**: conditions for conversion are formulated, and the basis for feature conversion is prepared.

## Algorithm validation
Let's perform calculations and predictions for two data sets: unchanged and transformed.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
                                    data[data.columns[:-1]].values, data[data.columns[-1]].values, 
                                    test_size=0.25, random_state=42)

# Let's get new transformed parameters
X_train_transformed = get_transformed_features(X_train)
X_test_transformed = get_transformed_features(X_test)

# Let's create a function that will train the model and determine its quality
def linear_regression_checker(name, X_train, X_test, y_train, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    r2 = model.score(X_test, y_test)
    print(f'For model {name} r2 = {r2}\n')
    return r2

# Let's use it for both cases
r2_not_transformed = linear_regression_checker('starting parametrs', X_train, X_test, y_train, y_test)
r2_transformed = linear_regression_checker('modified parametrs', X_train_transformed,
                          X_test_transformed, y_train, y_test)
print(f'Difference in prediction quality = {r2_not_transformed - r2_transformed}')

For model starting parametrs r2 = 0.4434633083161058

For model modified parametrs r2 = 0.4434633083154019

Difference in prediction quality = 7.038813976123492e-13


Thus the accuracy of predictions for transformed and untransformed parameters can be considered unchanged.

## Conclusions
1) The data were downloaded and examined, duplicates found were removed.
2) It is shown that multiplication of the feature matrix by a random reversible matrix does not affect the predictions.
3) A transformation algorithm (multiplication of the feature matrix by a quadratic nondegenerate matrix filled with random numbers with dimensionality corresponding to the number of features) is designed and implemented.
4) Verification showed that the models trained and tested on transformed and untransformed features have the same accuracy.