# Data
## Get data from the dataset
To reconstruct some of the steps in paper from Yilmaz et al. one of the datasets from that paper was used: the car evaluation dataset from UCI Machine Learning.

In [1]:
import pandas as pd
import numpy as np
raw_data = pd.read_csv("./Dataset/car.data")

print(raw_data.describe())

       buying  maint doors persons lug_boot safety classification
count    1728   1728  1728    1728     1728   1728           1728
unique      4      4     4       3        3      3              4
top     vhigh  vhigh     2       2    small    low          unacc
freq      432    432   432     576      576    576           1210


## Change catergorical data to numeric
The different categories that has string values as data are changed in to take on a numeric form in the following manner:

|numeric val|buying|maint|doors|persons|lug_boot|safety|classification|
|-----------|------|-----|-----|-------|--------|------|--------------|
|0|vhigh|vhigh|2|2|small|low|unacc|
|1|high|high|3|4|med|med|acc|
|2|med|med|4|more|big|high|good|
|3|low|low|5more| | | |vgood|

In [2]:
# Change all the different columns to take on their catergory codes
features = raw_data.columns

for feature_name in features:
    raw_data[feature_name] = pd.Categorical(raw_data[feature_name])
    raw_data[feature_name] = raw_data[feature_name].cat.codes

test_row = raw_data.iloc[-100:]
raw_data = raw_data[:-1]

raw_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classification
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


# Classification
## Naive Bayes classification without LDP
Make a simple Naive Bayes classifier with the help of the Scikit-learn python library.
The same train/test ratio was used (80%/20%)
The classifier is much worse in accuracy (83%) than that from the paper (97%). 
The goal of this notebook however is to get an overview of all the steps involved in the proces of using LDP data with ML classifiers.

In [3]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data into features and classes
features = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
X = raw_data[features]
y = raw_data.classification

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize the Naive Bayes classifier
mnb = CategoricalNB()

# Fit data to classifier
y_pred = mnb.fit(X_train, y_train).predict(X_test)

print("Number of incorrectly predicted classifications: out of %d items, %d are incorrect." 
      %(X_test.shape[0], (y_test != y_pred).sum()))
print("Accuracy score: %f: "% (accuracy_score(y_test, y_pred)))

Number of incorrectly predicted classifications: out of 346 items, 63 are incorrect.
Accuracy score: 0.817919: 


# LDP
Let's try to make an LDP data encoder to perpurb the data in the dataset
## Unary encoding
Unary encoding is a simple start and can be used on our different features.
Let's start to change just one column of the dataset: the classification.

The first two steps will perturb the data.

The last step will get an estimate of the frequency of each value in the domain of the classification.

In [4]:
# The domain to use is that of the classification column
p = .8
q = 1 - p

# Define 3 functions:
# First the encoder, which encodes the response.
def encode(response, domain):
    return [1 if d == response else 0 for d in domain]

# Second the perturbing of the data
def perturb(encoded_response):
    return [perturb_bit(b) for b in encoded_response]

def perturb_bit(bit):
    sample = np.random.random()
    if bit == 1:
        if sample <= p:
            return 1
        else:
            return 0
    elif bit == 0:
        if sample <= q:
            return 1
        else: 
            return 0
        
# Third is the clean up of the amount of 'fake' responses from the perturbed data.
# The aggregate of the data will still be valueable, but local data is now still anonimized.
def aggregate(responses):
    # Take the sum of all the columns (axis=0), ie. go over all the bits that represent the encoded response.
    sums = np.sum(responses, axis=0)
    n = len(responses)
    
    return [(v - n*q) / (p-q) for v in sums]

# per row in the dataset: first encode the classifier into a vector of bit, then perturb that vector.
responses = [perturb(encode(i, raw_data.classification.unique())) for i in raw_data.classification]

counts = aggregate(responses)

print("The estimated count per value in the domain:", list(zip(range(len(raw_data.classification.unique())), counts)))
counts = np.sum([encode(i,raw_data.classification.unique()) for i in raw_data.classification], axis=0)
print("The ACTUAL count per value in the domain:", list(zip(range(len(raw_data.classification.unique())), counts)))

The estimated count per value in the domain: [(0, 1214.3333333333335), (1, 386.00000000000006), (2, 24.333333333333464), (3, 37.66666666666679)]
The ACTUAL count per value in the domain: [(0, 1210), (1, 384), (2, 64), (3, 69)]


## Problem connecting to a ML classifier
The aggregate function corrects some of the forced fake responses by using a function with p and q. 

How do we get this last step into a ML classification algorithm?

Let's recreate the Naive Bayes classifier from the Yilmaz et al. paper.

### Split data to fit the Bayes function
Here the first step is to encode two pieces of data: 
- The classification
- The feature combined with the classification

The first step is straight forward encoding and perturbing of all the classifications.
For the second step we need to make the feature dependent on the classification in some way. For this the following function can be used: `(input) * k + v` where k is the number of different class values and v is the actual value for the class.

In [5]:
# Make an array of encoded class values
perturbed_classes = [perturb(encode(i, y_train.unique().tolist())) for i in y_train]

# Make an array of all the encoded features
perturbed_features = []
for i in range(len(X_train)):
    features_list = []
    for feature in features:
        input_val = X_train.iloc[i][feature]
        k = len(y_train.unique().tolist())
        v = y_train.iloc[i]
        domain = X_train[feature].unique()
        perturbed_input = perturb(encode((input_val) * k + v, range(len(domain) * k)))
        features_list.append(perturbed_input)
    perturbed_features.append(features_list)

print("An example row of features and class:")
print(perturbed_features[0],perturbed_classes[0])

An example row of features and class:
[[0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0]] [1, 1, 0, 0]


## Central aggregator
### Perform frequency estimation at the data aggregator
At the central data aggregator we must now perform a estimate of the frequency of each value of each feature|class pair and class.

We need to make two different functions that will return the specific esitmate for a requested feature|class or class

In [20]:
# Get frequencies for the classification
frequency_class = aggregate(perturbed_classes)
frequency_class = [1 if item < 0 else item for item in frequency_class]

# Function to make frequency estimation per feature|class value
def estimate(feat_val, class_val, feat_name):
    domain_class = len(y_train.unique().tolist())
    domain_feat = len(X_train[feat_name].unique().tolist())
    domain = domain_class * domain_feat
    enc_b_g = encode(feat_val * domain_class + class_val, range(domain))

# Get the estimates for the different possibilities in the provided data
    df = pd.DataFrame(perturbed_features).set_axis(features, axis=1)
    buying_estimates = aggregate(df[feat_name].values.tolist())
    buying_estimates = [item if item > 0 else 1 for item in buying_estimates]

# Now estimate for the frequency of this particular value can be read from the list
    b_g_estimate = np.sum([buying_estimates[i] if enc_b_g[i] else 0 for i in range(len(enc_b_g))])
    return b_g_estimate

# First do it for Fi = x | Cj, so for feature being 'buying = low' given 'good'
# Encode 'buying = low' given 'good' (low = 3 and good = 2)
estimate(3, 2, 'buying')

276.3333333333334

### Convert frequencies to probabilities
Now that we have the frequencies for each feature and class we can convert these into probabilties that we can use in the Bayes function.

This is done by taking the average of each value.

In [21]:
def probability_class(class_val):
    return frequency_class[class_val] / np.sum(frequency_class)

# Sum estimates of all values in the domain of buying
# Devide secific estimate by sum to get probability

def probability_feat_class(feat_val, class_val, feat_name):
    sum_feat_class = np.sum([estimate(i, class_val, feat_name) for i in range(len(X_train[feat_name].unique().tolist()))])
    return estimate(feat_val, class_val, feat_name) / sum_feat_class

probability_feat_class(3,2,'buying')

0.29386742289968093

## Perform the Bayes function
The Bayes function is now able to give a probability for each class given a certain feature USING our found probabilities.

In [30]:
# We need to calculate the probability for each class with out input values
def bayes(input_val):
    probs = []
    for class_val in range(len(y_train.unique().tolist())):
        prob_prod = 1
        for feat_num in range(len(features)):
            feat_name = features[feat_num]
            feat_val = input_val[feat_num]
            prob_prod = prob_prod * probability_feat_class(feat_val, class_val, feat_name)
        probs.append(prob_prod * probability_class(class_val))
    return probs

def select_max_classification(results):
    classification = []
    for item in results:
        maximum_value = 0
        maximum_index = -1
        for i in range(len(item)):
            if item[i] > maximum_value:
                maximum_value = item[i]
                maximum_index = i
        classification.append(maximum_index)
    return classification

# print(pd.DataFrame(bayes(test_row.values.tolist())).transpose())
# print("Actual classifcation: ", test_row.classification)
y_predict = select_max_classification([bayes(row) for row in X_test.values.tolist()])

print(accuracy_score(y_predict, y_test))

0.43352601156069365
