# Data
## Get data from the dataset
To reconstruct some of the steps in paper from Yilmaz et al. one of the datasets from that paper was used: the car evaluation dataset from UCI Machine Learning.

In [2]:
import pandas as pd
import numpy as np
raw_data = pd.read_csv("./Dataset/car.data")

print(raw_data.describe())

       buying  maint doors persons lug_boot safety classification
count    1728   1728  1728    1728     1728   1728           1728
unique      4      4     4       3        3      3              4
top     vhigh  vhigh     2       2    small    low          unacc
freq      432    432   432     576      576    576           1210


## Change catergorical data to numeric
The different categories that has string values as data are changed in to take on a numeric form in the following manner:

|numeric val|buying|maint|doors|persons|lug_boot|safety|classification|
|-----------|------|-----|-----|-------|--------|------|--------------|
|0|vhigh|vhigh|2|2|small|low|unacc|
|1|high|high|3|4|med|med|acc|
|2|med|med|4|more|big|high|good|
|3|low|low|5more| | | |vgood|

In [4]:
# Change all the different columns to take on their catergory codes
features = raw_data.columns

for feature_name in features:
    raw_data[feature_name] = pd.Categorical(raw_data[feature_name])
    raw_data[feature_name] = raw_data[feature_name].cat.codes

raw_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classification
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


# Classification
## Naive Bayes classification without LDP
Make a simple Naive Bayes classifier with the help of the Scikit-learn python library.
The same train/test ratio was used (80%/20%)
The classifier is much worse in accuracy (83%) than that from the paper (97%). 
The goal of this notebook however is to get an overview of all the steps involved in the proces of using LDP data with ML classifiers.

In [5]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data into features and classes
features = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
X = raw_data[features]
y = raw_data.classification

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize the Naive Bayes classifier
mnb = CategoricalNB()

# Fit data to classifier
y_pred = mnb.fit(X_train, y_train).predict(X_test)

print("Number of incorrectly predicted classifications: out of %d items, %d are incorrect." 
      %(X_test.shape[0], (y_test != y_pred).sum()))
print("Accuracy score: %f: "% (accuracy_score(y_test, y_pred)))

Number of incorrectly predicted classifications: out of 346 items, 58 are incorrect.
Accuracy score: 0.832370: 


# LDP
Let's try to make an LDP data encoder to perpurb the data in the dataset
## Unary encoding
Unary encoding is a simple start and can be used on our different features.
Let's start to change just one column of the dataset: the classification.

The first two steps will perturb the data.

The last step will get an estimate of the frequency of each value in the domain of the classification.

In [87]:
# The domain to use is that of the classification column
domain = raw_data.classification.unique()

# Define 3 functions:
# First the encoder, which encodes the response.
def encode(response):
    return [1 if d == response else 0 for d in domain]

# Second the perturbing of the data
def perturb(encoded_response):
    return [perturb_bit(b) for b in encoded_response]

def perturb_bit(bit):
    p = .8
    q = .2
    
    sample = np.random.random()
    if bit == 1:
        if sample <= p:
            return 1
        else:
            return 0
    elif bit == 0:
        if sample <= q:
            return 1
        else: 
            return 0
        
# Third is the clean up of the amount of 'fake' responses from the perturbed data.
# The aggregate of the data will still be valueable, but local data is now still anonimized.
def aggregate(responses):
    p = .8
    q = .2
    
    # Take the sum of all the columns (axis=0), ie. go over all the bits that represent the encoded response.
    sums = np.sum(responses, axis=0)
    n = len(responses)
    
    return [(v - n*q) / (p-q) for v in sums]

# per row in the dataset: first encode the classifier into a vector of bit, then perturb that vector.
responses = [perturb(encode(i)) for i in raw_data.classification]

counts = aggregate(responses)
print("The estimated count per value in the domain:", list(zip(domain, counts)))
counts = np.sum([encode(i) for i in raw_data.classification], axis=0)
print("The ACTUAL count per value in the domain:", list(zip(domain, counts)))

The estimated count per value in the domain: [(2, 1175.6666666666665), (0, 318.99999999999994), (3, 95.66666666666661), (1, 103.99999999999994)]
The ACTUAL count per value in the domain: [(2, 1210), (0, 384), (3, 65), (1, 69)]


## Problem connecting to a ML classifier
The aggregate function corrects some of the forced fake responses by using a function with p and q. 

How do we get this last step into a ML classification algorithm?