# Chapter 5: Differential Privacy

### Installing the SmartNoise SDK
SmartNoise is a toolkit from OpenDP a joint project between researchers at Microsoft, Harvard University, and other contributors that aims to provide building blocks for using differential privacy in data analysis and machine learning projects.

Let's start by installing the SmartNoise Python SDK package. 

In [1]:
pip install smartnoise-sql

Collecting smartnoise-sql
  Using cached smartnoise_sql-1.0.1-py3-none-any.whl (145 kB)
Collecting PyYAML<7.0.0,>=6.0.1 (from smartnoise-sql)
  Downloading PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (705 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m705.5/705.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting antlr4-python3-runtime==4.9.3 (from smartnoise-sql)
  Using cached antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting graphviz<0.18,>=0.17 (from smartnoise-sql)
  Using cached graphviz-0.17-py3-none-any.whl (18 kB)
Collecting opendp<0.8.0,>=0.7.0 (from smartnoise-sql)
  Using cached opendp-0.7.0-py3-none-any.whl (19.7 MB)
Building wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l- \ | done
[?25h  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_ru

### Loading data
We will now load some mock data and analyze the results. Our mock dataset contains 1000 records of random data, that include a diabetic column that declares if the person is diabetic or not and an age column with a range of values from 15 to 80 years old.

> **Note**: The data have been generated using a fake data generator and have no real application except to demonstrate the library capabilities.

In [16]:
import pandas as pd

data_path = 'mockdata.csv'
mockdata = pd.read_csv(data_path)

actualdata = mockdata[['age','diabetic']].groupby(['diabetic']).mean().to_markdown()
print(actualdata)

                id          age
count  1000.000000  1000.000000
mean    500.500000    47.441000
std     288.819436    19.109256
min       1.000000    15.000000
25%     250.750000    31.000000
50%     500.500000    47.000000
75%     750.250000    64.000000
max    1000.000000    80.000000
| diabetic   |     age |
|:-----------|--------:|
| False      | 47.4101 |
| True       | 47.4741 |


### Perform the analysis
Run the following code to compare the results with the actual data above. Changing the epsilon calue will result with either higher privacy or higher accuracy results.
> **Note**: A smaller value of epsilon indicates a higher level of privacy and lower accuracy, a higher value changes the balance to the opposite.

In [15]:
import snsql
from snsql import Privacy
import pandas as pd

privacy = Privacy(epsilon=0.05, delta=0.01)

csv_path = 'mockdata.csv'
meta_path = 'mockdata.yaml'

mockdata = pd.read_csv(csv_path)
reader = snsql.from_df(mockdata, privacy=privacy, metadata=meta_path)

result = reader.execute('SELECT diabetic, AVG(age) AS age FROM mockdata.table GROUP BY diabetic')

print(result)

[['diabetic', 'age'], [False, 54.2823275862069], [True, 42.19132149901381]]
