**<center><h1>Introduction</h1></center>**

Data science projects, including machine learning projects, involve analysis of data; and often that data includes sensitive personal details that should be kept private. In practice, most reports that are published from the data include aggregations of the data, which you may think would provide some privacy – after all, the aggregated results do not reveal the individual data values.

However, consider a case where multiple analyses of the data result in reported aggregations that when combined, could be used to work out information about individuals in the source dataset. Suppose 10 participants share data about their location and salary, from which two reports are produced:

- An aggregated salary report that tells us the average salaries in New York, San Francisco, and Seattle
- A worker location report that tells us that 10% of the study participants (in other words, a single person) is based in Seattle.

<img src = "images/09-reveal-analysis.png" />

From these two reports, we can easily determine the specific salary of the Seattle-based participant. Anyone reviewing both studies who happens to know a person from Seattle that participated, now knows that person's salary.

In this module, you'll explore differential privacy, a technique that can help protect an individual's data against this kind of exposure.

**<h2>Learning objectives</h2>**

In this module, you will learn how to:

- Articulate the problem of data privacy
- Describe how differential privacy works
- Configure parameters for differential privacy
- Perform differentially private data analysis

<hr>

**<center><h1>Understand differential privacy</h1></center>**

Differential privacy seeks to protect individual data values by adding statistical "noise" to the analysis process. The math involved in adding the noise is complex, but the principle is fairly intuitive – the noise ensures that data aggregations stay statistically consistent with the actual data values allowing for some random variation, but make it impossible to work out the individual values from the aggregated data. In addition, the noise is different for each analysis, so the results are non-deterministic – in other words, two analyses that perform the same aggregation may produce slightly different results.

<img src="images/09-differential-privacy.png" />


<hr>

**<center><h1>Configure data privacy parameters</h1></center>**


One way that an individual can protect their personal data is simply to not participate in a study – this is known as their "opt-out" option. However, there are a few considerations for this as a solution:

- Even if you decide to opt out a study may still produce results that affect you. For example, you may choose to opt-out of a study that compares the heart disease diagnoses across a group of people on the basis that doing so may reveal a heart disease diagnosis that causes your health insurance premiums to rise. If the study finds a correlation between people who drink coffee and higher risk of heart disease, and your insurance company knows that you are a coffee drinker, your rate may rise even though you didn’t personally participate in the study.
- The benefits of participation in the study may outweigh any negative impact. For example, if you're paid $100 to participate in a study that results in your health insurance rate rising by $10 per year, it will be more than 10 years before you make a net loss. This may be a worthwhile tradeoff to you (particularly if your rate may rise as a result of the study even if you don’t participate!)
- The only way for the opt-out option to work for every individual, is for every individual not to take part – which makes the whole study pointless!

The amount of variation caused by adding noise is configurable through a parameter called epsilon. This value governs the amount of additional risk that your personal data can be identified through rejecting the opt-out option and participating in a study. The key thing is that it applies this privacy principle for everyone participating in the study. A low epsilon value provides the most privacy, at the expense of less accuracy when aggregating the data. A higher epsilon value results in aggregations that are more true to the actual data distribution, but in which the individual contribution of a single individual to the aggregated value is less obscured by noise.

<img src="images/epsilon.png" />
<hr>

**<center><h1>Use differential privacy</h1></center>**

Now it's your chance to explore differential privacy for yourself by using the SmartNoise package.

In this exercise, you will:

- Use SmartNoise to generate differentially private analyses.
- Use SmartNoise to submit differentially private queries.

**<h2>Instructions</h2>**

Follow these instructions to complete the exercise.

1. If you do not already have an Azure subscription, sign up for a free trial at https://azure.microsoft.com.
2. View the exercise repo at https://aka.ms/mslearn-dp100.
3. If you have not already done so, complete the Create an Azure Machine Learning workspace exercise to provision an Azure Machine Learning workspace, create a compute instance, and clone the required files.
4. Complete the Explore differential privacy exercise.

<hr>

**<center><h1>Knowledge check</h1></center>**

How does differential privacy work?

- All numeric values in the dataset are encrypted and cannot be used in analysis.

- Noise is added to the data during analysis so that aggregations are statistically consistent with the data distribution but non-deterministic.

- All numeric column values in the dataset are converted to the mean value for the column. Analyses of the data use the mean values instead of the actual values.

2. In a differential privacy solution, what is the effect of setting an epsilon parameter?

- A lower epsilon reduces the impact of an individual's data on aggregated results, increasing privacy and reducing accuracy

- A lower epsilon reduces the amount of noise added to the data, increasing accuracy and reducing privacy

- Setting epsilon to 1 enables differential privacy. Setting it to 0 disables differential privacy.

<hr>

**<center><h1>Summary</h1></center>**

In this module, you learned how to:

- Articulate the problem of data privacy
- Describe how differential privacy works
- Configure parameters for differential privacy
- Perform differentially private data analysis

To learn more about interpreting models, see [Differential Privacy](https://docs.microsoft.com/en-us/azure/machine-learning/concept-differential-privacy) in the Azure Machine Learning documentation.

<hr>