# Part 2: Researcher proposes a research question
<center><img src="../CCAIO workshop imgs/steps/20.svg" align="center" style="width:90%"/></center>

This involves:
1. Understand the data proposed (via the mock data)
2. Propose questions to answer with the data
3. Build the audit code to answer it
4. Submit it for review and approval

## 2.1 Login to the Domain Server

<center><img src="../CCAIO workshop imgs/steps/21.svg" align="center" style="width:90%"/></center>


In [None]:
import syft as sy

In [None]:
researcher_client = sy.login(url='http://localhost:8087', email='oscarwilde@skywalker.net', password='oscars_house')
researcher_client

## 2.2 Propose a research question

<center><img src="../CCAIO workshop imgs/steps/22.svg" align="center" style="width:90%"/></center>

### Inspects & understands the data

In [None]:
researcher_client.datasets

In [None]:
researcher_client.datasets[0]

In [None]:
asset = researcher_client.datasets[0].assets[0]
asset

<br>
<br>
<br>
<br>

## Propose a research question

Once familiar with the dataset schema, the researcher proposes a question to answer. 

Today, let's focus on answering **"Is highly suggestive content amplified by the algorithm?** In particular, we know from the dataset description that highly suggestive content are posts containing violent or sexual artifacts that have been scored by an additional internal algorithm.

### Proposing a project

**Project** = describes the intent behind the incoming audit code and sets the context for the admin.

Usually, it is derived from what the dataset offers for analysis, and it must be in alignment with the prior research agreements signed by both parties. 

In [None]:
project_proposal = sy.Project(
    name="Suggestive content analysis",
    description="""
    This project aims to study the relationship between suggestiveness scores and the degree to which the
    algorithms deployed by DailyMotion are amplifying such content.""",
    members=[researcher_client]  # A project could be conducted by multiple researchers
)

project = project_proposal.start()

In [None]:
project

<br>
<br>

<br>
<br>

## Proposing the audit code

A project consists of multiple audit code requests sent together or one by one by the researcher to make sense of the algorithm metadata and its properties. 

### How do audit code requests work?
An audit code request is a remote code execution request:
- designed by the researcher
- tested against the fake counterpart of the data to test its correctness
- submitted to the admin, with the purpose of being run against the private data
- must adhere to the data owner organisation's use/mis-use policy

### How to ensure the audit code is compliant with the data usage policies?
The admin can specify guidelines on how the computational results can be released. Data use policies include:
- releasing aggregate statistics
- releasing only aggregate statistics with added diffierential privacy noise (e.g. Linkedin)
- releasing very small-scale samples of the private dataset
whereas data mis-use could be:
- no information that allows a researcher to directly identify an individual can be released

## 1st audit code: Are most popular videos more suggestive?

In [None]:
@sy.syft_function_single_use(df=asset)
def suggestiveness_in_top10pp_videos(df):
    from opendp.measurements import make_base_laplace
    from opendp.mod import enable_features
    from io import BytesIO
    import numpy as np
    import matplotlib.pyplot as plt

    def most_popular_videos_in_dataset(data):
        # Select the top 10% of the videos by number of recommendations
        count_top10p = int(0.1 * data.size)
        top10_threshold = data['recommendations'].nlargest(count_top10p).iloc[-1].astype(float)  
        top_posts = data[data['recommendations'] > top10_threshold]
        return top_posts

    # Select most popular videos (top 10%)
    top_posts = most_popular_videos_in_dataset(df)
    top_posts_per_algo = [most_popular_videos_in_dataset(df[df['algo'] == x]) for x in ['A', 'B', 'C']]
    
    # Adding DP noise    
    enable_features("contrib")
#     base_lap_vec_sugg = make_base_laplace(scale=1e-4, D="VectorDomain<AllDomain<float>>")

    to_plot = [x['suggestive'].to_list() for x in [df, top_posts] + top_posts_per_algo]
#     to_plot = [base_lap_vec_sugg(x['suggestive'].to_list()) for x in [df, top_posts] + top_posts_per_algo]
    
    # Plotting
    fig, ax = plt.subplots(figsize=(12,8))
    bp = ax.boxplot(to_plot, sym='k+', positions=np.arange(len(to_plot)) + 1, vert=True, patch_artist=True, notch=True)

    colors = ['pink', 'lightblue', 'lightgreen', 'lavender']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)

    ax.yaxis.grid(True)
    ax.set_title('Most popular videos\' sugestiveness', size=18)
    ax.set_ylabel('Suggestiveness ', fontsize=16)
    ax.set_xticks(range(7),['', 'All videos', 'Top posts', 'Top posts Alg.1', 'Top posts Alg.2', 'Top posts Alg.3', ''], size=14)

    figfile = BytesIO()
    plt.savefig(figfile, format='png')
    return figfile

## 1st audit code: Are most popular videos more suggestive?

We can run the above code against the mock data and check our code for corectness. However, we need to check the output on the private dataset to derive out initial hypothesis whether amplification of suggestive content is prevalent in DailyMotion's algorithm.

In [None]:
suggestiveness_in_top10pp_videos(df=asset.mock)

## 2st audit code: Are certain videos behaving much different than the rest? 

The below checks whether there are outlier videos in the data.

In [None]:
@sy.syft_function_single_use(df=asset)
def get_outliers(df, x_axis = 'suggestive', y_axis = 'recommendations', category_seriesname = 'algo', threshold = 6):
    import numpy as np
    import pandas as pd
    from sklearn.covariance import MinCovDet

    # Pick out outliers per category
    categories = sorted(df[category_seriesname].unique())
    results = []
    for category_label in categories:
        df_outliers = df[df[category_seriesname] == category_label]
        X = df_outliers[[x_axis, y_axis]].to_numpy()


        # fit a MCD robust estimator to data
        robust_cov = MinCovDet().fit(X)

        # Outliers
        df_outliers = df_outliers[df_outliers[x_axis] > robust_cov.location_[0]]
        df_outliers = df_outliers[df_outliers[y_axis] > robust_cov.location_[1]]

        # Filter for data in the upper right quadrant
        X = X[np.logical_and(X[:,0] > robust_cov.location_[0], X[:,1] > robust_cov.location_[1])]

        # Compute Mahalanobis distance (squared)
        df_outliers['distance'] = robust_cov.mahalanobis(np.c_[df_outliers[x_axis].to_numpy(),
                                        df_outliers[y_axis].to_numpy()])

        # Only consider points exceeding the threshold number of standard deviations
        df_outliers = df_outliers[df_outliers['distance'] > threshold]

        results.append(df_outliers)
    
    df_results = pd.concat(results)
    
    # Sort by Mahalanobis distance
    df_results = df_results.sort_values('distance', ascending=False)
    
    return df_results

## 2st audit code: Are certain videos behaving much different than the rest? (outliers)

In [None]:
get_outliers(df=asset.mock)

## Submit the audit code together with the project for review

In [None]:
project.create_code_request(suggestiveness_in_top10pp_videos, researcher_client)

In [None]:
project.create_code_request(get_outliers, researcher_client)

In [None]:
project.requests

## 2.3 Wait for review & approval

<center><img src="../CCAIO workshop imgs/steps/23.png" align="center" style="width:90%"/></center>


### Check if the any of the code requests were answered

In [None]:
researcher_client.code.suggestiveness_in_top10pp_videos(df=asset)

In [None]:
researcher_client.code.get_outliers(df=asset)

<br>
<br>
<br>

# Researchers awaits for answers
<img src="../CCAIO workshop imgs/steps/w.svg" style="width:100%"/><br>
<br>
<br>

# Part 4: Researcher's questions got answered
<center><img src="../CCAIO workshop imgs/steps/40.svg" align="center" style="width:90%"/></center>


In [None]:
researcher_client = sy.login(url='http://localhost:8086', email='oscarwilde@skywalker.net', password='oscars_house')
researcher_client

## First request result

In [None]:
from PIL import Image

# Get the private asset reference
asset = researcher_client.datasets[0].assets[0]

# Compute the result on the private asset, as it was approved
result = researcher_client.code.suggestiveness_in_top10pp_videos(df=asset)

Image.open(result.get())

## Second request result

In [None]:
researcher_client.code.get_outliers(df=asset)

# Thank you!


## We invite you now to conduct your own audit on Linkedin's and DailyMotion's node.

Steps:
1. Download the notebooks available here:
2. Follow the installation instructions
3. Creatively propose questions that can be answered with the data!