# <span style='background :yellow' > Assessing SMEs' GDPR-compliance Through Privacy Policies: A Machine Learning Approach </span>
<span style='background :yellow' >
The goal of this research is to explore--using natural language processing and machine learning techniques--how organisations differ in their approach towards GDPR-compliance. We intend to do this by assessing privacy policies on their *focus* (rather than completeness) of the GDPR user rights and explore whether there is a correlation with the corresponding organisation's meta-data (e.g., country, service, data-driven). This will give us insight in organisations' interpretation of the GDPR (e.g., stressing a specific part) and factors (e.g., company size) that contribute to this particular interpretation.
</span>

---
## DATASET ##
This manually labeled set comprises 250 individual policies, containing over 18,300 natural sentences. For legal reasons, we have anonymized the data set, e.g. we have scrambled all num- bers and substituted names, email addresses, companies and URLs with generic replacements (e.g. ‘company 42645’). <br>
Source: __On GDPR Compliance of Companies’ Privacy Policies__ _by Müller et al._

The five GDPR requirements chosen to evaluate privacy policy compliance:

|No.|	Category|Required content in privacy policy|
|---|---|---|
|1| DPO | Contact details for the data protection officer or equivalent |
|2| Purpose | Disclosure of the purpose for which personal data is or is not used for |
|3| Acquired data | Disclosure that personal data is or is not collected, and/or which data is collected |
|4| Data sharing | Disclosure if 3rd parties can or cannot access a user’s personal data |
|5| Rights | Disclosure of the user’s right to rectify or erase personal data |

## Import labeled PPs (18.397 sentence snippets)

In [None]:
import pandas

# Let's load the training data from a csv file
dataset = pandas.read_csv('data/PP/GDPR.csv', sep='\t', encoding='utf-8')
dataset

#### Explore balance of dataset
Source: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5

In [None]:
import matplotlib.pyplot as plt

categ = list(dataset.columns)[1:] # select all except 'text' column

counts = []
for column in categ:
#     print(dataset[column].value_counts())
    tmp_count = dataset[column].value_counts()
    # make a list of tuples that contain column name and number of pos labeled sentences     
    counts.append((column, tmp_count[1]))

df_stats = pandas.DataFrame(counts, columns=['GDPR_criteria', 'number_of_pos_sen'])

df_stats.plot(x='GDPR_criteria', y='number_of_pos_sen', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of positively labeled sentences per category")
plt.ylabel('# of Occurrences (of a total of 18.397)', fontsize=12)
plt.xlabel('GDPR Assessment Criteria', fontsize=12)

#### What is the number of multi-labeled sentences?
Source: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5

In [None]:
import seaborn as sns
rowsums = dataset.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()
#plot
plt.figure(figsize=(8,5))
ax = sns.barplot(x.index, x.values)
plt.title("Multiple GDPR criteria per sentence")
plt.ylabel('# of Occurrences (of a total of 18.397)', fontsize=12)
plt.xlabel('# of GDPR criteria', fontsize=12)

- The vast majority of the sentences is not labeled at all (almost 16.000)

#### Class imbalance, possible solutions:
- oversampling minority class
    - Resample function from scikit-learn packaged: randomly duplicate examples in the minority class.
    - generating synthetic samples using SMOTE functionality in Imblearn package
- undersampling majority class

##### Oversampling should be done on the training set only:
In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.

Max Kuhn, creator of the caret R package and co-author of the (highly recommended) Applied Predictive Modelling textbook, in Chapter 11: Subsampling For Class Imbalances of the caret ebook:

__You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see “in the wild”.__

Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.

Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds._

sources: 
- https://imbalanced-learn.org/stable/over_sampling.html
- https://stackoverflow.com/questions/48805063/balance-classes-in-cross-validation/48810493#48810493


#### Oversampling

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
      
def oversample_data(x, y):    
    oversample = RandomOverSampler(sampling_strategy='minority')
    x_over, y_over = oversample.fit_resample(x, y)
    return x_over, y_over

print(Counter(y))

#### Preprocessing

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re

def preprocessing(pps):
#     tokenizer = nlp.tokenizer
    # tokenize sentences
    tokenized_sent = [sent.split() for sent in pps]
    
    # remove punctuation
    tokenized_sent = [[re.sub('[,’\'\.!?&“”():*_;"]', '', y) for y in x] for x in tokenized_sent]
    
    # remove words with numbers in them
    tokenized_sent = [[y for y in x if not any(c.isdigit() for c in y)] for x in tokenized_sent]
    
    # remove stopwords    
    tokenized_sent_clean = [[y for y in x if y not in stopwords.words('english')] for x in tokenized_sent]
    
    # from nltk.stem import PorterStemmer
    porter = PorterStemmer()
    tokenized_sent_clean = [[porter.stem(y) for y in x] for x in tokenized_sent_clean]
    
    detokenized_pps = []
    for i in range(len(tokenized_sent_clean)):
        t = ' '.join(tokenized_sent_clean[i])
        detokenized_pps.append(t) 
    
    return detokenized_pps

In [None]:
print("Before preprocessing: ")
print(pps[0:3])

print("Post preprocessing: ")
print(preprocessing(pps[:3]))

## Feature engineering

#### TF-IDF
<img src="img/tfidfformula.png">


#### TF-IDF Vectorizer
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

- fit: ...
- transform: ...
- fit transform: ...

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initalise the vectorizer 
# if we remove the stopwords here, the coef list is larger than vectorizer.get_feature_names()
# vectorizer = TfidfVectorizer(min_df = 2,  stop_words='english')
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(pps_n)

#compressed sparse row matrix: list of rows
type(tfidf)

# dense array: easier to work with
tfidf = tfidf.toarray()
type(tfidf)

In [None]:
#explore
print(tfidf.shape)
print(len(y_purpose))

words = vectorizer.get_feature_names()
print(len(words))
words[-10:]

## Train model - Classification: Logistic Regression

For classification tasks, Logistic regression models the probabability of an event occurring (e.g., "DPO", "Purpose") depending on the values of the independent variables, which are categorical (in our case even binary: "DPO" is 1 or 0).

We know that z is the weighted sum of the evidence for the class (probability of the class occurring).<br>


\begin{align}
z = c_0+c_1*𝑥_1+c_2*𝑥_2+...+c_𝑛*𝑥_𝑛
\end{align}

The larger the weight the greater impact the given feature has on the final decision:<br>
- large positive values indicate a positive impact (for the event to occur)
- large negative values indicate a negative impact (for the event not to occur)

Z value is between -∞ and +∞. 
Therefore we apply the sigmoid (or logistic function) to this value to obtain prob. values between 0 and 1.
The final probability scores let the model predict the label. If the prob of "Red" is higher than all other labels, the prediction will be "Red".

More info: https://machinelearningmastery.com/logistic-regression-for-machine-learning/

### Optimize parameters
___max_df___ float in range [0.0, 1.0] or int, default=1.0, is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
- max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
- max_df = 25 means "ignore terms that appear in more than 25 documents".
The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

___min_df___ float in range [0.0, 1.0] or int, default=1, is used for removing terms that appear too infrequently. For example:
- min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
- min_df = 5 means "ignore terms that appear in less than 5 documents".
The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

__ngram_range___ tuple (min_n, max_n), default=(1, 1)
- The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.

___max_features___ int, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

Source:
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from pprint import pprint
from time import time


# 'UR_explicitly_mentioned' weggelaten
# categories = ['Purpose']
prep_dataset = preprocessing(pps = dataset['Text'].to_list())


for i, category in enumerate(categ):
    tfidf_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('lr', LogisticRegression()),
    ])

    # increase processing time in a combinatorial way
    parameters = {
#         'tfidf__min_df': (.05, .1, .15, .2), #best solution: (.75)
#         'tfidf__max_df': (0.75, .85), #best solution: (1.)
#         'tfidf__max_features': (None, 5000, 10000, 50000),
        'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)), #best solution: (1,2)
#         'dt__max_depth': np.arange(3,10)
    }

    y = dataset[category].to_list()
    
    x_train, x_test, y_train, y_test = train_test_split(prep_dataset, y, test_size=0.1)
    grid_search = GridSearchCV(tfidf_pipeline, parameters)

    print("Performing grid search for label: {}".format(category))
    print("tf-idf pipeline:", [name for name, _ in tfidf_pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(x_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))


#### Use optimized params

In [None]:
#In sklearn, all machine learning models are implemented as Python classes
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

prep_dataset = preprocessing(pps = dataset['Text'].to_list())

params = [
    (1,1),
    (1,3),
    (1,1),
    (1,2),
    (1,1)
]

for i, category in enumerate(categ):
# for i, category in enumerate(["DPO"]):
    print("Label in progress:" + category)
    print()
    
    # initalise the vectorizer 
#     vectorizer = TfidfVectorizer(max_df = .75, min_df = .05, ngram_range = (1,2))
    vectorizer = TfidfVectorizer(ngram_range = params[i])
    print("Ngram:", params[i])

    tfidf = vectorizer.fit_transform(prep_dataset)

    # dense array: easier to work with
#     tfidf = tfidf.toarray()

    x = tfidf
    y = dataset[category].to_list()
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
    
    x_train_over, y_train_over = oversample_data(x_train, y_train)

    # Make an instance of the Model
    # all parameters not specified are set to their defaults
    lr = LogisticRegression()

    # Train the model on the data, storing the information learned from the data
    # Model is learning the relationship between digits (x_train) and labels (y_train)
    lr.fit(x_train_over, y_train_over)

    # Let's see what are the possible labels to predict (and in which order they are stored)
    print(lr.classes_)

    # We can get additional information about all the parameters used with LogReg model
    print(lr.get_params())

    y_pred = lr.predict(x_test)
    
    words = vectorizer.get_feature_names()

    print()
    print("Most important features:")
    for label, coefs, intercept in zip(lr.classes_, lr.coef_, lr.intercept_):
        print(label)
        sort_zipped_list = sorted(zip(words, coefs), key = lambda x: x[1], reverse = True) 
        for t, c in list(sort_zipped_list)[:10]:
            print(t, c)
        print("...")
        print("INTERCEPT:" +str(intercept))
        print("...")
        for t, c in list(sort_zipped_list)[-10:]:
            print(t, c)
        print()
        print()
        
    
    print()
    print("Confusion matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    print()
    print("Classification report:")
    print(classification_report(y_test, y_pred))
    print("TFIDF ROC_AUC Score", roc_auc_score(y_test, lr.predict_proba(x_test)[:,1]))
    
    print("----------------------------------")
    print()

## Predict & Evaluate

### Accuracy ### 
For label X, precision is the number of correctly predicted labels divided by all labels<br>

\begin{align}
Precision(p) = \frac{correctly\ predicted\ as\ label\ A}{all\ predictions\ made} = \frac{true\ positives\ +\ true\ negatives}{true\ positives\ +\ false\ positives\ +\ true\ negatives\ +\ false\ negatives} \\
\end{align}

true positive = correctly predicted as label A<br>
false positive = incorrectly predicted as label A<br>
true negative = correctly predicted as not label A<br>
false negative = predicted as another label, whereas it is actually label A

- ___Is not very helpful in case of class imbalance (classifying everything to the majority class will result in this case in a good accuracy)___


### Precision ### 
For label X, precision is the number of correctly predicted labels __out of all predicted labels__ (for the actual label X) (What percent of the predicted labels are correct? The focus is on predictions.).<br>

\begin{align}
Precision(p) = \frac{correctly\ predicted\ as\ label\ A}{all\ predictions\ made\ as\ label\ A} = \frac{true\ positives}{true\ positives\ +\ false\ positives} \\
\end{align}

true positive = correctly predicted as label A<br>
false positive = incorrectly predicted as label A<br>

### Recall ### 
For label X, recall is the number of correctly predicted labels (same as above) __out of the number of actual labels A__ (Out of all actual label A's, what percent of them did the model predict correctly? The focus is on actual labels.).<br>
In other words: r = true positives / (true positives + false negatives)

\begin{align}
Recall(r) = \frac{correctly\ predicted\ as\ label\ A}{all\ actual\ items\ with\ label\ A} = \frac{true\ positives}{true\ positives\ +\ false\ negatives} \\
\end{align}

true positive = correctly predicted as label A<br>
false negative = predicted as another label, whereas it is actually label A

### F1 Score ###
Ok so precision and recall measures the performance of a model from two different perspectives.
We can combine the two measures to get a single, balanced score, which is also called __F1 score__.
Obtaining a single score is often easier to compare different models.

\begin{align}
F1 = 2 * \frac{Precision * Recall}{Precision + Recall} \\
\end{align}

#### Predict

In [None]:
# y_pred = lr.predict(x_test)
# len words = 4706 - ngram(1,2)
# len words = 4706 - ngram(1,1)
print(len(words))

# from sklearn.metrics import roc_auc_score

# roc=roc_auc_score(y_test, lr.predict_proba(x_test)[:,1])
# print(roc)

#### Evaluate coefficients and intercept

In [None]:
# print the 10 terms that have the largest weight (coefficients)    
words = vectorizer.get_feature_names()

for label, coefs, intercept in zip(lr.classes_, lr.coef_, lr.intercept_):
    print(label)
    sort_zipped_list = sorted(zip(words, coefs), key = lambda x: x[1], reverse = True) 
    for t, c in list(sort_zipped_list)[:100]:
        print(t, c)
    print("INTERCEPT:" +str(intercept))
    print()
    print()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Xtesting = ['this is the first sentence', 'second sentence', 'third sentence']

tfidf_test = TfidfVectorizer(stop_words=None, ngram_range=(2, 2))

sps = tfidf_test.fit_transform(Xtesting)
print(len(tfidf_test.get_feature_names()[0]))
print((tfidf_test.get_feature_names()[0]))
# ['ab', 'bc', 'cd', 'de']

#### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Let's load the training data from a csv file
# train_set = pandas.read_csv('./test.csv', sep='\t', encoding='utf-8')
# train_colors = train_set['Color'].to_list()

# Get a dictionary of unique items with their counts
# print(Counter(train_colors))

# Get the confusion matrix
confusion_matrix(y_test, y_pred)


|Model|	Predicted: No|Predicted: Yes|
|---|---|---|
|Actual: No| 1731 | 3 |
|Actual: Yes| 82 | 24 |

#### Classification report
- __macro avg__: Calculate precision, recall and f1 metrics for each label, and find their average. This does not take label imbalance into account: f1 scores are averaged (with equal weights)
- __weighted avg__: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance (it can result in an F-score that is not between precision and recall).

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### LABEL: DPO
___1. Ngram (1,1), preprocessing(+ remove words with digits)*___
<img src="img/dpo-ngram11.png">

2. Ngram (1,2), preprocessing(+ remove words with digits) <br>
<img src="img/dpo-ngram12.png">

3. Ngram (1,1), preprocessing(all + remove words with digits), oversampling <br>
<img src="img/dpo-ngram11-oversampling.png" height=500>

#### LABEL: PURPOSE
1. Ngram (1,1), preprocessing(+ remove words with digits)
<img src="img/purpose-ngram11.png">

___2. Ngram (1,3), preprocessing(+ remove words with digits)*___<br>
<img src="img/purpose-ngram13.png">

3. Ngram (1,3), preprocessing(all + remove words with digits), oversampling <br>
<img src="img/purpose-ngram13-oversampling.png" height=500>

#### LABEL: ACQUIRED DATA
___1. Ngram (1,1), preprocessing(+ remove words with digits)___
<img src="img/acquireddata-ngram11.png">

2. Ngram (1,2), preprocessing(+ remove words with digits)<br>
<img src="img/acquireddata-ngram12.png">

3. Ngram (1,2), preprocessing(all + remove words with digits), oversampling <br>
<img src="img/acquireddata-ngram11-oversampling.png" height=500>

#### LABEL: DATA SHARING
1. Ngram (1,1), preprocessing(all + remove words with digits)
<img src="img/datasharing-ngram11.png">

___2. Ngram (1,2), preprocessing(all + remove words with digits)*___<br>
<img src="img/datasharing-ngram12.png">

3. Ngram (1,2), preprocessing(all + remove words with digits), oversampling <br>
<img src="img/datasharing-ngram12-oversampling.png" height=500>

#### LABEL: RIGHTS
___1. Ngram (1,1), preprocessing(+ remove words with digits)*___
<img src="img/rights-ngram11.png">

2. Ngram (1,2), preprocessing(all + remove words with digits)<br>
<img src="img/rights-ngram12.png">

3. Ngram (1,1), preprocessing(all + remove words with digits), oversampling <br>
<img src="img/rights-ngram11-oversampling.png" height=500>