### Conducting basic NLP on sentence and segment level text

Some ideas: 
1. ML model for high frequency practices

    a. Non-neural net models as last option (too complex)

    b. Experiment with a bunch of sklearn models.

2. rule based text models for lowest 10? (after all the focus is about explainability)
3. What is the best way to explain? How does that interact with the type of model used?

Questions:
1. How to test performance? What is the nature of the hold out data? 
2. How to balance explainability vs performance?

    a. Need to add some papers on this

In [1]:
SEED = 1

import pandas as pd
import sklearn
import seaborn as sns

PATH_SENTENCE_TEXT = r"../dataset/concat_sentence_text.csv"
PATH_SEGMENT_TEXT = r"../dataset/concat_segment_text.csv"

## Part 1: NLP on sentence level text

In [2]:
df = pd.read_csv(PATH_SENTENCE_TEXT)
df.head()

Unnamed: 0,sentence_text,practice,modality
0,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
1,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_IP_Address_1stParty,PERFORMED
2,"IP addresses will be collected, along with inf...",Identifier_IP_Address_1stParty,PERFORMED
3,The information that our products collect incl...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
4,The information that our products collect incl...,Identifier_IP_Address_1stParty,PERFORMED


## Try first with basic model: TfIDF, with logistic regression, SGDClassifier?

### Also todo: To try various word representations and tokenisation. With different stop words? Or n-grams?

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'}, ngram_range=(1,2), strip_accents='ascii', binary = True)
tfidf_vectors = vectorizer.fit_transform(df["sentence_text"])

In [4]:
# Sanity check: Number of rows in matrix same as number of sentences.
# We have 51747 unique tokens after tokenisation
print(len(df))
print(tfidf_vectors.shape)

18829
(18829, 51747)


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [6]:
# Train test split, 20% test size?
x_train, x_test, y_train, y_test = train_test_split(tfidf_vectors, df["practice"], test_size = 0.2, random_state = SEED)

## Testing with logistic regression

In [7]:
logistic_clf = LogisticRegression(random_state = SEED, max_iter = 500, n_jobs = -1, multi_class = "ovr").fit(x_train, y_train)
y_pred = logistic_clf.predict(x_test)

In [8]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.14      0.07      0.09        74
                          Contact_3rdParty       0.00      0.00      0.00        14
             Contact_Address_Book_1stParty       0.45      0.24      0.32        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.00      0.00      0.00        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.25      0.62      0.36       415
           Contact_E_Mail_Address_3rdParty       0.15      0.04      0.06        52
                 Contact_Password_1stParty       0.00      0.00      0.00        77
                 Contact_Password_3rdParty       0.00      0.00      0.00         5
             Contact_Phone_Number_1stParty       0.18      0.15      0.16  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### To try visualisation of logistic regression with interpret. At least we know how it works using a simple linear classifier.

## Testing with LinearSVC (used by the original authors)

In [9]:
from sklearn.svm import SVC

In [10]:
linearSVC_clf = SVC(kernel= "linear", class_weight="balanced").fit(x_train, y_train)
y_pred = linearSVC_clf.predict(x_test)

KeyboardInterrupt: 

In [None]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.30      0.26      0.28        74
                          Contact_3rdParty       0.06      0.07      0.06        14
             Contact_Address_Book_1stParty       0.40      0.50      0.45        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.09      0.24      0.13        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.43      0.35      0.38       415
           Contact_E_Mail_Address_3rdParty       0.08      0.04      0.05        52
                 Contact_Password_1stParty       0.27      0.38      0.32        77
                 Contact_Password_3rdParty       0.01      0.20      0.03         5
             Contact_Phone_Number_1stParty       0.31      0.12      0.18  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### To consider decision trees and boosting

## Part 2: NLP on segment level text

In [27]:
df_segments = pd.read_csv(PATH_SEGMENT_TEXT)
df_segments.head()

Unnamed: 0,segment_text,practice,modality
0,PRIVACY POLICY This privacy policy (hereafter ...,,
1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,,
2,2. THE INFORMATION WE COLLECT The information ...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
3,2. THE INFORMATION WE COLLECT The information ...,Identifier_IP_Address_1stParty,PERFORMED
4,"2.2 In addition, we store certain information ...",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED


### Data cleaning: Drop segments without any practice. Instead of fillna with string None. Test the performance difference.
Tried with fillna as None, but performance was even worse than sentence level.

In [21]:
# Replace NaNs with string because sklearn does not accept NaN category
# df_segments["practice"] = df_segments["practice"].fillna("None")
# df_segments["modality"] = df_segments["modality"].fillna("None")

In [28]:
df_segments = df_segments.dropna()

In [29]:
df_segments.head()

Unnamed: 0,segment_text,practice,modality
2,2. THE INFORMATION WE COLLECT The information ...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
3,2. THE INFORMATION WE COLLECT The information ...,Identifier_IP_Address_1stParty,PERFORMED
4,"2.2 In addition, we store certain information ...",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
8,2.3 6677g may also use ad network providers to...,Identifier_Cookie_or_similar_Tech_3rdParty,PERFORMED
10,2.5 6677g may share demographic information (c...,Demographic_3rdParty,PERFORMED


In [30]:
# Vectorize
vectorizer = TfidfVectorizer(stop_words={'english'}, ngram_range=(1,2), strip_accents='ascii', binary = True)
tfidf_vectors_segments = vectorizer.fit_transform(df_segments["segment_text"])

In [31]:
x_train, x_test, y_train, y_test = train_test_split(tfidf_vectors_segments, df_segments["practice"], test_size = 0.2, random_state = SEED)

In [32]:
logistic_clf = LogisticRegression(random_state = SEED, max_iter = 500, n_jobs = -1, multi_class = "ovr").fit(x_train, y_train)
y_pred = logistic_clf.predict(x_test)

In [33]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.11      0.03      0.04        37
                          Contact_3rdParty       0.00      0.00      0.00         5
             Contact_Address_Book_1stParty       0.33      0.23      0.27        40
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         3
                     Contact_City_1stParty       0.00      0.00      0.00        20
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.25      0.56      0.34       248
           Contact_E_Mail_Address_3rdParty       0.00      0.00      0.00        29
                 Contact_Password_1stParty       0.08      0.02      0.03        46
                 Contact_Password_3rdParty       0.00      0.00      0.00         2
             Contact_Phone_Number_1stParty       0.14      0.16      0.15  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
