### Conducting basic NLP on sentence and segment level text

Some ideas: 
1. ML model for high frequency practices

    a. Non-neural net models as last option (too complex)

    b. Experiment with a bunch of sklearn models.

2. rule based text models for lowest 10? (after all the focus is about explainability)
3. What is the best way to explain? How does that interact with the type of model used?

Questions:
1. How to test performance? What is the nature of the hold out data? 
2. How to balance explainability vs performance?

    a. Need to add some papers on this

In [1]:
SEED = 1

import pandas as pd
import sklearn

PATH_SENTENCE_TEXT = r"../dataset/concat_sentence_text.csv"
PATH_SEGMENT_TEXT = r"../dataset/concat_segment_text.csv"

## Part 1: NLP on sentence level text

In [2]:
df = pd.read_csv(PATH_SENTENCE_TEXT)
df.head()

Unnamed: 0,sentence_text,practice,modality
0,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
1,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_IP_Address_1stParty,PERFORMED
2,"IP addresses will be collected, along with inf...",Identifier_IP_Address_1stParty,PERFORMED
3,The information that our products collect incl...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
4,The information that our products collect incl...,Identifier_IP_Address_1stParty,PERFORMED


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18829 entries, 0 to 18828
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   sentence_text  18829 non-null  object
 1   practice       18829 non-null  object
 2   modality       18829 non-null  object
dtypes: object(3)
memory usage: 441.4+ KB


In [4]:
df["sentence_text"] = df["sentence_text"].astype("string")
df["practice"] = df["practice"].astype("category")
df["practice"] = df["practice"].astype("category")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18829 entries, 0 to 18828
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   sentence_text  18829 non-null  string  
 1   practice       18829 non-null  category
 2   modality       18829 non-null  object  
dtypes: category(1), object(1), string(1)
memory usage: 315.2+ KB


In [6]:
## Limit to top 5 categories
top_5_cats = ["Identifier_Cookie_or_similar_Tech_1stParty", "Contact_E_Mail_Address_1stParty", "Location_1stParty", "Identifier_Cookie_or_similar_Tech_3rdParty", "Identifier_IP_Address_1stParty"]

df = df[df["practice"].isin(top_5_cats)]
df.head()

Unnamed: 0,sentence_text,practice,modality
0,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
1,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_IP_Address_1stParty,PERFORMED
2,"IP addresses will be collected, along with inf...",Identifier_IP_Address_1stParty,PERFORMED
3,The information that our products collect incl...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
4,The information that our products collect incl...,Identifier_IP_Address_1stParty,PERFORMED


## Try first with basic model: TfIDF, with logistic regression, SGDClassifier?

### Also todo: To try various word representations and tokenisation. With different stop words? Or n-grams?

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'}, ngram_range=(1,4), strip_accents='ascii', binary = True)
tfidf_vectors = vectorizer.fit_transform(df["sentence_text"])

In [8]:
# Sanity check: Number of rows in matrix same as number of sentences.
# We have 51747 unique tokens after tokenisation
print(len(df))
print(tfidf_vectors.shape)

7982
(7982, 218540)


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [10]:
# Train test split, 20% test size?
x_train, x_test, y_train, y_test = train_test_split(tfidf_vectors, df["practice"], test_size = 0.2, random_state = SEED)

## Testing with logistic regression

In [11]:
logistic_clf = LogisticRegression(random_state = SEED, max_iter = 500, n_jobs = -1, multi_class = "ovr").fit(x_train, y_train)
y_pred = logistic_clf.predict(x_test)

In [12]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

           Contact_E_Mail_Address_1stParty       0.66      0.85      0.74       413
Identifier_Cookie_or_similar_Tech_1stParty       0.54      0.70      0.61       418
Identifier_Cookie_or_similar_Tech_3rdParty       0.47      0.30      0.37       238
            Identifier_IP_Address_1stParty       0.56      0.37      0.45       214
                         Location_1stParty       0.60      0.45      0.52       314

                                  accuracy                           0.59      1597
                                 macro avg       0.57      0.53      0.54      1597
                              weighted avg       0.58      0.59      0.57      1597



### To try visualisation of logistic regression with interpret. At least we know how it works using a simple linear classifier.

## Testing with SGDClassifier

In [13]:
from sklearn.linear_model import SGDClassifier

In [16]:
clf_sgdclassifier = SGDClassifier(loss = "hinge", max_iter = 5000, random_state=SEED, n_jobs = -1).fit(x_train, y_train)
y_pred = clf_sgdclassifier.predict(x_test)

In [17]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

           Contact_E_Mail_Address_1stParty       0.70      0.80      0.75       413
Identifier_Cookie_or_similar_Tech_1stParty       0.54      0.59      0.56       418
Identifier_Cookie_or_similar_Tech_3rdParty       0.41      0.38      0.39       238
            Identifier_IP_Address_1stParty       0.47      0.36      0.41       214
                         Location_1stParty       0.56      0.53      0.54       314

                                  accuracy                           0.57      1597
                                 macro avg       0.54      0.53      0.53      1597
                              weighted avg       0.56      0.57      0.56      1597



## Testing with LinearSVC (used by the original authors)

In [18]:
from sklearn.svm import SVC

In [19]:
linearSVC_clf = SVC(kernel= "linear", class_weight="balanced").fit(x_train, y_train)
y_pred = linearSVC_clf.predict(x_test)

In [20]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

           Contact_E_Mail_Address_1stParty       0.76      0.77      0.76       413
Identifier_Cookie_or_similar_Tech_1stParty       0.59      0.57      0.58       418
Identifier_Cookie_or_similar_Tech_3rdParty       0.43      0.45      0.44       238
            Identifier_IP_Address_1stParty       0.46      0.48      0.47       214
                         Location_1stParty       0.58      0.55      0.56       314

                                  accuracy                           0.59      1597
                                 macro avg       0.56      0.56      0.56      1597
                              weighted avg       0.59      0.59      0.59      1597



### Random forests

In [21]:
from sklearn.ensemble import RandomForestClassifier

In [22]:
clf_randomforest = RandomForestClassifier(n_jobs = -1, random_state = SEED).fit(x_train, y_train)
y_pred = clf_randomforest.predict(x_test)

In [23]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

           Contact_E_Mail_Address_1stParty       0.65      0.75      0.70       413
Identifier_Cookie_or_similar_Tech_1stParty       0.48      0.57      0.52       418
Identifier_Cookie_or_similar_Tech_3rdParty       0.37      0.30      0.33       238
            Identifier_IP_Address_1stParty       0.44      0.34      0.38       214
                         Location_1stParty       0.52      0.44      0.48       314

                                  accuracy                           0.52      1597
                                 macro avg       0.49      0.48      0.48      1597
                              weighted avg       0.51      0.52      0.51      1597



### Summary (29/8/22): Trying different models using Tfidf yields low performance. 
The issue should be with feature engineering. Need to look at word embeddings first perhaps, before looking at what models to use.

TODO: How does this affect interpret package usage?

### Summary (3/9/22): Tried BERT both on all categories and on top 5 categories. Both yield low performance. 
Might be because not enough training data to train all the parameters.

Non-neural networks yield better performance out of the box. 
So stick to linear models, but at the same time figure out to limit to which categories to predict? 