### Conducting basic NLP on sentence and segment level text

Some ideas: 
1. ML model for high frequency practices

    a. Non-neural net models as last option (too complex)

    b. Experiment with a bunch of sklearn models.

2. rule based text models for lowest 10? (after all the focus is about explainability)
3. What is the best way to explain? How does that interact with the type of model used?

Questions:
1. How to test performance? What is the nature of the hold out data? 
2. How to balance explainability vs performance?

    a. Need to add some papers on this

In [1]:
SEED = 1

import pandas as pd
import sklearn
import seaborn as sns

PATH_SENTENCE_TEXT = r"../dataset/concat_sentence_text.csv"
PATH_SEGMENT_TEXT = r"../dataset/concat_segment_text.csv"

## Part 1: NLP on sentence level text

In [2]:
df = pd.read_csv(PATH_SENTENCE_TEXT)
df.head()

Unnamed: 0,sentence_text,practice,modality
0,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
1,"IP ADDRESS, COOKIES, AND WEB BEACONS",Identifier_IP_Address_1stParty,PERFORMED
2,"IP addresses will be collected, along with inf...",Identifier_IP_Address_1stParty,PERFORMED
3,The information that our products collect incl...,Identifier_Cookie_or_similar_Tech_1stParty,PERFORMED
4,The information that our products collect incl...,Identifier_IP_Address_1stParty,PERFORMED


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18829 entries, 0 to 18828
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   sentence_text  18829 non-null  object
 1   practice       18829 non-null  object
 2   modality       18829 non-null  object
dtypes: object(3)
memory usage: 441.4+ KB


In [4]:
df["sentence_text"] = df["sentence_text"].astype("string")
df["practice"] = df["practice"].astype("category")
df["practice"] = df["practice"].astype("category")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18829 entries, 0 to 18828
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   sentence_text  18829 non-null  string  
 1   practice       18829 non-null  category
 2   modality       18829 non-null  object  
dtypes: category(1), object(1), string(1)
memory usage: 315.2+ KB


## Try first with basic model: TfIDF, with logistic regression, SGDClassifier?

### Also todo: To try various word representations and tokenisation. With different stop words? Or n-grams?

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'}, ngram_range=(1,4), strip_accents='ascii', binary = True)
tfidf_vectors = vectorizer.fit_transform(df["sentence_text"])

In [13]:
# Sanity check: Number of rows in matrix same as number of sentences.
# We have 51747 unique tokens after tokenisation
print(len(df))
print(tfidf_vectors.shape)

18829
(18829, 369585)


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [19]:
# Train test split, 20% test size?
x_train, x_test, y_train, y_test = train_test_split(tfidf_vectors, df["practice"], test_size = 0.2, random_state = SEED)

## Testing with logistic regression

In [20]:
logistic_clf = LogisticRegression(random_state = SEED, max_iter = 500, n_jobs = -1, multi_class = "ovr").fit(x_train, y_train)
y_pred = logistic_clf.predict(x_test)

In [21]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.05      0.01      0.02        74
                          Contact_3rdParty       0.00      0.00      0.00        14
             Contact_Address_Book_1stParty       0.46      0.13      0.21        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.00      0.00      0.00        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.24      0.65      0.35       415
           Contact_E_Mail_Address_3rdParty       0.10      0.02      0.03        52
                 Contact_Password_1stParty       0.00      0.00      0.00        77
                 Contact_Password_3rdParty       0.00      0.00      0.00         5
             Contact_Phone_Number_1stParty       0.11      0.08      0.09  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### To try visualisation of logistic regression with interpret. At least we know how it works using a simple linear classifier.

## Testing with SGDClassifier

In [9]:
from sklearn.linear_model import SGDClassifier

In [17]:
clf_sgdclassifier = SGDClassifier(loss = "squared_hinge", max_iter = 5000, random_state=SEED, n_jobs = -1).fit(x_train, y_train)
y_pred = clf_sgdclassifier.predict(x_test)



In [18]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.16      0.04      0.06        74
                          Contact_3rdParty       0.00      0.00      0.00        14
             Contact_Address_Book_1stParty       0.23      0.39      0.29        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.05      0.07      0.06        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.32      0.22      0.26       415
           Contact_E_Mail_Address_3rdParty       0.00      0.00      0.00        52
                 Contact_Password_1stParty       0.04      0.03      0.03        77
                 Contact_Password_3rdParty       0.04      0.20      0.06         5
             Contact_Phone_Number_1stParty       0.19      0.20      0.20  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Testing with LinearSVC (used by the original authors)

In [9]:
from sklearn.svm import SVC

In [10]:
linearSVC_clf = SVC(kernel= "linear", class_weight="balanced").fit(x_train, y_train)
y_pred = linearSVC_clf.predict(x_test)

KeyboardInterrupt: 

In [None]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.30      0.26      0.28        74
                          Contact_3rdParty       0.06      0.07      0.06        14
             Contact_Address_Book_1stParty       0.40      0.50      0.45        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.09      0.24      0.13        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.43      0.35      0.38       415
           Contact_E_Mail_Address_3rdParty       0.08      0.04      0.05        52
                 Contact_Password_1stParty       0.27      0.38      0.32        77
                 Contact_Password_3rdParty       0.01      0.20      0.03         5
             Contact_Phone_Number_1stParty       0.31      0.12      0.18  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Random forests

In [19]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
clf_randomforest = RandomForestClassifier(n_jobs = -1, random_state = SEED).fit(x_train, y_train)
y_pred = clf_randomforest.predict(x_test)

In [22]:
print(classification_report(y_test, y_pred))

                                            precision    recall  f1-score   support

                          Contact_1stParty       0.16      0.14      0.15        74
                          Contact_3rdParty       0.00      0.00      0.00        14
             Contact_Address_Book_1stParty       0.27      0.27      0.27        82
             Contact_Address_Book_3rdParty       0.00      0.00      0.00         5
                     Contact_City_1stParty       0.03      0.03      0.03        29
                     Contact_City_3rdParty       0.00      0.00      0.00         3
           Contact_E_Mail_Address_1stParty       0.25      0.34      0.29       415
           Contact_E_Mail_Address_3rdParty       0.02      0.02      0.02        52
                 Contact_Password_1stParty       0.07      0.06      0.07        77
                 Contact_Password_3rdParty       0.00      0.00      0.00         5
             Contact_Phone_Number_1stParty       0.09      0.09      0.09  

### Summary (29/8/22): Trying different models using Tfidf yields low performance. 
The issue should be with feature engineering. Need to look at word embeddings first perhaps, before looking at what models to use.

TODO: How does this affect interpret package usage?