# Baseline Models

- Using three baseline models for the development of an analysis that detects political language in court documents.
    1. Logistic regression
    2. Naive Bayes
    3. SVM
- Using baselines as a means for testing model performance and building an accurate, model for classifying language used in Bulgaria's constitutional court

## Next steps

- Data needs more sentences labelled as political, as data is imbalanced and models observe few political sentences
- Need to optimize hyperparameters to improve model performance

In [73]:
import pickle
import re

import nltk
import pandas as pd

nltk.download("stopwords")

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn import naive_bayes, svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/paulj1989/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [74]:
dfD4 = pd.read_json("data/json/D4_060493.json")
dfD8 = pd.read_json("data/json/D8_081002.json")
dfD9 = pd.read_json("data/json/D9_241002.json")
dfD10 = pd.read_json("data/json/D10_100501.json")
dfD12 = pd.read_json("data/json/D12_220501.json")
dfD14 = pd.read_json("data/json/D14_120995.json")
dfD17 = pd.read_json("data/json/D17_241192.json")


In [75]:
# add column identifying documents
dfD4['doc_id'] = 'D4_060493'
dfD8['doc_id'] = 'D8_081002'
dfD9['doc_id'] = 'D9_241002'
dfD10['doc_id'] = 'D10_100501'
dfD12['doc_id'] = 'D12_220501'
dfD14['doc_id'] = 'D14_120995'
dfD17['doc_id'] = "D17_241192"

In [76]:
# merge all dataframes
df = pd.concat([dfD4, dfD8, dfD9, dfD10, dfD12, dfD14, dfD17], ignore_index=True)

## Cleaning Text

In [77]:
# create binary variable where POLITICAL = 1, all else = 0
df.loc[df["label_id"] != 4, "label_id"] = 0

df.loc[df["label_id"] == 4, "label_id"] = 1


In [78]:
def preprocessing(text):

    text = re.sub('<[^>]*>', '', text)
    text = re.sub(r'[^\w\s]','', text)
    stop_words = set(stopwords.words("english"))
    words = [word for word in text.lower().split() if not word in stop_words]
    text = " ".join(words)

    return text

In [79]:
df['text'] = df['text'].apply(preprocessing)

In [80]:
# pd.set_option('display.max_rows', 371)
df

Unnamed: 0,paragraph_id,text,label,label_id,description,doc_id
0,1,solution ne 5 6 april 1993 cd ne 693 interpret...,BACKGROUND,0,CASE TITLE,D4_060493
1,2,members asen manov chairman mladen danailov ts...,BACKGROUND,0,CASE TITLE,D4_060493
2,3,proceedings instituted request 52 mps 36th nat...,SUMMARY,0,REFERRAL,D4_060493
3,4,order 2 march 1993 constitutional court grante...,SUMMARY,0,REFERRAL,D4_060493
4,5,authors request instructed clarify issues view...,SUMMARY,0,REFERRAL,D4_060493
...,...,...,...,...,...,...
366,32,provision art 2 para higher education act must...,CONSTITUTIONAL INTERPRETATION,0,,D17_241192
367,33,national assembly authorized legalize public h...,CONSTITUTIONAL INTERPRETATION,0,,D17_241192
368,34,provision art 2 para zvo unconstitutional alle...,POLITICAL,1,MORAL/POLITICAL JUDGEMENT OF NATIONAL ASSEMBLY...,D17_241192
369,35,authors request opinions allege infringements ...,FACTUAL,0,ROLE/SCOPE OF THE COURT,D17_241192


In [81]:
ps = PorterStemmer()

def token_ps(text):
    return [ps.stem(word) for word in text.split()]

## Logistic Regression

- Computing a logistic regression model based on the values created from a vectorizer algorithm called tf-idf, which stands for term-frequency inverse document frequency.
- tf-idf measures the originality of the word by comparing how often it appears in a doc with the number of docs the word appears in. The frequency of the words in a doc (compared against other docs) measures the importance of that word in the wider corpus.
- The logistic regression below is computed by building a vector of word values based on the iportance of each word, before using the word vectors to identify the characteristics of the political label to predict which sentences will be political.

In [82]:
# transforming text into vectors
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True)
# compute tfidf values for all words in 'text' column of df
X = tfidf.fit_transform(df['text'])
y = df.label_id.values

In [122]:
# splitting data into train and test splits in order to test predictive accuracy
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, test_size=0.3, shuffle=True
)

# computes and then fits logistic regression that implements cross-validation as a part of the process
# cv = number of cross validation folds
log_reg = LogisticRegressionCV(
    cv=10, scoring="accuracy", n_jobs=-1, verbose=3, max_iter=500
).fit(X_train, y_train)

# model accuracy
log_predictions = log_reg.predict(X_test)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    3.9s remaining:    9.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    4.3s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    4.8s finished


In [123]:
# defining a function that prints model prediction accuracy
def model_accuracy(name, preds):
    print("---{} Test Set Results---".format(name))
    print("Weighted F1 Average: {}".format(f1_score(y_test, preds, average="weighted")))
    # precision = % predicted accurately
    # recall = % positives identified
    # f1-score = weighted harmonic mean of precision & recall
    # weighted f-1 avg used for comparing classification models
    print(classification_report(y_test, preds))

In [124]:
model_accuracy("Logit", log_predictions)

---Logit Test Set Results---
Weighted F1 Average: 0.8372408293460925
              precision    recall  f1-score   support

           0       0.88      1.00      0.94        98
           1       1.00      0.07      0.13        14

    accuracy                           0.88       112
   macro avg       0.94      0.54      0.54       112
weighted avg       0.90      0.88      0.84       112



## Naive Bayes

In [125]:
# compute tfidf values for all words in 'text' column of df
# .toarray() added in this instance to adjus the way the data is structured
# for nb model to run without error
X = tfidf.fit_transform(df["text"]).toarray()
y = df.label_id.values

In [126]:
# splitting data into train and test splits in order to test predictive accuracy
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, test_size=0.3, shuffle=True
)

In [127]:
# fit the training dataset on the NB classifier
nb = naive_bayes.MultinomialNB()
nb.fit(X_train, y_train)

# model accuracy
nb_predictions = nb.predict(X_test)
model_accuracy("Naive Bayes", nb_predictions)

---Naive Bayes Test Set Results---
Weighted F1 Average: 0.8166666666666667
              precision    recall  f1-score   support

           0       0.88      1.00      0.93        98
           1       0.00      0.00      0.00        14

    accuracy                           0.88       112
   macro avg       0.44      0.50      0.47       112
weighted avg       0.77      0.88      0.82       112



## Support Vector Machines (SVM)

In [128]:
# fit the training dataset on the SVM classifier
SVM = svm.SVC(C=1.0, kernel="linear", degree=3, gamma="auto")
SVM.fit(X_train, y_train)

# model accuracy
svm_predictions = SVM.predict(X_test)
model_accuracy("SVM", svm_predictions)

---SVM Test Set Results---
Weighted F1 Average: 0.8166666666666667
              precision    recall  f1-score   support

           0       0.88      1.00      0.93        98
           1       0.00      0.00      0.00        14

    accuracy                           0.88       112
   macro avg       0.44      0.50      0.47       112
weighted avg       0.77      0.88      0.82       112



## Pickling Models (for Future Use)

In [131]:
# saving tfidf
pickle.dump(tfidf, open('tfidf.pickle', 'wb'))

# saving models
pickle.dump(log_reg, open('log_reg.pickle', 'wb'))
pickle.dump(nb, open('nb.pickle', 'wb'))
pickle.dump(SVM, open('svm.pickle', 'wb'))