# Baseline Models

- Using three baseline models for the development of an analysis that detects political language in court documents.
    1. Logistic regression
    2. Naive Bayes
    3. SVM
- Using baselines as a means for testing model performance and building an accurate, model for classifying language used in Bulgaria's constitutional court

## Next steps

- Data needs more sentences labelled as political, as data is imbalanced and models observe few political sentences

In [1]:
import pickle
import re

import nltk
import pandas as pd

nltk.download("stopwords")

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn import naive_bayes, svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/paulj1989/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
dfD4 = pd.read_json("data/json/D4_060493.json")
dfD8 = pd.read_json("data/json/D8_081002.json")
dfD9 = pd.read_json("data/json/D9_241002.json")
dfD10 = pd.read_json("data/json/D10_100501.json")
dfD12 = pd.read_json("data/json/D12_220501.json")
dfD14 = pd.read_json("data/json/D14_120995.json")
dfD17 = pd.read_json("data/json/D17_241192.json")


In [3]:
# add column identifying documents
dfD4['doc_id'] = 'D4_060493'
dfD8['doc_id'] = 'D8_081002'
dfD9['doc_id'] = 'D9_241002'
dfD10['doc_id'] = 'D10_100501'
dfD12['doc_id'] = 'D12_220501'
dfD14['doc_id'] = 'D14_120995'
dfD17['doc_id'] = "D17_241192"

In [4]:
# merge all dataframes
df = pd.concat([dfD4, dfD8, dfD9, dfD10, dfD12, dfD14, dfD17])

## Cleaning Text

In [5]:
# create binary variable where POLITICAL = 1, all else = 0
df.loc[df["label_id"] != 4, "label_id"] = 0

df.loc[df["label_id"] == 4, "label_id"] = 1


In [6]:
def preprocessing(text):

    text = re.sub('<[^>]*>', '', text)
    text = re.sub(r'[^\w\s]','', text)
    stop_words = set(stopwords.words("english"))
    words = [word for word in text.lower().split() if not word in stop_words]
    text = " ".join(words)

    return text

In [7]:
df['text'] = df['text'].apply(preprocessing)

In [8]:
ps = PorterStemmer()

def token_ps(text):
    return [ps.stem(word) for word in text.split()]

## Logistic Regression

- Computing a logistic regression model based on the values created from a vectorizer algorithm called tf-idf, which stands for term-frequency inverse document frequency.
- tf-idf measures the originality of the word by comparing how often it appears in a doc with the number of docs the word appears in. The frequency of the words in a doc (compared against other docs) measures the importance of that word in the wider corpus.
- The logistic regression below is computed by building a vector of word values based on the iportance of each word, before using the word vectors to identify the characteristics of the political label to predict which sentences will be political.

In [9]:
# transforming text into vectors
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        tokenizer=token_ps,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True)
# compute tfidf values for all words in 'text' column of df
X = tfidf.fit_transform(df['text'])
y = df.label_id.values

In [20]:
# splitting data into train and test splits in order to test predictive accuracy
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, test_size=0.5, shuffle=False
)

# computes and then fits logistic regression that implements cross-validation as a part of the process
# cv = number of cross validation folds
clf = LogisticRegressionCV(
    cv=10, scoring="accuracy", n_jobs=-1, verbose=3, max_iter=500
).fit(X_train, y_train)

# save the model - meaning it doesn't need to be trained every time
logit_model = open("logit_model.sav", "wb")
pickle.dump(clf, logit_model)
logit_model.close()

# presenting model accuracy
y_pred = clf.predict(X_test)
print("---Test Set Results---")
print("Accuracy with logreg: {}".format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    2.1s remaining:    5.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    2.7s remaining:    1.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    3.2s finished


---Test Set Results---
Accuracy with logreg: 0.8118279569892473
              precision    recall  f1-score   support

           0       0.81      1.00      0.90       151
           1       0.00      0.00      0.00        35

    accuracy                           0.81       186
   macro avg       0.41      0.50      0.45       186
weighted avg       0.66      0.81      0.73       186



## Naive Bayes

In [11]:
# compute tfidf values for all words in 'text' column of df
# .toarray() added in this instance to adjus the way the data is structured
# for nb model to run without error
X = tfidf.fit_transform(df["text"]).toarray()
y = df.label_id.values

In [15]:
# splitting data into train and test splits in order to test predictive accuracy
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, test_size=0.2, shuffle=False
)

In [17]:
# fit the training dataset on the NB classifier
nb = naive_bayes.MultinomialNB()
nb.fit(X_train,y_train)

# presenting model accuracy
y_pred = nb.predict(X_test)
print("---Test Set Results---")
print("Naive Bayes Accuracy: {}".format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))


---Test Set Results---
Naive Bayes Accuracy: 0.72
              precision    recall  f1-score   support

           0       0.72      1.00      0.84        54
           1       0.00      0.00      0.00        21

    accuracy                           0.72        75
   macro avg       0.36      0.50      0.42        75
weighted avg       0.52      0.72      0.60        75



## Support Vector Machines (SVM)

In [14]:
# fit the training dataset on the SVM classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(X_train,y_train)

# presenting model accuracy
y_pred_svm = SVM.predict(X_test)
print("---Test Set Results---")
print("SVM Accuracy: {}".format(accuracy_score(y_test, y_pred_svm)))
print(classification_report(y_test, y_pred_svm))


---Test Set Results---
SVM Accuracy: 0.8035714285714286
              precision    recall  f1-score   support

           0       0.80      1.00      0.89        90
           1       0.00      0.00      0.00        22

    accuracy                           0.80       112
   macro avg       0.40      0.50      0.45       112
weighted avg       0.65      0.80      0.72       112

