# 1. Preparation

This is a **Python Notebook**. To make it work, you need to press "play" on the code cells.  

*Remember to always run cells in the right order and never skip one!*  
In case of doubt, you can always restart from the beginning.

Before you start, you need to "clone" the course GitHub repository into the Python notebook.  




In [None]:
!git clone https://github.com/SimoneRebora/CMCLS.git

Then we can read the "sample" dataset, which contains sentences from the "corpus" dateset annotated for valence.  
Annotations were created during the "Distant Reading in R" workshop with the help of ten annotators: https://github.com/ABC-DH/Distant_Reading_in_R/tree/main/Machine_Learning_Files  
We can upload and take a look at the dataset.

In [None]:
import pandas as pd
my_df = pd.read_csv('CMCLS/materials/ML_sample_dataset.csv', index_col=0)
my_df

# 2. Machine Learning


## Create embeddings for texts

We'll be using SentiArt to create embeddings.  

In [None]:
# load libraries
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data if you haven't already
nltk.download('punkt')

# read the SentiArt dictionary
TC = 'CMCLS/materials/SentiArt.csv' # for English texts
sa = pd.read_csv(TC)

# tokenize sentences
sents = my_df['text']
tokens = [[t for t in word_tokenize(s) if t.isalpha()] for s in sents]

# compute mean AAP (or mean fear etc.) per sentence
sent_mean_ang_z = [];sent_mean_fear_z = [];sent_mean_disg_z = [];sent_mean_hap_z = [];sent_mean_sad_z = [];sent_mean_surp_z = []
for t in tokens:
    dt = sa.query('word in @t')
    sent_mean_ang_z.append(dt.ang_z.mean())
    sent_mean_fear_z.append(dt.fear_z.mean())
    sent_mean_disg_z.append(dt.disg_z.mean())
    sent_mean_hap_z.append(dt.hap_z.mean())
    sent_mean_sad_z.append(dt.sad_z.mean())
    sent_mean_surp_z.append(dt.surp_z.mean())

# add results to table
my_df['anger'] = sent_mean_ang_z
my_df['fear'] = sent_mean_fear_z
my_df['disgust'] = sent_mean_disg_z
my_df['happiness'] = sent_mean_hap_z
my_df['sadness'] = sent_mean_sad_z
my_df['surprise'] = sent_mean_surp_z

# remove NAs
my_df.fillna(0, inplace=True)

# visualize
my_df = round(my_df,3)
my_df

## Logistic regression

Train a logistic regression model with SentiArt embeddings (to predict evaluation vs. other)

Split the dataset in Train and Test

In [None]:
from sklearn.model_selection import train_test_split

# create tables for train and test
train, test = train_test_split(my_df, test_size=0.2, random_state = 123) # our main data split into train and test
# the attribute test_size=0.2 splits the data into 80% and 20% ratio. train=80% and test=20%
# the attribute random_state=123 randomizes the selection of sentences in a repeatable way

# separate training embeddings from training labels
train_X = train[['anger','fear','disgust','happiness','sadness','surprise']] # taking the training data features
train_y = train.label # output of the training data

# separate test embeddings from test labels
test_X = test[['anger','fear','disgust','happiness','sadness','surprise']] # taking test data feature
test_y = test.label # output value of the test data

# print shapes of train/test embedding tables
print(train_X.shape)
print(test_X.shape)

# show train material
train_X

Let's run logistic regression and see the results in the table

In [None]:
from sklearn.linear_model import LogisticRegression

# fit a LR model to our train data
model = LogisticRegression()
model.fit(train_X, train_y)

# make prediction with model on test data
prediction = model.predict(test_X)

# add results to test table and show
test['logistic regression'] = prediction
test

See the efficiency scores

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_y, prediction))