## Interactive Model Training 
Run the cells to reproduce the results

### Data and Models Used in Hisia

**Data:** 2016 TrustPilot's 254,464 Danish reviews' body and stars and [8 fake reviews]*20 see notes for the explanation.<br>
&ensp; _Update_: 2021-10-02: Political Data from [Sentiment Analysis on Comments from Danish Political Articles on Social Media](https://github.com/steffan267/Sentiment-Analysis-on-Danish-Social-Media)

**Models**<br>
Hisia, _LogisticRegression_ with SAGA, a variant of Stochastic Average Gradient (SAG), as a solver. L2 penalty was select for the base model. Test score **accuracy is ca. 93%** and **recall of 93%**. SAGA is a faster solver for large datasets (both rows and columns wise). As a stochastic gradient, the memory of the previous gradient is incorporated/feed-forward to achieve faster convergence rate. Seeds of 42 was set in data split, and 42 in a model for reproducibility.

HisiaTrain, _SGDClassifier_, Stochastic Gradient Descent learner with smooth loss 'modified_huber as loss function and L2 penalty. Test score **accurance  94%** and **recall of 94%**. SGDClassifier was select because of partial_fit. It allows batch/online training.

**Note:** This score reflects models in regards to TrustPilot reviews style of writing.<b>
 >8*10 fake reviews. TrustPilot reviews are directed toward products and services. Words like 'elsker'(love) or 'hader'(hate) are rarely used. To make sure the model learns such a relationship, I added 8 reviews and duplicated them 10 times. These new sentences did not increase or decrease the model accurance but added the correct coefficient of love, hate and (ikke dårligt) not bad.

In [1]:
%reload_ext watermark
%watermark -uniz --author "Author Prayson W. Daniel" -vm -p pandas,numpy,matplotlib,scikit-learn,lemmy,dill

Author: Author Prayson W. Daniel

Last updated: 2022-02-09T17:32:33.827667+01:00

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 8.0.1

pandas      : 1.4.0
numpy       : 1.22.2
matplotlib  : 3.5.1
scikit-learn: 1.0.2
lemmy       : 2.1.0
dill        : 0.3.4

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.10.16.3-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit



In [2]:
from collections import namedtuple
import joblib
import re
from pathlib import Path

import dill
import lemmy

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV, SGDClassifier
from loguru import logger

In [4]:
from helpers import show_diagram
from helpers import show_most_informative_features

In [5]:
%matplotlib inline
plt.rcParams['figure.figsize'] = (15,5)
plt.style.use('fivethirtyeight')

This stops were custom made from both unknown Danish stops words and products and services related words such as delivery, package, post office.

In [6]:
PATH_TO_DATA = '../data'
PATH_TO_STOPWORDS = '../hisia/models/data'
STOP_WORDS = joblib.load(f'{PATH_TO_STOPWORDS}/stops.pkl')

The tokenizer separates emojis from the words, removes digits, repetive words and stop words and lemmatize words

In [7]:
lemmatizer = lemmy.load('da')


# Add more stopwords
STOP_WORDS.update({"kilometer", "alme", "bank", "brand", "dansk", "presi"})

In [8]:
"kilometer" in STOP_WORDS

True

In [9]:
def tokenizer(blob, stop_words=STOP_WORDS, remove_digits=True):
    
    if stop_words is None:
        stop_words = {}
    
    blob = blob.lower()
    
     # eyes [nose] mouth | mouth [nose] eyes pattern
    emoticons = r"(?:[<>]?[:;=8][\-o\*\']?[\)\]\(\[dDpP/\:\}\{@\|\\]|[\)\]\(\[dDpP/\:\}\{@\|\\][\-o\*\']?[:;=8][<>]?)"
    emoticon_re = re.compile(emoticons, re.VERBOSE | re.I | re.UNICODE)
    
    text = re.sub(r'[\W]+', ' ', blob)
    
    # remove 3+ repetitive characters i.e. hellllo -> hello, jaaaa -> jaa 
    repetitions = re.compile(r'(.)\1{2,}')
    text = repetitions.sub(r'\1\1', text)
    
    # remove 2+ repetitive words e.g. hej hej hej -> hej
    
    repetitions = re.compile(r'\b(\w+)\s+(\1\s*)+\b')
    text = repetitions.sub(r'\1 ', text)
    
    
    # 14år --> 14 år
    text = re.sub(r'([0-9]+(\.[0-9]+)?)', r' \1 ', text).strip()
    
    emoji = ''.join(re.findall(emoticon_re, blob))
    
       
    # remove stopwords
    text_nostop = [word for word in text.split() if word not in stop_words]
    
    # tokenization lemmatize
    lemmatized_text = [lemmatizer.lemmatize('', word)[-1]  
                                 for word in text_nostop]
    
    remove_stopwords = ' '.join(word for word in lemmatized_text if len(word)>1)
    
    if remove_digits:
        remove_stopwords = re.sub(r'\b\d+\b', '', remove_stopwords)
    

    # remove extra spaces
    remove_stopwords = ' '.join(remove_stopwords .split())
    result = f'{remove_stopwords} {emoji}'.encode('utf-8').decode('utf-8')
       
    
    return result.split()

In [10]:
tokenizer('Jeg er vred på, at jeg ikke fik min pakke :( kilometer')

['vred', 'ikke', ':(']

In [None]:
df = pd.read_json(f'{PATH_TO_DATA}/data.json')

Fake reviews to teach our model the missing relationsh that is not found in TP reviews

In [None]:
dt = pd.DataFrame([('men elsker elsker', 1,5), 
                   ('elsker det ikke', 0, 1), 
                   ('ikke dårligt', 1, 5),
                   ('elsker skat, kæreste, tilbedte, dyrebare, elskling, darling, hjerte, hjertenskær; ven; veninde', 1, 5),
                   ('dårlig: syg, sløj, utilpas, ilde tilpas, upasselig, snavs, indisponeret;'
                    'sygelig, usund, ond, slet; arg, uheldig, umulig, elendig, under al kritik,'
                    'dødssyg, skidt, skral, krank, ussel, ikke noget at samle på, talentløs, uantagelig,'
                    'uacceptabel, forkastelig; ikke noget at råbe hurra for, ikke noget at skrive hjem om,'
                    'noget skidt (lort, pis), andenklasses, tredjeklasses (osv.), ringe, halvgod, ikke nogen'
                    'ørn til, ikke ens stærke side, ens svage punkt, som en brækket arm; sjusket; ufordelagtig,'
                    'ufyldestgørende, utilstrækkelig, utilfredsstillende, middelmådig, under lavmålet, uduelig,'
                    'udygtig, uhensigtsmæssig, forkert, tarvelig; skadelig, ødelæggende, fordærvet, ubrugelig;'
                    'ubehagelig, væmmelig, ulystbetonet; dys-; utiltalende, usympatisk, kedelig', 0, 1),
                    ('20.000 kroner. Det er, hvad man som arbejdstager burde få ekstra i lønningsposen,'
                    ' hvis man skal kunne acceptere at have en dårlig chef.', 0,1),
                   ('kærlighed, hvordan elsker vi hinanden godt – uanset hvem vi elsker?',1,5),
                   ('jeg hader dig', 0, 1),
                   
                  ]*20, 
                  columns='features target stars'.split())

New data from ["Sentiment Analysis on Comments from Danish Political Articles on Social Media"](https://github.com/lucaspuvis/SAM/blob/master/Thesis.pdf)

In [None]:
SAM = "https://raw.githubusercontent.com/steffan267/Sentiment-Analysis-on-Danish-Social-Media/master/all_sentences.csv"
ds = pd.read_csv(SAM, names=["target", "features"])

In [None]:
ds['target'].value_counts()

In [None]:
ds.to_json("../data/steffan267_SAM.json")

In [None]:
(
    ds
      .loc[lambda d: d['target'].ne(0), ["target", ]]
      .assign(target= lambda d: np.where(d["target"].gt(0), 1, 0))
      .value_counts()
      .rename(index ={0: "negative", 1:"positive"})
      .to_frame(name="observations")    
)


  

In [None]:
ds = (
    ds
      .loc[lambda d: d["target"].ne(0), ["features", "target"]]
      .assign(target= lambda d: np.where(d["target"].gt(0), 1, 0))
       
)
  

In [None]:
dt = pd.concat([dt, ds], ignore_index=True)

In [None]:
# dt.to_json('../data/data_custom.json')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['features'], 
                                                     df['target'],
                                                     test_size=.2,
                                                     random_state=42,
                                                     stratify=df['target'])

In [None]:
X_train, y_train = (pd.concat([X_train, dt['features']] ,ignore_index=True),
                    pd.concat([y_train, dt['target']] ,ignore_index=True)
)

In [None]:
print(f'Traing Size: {X_train.shape[0]}\nTest Size: {X_test.shape[0]:>8}')
print(f'\nTraing Size\n\tPositive||Negative Samples\n\t  {y_train[y_train==1].shape[0]}||{y_train[y_train==0].shape[0]}')
print(f'\nTest Size\n\tPositive||Negative Samples\n\t  {y_test[y_test==1].shape[0]}||{y_test[y_test==0].shape[0]}')

In [None]:
hisia = Pipeline(steps =[
        ('count_verctorizer',  CountVectorizer(ngram_range=(1, 2), 
                                 max_features=150000,
                                 tokenizer=tokenizer, 
                                 stop_words=STOP_WORDS
                                )
        ),
        ('feature_selector', SelectKBest(chi2, k=10000)),
        ('tfidf', TfidfTransformer(sublinear_tf=True)),
        ('logistic_regression', LogisticRegressionCV(cv=5,
                                                    solver='saga',
                                                    scoring='accuracy',
                                                    max_iter=200,
                                                    n_jobs=-1,
                                                    random_state=42,
                                                    verbose=0))
])

In [None]:
%%time
hisia.fit(X_train, y_train)

In [None]:
hisia.score(X_test, y_test)

In [None]:
show_diagram(hisia, X_train, y_train, X_test, y_test, compare_test=True)

In [None]:
feature_names = hisia.named_steps['count_verctorizer'].get_feature_names_out()
best_features = [feature_names[i] for i in hisia.named_steps['feature_selector'].get_support(indices=True)]
predictor =  hisia.named_steps['logistic_regression']

In [None]:
N = 100
print(f'Showing {N} models learned features for negative and postive decisions')
print('_'*70)
print('\n')
show_most_informative_features(best_features, predictor, n=N)

In [None]:
# [negative, positive] probability
hisia.predict_proba(['det er ikke okay!'])

In [None]:
hisia.predict_proba(['det er ikke dårligt!'])

In [None]:
(hisia.predict_proba(['jeg kan lide det!']), 
 hisia.predict_proba(['jeg kan ikke lide det!']),
 hisia.predict_proba(['jeg elsker det!']),
 hisia.predict_proba(['jeg elsker det slet ikke!'])
)

In [None]:
mad_max = ['Jeg er vred på, at jeg ikke fik min pakke :( elsker']

In [None]:
hisia.named_steps['logistic_regression'].random_state

In [None]:
hisia.predict_proba(['']) # model is positive :)

In [None]:
hisia.predict(mad_max)

In [None]:
res = hisia.predict_proba(mad_max)
res

In [None]:
hisia.decision_function(mad_max)

In [None]:
v = hisia.named_steps['count_verctorizer'].transform(mad_max)
v = hisia.named_steps['feature_selector'].transform(v)
v = pd.DataFrame.sparse.from_spmatrix(v)

In [None]:
look_up = {index:(token,coef) for index, coef, token in 
           zip(range(len(best_features)),
               hisia.named_steps['logistic_regression'].coef_[0], 
               best_features)}

In [None]:
{look_up[item] for item in v}

In [None]:
hisia.named_steps['logistic_regression'].intercept_[0]

In [None]:
g = [look_up[item] for item in v]

𝑓(𝐱): 𝑝(𝐱) = 1 / (1 + exp(−𝑓(𝐱))

In [None]:
1/(1 + np.exp(-(g[0][1] + g[1][1] + hisia.named_steps['logistic_regression'].intercept_[0])))

In [None]:
hisia.decision_function(mad_max)[0]

In [None]:
df = pd.DataFrame(res)

In [None]:
df['sentiment'] = np.where(df[0] > .5, 'negative', 'positive')

df.columns = ['negative_probability','positive_probability','sentiment']

Sentiment = namedtuple('Sentiment', ['sentiment','positive_probability', 'negative_probability'])

df

In [None]:
b = Sentiment(**df.round(3).to_dict(orient='index')[0])

In [None]:
b

# Retrainable Model SGD

In [None]:
hisia_trainer =Pipeline(steps =[
                ('count_verctorizer',  CountVectorizer(ngram_range=(1, 2), 
                                         max_features=100000,
                                         tokenizer=tokenizer, 
                                        )
                ),
                ('feature_selector', SelectKBest(chi2, k=5000)),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
                ('modified_hubern', SGDClassifier(loss='modified_huber', 
                                                      random_state=7,
                                                      max_iter=1000))
])

In [None]:
%%time
hisia_trainer.fit(X_train, y_train)

In [None]:
# for partil_fit we have to split the pipeline to transformation and scoring
hisia_trainer.score(X_test,y_test)

In [None]:
show_diagram(hisia_trainer, X_train, y_train, X_test, y_test, compare_test=True)

In [None]:
feature_names = hisia_trainer.named_steps['count_verctorizer'].get_feature_names()
best_features = [feature_names[i] for i in hisia_trainer.named_steps['feature_selector'].get_support(indices=True)]
predictor =  hisia_trainer.named_steps['modified_hubern']

In [None]:
N = 100
print(f'Showing {N} models learned features for negative and postive decisions')
print('_'*70)
print('\n')
show_most_informative_features(best_features, predictor, n=N)