# Unsupervised Semantic Analysis tutorial

The **SemanticDetector** class is used to predict a sentiment score in a document / email.

For that purpose, two inputs are required:
- a list of seed words that caracterize a sentiment 
  Exemple : the seed words ["mad", "furious", "insane"] caracterize the sentiment "dissatisfaction"
- a trained embedding (Melusine **Embedding** class instance) to compute distances between words/tokens

The three steps for sentiment score prediction are the following:
- Instanciate a SentimentDetector object with a list of seed words as argument
- Use the SentimentDetector.fit method (with an embedding object as argument) to compute the lexicons
- Use the SentimentDetector.predict method on a document/email DataFrame to predict the sentiment score

## Minimal working exemple

In [2]:
import pandas as pd
import numpy as np

# Data
from melusine import load_email_data

# NLP tools
from melusine.nlp_tools.embedding import Embedding
from melusine.nlp_tools.tokenizer import Tokenizer

# Models
from melusine.models.modeler_semantic import SemanticDetector



### Load email data

In [5]:
df_emails_clean = load_email_data(type="full")

### Embedding

In [6]:
# Train an embedding using the text data in the 'clean_body' column
embedding = Embedding(tokens_column='tokens', size=300, min_count=2)
embedding.train(df_emails_clean)

In [7]:
# Print a list of words present in the Embedding vocabulary
list(embedding.embedding.key_to_index.keys())[:3]

['a', 'vehicule', 'flag_date_']

In [8]:
# Test the trained embedding : print most similar words
embedding.embedding.most_similar('client', topn=3)

[('bien', 0.13862471282482147),
 ('ci-joint', 0.13478754460811615),
 ('studio', 0.13079118728637695)]

### Tokenizer

In [9]:
# Tokenize the text in the clean_body column
tokenizer = Tokenizer (input_column='clean_body', stop_removal=True)
df_emails_clean = tokenizer.fit_transform(df_emails_clean)

In [10]:
# Test the tokenizer : print tokens
df_emails_clean['tokens'].head()

0    [client, chez, pouvez, etablir, devis, fils, s...
1    [informe, nouvelle, immatriculation, enfin, fa...
2    [suite, a, conversation, telephonique, flag_da...
3    [fais, suite, a, mail, envoye, bulletin, salai...
4    [voici, ci, joint, bulletin, salaire, comme, d...
Name: tokens, dtype: object

### Instanciate and fit the Sentiment Detector

In [11]:
seed_word_list = ['immatriculation']

# Instanciate a SentimentDetector object
semantic_detector = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens')

# Fit the SentimentDetector using the trained embedding
semantic_detector.fit(embedding=embedding)

In [12]:
print('List of seed words:')
print(semantic_detector.seed_list)

List of seed words:
['immatriculation']


In [13]:
seed_word = semantic_detector.seed_list[0]
lexicon = semantic_detector.lexicon
sorted_lexicon = dict(sorted(lexicon.items(), key = lambda x: x[0]))

print(f'(Part of) Lexicon associated with the seed words "{", ".join(semantic_detector.seed_list)}":')
for word, sentiment_score in list(sorted_lexicon.items())[:10]:
    print('  ' + word + ' : ' + str(sentiment_score))

(Part of) Lexicon associated with the seed words "immatriculation":
  00_rue : 0.047417573630809784
  1 : 0.013722526840865612
  2 : 0.10784198343753815
  a : 0.10627153515815735
  adresse : 0.03664073720574379
  afin : -0.038984864950180054
  ainsi : 0.003485708264634013
  assurance : 0.021390274167060852
  assurer : 0.05217312276363373
  attached : 0.04214974865317345


### Predict and print the sentiment score

**Warning :** In this exemple, the embedding is trained on a corpus of 40 emails which is WAY too small to yield valuable results

In [14]:
# Choose the name of the column returned (default is "score")
return_column = "semantic_score"

# Predict the sentiment score on each email of the DataFrame
df_emails_clean = semantic_detector.predict(df_emails_clean, return_column=return_column)

# Print emails with the maximum sentiment score
df_emails_clean.sort_values(by=return_column, ascending=False).head()

Unnamed: 0,body,header,date,from,to,attachment,sexe,age,label,is_begin_by_transfer,...,stemmed_tokens,lemma_spacy_sm,lemma_lefff,extension,hour,min,dayofweek,attachment_type,keywords,semantic_score
21,"\n \n \n \n Bonjour, \n \n Pourriez vous ...",Fwd: Changement de vehicule,2018-02-06 11:07:00,monsieurdupont@extensionf.net,Societe Imaginaire <region@Societeimaginaire.fr>,[],M,64,vehicule,True,...,"[pourr, fair, suit, mail, suiv, dat, flag_date_]",pourriez vous faire suite au mail suivre en da...,pouvoir vous faire suite au mail suivant en da...,6,11,7,1,[4],"[pourriez, suite, suivant, date]",0.077974
36,\n \n \n Bonjour \n \n Je m'aperçois ce jo...,prélèvements bancaires,2018-06-07 15:16:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],F,19,modification,False,...,"[apercois, jour, ete, preleve, plusieur, fois,...",je me apercois ce jour que je avoir ete prelev...,je clr apercois ce jour que j' avoir ete prele...,1,15,16,3,[4],"[fois, compte, bancaire, vehicule]",0.075932
11,"\n \n \n Bonjour, \n \n Suite à notre entr...",Numéro de téléphone,2018-05-31 12:44:00,monsieurdupont@extensionf.net,demandes@societeimaginaire.fr,[],M,23,modification,False,...,"[suite_, entretien_telephon, jour, join, numer...",suite avoir notre entretien telephoniqu de ce ...,suite avoir son entretien telephonique de ce j...,6,12,44,3,[4],"[numero, telephone, fils, flag_phone_]",0.070483
16,"\n \n \n \n Bonjour madame, \n Suite à not...",certificat de cession de véhicule,2018-06-04 15:39:00,Monsieur Dupont <monsieurdupont@extensionb.com>,demandes@societeimaginaire.fr,[Numériser.pdf],M,57,resiliation,False,...,"[suite_, entretien_telephon, jour, join, scan,...",suite avoir notre entretien telephoniqu de ce ...,suite avoir son entretien telephonique de ce j...,2,15,39,0,[5],"[entretien_telephonique, joins, certificat_ces...",0.070483
8,"\n \n \n Bonjour, \n \n Voici la copie du ...",Re: Virement,2018-05-31 17:10:00,Monsieur Dupont <monsieurdupont@extensione.com>,demandes@societeimaginaire.fr,[pj.pdf],M,38,autres,False,...,"[voic, cop, vir, effectu, a, jour, serait-il_p...",voici le copie de virement effectuer a ce jour...,voici le copie du virement effectuer avoir ce ...,5,17,10,3,[5],"[voici, copie, virement, attestation]",0.067231


## The SentimentDetector class

The SemanticDetector class provides an unsupervised methodology to assign a sentiment score to a corpus of documents/emails. The methodology used to predict a sentiment score using the SemanticDetector is described below:

1. **Define a list of seed words that caracterize a sentiment**
    - Take a list of seed words as input
    - If the `extend_seed_word_list` parameter is set to True: extend the list of seed words with words sharing the same root (dance -> ["dancing", "dancer"])  
    
    
2. **Fit the model (= create a lexicon to assign a score for every word in the vocabulary)**
    - Create a lexicon for each seed word by computing the cosine similarity between the seed word and all the words in the vocabulary is computed.
    - Aggregate the similarity score obtained for the different seed words in a unique lexicon
    - (To compute cosine similarities, a trained embedding is required.)  
    
3. **Predict a sentiment score for emails/documents**
    - Filter out the tokens in the document that are not in the vocabulary.
    - For each remaining token, compute its sentiment score using the lexicon.
    - For each email, aggregate the score accross different tokens

The arguments of a SemanticDetector object are :
    
- **base_seed_words :** the list of seed words that caracterize a sentiment/theme
- **base_anti_seed_words :** the list of seed words that caracterize undesired sentiments/themes
- **anti_weight :** the weight of anti_seeds in the computation of the semantic score
- **tokens_column :** name of the column in the input DataFrame that contains tokens
- **extend_seed_word_list :** if True: complement seed words with words sharing the same root (dance -> ["dancing", "dancer"]). Default value False.
- **normalize_scores :** if True: normalize the lexicon scores of eache word. Default value False.
- **aggregation_function_seed_wise :** Function to aggregate the scores associated with a token accross the different seeds. Default function is a max.
- **aggregation_function_email_wise :** Function to aggregate the scores associated with the different tokens in an email. Default function is the 60th percentile.
- **n_jobs :** the number of cores used for computation. Default value, 1.

## Filter out undesired themes with using "anti seed words" and "anti_ratio"

If you want to detect emergency in your emails, you could use the seed word `"emergency"`.  
* "I need an answer, this is an emergency !!" => Semantic score = 0.98   

But you might detect undesired sentences such as:
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.95  

You can prevent the detection of undesired themes using anti seed words:  
* `base_anti_seed_word_list = ['brake']`
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.50  

You can control the contribution of anti seed words using the `anti_weight` (default 0.3):  
* `base_anti_seed_word_list = ['brake']`
* `anti_weight = 0.6`
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.30  

The formula used to compute the semantic score is:  
* semantic score = seed_word_contrib - anti_weight * anti_seed_word_contrib  

Warning : an `anti_weight` above one means anti seeds contribute more (negatively) than regular seeds

In [15]:
seed_word_list = ['immatriculation']
anti_seed_word_list = ['demandes']


# Instanciate SentimentDetector objects
regular_semantic_detector = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens')
semantic_detector_with_anti = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens', 
                                               base_anti_seed_words = anti_seed_word_list)
semantic_detector_with_anti2 = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens', 
                                               base_anti_seed_words = anti_seed_word_list, anti_weight=0.5)


# Fit the SentimentDetectors using the trained embedding
regular_semantic_detector.fit(embedding=embedding)
semantic_detector_with_anti.fit(embedding=embedding)
semantic_detector_with_anti2.fit(embedding=embedding)

In [16]:
# Choose the name of the column returned (default is "score")
return_column1 = "semantic_score"
return_column2 = "semantic_score with anti (anti_weight=0.3)"
return_column3 = "semantic_score with anti (anti_weight=0.5)"



# Predict the sentiment score on each email of the DataFrame
df_emails_clean = regular_semantic_detector.predict(df_emails_clean, return_column=return_column1)
df_emails_clean = semantic_detector_with_anti.predict(df_emails_clean, return_column=return_column2)
df_emails_clean = semantic_detector_with_anti2.predict(df_emails_clean, return_column=return_column3)


# Print emails with the maximum sentiment score
df_emails_clean.sort_values(by=return_column1, ascending=False).head()

Unnamed: 0,body,header,date,from,to,attachment,sexe,age,label,is_begin_by_transfer,...,lemma_lefff,extension,hour,min,dayofweek,attachment_type,keywords,semantic_score,semantic_score with anti (anti_weight=0.3),semantic_score with anti (anti_weight=0.5)
21,"\n \n \n \n Bonjour, \n \n Pourriez vous ...",Fwd: Changement de vehicule,2018-02-06 11:07:00,monsieurdupont@extensionf.net,Societe Imaginaire <region@Societeimaginaire.fr>,[],M,64,vehicule,True,...,pouvoir vous faire suite au mail suivant en da...,6,11,7,1,[4],"[pourriez, suite, suivant, date]",0.077974,0.058752,0.056253
36,\n \n \n Bonjour \n \n Je m'aperçois ce jo...,prélèvements bancaires,2018-06-07 15:16:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],F,19,modification,False,...,je clr apercois ce jour que j' avoir ete prele...,1,15,16,3,[4],"[fois, compte, bancaire, vehicule]",0.075932,0.072317,0.073576
11,"\n \n \n Bonjour, \n \n Suite à notre entr...",Numéro de téléphone,2018-05-31 12:44:00,monsieurdupont@extensionf.net,demandes@societeimaginaire.fr,[],M,23,modification,False,...,suite avoir son entretien telephonique de ce j...,6,12,44,3,[4],"[numero, telephone, fils, flag_phone_]",0.070483,0.051574,0.047799
16,"\n \n \n \n Bonjour madame, \n Suite à not...",certificat de cession de véhicule,2018-06-04 15:39:00,Monsieur Dupont <monsieurdupont@extensionb.com>,demandes@societeimaginaire.fr,[Numériser.pdf],M,57,resiliation,False,...,suite avoir son entretien telephonique de ce j...,2,15,39,0,[5],"[entretien_telephonique, joins, certificat_ces...",0.070483,0.08984,0.09682
8,"\n \n \n Bonjour, \n \n Voici la copie du ...",Re: Virement,2018-05-31 17:10:00,Monsieur Dupont <monsieurdupont@extensione.com>,demandes@societeimaginaire.fr,[pj.pdf],M,38,autres,False,...,voici le copie du virement effectuer avoir ce ...,5,17,10,3,[5],"[voici, copie, virement, attestation]",0.067231,0.055122,0.052213


## Find extra seed words with the `extend_seed_word_list` parameter

The SentimentDetector "extend_seed_word_list" parameter activates the search for extra seed words sharing the same root as the base seed words.  

For example, if "dance" is a base seed word, "extend_seed_word_list" will loop through the words in the embedding vocabulary and find new seed words such as "dancer", "dancing".

In [17]:
# Instanciate a SentimentDetector object
semantic_detector_extended_seed = SemanticDetector(
    base_seed_words=['tel', 'assur'], tokens_column='tokens', extend_seed_word_list=True)

# Fit the SentimentDetector using the trained embedding
semantic_detector_extended_seed.fit(embedding=embedding)

In [18]:
# Print the extended list of seed words
print(semantic_detector_extended_seed.seed_dict)
print(semantic_detector_extended_seed.seed_list)

{'tel': ['telephone', 'tel'], 'assur': ['assurance', 'assurer']}
['telephone', 'tel', 'assurance', 'assurer']


## Use a custom function to aggregate lexicon scores

### Aggregate token score over seeds
The SemanticDetector computes a similarity between a word and every seed words.  
An aggretion function is then used to keep a single score for each token.  

Exemple : 
- Seed word list : ["horse", "animal"]
- Embedding : simulated

Lexicon "horse" :
{
  "apple" : 0.2,
  ...
  "hello" : 0.1,
  ...
  "ponies" : 8.8,
  ...
  "zebra" : 1.2
}  
Lexicon "animal" :
{
  "apple" : 0.1,
  ...
  "hello" : 0.3,
  ...
  "ponies" : 4.8,
  ...
  "zebra" : 6.2
}

**Aggregated Lexicon :**  
{
  "apple" : 0.2,
  ...
  "hello" : 0.3,
  ...
  "ponies" : 8.8,
  ...
  "zebra" : 6.2
}

### Aggregate semantic score over tokens
When evaluating an email, each word in the email has an associated score.  
An aggregation function is used to keep a single score for each email.  

Exemple : 
- Sentence : "Hello, I like ponies"
- Seed word list : ["horse", "animal"]
- Embedding : simulated

**Sentence score :**  
- score : score(Hello) + score(I) + score(like) + score(ponies)
- score : 0.3 + 0.1 + 0.2 + 8.8 = 9.4


The semantic score for the email is thus 9.4

### Default aggregation functions

The default aggregation methodology is the following:  
- Seed-wise aggregation : For a token, take the max score accross seed
  - Exemple :  
    ponies_score = 8.8 (lexicon "horse")   
    ponies_score = 4.8 (lexicon "animal")  
    => Score for the "ponies" token = np.max(8.8, 4.8) = 8.8  
  
  
- Email-wise aggregation : Given a list of token scores, take the percentile 60 as the sentiment_score for the email
  - Exemple :  
    token_score_list : [0.3 (hello), 0.3 (i), 0.2 (ponies)]  
    => sentiment_score = np.percentile([0.3, 0.3, 0.2, 8.8], 60) = 0.3  

In [19]:
# Instanciate a SentimentDetector object with custom aggregation function:
# - A mean for the seed-wise aggregation
# - A 95th percentile for the email-wise aggregation

def aggregation_mean(x):
    return np.mean(x, axis=0)

def aggregation_percentile_95(x):
    return np.percentile(x, 95)

semantic_detector_custom_aggregation = SemanticDetector(
    base_seed_words=['client'], 
    tokens_column='tokens', 
    aggregation_function_seed_wise=aggregation_mean,
    aggregation_function_email_wise=aggregation_percentile_95
)

# Fit the SentimentDetector using the trained embedding
semantic_detector_custom_aggregation.fit(embedding=embedding)

# Predict the sentiment score on each email of the DataFrame
df_emails_clean_custom_aggregation = semantic_detector_custom_aggregation.predict(df_emails_clean)

## Multiprocessing

In [21]:
semantic_detector_multiprocessing = SemanticDetector(
    base_seed_words=['client'], 
    tokens_column='tokens', 
    n_jobs = 2
)

# Fit the SentimentDetector using the trained embedding
semantic_detector_multiprocessing.fit(embedding=embedding)

# Predict the sentiment score on each email of the DataFrame
df_emails_multiprocessing = semantic_detector_multiprocessing.predict(df_emails_clean)

In [22]:
df_emails_clean

Unnamed: 0,body,header,date,from,to,attachment,sexe,age,label,is_begin_by_transfer,...,extension,hour,min,dayofweek,attachment_type,keywords,semantic_score,semantic_score with anti (anti_weight=0.3),semantic_score with anti (anti_weight=0.5),score
0,\n \n \n \n Bonjour \n Je suis client chez...,Devis habitation,2018-05-24 11:36:00,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,[],F,35,habitation,True,...,1,11,36,3,[4],"[pouvez, fils, suivant, flag_cp_]",0.031198,0.055655,0.076051,-0.00083
1,"\n \n \n \n Bonsoir madame, \n \n Je vous...",Immatriculation voiture,2018-05-24 19:37:00,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,[pj.pdf],M,32,vehicule,True,...,1,19,37,3,[5],"[nouvelle, immatriculation, prie_trouver, faire]",0.037342,0.048318,0.06602,0.013508
2,"\n \n \n Bonjours, \n \n Suite a notre con...",Re: Envoi d'un document de la Société Imaginaire,2018-05-25 06:45:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],M,66,compte,False,...,1,6,45,4,[4],"[conversation, telephonique, pourriez, afin]",0.067073,0.05848,0.056253,0.018664
3,"\n \n \n \n \n Bonjour, \n \n \n Je fai...",Re: Votre adhésion à la Société Imaginaire,2018-05-25 10:15:00,Monsieur Dupont <monsieurdupont@extensiond.com>,demandes@societeimaginaire.fr,[fichedepaie.png],M,50,adhesion,False,...,4,10,15,4,[6],"[fais, mail, bulletin_salaire, trouverez]",0.055584,0.03993,0.029495,0.065979
4,"\n \n \n Bonjour, \n Voici ci joint mon bul...",Bulletin de salaire,2018-05-25 17:30:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[pj.pdf],M,15,adhesion,False,...,1,17,30,4,[5],"[ci, joint, bulletin_salaire, comme]",-0.005475,0.017324,0.032524,0.041188
5,"Madame, Monsieur, \n \n Je vous avais contact...",Modification et extension de ma maison,2018-05-31 10:28:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],F,22,habitation,False,...,1,10,28,3,[4],"[projet, suite, les, flag_name]",0.058675,0.050536,0.057572,0.035437
6,"\n \n \n \n Bonjour, \n \n J'emménage dan...",Assurance d'un nouveau logement,2018-05-30 15:56:00,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,[pj.pdf],F,28,resiliation,True,...,1,15,56,2,[5],"[nouveau, studio, afin, pouvoir]",0.052345,0.038161,0.039802,0.032487
7,"\n \n \n \n \n Bonjour, \n \n \n \n Je...",Assurance véhicules,2018-05-31 14:02:00,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[image001.png],M,39,vehicule,False,...,1,14,2,3,[6],"[dassurance, vehicule, merci, retour]",0.027087,0.042275,0.041786,0.028614
8,"\n \n \n Bonjour, \n \n Voici la copie du ...",Re: Virement,2018-05-31 17:10:00,Monsieur Dupont <monsieurdupont@extensione.com>,demandes@societeimaginaire.fr,[pj.pdf],M,38,autres,False,...,5,17,10,3,[5],"[voici, copie, virement, attestation]",0.067231,0.055122,0.052213,0.035498
9,\n \n \n \n \n \n \n \n BONJOUR \n \n...,Prêt véhicule,2018-05-31 08:54:00,Monsieur Dupont <monsieurdupont@extensionb.com>,demandes@societeimaginaire.fr,[pj.pdf],M,30,vehicule,False,...,2,8,54,3,[5],"[pret, vehicule]",-0.005622,-0.006581,-0.007219,0.055332
