# Unsupervised Semantic Analysis tutorial

The **SemanticDetector** class is used to predict a sentiment score in a document / email.

For that purpose, two inputs are required:
- a list of seed words that caracterize a sentiment 
  Exemple : the seed words ["mad", "furious", "insane"] caracterize the sentiment "dissatisfaction"
- a trained embedding (Melusine **Embedding** class instance) to compute distances between words/tokens

The three steps for sentiment score prediction are the following:
- Instanciate a SentimentDetector object with a list of seed words as argument
- Use the SentimentDetector.fit method (with an embedding object as argument) to compute the lexicons
- Use the SentimentDetector.predict method on a document/email DataFrame to predict the sentiment score

## Minimal working exemple

In [1]:
import pandas as pd
import numpy as np

# NLP tools
from melusine.nlp_tools.embedding import Embedding
from melusine.nlp_tools.tokenizer import Tokenizer

# Models
from melusine.models.modeler_semantic import SemanticDetector

### Load email data

In [2]:
df_emails_clean = pd.read_csv('./data/emails_preprocessed.csv', encoding='utf-8', sep=';')
df_emails_clean = df_emails_clean[['clean_body']]
df_emails_clean = df_emails_clean.astype(str)

### Embedding

In [3]:
# Train an embedding using the text data in the 'clean_body' column
embedding = Embedding(input_column='clean_body', vector_size=300, min_count=2)
embedding.train(df_emails_clean)

21/05 11:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
21/05 11:28 - melusine.nlp_tools.embedding - INFO - Done.


In [4]:
# Print a list of words present in the Embedding vocabulary
list(embedding.embedding.wv.key_to_index.keys())[:3]

['a', 'vehicule', 'flag_date_']

In [5]:
# Test the trained embedding : print most similar words
embedding.embedding.wv.most_similar('client', topn=3)

[('pouvoir', 0.14628919959068298),
 ('merci', 0.11492132395505905),
 ('studio', 0.1143675297498703)]

### Tokenizer

In [6]:
# Tokenize the text in the clean_body column
tokenizer = Tokenizer (input_column='clean_body', stop_removal=True, n_jobs=20)
df_emails_clean = tokenizer.fit_transform(df_emails_clean)

In [7]:
# Test the tokenizer : print tokens
df_emails_clean['tokens'].head()

0    [client, chez, pouvez, etablir, devis, fils, s...
1    [informe, nouvelle, immatriculation, enfin, fa...
2    [suite, a, conversation, telephonique, flag_da...
3    [fais, suite, a, mail, envoye, bulletin, salai...
4    [voici, ci, joint, bulletin, salaire, comme, d...
Name: tokens, dtype: object

### Instanciate and fit the Sentiment Detector

In [8]:
seed_word_list = ['immatriculation']

# Instanciate a SentimentDetector object
semantic_detector = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens')

# Fit the SentimentDetector using the trained embedding
semantic_detector.fit(embedding=embedding)

In [9]:
print('List of seed words:')
print(semantic_detector.seed_list)

List of seed words:
['immatriculation']


In [10]:
seed_word = semantic_detector.seed_list[0]
lexicon = semantic_detector.lexicon
sorted_lexicon = dict(sorted(lexicon.items(), key = lambda x: x[0]))

print(f'(Part of) Lexicon associated with the seed words "{", ".join(semantic_detector.seed_list)}":')
for word, sentiment_score in list(sorted_lexicon.items())[:10]:
    print('  ' + word + ' : ' + str(sentiment_score))

(Part of) Lexicon associated with the seed words "immatriculation":
  00 : -0.1417384296655655
  1 : -0.010289491154253483
  2 : -0.019492965191602707
  a : 0.0005874728667549789
  adresse : 0.0450427271425724
  afin : 0.08064097911119461
  ainsi : 0.08764959126710892
  assurance : -0.03274295851588249
  assurer : 0.02226885035634041
  attached : -0.05434000864624977


### Predict and print the sentiment score

**Warning :** In this exemple, the embedding is trained on a corpus of 40 emails which is WAY too small to yield valuable results

In [11]:
# Choose the name of the column returned (default is "score")
return_column = "semantic_score"

# Predict the sentiment score on each email of the DataFrame
df_emails_clean = semantic_detector.predict(df_emails_clean, return_column=return_column)

# Print emails with the maximum sentiment score
df_emails_clean.sort_values(by=return_column, ascending=False).head()

Unnamed: 0,clean_body,tokens,semantic_score
1,je vous informe que la nouvelle immatriculatio...,"[informe, nouvelle, immatriculation, enfin, fa...",0.072973
2,suite a notre conversation telephonique de fl...,"[suite, a, conversation, telephonique, flag_da...",0.065651
4,voici ci joint mon bulletin de salaire comme d...,"[voici, ci, joint, bulletin, salaire, comme, d...",0.0557
26,ci-joint le rib du compte comme demande,"[ci-joint, rib, compte, comme, demande]",0.045614
9,ci-joint pret vehicule,"[ci-joint, pret, vehicule]",0.044213


## The SentimentDetector class

The SemanticDetector class provides an unsupervised methodology to assign a sentiment score to a corpus of documents/emails. The methodology used to predict a sentiment score using the SemanticDetector is described below:

1. **Define a list of seed words that caracterize a sentiment**
    - Take a list of seed words as input
    - If the `extend_seed_word_list` parameter is set to True: extend the list of seed words with words sharing the same root (dance -> ["dancing", "dancer"])  
    
    
2. **Fit the model (= create a lexicon to assign a score for every word in the vocabulary)**
    - Create a lexicon for each seed word by computing the cosine similarity between the seed word and all the words in the vocabulary is computed.
    - Aggregate the similarity score obtained for the different seed words in a unique lexicon
    - (To compute cosine similarities, a trained embedding is required.)  
    
3. **Predict a sentiment score for emails/documents**
    - Filter out the tokens in the document that are not in the vocabulary.
    - For each remaining token, compute its sentiment score using the lexicon.
    - For each email, aggregate the score accross different tokens

The arguments of a SemanticDetector object are :
    
- **base_seed_words :** the list of seed words that caracterize a sentiment/theme
- **base_anti_seed_words :** the list of seed words that caracterize undesired sentiments/themes
- **anti_weight :** the weight of anti_seeds in the computation of the semantic score
- **tokens_column :** name of the column in the input DataFrame that contains tokens
- **extend_seed_word_list :** if True: complement seed words with words sharing the same root (dance -> ["dancing", "dancer"]). Default value False.
- **normalize_scores :** if True: normalize the lexicon scores of eache word. Default value False.
- **aggregation_function_seed_wise :** Function to aggregate the scores associated with a token accross the different seeds. Default function is a max.
- **aggregation_function_email_wise :** Function to aggregate the scores associated with the different tokens in an email. Default function is the 60th percentile.
- **n_jobs :** the number of cores used for computation. Default value, 1.

## Filter out undesired themes with using "anti seed words" and "anti_ratio"

If you want to detect emergency in your emails, you could use the seed word `"emergency"`.  
* "I need an answer, this is an emergency !!" => Semantic score = 0.98   

But you might detect undesired sentences such as:
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.95  

You can prevent the detection of undesired themes using anti seed words:  
* `base_anti_seed_word_list = ['brake']`
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.50  

You can control the contribution of anti seed words using the `anti_weight` (default 0.3):  
* `base_anti_seed_word_list = ['brake']`
* `anti_weight = 0.6`
* "Yesterday I tested the emergency brake of my car" => Semantic score = 0.30  

The formula used to compute the semantic score is:  
* semantic score = seed_word_contrib - anti_weight * anti_seed_word_contrib  

Warning : an `anti_weight` above one means anti seeds contribute more (negatively) than regular seeds

In [12]:
seed_word_list = ['immatriculation']
anti_seed_word_list = ['demandes']


# Instanciate SentimentDetector objects
regular_semantic_detector = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens')
semantic_detector_with_anti = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens', 
                                               base_anti_seed_words = anti_seed_word_list)
semantic_detector_with_anti2 = SemanticDetector(base_seed_words=seed_word_list, tokens_column='tokens', 
                                               base_anti_seed_words = anti_seed_word_list, anti_weight=0.5)


# Fit the SentimentDetectors using the trained embedding
regular_semantic_detector.fit(embedding=embedding)
semantic_detector_with_anti.fit(embedding=embedding)
semantic_detector_with_anti2.fit(embedding=embedding)

In [13]:
# Choose the name of the column returned (default is "score")
return_column1 = "semantic_score"
return_column2 = "semantic_score with anti (anti_weight=0.3)"
return_column3 = "semantic_score with anti (anti_weight=0.5)"



# Predict the sentiment score on each email of the DataFrame
df_emails_clean = regular_semantic_detector.predict(df_emails_clean, return_column=return_column1)
df_emails_clean = semantic_detector_with_anti.predict(df_emails_clean, return_column=return_column2)
df_emails_clean = semantic_detector_with_anti2.predict(df_emails_clean, return_column=return_column3)


# Print emails with the maximum sentiment score
df_emails_clean.sort_values(by=return_column1, ascending=False).head()

Unnamed: 0,clean_body,tokens,semantic_score,semantic_score with anti (anti_weight=0.3),semantic_score with anti (anti_weight=0.5)
1,je vous informe que la nouvelle immatriculatio...,"[informe, nouvelle, immatriculation, enfin, fa...",0.072973,0.052712,0.03622
2,suite a notre conversation telephonique de fl...,"[suite, a, conversation, telephonique, flag_da...",0.065651,0.076134,0.075565
4,voici ci joint mon bulletin de salaire comme d...,"[voici, ci, joint, bulletin, salaire, comme, d...",0.0557,0.056599,0.057199
26,ci-joint le rib du compte comme demande,"[ci-joint, rib, compte, comme, demande]",0.045614,0.044363,0.043529
9,ci-joint pret vehicule,"[ci-joint, pret, vehicule]",0.044213,0.041173,0.039147


## Find extra seed words with the `extend_seed_word_list` parameter

The SentimentDetector "extend_seed_word_list" parameter activates the search for extra seed words sharing the same root as the base seed words.  

For example, if "dance" is a base seed word, "extend_seed_word_list" will loop through the words in the embedding vocabulary and find new seed words such as "dancer", "dancing".

In [14]:
# Instanciate a SentimentDetector object
semantic_detector_extended_seed = SemanticDetector(
    base_seed_words=['tel', 'assur'], tokens_column='tokens', extend_seed_word_list=True)

# Fit the SentimentDetector using the trained embedding
semantic_detector_extended_seed.fit(embedding=embedding)

In [15]:
# Print the extended list of seed words
print(semantic_detector_extended_seed.seed_dict)
print(semantic_detector_extended_seed.seed_list)

{'tel': ['telephonique', 'telephone', 'tel'], 'assur': ['assurance', 'assurer']}
['telephonique', 'telephone', 'tel', 'assurance', 'assurer']


## Use a custom function to aggregate lexicon scores

### Aggregate token score over seeds
The SemanticDetector computes a similarity between a word and every seed words.  
An aggretion function is then used to keep a single score for each token.  

Exemple : 
- Seed word list : ["horse", "animal"]
- Embedding : simulated

Lexicon "horse" :
{
  "apple" : 0.2,
  ...
  "hello" : 0.1,
  ...
  "ponies" : 8.8,
  ...
  "zebra" : 1.2
}  
Lexicon "animal" :
{
  "apple" : 0.1,
  ...
  "hello" : 0.3,
  ...
  "ponies" : 4.8,
  ...
  "zebra" : 6.2
}

**Aggregated Lexicon :**  
{
  "apple" : 0.2,
  ...
  "hello" : 0.3,
  ...
  "ponies" : 8.8,
  ...
  "zebra" : 6.2
}

### Aggregate semantic score over tokens
When evaluating an email, each word in the email has an associated score.  
An aggregation function is used to keep a single score for each email.  

Exemple : 
- Sentence : "Hello, I like ponies"
- Seed word list : ["horse", "animal"]
- Embedding : simulated

**Sentence score :**  
- score : score(Hello) + score(I) + score(like) + score(ponies)
- score : 0.3 + 0.1 + 0.2 + 8.8 = 9.4


The semantic score for the email is thus 9.4

### Default aggregation functions

The default aggregation methodology is the following:  
- Seed-wise aggregation : For a token, take the max score accross seed
  - Exemple :  
    ponies_score = 8.8 (lexicon "horse")   
    ponies_score = 4.8 (lexicon "animal")  
    => Score for the "ponies" token = np.max(8.8, 4.8) = 8.8  
  
  
- Email-wise aggregation : Given a list of token scores, take the percentile 60 as the sentiment_score for the email
  - Exemple :  
    token_score_list : [0.3 (hello), 0.3 (i), 0.2 (ponies)]  
    => sentiment_score = np.percentile([0.3, 0.3, 0.2, 8.8], 60) = 0.3  

In [16]:
# Instanciate a SentimentDetector object with custom aggregation function:
# - A mean for the seed-wise aggregation
# - A 95th percentile for the email-wise aggregation

def aggregation_mean(x):
    return np.mean(x, axis=0)

def aggregation_percentile_95(x):
    return np.percentile(x, 95)

semantic_detector_custom_aggregation = SemanticDetector(
    base_seed_words=['client'], 
    tokens_column='tokens', 
    aggregation_function_seed_wise=aggregation_mean,
    aggregation_function_email_wise=aggregation_percentile_95
)

# Fit the SentimentDetector using the trained embedding
semantic_detector_custom_aggregation.fit(embedding=embedding)

# Predict the sentiment score on each email of the DataFrame
df_emails_clean_custom_aggregation = semantic_detector_custom_aggregation.predict(df_emails_clean)

## Multiprocessing

In [17]:
semantic_detector_multiprocessing = SemanticDetector(
    base_seed_words=['certificat'], 
    tokens_column='tokens', 
    n_jobs = 2
)

# Fit the SentimentDetector using the trained embedding
semantic_detector_multiprocessing.fit(embedding=embedding)

# Predict the sentiment score on each email of the DataFrame
df_emails_multiprocessing = semantic_detector_multiprocessing.predict(df_emails_clean)