# KeywordsGenerator class

The KeywordGenerator class extracts relevant keywords in the text data **based on a tf-idf score computed on the training dataset**.

#### The input dataframe

KeywordGenerator **requires a *tokens* column** fow which iach elements is a list of strings. The *tokens* column can be generated with a Tokenizer object

In [1]:
import pandas as pd
import ast
df_emails_preprocessed = pd.read_csv('./data/emails_preprocessed.csv', encoding='utf-8', sep=';')
df_emails_preprocessed = df_emails_preprocessed[['tokens']]
df_emails_preprocessed['tokens'] = df_emails_preprocessed['tokens'].apply(lambda x: ast.literal_eval(x))

In [2]:
df_emails_preprocessed.tokens[0]

['client',
 'chez',
 'pouvez',
 'etablir',
 'devis',
 'fils',
 'souhaite',
 'louer',
 'lappartement',
 'suivant',
 '25',
 'rue',
 'rueimaginaire',
 'flag_cp_']

#### Arguments 

The specific parameters of the KeywordGenerator class are:
- max_tfidf_features : size of vocabulary for tfidf
- keywords : list of keyword to be extracted in priority (this list can be defined in the conf file)
- stopwords : list of keywords to be ignored (this list can be defined in the conf file)
- resample : when DataFrame contains a ‘label’ column, balance the dataset by resampling
- n_max_keywords : maximum number of keywords to be returned for each email
- n_min_keywords : minimum number of keywords to be returned for each email
- threshold_keywords : minimum tf-idf score for a word to be selected as keyword

In [3]:
keywords = ['devis', 'contrat', 'resilitation']

In [4]:
stopwords = ["au", "aux", "avec", "ce", "ces", "dans", "de", "des", "du",
        "elle", "en", "et", "eux", "il", "je", "la", "le", "leur", "lui", "ma",
        "mais", "me", "même", "mes", "moi", "mon", "ne", "nos", "notre", "nous",
        "on", "ou","par", "pas", "pour", "qu", "que", "qui", "sa", "se", "ses",
        "son", "sur","ta", "te", "tes", "toi", "ton", "tu", "un", "une", "vos",
        "votre", "vous", "c", "d", "j", "l", "à", "m", "n", "s", "t", "y", "été",
        "étée", "étées", "étés", "étant", "étante", "étants", "étantes", "suis",
        "es", "est", "sommes", "êtes", "sont", "serai", "seras", "sera", "serons",
        "serez", "seront", "serais", "serait", "serions", "seriez", "seraient",
        "étais", "était", "étions", "étiez", "étaient", "fus", "fut", "fûmes",
        "fûtes", "furent", "sois", "soit", "soyons", "soyez", "soient", "fusse",
        "fusses", "fût", "fussions", "fussiez", "fussent", "ayant", "ayante",
        "ayantes", "ayants", "eu", "eue", "eues", "eus", "ai", "as", "avons",
        "avez", "ont", "aurai", "auras", "aura", "aurons", "aurez", "auront",
        "aurais", "aurait", "aurions", "auriez", "auraient", "avais", "avait",
        "avions", "aviez", "avaient", "eut", "eûmes", "eûtes", "eurent", "aie",
        "aies", "ait", "ayons", "ayez", "aient", "eusse", "eusses", "eût",
        "eussions", "eussiez", "eussent", "suivant"],

#### Defining the KeywordsGenerator

In [5]:
from melusine.summarizer.keywords_generator import KeywordsGenerator

keywords_generator = KeywordsGenerator(keywords = keywords,
                                       stopwords = stopwords,
                                       n_max_keywords=5,
                                       n_min_keywords=0,
                                       threshold_keywords=0.1,
                                       keywords_coef=10)

#### Training the KeywordsGenerator

In [6]:
keywords_generator.fit(df_emails_preprocessed) 

KeywordsGenerator(keywords=['devis', 'contrat', 'resilitation'],
                  n_max_keywords=5,
                  stopwords=(['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de',
                              'des', 'du', 'elle', 'en', 'et', 'eux', 'il',
                              'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais',
                              'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos',
                              'notre', 'nous', ...],),
                  threshold_keywords=0.1)

#### Extracting keywords

In [7]:
df_emails_preprocessed = keywords_generator.transform(df_emails_preprocessed)

In [8]:
df_emails_preprocessed.head()

Unnamed: 0,tokens,keywords
0,"[client, chez, pouvez, etablir, devis, fils, s...","[pouvez, devis, fils, suivant, flag_cp_]"
1,"[informe, nouvelle, immatriculation, enfin, fa...","[nouvelle, immatriculation, prie, trouver, faire]"
2,"[suite, a, conversation, telephonique, flag_da...","[conversation, pourriez, dire, dois, afin]"
3,"[fais, suite, a, mail, envoye, bulletin, salai...","[suite, mail, bulletin, salaire, trouverez]"
4,"[voici, ci, joint, bulletin, salaire, comme, d...","[ci, joint, bulletin, salaire, comme]"


In [9]:
df_emails_preprocessed.tokens[1]

['informe',
 'nouvelle',
 'immatriculation',
 'enfin',
 'faite',
 'prie',
 'trouver',
 'donc',
 'carte',
 'grise',
 'ainsi',
 'nouvelle',
 'immatriculation',
 'demanderai',
 'faire',
 'les',
 'changements',
 'necessaires',
 'concernant',
 'lassurance']

In [10]:
df_emails_preprocessed.keywords[1]

['nouvelle', 'immatriculation', 'prie', 'trouver', 'faire']