# KeywordsGenerator class

The KeywordGenerator class extracts relevant keywords in the text data **based on a tf-idf score computed on the training dataset**.

#### The input dataframe

KeywordGenerator **requires a *tokens* column** fow which iach elements is a list of strings. The *tokens* column can be generated with a Tokenizer object

In [1]:
from melusine.data.data_loader import load_email_data
df_emails_preprocessed = load_email_data(type="preprocessed")

In [2]:
# Create a tokens column
from melusine.nlp_tools.tokenizer import WordLevelTokenizer
tokenizer = WordLevelTokenizer()
df_emails_preprocessed["tokens"] = df_emails_preprocessed["clean_body"].apply(tokenizer.tokenize)

In [3]:
df_emails_preprocessed.tokens[0]

['client',
 'chez',
 'pouvez',
 'etablir',
 'devis',
 'fils',
 'souhaite',
 'louer',
 'lappartement',
 'suivant',
 '25',
 'rue',
 'rueimaginaire',
 'flag_cp_']

#### Arguments 

The specific parameters of the KeywordGenerator class are:
- max_tfidf_features : size of vocabulary for tfidf
- keywords : list of keyword to be extracted in priority (this list can be defined in the conf file)
- stopwords : list of keywords to be ignored (this list can be defined in the conf file)
- resample : when DataFrame contains a ‘label’ column, balance the dataset by resampling
- n_max_keywords : maximum number of keywords to be returned for each email
- n_min_keywords : minimum number of keywords to be returned for each email
- threshold_keywords : minimum tf-idf score for a word to be selected as keyword

In [4]:
keywords = ['devis', 'contrat', 'resilitation']

In [5]:
stopwords = ["au", "aux", "avec", "ce", "ces", "dans", "de", "des", "du",
        "elle", "en", "et", "eux", "il", "je", "la", "le", "leur", "lui", "ma",
        "mais", "me", "même", "mes", "moi", "mon", "ne", "nos", "notre", "nous",
        "on", "ou","par", "pas", "pour", "qu", "que", "qui", "sa", "se", "ses",
        "son", "sur","ta", "te", "tes", "toi", "ton", "tu", "un", "une", "vos",
        "votre", "vous", "c", "d", "j", "l", "à", "m", "n", "s", "t", "y", "été",
        "étée", "étées", "étés", "étant", "étante", "étants", "étantes", "suis",
        "es", "est", "sommes", "êtes", "sont", "serai", "seras", "sera", "serons",
        "serez", "seront", "serais", "serait", "serions", "seriez", "seraient",
        "étais", "était", "étions", "étiez", "étaient", "fus", "fut", "fûmes",
        "fûtes", "furent", "sois", "soit", "soyons", "soyez", "soient", "fusse",
        "fusses", "fût", "fussions", "fussiez", "fussent", "ayant", "ayante",
        "ayantes", "ayants", "eu", "eue", "eues", "eus", "ai", "as", "avons",
        "avez", "ont", "aurai", "auras", "aura", "aurons", "aurez", "auront",
        "aurais", "aurait", "aurions", "auriez", "auraient", "avais", "avait",
        "avions", "aviez", "avaient", "eut", "eûmes", "eûtes", "eurent", "aie",
        "aies", "ait", "ayons", "ayez", "aient", "eusse", "eusses", "eût",
        "eussions", "eussiez", "eussent", "suivant"],

#### Defining the KeywordsGenerator

In [6]:
from melusine.summarizer.keywords_generator import KeywordsGenerator

keywords_generator = KeywordsGenerator(keywords = keywords,
                                       stopwords = stopwords,
                                       n_max_keywords=5,
                                       n_min_keywords=0,
                                       threshold_keywords=0.1,
                                       keywords_coef=10)

#### Training the KeywordsGenerator

In [7]:
keywords_generator.fit(df_emails_preprocessed) 

KeywordsGenerator(keywords=['devis', 'contrat', 'resilitation'],
                  n_max_keywords=5,
                  stopwords=(['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de',
                              'des', 'du', 'elle', 'en', 'et', 'eux', 'il',
                              'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais',
                              'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos',
                              'notre', 'nous', ...],),
                  threshold_keywords=0.1)

#### Extracting keywords

In [8]:
df_emails_preprocessed = keywords_generator.transform(df_emails_preprocessed)

                                                       

In [9]:
df_emails_preprocessed.head()

Unnamed: 0,body,header,date,from,to,attachment,sexe,age,label,is_begin_by_transfer,is_answer,is_transfer,structured_historic,structured_body,last_body,clean_body,clean_header,tokens,keywords
0,\n \n \n \n Bonjour \n Je suis client chez...,Devis habitation,24/05/2018 11:36,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,[],F,35,habitation,True,False,False,[{'text': ' \n \n \n \n Bonjour \n Je suis ...,"[{'meta': {'date': None, 'from': None, 'to': N...",Je suis client chez vous Pouvez vous m établir...,je suis client chez vous pouvez vous m etablir...,devis habitation,"[client, chez, pouvez, etablir, devis, fils, s...","[pouvez, devis, fils, suivant, flag_cp_]"
1,"\n \n \n \n Bonsoir madame, \n \n Je vous...",Immatriculation voiture,24/05/2018 19:37,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,"[""pj.pdf""]",M,32,vehicule,True,False,False,"[{'text': ' \n \n \n \n Bonsoir madame, \n ...","[{'meta': {'date': None, 'from': None, 'to': N...",Je vous informe que la nouvelle immatriculati...,je vous informe que la nouvelle immatriculatio...,immatriculation voiture,"[informe, nouvelle, immatriculation, enfin, fa...","[nouvelle, immatriculation, prie, trouver, faire]"
2,"\n \n \n Bonjours, \n \n Suite a notre con...",Re: Envoi d'un document de la Société Imaginaire,vendredi 25 mai 2018 06 h 45 CEST,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],M,66,compte,False,True,False,"[{'text': "" \n \n \n Bonjours, \n \n Suite ...","[{'meta': {'date': None, 'from': None, 'to': N...",Suite a notre conversation téléphonique de Ma...,suite a notre conversation telephonique de mar...,envoi d'un document de la societe imaginaire,"[suite, a, conversation, telephonique, flag_da...","[conversation, pourriez, dire, dois, afin]"
3,"\n \n \n \n \n Bonjour, \n \n \n Je fai...",Re: Votre adhésion à la Société Imaginaire,vendredi 25 mai 2018 10 h 15 CEST,Monsieur Dupont <monsieurdupont@extensiond.com>,demandes@societeimaginaire.fr,"[""fichedepaie.png""]",M,50,adhesion,False,True,False,"[{'text': "" \n \n \n \n \n Bonjour, \n \n...","[{'meta': {'date': None, 'from': None, 'to': N...",Je fais suite à votre mail. J'ai envoyé mon...,je fais suite a votre mail. j'ai envoye mon bu...,votre adhesion a la societe imaginaire,"[fais, suite, a, mail, envoye, bulletin, salai...","[suite, mail, bulletin, salaire, trouverez]"
4,"\n \n \n Bonjour, \n Voici ci joint mon bul...",Bulletin de salaire,vendredi 25 mai 2018 17 h 30 CEST,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,"[""pj.pdf""]",M,15,adhesion,False,False,False,"[{'text': ' \n \n \n Bonjour, \n Voici ci jo...","[{'meta': {'date': None, 'from': None, 'to': N...",Voici ci joint mon bulletin de salaire comme d...,voici ci joint mon bulletin de salaire comme d...,bulletin de salaire,"[voici, ci, joint, bulletin, salaire, comme, d...","[ci, joint, bulletin, salaire, comme]"


In [10]:
df_emails_preprocessed.tokens[1]

['informe',
 'nouvelle',
 'immatriculation',
 'enfin',
 'faite',
 'prie',
 'trouver',
 'donc',
 'carte',
 'grise',
 'ainsi',
 'nouvelle',
 'immatriculation',
 'demanderai',
 'faire',
 'changements',
 'necessaires',
 'concernant',
 'lassurance']

In [11]:
df_emails_preprocessed.keywords[1]

['nouvelle', 'immatriculation', 'prie', 'trouver', 'faire']



















