**Date :** Created on Thursday January 7 2021

**Group 8 - Innovation**

**Embedding_Update_v0** 

**@author :** Flora Estermann, Damien Sonneville. 

# Part 1 : Download / Import Librairy

## Download Library

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Import librairy

### - Usefull library :

In [2]:
import pandas as pd
from tqdm import tqdm
from google.colab import drive

### - Text library :

In [3]:
from nltk.corpus import stopwords

### - Machine Learning Library :

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Part 2 : Data Loading

## - Phase 1 : Load Abstracts Data

In [5]:
def Load_data(helper_path : str) -> pd.DataFrame :
    """Documentation
    
    Parameters :
        - helper_path : the file path

    Output (if exists) :
        - df : My Dataframe cleaned and reindexed

    """
    
    # Data Load with pandas librairy
    df = pd.read_csv(helper_path)

    # Drop articles with no content
    df = df[df['art_content'] != '']

    # Reset my dataframe index
    df = df.reset_index(drop = True)
    
    # Returns my clean dataframe
    return df

## - Phase 2 : Load Requests Data

In [6]:
def Load_req(helper_path : str) -> list :
    """Documentation
    
    Parameters :
        - helper_path : the file path

    Output (if exists) :
        - req : My list of requests

    """

    # Data Load with pandas librairy
    req = pd.read_pickle(helper_path)

    # Print my list lentgh (Optional)
    # print('Requests numbers :', len(req))

    # Returns my requests list
    return req

In [7]:
# Connect the drive folder
drive.mount('/content/drive')

# First file path (Fonction Data)
Helper_path_D : str = '/content/drive/MyDrive/data_interpromo/Data/abstract_v1.csv'

# Second file path (Fonction Requests)
Helper_path_R : str = '/content/drive/MyDrive/data_interpromo/Data/request_word_weight'

# My DataFrame variable
My_data : pd.DataFrame = Load_data(Helper_path_D)

# My request variable
My_request : list = Load_req(Helper_path_R)

# To show my DataFrame
My_data.head(10)

Mounted at /content/drive


Unnamed: 0,art_id,art_content,art_content_html,art_extract_datetime,art_lang,art_title,art_url,src_name,src_type,src_url,src_img,art_auth,art_tag,art_clean,abstract_sentence,abstract_words
0,1,le FNCDG et l’ andcdg avoir publier en septemb...,"<p style=""text-align: justify;"">La FNCDG et l’...",22 septembre 2020,fr,9ème édition du Panorama de l’emploi territorial,http://fncdg.com/9eme-edition-du-panorama-de-l...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/09/im...,,,fncdg andcdg avoir publier septembre 9em editi...,fncdg andcdg avoir publier septembre 9em editi...,fncdg andcdg avoir publier septembre 9em editi...
1,2,malgré le levée un mesure de confinement le 11...,"<p style=""text-align: justify;"">Malgré la levé...",17 mars 2020,fr,ACTUALITÉS FNCDG / COVID19,http://fncdg.com/actualites-covid19/,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/03/co...,,,malgre levee mesure confinement 11 mai 2020 pl...,malgre levee mesure confinement 11 mai 2020 pl...,malgre levee mesure confinement 11 mai 2020 pl...
2,25,quel être le objectif poursuivre par le gouver...,"<p style=""text-align: justify;""><strong>Quels ...",24 octobre 2019,fr,"Interview de M. Olivier DUSSOPT, Secretaire d’...",http://fncdg.com/interview-de-m-olivier-dussop...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2019/10/in...,,,quel etre objectif poursuivre gouvernement cad...,quel etre objectif poursuivre gouvernement cad...,quel etre objectif poursuivre gouvernement cad...
3,27,"le journée thématique , qui avoir lieu durant ...","<p style=""text-align: justify;""><strong>La jo...",31 mai 2017,fr,Journée Thématique FNCDG « Les services de san...,http://fncdg.com/journee-thematique-fncdg-les-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/05/pu...,,,journee thematique avoir lieu durant salon pre...,journee thematique avoir lieu durant salon pre...,journee thematique avoir lieu durant salon pre...
4,28,le 1ère journée thématique en région sur le th...,"<p style=""text-align: justify;"">La 1<sup>ère</...",13 mars 2017,fr,Journée Thématique FNCDG « Vers de nouveaux mo...,http://fncdg.com/journee-thematique-fncdg-vers...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/03/Sa...,,,1ere journee thematique region theme vers nouv...,1ere journee thematique region theme vers nouv...,1ere journee thematique region theme vers nouv...
5,30,l’ un un innovation de le loi n degré 2019 - 8...,"<p style=""text-align: justify;"">L’une des inno...",22 octobre 2020,fr,La publication d’un guide d’accompagnement à l...,http://fncdg.com/la-publication-dun-guide-dacc...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/10/LG...,,,innovation loi degre 2019 828 6 aout 2019 dire...,innovation loi degre 2019 828 6 aout 2019 dire...,innovation loi degre 2019 828 6 aout 2019 dire...
6,31,"le FNCDG mener , en collaboration avec d’ autr...","<p style=""text-align: justify;"">La FNCDG mène,...",10 décembre 2020,fr,La publication d’un guide de sensibilisation a...,http://fncdg.com/la-publication-dun-guide-de-s...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/12/im...,,,fncdg mener collaboration autre partenaire cam...,fncdg mener collaboration autre partenaire cam...,fncdg mener collaboration autre partenaire cam...
7,32,"créer pour et par le décideur territorial , ét...","<p style=""text-align: justify;"">Créé pour et p...",24 février 2017,fr,Lancement du réseau Étoile,http://fncdg.com/lancement-du-reseau-etoile/,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/02/re...,,,creer decideur territorial etoile etre tout pr...,creer decideur territorial etoile etre tout pr...,creer decideur territorial etoile etre tout pr...
8,34,le décret n degré 2017 - 397 et n degré 2017 -...,"<p style=""text-align: justify;"">Les décrets n°...",5 avril 2017,fr,Le cadre d’emplois des agents de police munici...,http://fncdg.com/le-cadre-demplois-des-agents-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/04/po...,,,decret degre 2017 397 degre 2017 318 24 mars 2...,decret degre 2017 397 degre 2017 318 24 mars 2...,decret degre 2017 397 degre 2017 318 24 mars 2...
9,35,un candidat à un examen professionnel organise...,"<p style=""text-align: justify;"">Une candidate ...",6 juillet 2017,fr,Le Conseil d’Etat confirme la souveraineté des...,http://fncdg.com/le-conseil-detat-confirme-la-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/07/Co...,,,candidat a examen professionnel organiser cent...,candidat a examen professionnel organiser cent...,candidat a examen professionnel organiser cent...


# Part 3 : Word Embedding

**Function Description (To help understanding) :** 
- The goal is to use the textual content got from the user requests in our database to update the embeddings in a more representative way (user-oriented).

- For instance, by incrementing the weights of the words present in the most of requests, the weight of the sentence itself could become more important and thus would be put forward to build a better abstract.

In [8]:
def Get_TF_IDF(corpus : pd.DataFrame, min_df : int, max_df : int, stop_words : set):
    """Documentation
  
    Parameters :
      - corpus : all sentences in the column ["art_clean"]
      - min_df : the minimum number of appearances of a word
      - max_df : the maximum number of appearances of a word
      - stop_words : the list of stopwords to delete in my data

    Output (if exists) :
      - text_tfidf : my DataFrame of document-term matrix
      - vocab : my vocabulary of document-term matrix 

    """

    # Text transformation TF-IDF
    vectorizer = TfidfVectorizer(min_df = min_df, \
                                 max_df = max_df, \
                                 stop_words = stop_words)
  
    # Type change in the fit
    tfidf = vectorizer.fit_transform(corpus.astype(str))

    # Convert sparse array to dense dataframe
    text_tfidf = pd.DataFrame(tfidf.todense(), \
                              columns = vectorizer.get_feature_names())
  

    # Get vocabulary
    vocab = vectorizer.get_feature_names()

    # Return TF-IDF matrix
    # Return TF-IDF vocabulary list
    return text_tfidf, vocab

In [9]:
# My stopwords list
Stop_words = set(stopwords.words('french'))
Stop_words.add('pron')

# My corpus variable
Corpus = My_data['art_clean']

# Compute TF-IDF
Tfidf, Word_collection = Get_TF_IDF(Corpus, 20, 1000, Stop_words)

# Part 4 : Word Filter (from the requests)

**Function Description (To help understanding) :**  
- We create a filter that we will increment each time a word from the vocabulary appears in a request.

- By convoluting over the document-term matrix, we will update the weight of the words that come frequently in the user requests.

In [10]:
def Create_filter(word_collection : list, requests : list, normalize = True):
    """Documentation
  
    Parameters :
      - word_collection: feature names from the TF-IDF.
      - requests: list of requests.
      - normalize: if `True`, returns the filter normalized.

    Output (if exists) :
      - filter: dictionnary mapping the word_collection with their frecuency in the requests.  

    """
    # My initialized filter
    filter = dict([(v, 0) for v in word_collection])

    # Step 1 : Calcul from the requests
    for r in tqdm(requests):
        
        # Step 2 : Split my request
        for w in r.split():
            
            # Check all words in request
            if w in word_collection:
              
                # Get the term frecuency 
                filter[w] += 1

    # Convert my dico to DataFrame
    filter = pd.DataFrame.from_dict(filter, \
                                    orient='index', \
                                    columns=['freq'])
    # If normalized
    if (normalize == True):

        # Normalize the filter
        norm_filter = (filter - filter.mean()) / filter.std()
        
        # Return my normalized filter
        return norm_filter

    else:

        # Return initial filter
        return filter

In [11]:
# My filter creation
Filter = Create_filter(Word_collection, My_request)

# To show my filter
Filter.head(10)

100%|██████████| 150/150 [00:00<00:00, 1712.52it/s]


Unnamed: 0,freq
0,-0.100582
1,-0.100582
2,-0.100582
3,-0.100582
4,-0.100582
5,-0.100582
6,-0.100582
7,-0.100582
8,-0.100582
9,-0.100582


# Part 5 : Embedding Update

In [17]:
def Embedding_update(embedding : pd.DataFrame, filter : pd.DataFrame) -> pd.DataFrame :
    """Documentation
  
    Parameters :
      - embedding: chosen representation of the document-term matrix.
      - filter: dictionnary mapping the words of the document-term matrix 
      with their frecuency in the requests.

    Output (if exists) :
      - update: new embedding resulting from the convolution 
      with the filter (simple addition).  

    """ 
  
    # Add each corresponding value from the filter
    update = embedding + filter['freq'].values

    # Return my new TF-IDF
    return update

In [18]:
# My new TF-IDF variable
New_tfidf = Embedding_update(Tfidf, Filter)

# To show my DataFrame
New_tfidf

Unnamed: 0,00,01,02,03,04,05,06,07,08,09,100,1000,10000,100000,105,10h00,10h30,11,110,112,115,117,11h00,11h30,12,120,1200,123,124,125,12h00,12h30,13,130,135,14,140,145,14h00,150,...,voyage,voyageur,voyez,voynet,vrai,vraie,vraiment,vraisemblablement,vs,vu,vulnerabilite,vulnerable,wargon,washington,way,we,web,webinaire,week,werquin,wifi,with,woerth,work,workflow,workplace,world,xavier,xx,xxi,xxiem,york,your,yourcegid,yvelines,yves,zero,zonage,zone,zoom
0,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
1,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,0.013819,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
2,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
3,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
4,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7485,-0.100582,-0.058201,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.073284,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.071542,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.071227,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.008685,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
7486,-0.100582,0.019679,-0.100582,-0.100582,-0.004466,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.074761,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.073114,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.072815,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.013658,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
7487,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.060774,-0.100582,-0.100582,-0.085414,-0.100582,-0.100582,-0.100582,-0.100582,-0.059826,-0.100582,-0.100582,-0.084716,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.075635,-0.100582,-0.100582,-0.066336,-0.086532,-0.100582,-0.100582,-0.090322,...,-0.100582,-0.100582,-0.089641,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.091422,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.086233,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
7488,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,...,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,-0.100582,1.857089,-0.100582
