## Tokénisation, lemmatisation, étiquetage morphosyntaxique : 
---

**Objectif** : 

    - Avoir un tableau avec mot / lemme / POS / Oeuvre / Songwriter.
    
==> Boîte à outils SpaCy : https://spacy.io/api/tokenizer 
==> https://spacy.io/usage/linguistic-features

### Petit focus sur la tokénisation, la lemmatisation et l'étiquetage morpho-syntaxique.

   - Qu'est-ce que c'est ?
   - Dans quel but faire cela ?
   
---

In [1]:
# Installation de spacy :

# pip install spacy

# Librairie pandas (manipulation de données csv, dataframe, etc.)
import pandas as pd

# Import et lecture du corpus :
corpus = pd.read_csv('corpus_nettoye.csv')

In [2]:
# Construction :
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

In [4]:
import spacy

# Petit réglage pour permettre d'écrire sur les données...
# (sécurité panda)

pd.options.mode.chained_assignment = None  # default='warn' (Cf. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas)

corpus_test = corpus

corpus_test['words'] = corpus_test['Lyrics'].apply(lambda x: nlp.tokenizer(str(x)))
# df['sents'] = df['text'].apply(lambda x: list(nlp(x).sents))
# Réf : https://stackoverflow.com/questions/46981137/tokenizing-using-pandas-and-spacy

# Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

# Transform data using spaCy
# nlp = spacy.load("en_core_web_sm")
# WARNING: takes a long time!
# corpus_test['words'] = corpus_test['Lyrics'].apply(lambda x: [sent.text for sent in nlp(x).sents])

corpus_test.head()

Unnamed: 0.1,Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words
0,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,"(Words, are, flowing, out, like, endless, rain..."
1,2,"""All I've Got to Do""",UK: With the Beatles\n US: Meet the Beatles!,Lennon,Lennon,1963,Whenever I want you around yeh All I gotta do...,"(Whenever, I, want, you, around, yeh, , All, ..."
2,3,"""All My Loving""",UK: With the Beatles\n US: Meet the Beatles!,McCartney,McCartney,1963,Close your eyes and I'll kiss you Tomorrow I'l...,"(Close, your, eyes, and, I, 'll, kiss, you, To..."
3,5,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,"(One, two, three, four, Can, I, have, a, littl..."
4,6,"""All You Need Is Love""",Magical Mystery Tour,Lennon,Lennon,1967,"Love, love, love Love, love, love Love, love, ...","(Love, ,, love, ,, love, Love, ,, love, ,, lov..."


In [5]:
# https://towardsdatascience.com/tokenize-text-columns-into-sentences-in-pandas-2c08bc1ca790
corpus_test = corpus_test.explode("words", ignore_index=True)
corpus_test.head()
# The explode() function is used to transform each element of a list-like to a row, 
# replicating the index values. Returns: Series- Exploded lists to rows; index will 
# be duplicated for these rows.

Unnamed: 0.1,Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words
0,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,Words
1,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,are
2,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,flowing
3,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,out
4,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,like


In [6]:
# À faire : téléchargement du modèle :   

# python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

## Méthode :

## Cf. https://stackoverflow.com/questions/44395656/applying-spacy-parser-to-pandas-dataframe-w-multiprocessing

## Spacy is highly optimised and does the multiprocessing for you. 
## As a result, I think your best bet is to take the data out of 
## the Dataframe and pass it to the Spacy pipeline as a list rather 
## than trying to use .apply directly.
## You then need to the collate the results of the parse, and put 
## this back into the Dataframe. 

lemma = []
pos = []

for doc in nlp.pipe(corpus_test['words'].astype('unicode').values, batch_size=50):
    if doc.has_annotation("DEP"):
        #tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        # tokens.append(None)
        lemma.append(None)
        pos.append(None)
        
# corpus_test['tokens'] = tokens
corpus_test['lemma'] = lemma
corpus_test['pos'] = pos



In [7]:
# Regard sur les données : 

corpus_test[0:5]

Unnamed: 0.1,Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
0,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,Words,[word],[NOUN]
1,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,are,[be],[AUX]
2,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,flowing,[flow],[VERB]
3,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,out,[out],[ADP]
4,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,like,[like],[INTJ]


In [8]:
len(corpus_test)

36079

In [9]:
# Vérification qu'il n'y a pas de pb d'alignement entre words / lemma / pos
corpus_test.tail()
len(corpus_test)

36079

---

### Transformation en three-grams de POS :

   - Pour quoi faire ?
   - Voir notamment, sur ce point : ZHAO, Ying, ZOBEL, Justin, « Searching with style : authorship attribution in classic literature », in Proceedings of the thirtieth Australasian conference on Computer science, Volume 62, Australian Computer Society, Inc., AUS, 2007, pp. 59–68.
    
---

In [10]:
# Transformation en 3grams :

import nltk
from nltk.util import ngrams
liste_pos = list(corpus_test['pos'])
three_grams_pos = list(ngrams(liste_pos, 3))
three_grams_pos

len(three_grams_pos)

# Ajout dans la dataframe :
# corpus_test['3grams_pos'] = three_grams_pos
len(three_grams_pos)

36077

In [11]:
# Différence de longueur entre la df et la liste, ce qui est logique. 
# Des idées pourquoi ? 

a = [(['NOUN'], ['ADV'], ['NaN'])]
b = [(['ADV'], ['NaN'], ['NaN'])]
three_grams_pos.extend(a)
three_grams_pos.extend(b)

len(three_grams_pos)

36079

In [12]:
corpus_test['3grams_pos'] = three_grams_pos

In [13]:
corpus_test.tail()

Unnamed: 0.1,Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
36074,217,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand Turn my face to the ...,to,[to],[ADP],"([ADP], [VERB], [PRON])"
36075,217,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand Turn my face to the ...,hide,[hide],[VERB],"([VERB], [PRON], [NOUN])"
36076,217,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand Turn my face to the ...,your,[your],[PRON],"([PRON], [NOUN], [ADV])"
36077,217,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand Turn my face to the ...,love,[love],[NOUN],"([NOUN], [ADV], [NaN])"
36078,217,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand Turn my face to the ...,away,[away],[ADV],"([ADV], [NaN], [NaN])"


In [14]:
# Sauvegarde en csv :

corpus_test.to_csv('corpus_tokmorph.csv')