# **<span >Etape 1 - Le préprocessing</span>**

Vous l'avez vu, le préprocessing peut-être une étape clé en manipulation de texte. Vous allez maintenant apprendre comment néttoyer et normaliser un corpus de documents en quelques étapes :
- Passer un texte en minuscule
- Supprimer les caractères spéciaux et les accents
- Stemmatiser les mots
- Supprimer les stopwords

## **<span >Import des librairies utiles au module</span>**


unidecode, regex, nltk

unidecode : normaliser les caractères spéciaux  
nltk (Natural Language Toolkit) : analyser et manipuler le langage naturel   
regex (Expressions régulières) : repérer, extraire ou nettoyer des motifs textuels précis.

In [41]:
from unidecode import unidecode
import re
from nltk.stem import SnowballStemmer


In [8]:
print("c'est parti")

c'est parti


1. Création du corpus et exploration

L'objectif est de normaliser ce corpus pas à pas. Il s'agit d'un article de blog, paru en Mars 2019, sur l'investissment de l'Entreprise Lyft, à Chicago, pour étendre 
son service Divvy à de plus larges zones de la ville. L'article parle du développement de Lyft et de ses relations avec la mairie de Chicago.

Le corpus est composé d'un seul article. Nous allons l'ouvrir pour pouvoir travailler dessus.

In [9]:
f= open("Article de presse.txt","r",encoding="utf8")
r=f.read()
corpus=[r]
s=len(corpus)

print(f"Le corpus est composé de {s} document{'' if s==1 else 's'}.")

Le corpus est composé de 1 document.


In [10]:
texte=corpus[0]
print(type(corpus))
print(type(texte))

<class 'list'>
<class 'str'>


Comme notre corpus ne contient qu'un seul document, nous allons le convertir en une variable string pour ne pas travailler avec une liste.

Mesurons la longueur de notre document: nous mesurons approximativement le nombre de mots qu'il contient.

In [11]:


liste_mots=texte.split()

print(f"La longueur du corpus est de {len(liste_mots)} mots")
print(texte)


La longueur du corpus est de 839 mots


Lyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes


By John Greenfield

6:03 PM UTC−5 on March 12, 2019


**************************************************************

url: https://chi.streetsblog.org/2019/03/12/lyft-may-spend-50m-to-expand-divvy-to-all-wards-using-dockless-option-ebikes

**************************************************************

Out of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in 

En observant la variable string qui est générée, nous remarquons qu'il s'agit d'un texte en html. Il est donc necessaire de le retraiter pour obtenir 
un texte exploitable.

In [12]:
texte

'\n\nLyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes\n\n\nBy John Greenfield\n\n6:03 PM UTC−5 on March 12, 2019\n\n\n**************************************************************\n\nurl: https://chi.streetsblog.org/2019/03/12/lyft-may-spend-50m-to-expand-divvy-to-all-wards-using-dockless-option-ebikes\n\n**************************************************************\n\nOut of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in sponsorship money fro

Nous allons utiliser le package "Beautiful Soup" pour convertir le texte html en texte standard pouvant être exploité pour l'analyse de texte.

In [13]:
import bs4
import html
import re

# Suppression des balises html:

soup = bs4.BeautifulSoup(texte,"lxml")
text=soup.get_text(separator="\n")

# Décoder les entités html (&nbsp,&amp, etc...):

text=html.unescape(text)

# Supprimer les url:
# La fonction re est utile:
# re.sub(motif(à trouver dans le string), remplacement, chaîne[, count, flags])

text=re.sub(r"http\S+","",text)

# Nettoyer les espaces multiples: 
text=re.sub(r"\n+","\n",text)


text=text.strip()
text

'Lyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes\nBy John Greenfield\n6:03 PM UTC−5 on March 12, 2019\n**************************************************************\nurl: \n**************************************************************\nOut of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in sponsorship money from Lyft over nine years.\nThe deal could be good news for making the system more equitable. There are currently a higher density of stat

In [14]:
# 2/ Nettoyer le texte:

print(text)

Lyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes
By John Greenfield
6:03 PM UTC−5 on March 12, 2019
**************************************************************
url: 
**************************************************************
Out of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in sponsorship money from Lyft over nine years.
The deal could be good news for making the system more equitable. There are currently a higher density of stations dow

In [15]:
# Nature de la variable text?

type(text)

str

In [16]:
# Supprimons les retour à la ligne intempestifs (\n):
# On les convertit en espace.
text_2=re.sub(r"\n"," ",text)
text_2=text_2.strip()
print(text_2)

Lyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes By John Greenfield 6:03 PM UTC−5 on March 12, 2019 ************************************************************** url:  ************************************************************** Out of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in sponsorship money from Lyft over nine years. The deal could be good news for making the system more equitable. There are currently a higher density of stations dow

In [17]:
# Nous voulons remplacer tout ce qui se trouve entre le titre de l'article et le début de l'article
# Trouvons l'index de fin du titre et l'index de début de l'article:
print(text_2.index("By John Greenfield 6:03")) 
print(text_2.index("Out of the blue,"))

79
262


In [18]:
# Pour nettoyer, nous créons une nouvelle variable qui contient le texte souhaité:

text_3= text_2[:78]+ ". " + text_2[262:]
text_3

'Lyft May Spend $50M to Expand Divvy to All Wards, Using Dockless-Option eBikes. Out of the blue, the Divvy bike system may be expanding citywide. The city of Chicago is expected to announce tomorrow that the existing Divvy contract will be amended to make Lyft, the parent company of Divvy concessionaire Motivate, the sponsor of the network, replacing current sponsor Blue Cross Blue Shield of Illinois. As part of the deal, which will require City Council approval, Lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. The city says it would also receive at least $77 million in sponsorship money from Lyft over nine years. The deal could be good news for making the system more equitable. There are currently a higher density of stations downtown and in more affluent neighborhoods, and many outlying communities don\'t have docks at all. The most obvious downside of the the deal is that, with Lyft replacing Blue Cross B

2/ Nettoyage du corpus:

Nous allons tout d'abord passer l'ensemble du texte en minuscules.

In [19]:
# La variable text_4 sera notre texte nettoyé:
text_4=text_3.lower()
text_4

'lyft may spend $50m to expand divvy to all wards, using dockless-option ebikes. out of the blue, the divvy bike system may be expanding citywide. the city of chicago is expected to announce tomorrow that the existing divvy contract will be amended to make lyft, the parent company of divvy concessionaire motivate, the sponsor of the network, replacing current sponsor blue cross blue shield of illinois. as part of the deal, which will require city council approval, lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. the city says it would also receive at least $77 million in sponsorship money from lyft over nine years. the deal could be good news for making the system more equitable. there are currently a higher density of stations downtown and in more affluent neighborhoods, and many outlying communities don\'t have docks at all. the most obvious downside of the the deal is that, with lyft replacing blue cross b

### <span >2.2. Suppression des accents:

Ceci est un test

In [20]:
text_4=unidecode(text_4)

text_4

'lyft may spend $50m to expand divvy to all wards, using dockless-option ebikes. out of the blue, the divvy bike system may be expanding citywide. the city of chicago is expected to announce tomorrow that the existing divvy contract will be amended to make lyft, the parent company of divvy concessionaire motivate, the sponsor of the network, replacing current sponsor blue cross blue shield of illinois. as part of the deal, which will require city council approval, lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. the city says it would also receive at least $77 million in sponsorship money from lyft over nine years. the deal could be good news for making the system more equitable. there are currently a higher density of stations downtown and in more affluent neighborhoods, and many outlying communities don\'t have docks at all. the most obvious downside of the the deal is that, with lyft replacing blue cross b

### 2.3. Transformation d'expresions:

In [21]:
text_4=re.sub(r'[0-9]{4}','annee',text_4)
text_4

'lyft may spend $50m to expand divvy to all wards, using dockless-option ebikes. out of the blue, the divvy bike system may be expanding citywide. the city of chicago is expected to announce tomorrow that the existing divvy contract will be amended to make lyft, the parent company of divvy concessionaire motivate, the sponsor of the network, replacing current sponsor blue cross blue shield of illinois. as part of the deal, which will require city council approval, lyft would spend $50 million on stations and bikes to bring the bike-share service to neighborhoods that currently lack docks. the city says it would also receive at least $77 million in sponsorship money from lyft over nine years. the deal could be good news for making the system more equitable. there are currently a higher density of stations downtown and in more affluent neighborhoods, and many outlying communities don\'t have docks at all. the most obvious downside of the the deal is that, with lyft replacing blue cross b

### 2.4. Suppression des numériques et des caractères spéciaux:

In [22]:
text_4=re.sub(r'[^a-z]+',' ',text_4)
text_4

'lyft may spend m to expand divvy to all wards using dockless option ebikes out of the blue the divvy bike system may be expanding citywide the city of chicago is expected to announce tomorrow that the existing divvy contract will be amended to make lyft the parent company of divvy concessionaire motivate the sponsor of the network replacing current sponsor blue cross blue shield of illinois as part of the deal which will require city council approval lyft would spend million on stations and bikes to bring the bike share service to neighborhoods that currently lack docks the city says it would also receive at least million in sponsorship money from lyft over nine years the deal could be good news for making the system more equitable there are currently a higher density of stations downtown and in more affluent neighborhoods and many outlying communities don t have docks at all the most obvious downside of the the deal is that with lyft replacing blue cross blue shield of illinois as th

### 2.5. Suppression de tous les stopwords:

Nous allons tout d'abord récupérer une liste des stopwords en anglais/américain. Puis les retirer du corpus.  
Pour cela nous nous basons sur la liste des stopwords définis par spacy.  
https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py

In [23]:
# Liste des stopwords en anglais/américain.
# Nous avons une liste, qui est une variable string: 
stop_words=("""a about above across after afterwards again against all almost alone along
already also although always am among amongst amount an and another any anyhow
anyone anything anyway anywhere are around as at

back be became because become becomes becoming been before beforehand behind
being below beside besides between beyond both bottom but by

call can cannot ca could

did do does doing done down due during

each eight either eleven else elsewhere empty enough even ever every
everyone everything everywhere except

few fifteen fifty first five for former formerly forty four from front full
further

get give go

had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however hundred

i if in indeed into is it its itself

keep

last latter latterly least less

just

made make many may me meanwhile might mine more moreover most mostly move much
must my myself

name namely neither never nevertheless next nine no nobody none noone nor not
nothing now nowhere

of off often on once one only onto or other others otherwise our ours ourselves
out over own

part per perhaps please put

quite

rather re really regarding

same say see seem seemed seeming seems serious several she should show side
since six sixty so some somehow someone something sometime sometimes somewhere
still such

take ten than that the their them themselves then thence there thereafter
thereby therefore therein thereupon these they third this those though three
through throughout thru thus to together too top toward towards twelve twenty
two

under until up unless upon us used using

various very very via was we well were what whatever when whence whenever where
whereafter whereas whereby wherein whereupon wherever whether which while
whither who whoever whole whom whose why will with within without would

yet you your yours yourself yourselves""")

type (stop_words)
print(stop_words)

a about above across after afterwards again against all almost alone along
already also although always am among amongst amount an and another any anyhow
anyone anything anyway anywhere are around as at

back be became because become becomes becoming been before beforehand behind
being below beside besides between beyond both bottom but by

call can cannot ca could

did do does doing done down due during

each eight either eleven else elsewhere empty enough even ever every
everyone everything everywhere except

few fifteen fifty first five for former formerly forty four from front full
further

get give go

had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however hundred

i if in indeed into is it its itself

keep

last latter latterly least less

just

made make many may me meanwhile might mine more moreover most mostly move much
must my myself

name namely neither never nevertheless next nine no nobody none noone nor not
nothing now now

In [27]:
# Il faut retraiter cette chaine de caractères pour en faire une liste:
# 1. Suppression des saut de ligne:
sstop_words=stop_words.replace("\n"," ")

# 2. Création d'une liste à partir des éléments de cette variable:
stlist= sstop_words.split()

print(stlist)

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'ca', 'could', 'did', 'do', 'does', 'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon

In [39]:
# Nous pouvons egalement supprimer l'ensemble des lettres de l'alphabet qui ne sont pas dans cette liste:
alphaB=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 
             'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Rajoutons alphaB à la liste stwords des stopwords en anglais:
stlist=stlist+alphaB
stlist

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'ca',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'g

In [40]:
text_5=""
for word in text_4.split():
    if word not in stlist:
        text_5=text_5 + " "+ word

text_5

' lyft spend expand divvy wards dockless option ebikes blue divvy bike system expanding citywide city chicago expected announce tomorrow existing divvy contract amended lyft parent company divvy concessionaire motivate sponsor network replacing current sponsor blue cross blue shield illinois deal require city council approval lyft spend million stations bikes bring bike share service neighborhoods currently lack docks city says receive million sponsorship money lyft years deal good news making system equitable currently higher density stations downtown affluent neighborhoods outlying communities don docks obvious downside deal lyft replacing blue cross blue shield illinois sponsor bikes bear logo ride share service studies shown ride share increasing traffic deaths congestion cities decreasing transit ridership somewhat problematic thousands divvy bikes double ads lyft addition amendment freeze bike share providers operating chicago streets later administration priority create variety 

### 2.6. Stemmatisation:  

Nous allons terminer par stemmatiser le document.

In [48]:
stemmer = SnowballStemmer("english")
text_6=""

for word in text_5.split():
    text_6=text_6+" "+ stemmer.stem(word)

text_6

' lyft spend expand divvi ward dockless option ebik blue divvi bike system expand citywid citi chicago expect announc tomorrow exist divvi contract amend lyft parent compani divvi concessionair motiv sponsor network replac current sponsor blue cross blue shield illinoi deal requir citi council approv lyft spend million station bike bring bike share servic neighborhood current lack dock citi say receiv million sponsorship money lyft year deal good news make system equit current higher densiti station downtown affluent neighborhood out communiti don dock obvious downsid deal lyft replac blue cross blue shield illinoi sponsor bike bear logo ride share servic studi shown ride share increas traffic death congest citi decreas transit ridership somewhat problemat thousand divvi bike doubl ad lyft addit amend freez bike share provid oper chicago street later administr prioriti creat varieti high qualiti reliabl transport option chicagoan visitor want mayor rahm emanuel said statement divvi pro

In [49]:
set1=set(text_5.split())
set2=set(text_6.split())

diff =set2-set1
s3=" ".join(diff)
s3

'reduc signific visitor product equiti substanti street assert ad say extens upcom account freez capabl thousand exceed opportun licens train continu dock motiv incom oper electr loss prioriti ebik revenu adapt price chicagoan hardwar cycl disabl heavi densiti target offici requir qualiti decreas equip concessionair death ward addit reliabl guarante restructur make chang help ensur peopl citi own park varieti provid hope remain citywid increas congest problemat promot standard illinoi equit offend elimin detail neighborhood penalti purchas divvi out creat anne servic hate financi exclus scrutini compani oblig transport day announc expect exist abil dramat replac downsid approv receiv amend communiti administr lock studi advertis doubl updat perform propos modern'

### Fin du nettoyage du document.