# Text Preprocessing

Dans toute tâche d'apprentissage automatique, le nettoyage ou le prétraitement des données est aussi important que la construction du modèle. Les données textuelles sont l'une des formes les moins structurées de données disponibles et lorsqu'il s'agit de traiter le langage humain, c'est trop complexe. 
Dans ce Brief nous allons travailler sur le prétraitement des données textuelles en utilisant [NLTK](http://www.nltk.org).

## Veille technologique: Natural Language processing (NLP)
1- Les cas d'utlisation de NLP dans notre vie  
2- Comment Fecebook, Google et Amazon utilisent NLP  
3- Préparation des données textuelles  

## Setup


In [10]:
#Importer les bibliothèques nécessaires
import nltk
from nltk.tokenize import word_tokenize

In [11]:
#Télécharger les données NLTK 
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dhimb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dhimb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Netoyage des données

Dans cette partie nous allons utiliser [NLTK](http://www.nltk.org) pour nétoyer un texte de [wikipidéa](https://en.wikipedia.org/wiki/Natural_language_processing) sur la définition du NLP  
"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

In [12]:
#Lowercase: Mettre tout le texte en minuscule
text= 'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.'
# Tokenize the text into words
tokens = word_tokenize(text)

# Convert each token into lowercase
lowercase_tokens = [token.lower() for token in tokens]

# Join the lowercase tokens back into a single string
lowercase_text = ' '.join(lowercase_tokens)

print(lowercase_text)


natural language processing ( nlp ) is a subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language , in particular how to program computers to process and analyze large amounts of natural language data . the goal is a computer capable of `` understanding '' the contents of documents , including the contextual nuances of the language within them . the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves .


In [13]:
#Supprimer les punctuation
import string
# Remove punctuation
no_punctuation_tokens = [token for token in tokens if token not in string.punctuation]

# Join the tokens back into a single string
text_without_punctuation = ' '.join(no_punctuation_tokens)

print(text_without_punctuation)

Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data The goal is a computer capable of `` understanding '' the contents of documents including the contextual nuances of the language within them The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves


### Word Tokenization
La tokénisation([Tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)) consiste à diviser les chaînes de caractères en mots individuels sans blancs ni tabulations.


In [15]:
from nltk import word_tokenize
# Tokenize the text into words
tokens = word_tokenize(text)
print(tokens)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.']


### Stopwords
Les mots d'arrêt sont des mots qui n'ajoutent pas de sens significatif au texte. Utiliser NLTK pour lister les stop words et les supprimer du textes.

In [21]:
from nltk.corpus import stopwords # module for stop words that come with NLTK

#récupérer les stopwords

# Get English stop words
stop_words = set(stopwords.words('english'))
print(stop_words)


{'down', 'where', 'do', 'very', 'we', 'wouldn', 'too', 'into', 'shouldn', "doesn't", 'were', 'o', 're', 'such', 'don', 'below', 'only', 'myself', 'isn', 'own', 'or', 'our', 'doing', 'hers', 'd', 'now', 'are', 'did', 'up', 'until', 'y', 'should', 'not', 'am', "hasn't", 'they', 'a', 'few', 'each', 'her', 'at', 'him', 'ours', "it's", 'just', "hadn't", 'didn', 'nor', 'doesn', 'on', "you'll", 'he', 'other', 'ourselves', 's', 'having', 'i', 'here', 'you', 'this', 'can', 'about', "that'll", 'she', 'being', 'from', 'then', 'theirs', 'these', 'over', 'be', 'most', "needn't", "weren't", 'was', 'so', 'my', 'does', 'those', 'any', "don't", "couldn't", 'll', "won't", 'as', 'won', 'that', 'off', 'couldn', 'their', 'mightn', 'yours', 'above', 'its', 'before', 'been', "wasn't", 'why', 'in', "should've", 'mustn', 'further', 'have', 'wasn', 'whom', 'against', "isn't", "mustn't", 'both', 'yourself', 'himself', 'an', 'under', 'while', 'how', 'm', 'out', 'all', 'no', 'will', 'yourselves', "you'd", 'after',

In [22]:

# Supprimer les stopwords

# Filter out the stop words
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# Join the tokens back into a string
filtered_text = ' '.join(filtered_tokens)

print(filtered_text)


Natural language processing ( NLP ) subfield linguistics , computer science , artificial intelligence concerned interactions computers human language , particular program computers process analyze large amounts natural language data . goal computer capable `` understanding '' contents documents , including contextual nuances language within . technology accurately extract information insights contained documents well categorize organize documents .


### Stemming
L'étymologie est le processus qui consiste à réduire les mots à leur racine, leur base ou leur forme ([Stemming](https://en.wikipedia.org/wiki/Stemming) ).

In [23]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

Natural --> natur
language --> languag
processing --> process
( --> (
NLP --> nlp
) --> )
is --> is
a --> a
subfield --> subfield
of --> of
linguistics --> linguist
, --> ,
computer --> comput
science --> scienc
, --> ,
and --> and
artificial --> artifici
intelligence --> intellig
concerned --> concern
with --> with
the --> the
interactions --> interact
between --> between
computers --> comput
and --> and
human --> human
language --> languag
, --> ,
in --> in
particular --> particular
how --> how
to --> to
program --> program
computers --> comput
to --> to
process --> process
and --> and
analyze --> analyz
large --> larg
amounts --> amount
of --> of
natural --> natur
language --> languag
data --> data
. --> .
The --> the
goal --> goal
is --> is
a --> a
computer --> comput
capable --> capabl
of --> of
`` --> ``
understanding --> understand
'' --> ''
the --> the
contents --> content
of --> of
documents --> document
, --> ,
including --> includ
the --> the
contextual --> contextu
nuances 

In [24]:
# Snowball Stemmer 
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')


for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

Natural --> natur
language --> languag
processing --> process
( --> (
NLP --> nlp
) --> )
is --> is
a --> a
subfield --> subfield
of --> of
linguistics --> linguist
, --> ,
computer --> comput
science --> scienc
, --> ,
and --> and
artificial --> artifici
intelligence --> intellig
concerned --> concern
with --> with
the --> the
interactions --> interact
between --> between
computers --> comput
and --> and
human --> human
language --> languag
, --> ,
in --> in
particular --> particular
how --> how
to --> to
program --> program
computers --> comput
to --> to
process --> process
and --> and
analyze --> analyz
large --> larg
amounts --> amount
of --> of
natural --> natur
language --> languag
data --> data
. --> .
The --> the
goal --> goal
is --> is
a --> a
computer --> comput
capable --> capabl
of --> of
`` --> ``
understanding --> understand
'' --> ''
the --> the
contents --> content
of --> of
documents --> document
, --> ,
including --> includ
the --> the
contextual --> contextu
nuances 

## Développement des fonctions

Développer chaque étape du prétraitement du text dans une fonction

In [8]:
# Lowercase: Mettre tout le texte en minuscule

# Supprimer les punctuation

# Tokenization

# Stopwords

# Stemming

# What about Twitter messages !! :)

Dans cette partie nous allons appliquer les étapes de prétraitement de texte sur une base de données des messages Twitters 

In [27]:
import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                              # pseudo-random number generator

In [28]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\dhimb\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [29]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [30]:
#print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

[92m@Irenegilmour  Thanks for the flowers :) 
#flowers http://t.co/fh9a7oArCT
[91mso many nasty, narrow minded people :(
