# Text Preprocessing

Dans toute tâche d'apprentissage automatique, le nettoyage ou le prétraitement des données est aussi important que la construction du modèle. Les données textuelles sont l'une des formes les moins structurées de données disponibles et lorsqu'il s'agit de traiter le langage humain, c'est trop complexe. 
Dans ce Brief nous allons travailler sur le prétraitement des données textuelles en utilisant [NLTK](http://www.nltk.org).

## Veille technologique: Natural Language processing (NLP)
1- Les cas d'utlisation de NLP dans notre vie  
2- Comment Fecebook, Google et Amazon utilisent NLP  
3- Préparation des données textuelles  

## Setup


In [1]:
#Importer les bibliothèques nécessaires
import nltk

In [2]:
#Télécharger les données NLTK 
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hlakh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hlakh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Netoyage des données

Dans cette partie nous allons utiliser [NLTK](http://www.nltk.org) pour nétoyer un texte de [wikipidéa](https://en.wikipedia.org/wiki/Natural_language_processing) sur la définition du NLP  
"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

In [18]:
#Lowercase: Mettre tout le texte en minuscule
text= 'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.'

def lowercase_text(text):
    
    lowercase_text = text.lower()
    return lowercase_text

# utilisation de la fonction

lowercase_result = lowercase_text(text)
print(lowercase_result)

natural language processing (nlp) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. the goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.


In [14]:
#Supprimer les punctuation
import string

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

# Exemple d'utilisation :

text_without_punctuation = remove_punctuation(text)
print(text_without_punctuation)

Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data The goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves


### Word Tokenization
La tokénisation([Tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)) consiste à diviser les chaînes de caractères en mots individuels sans blancs ni tabulations.


In [24]:
from nltk import word_tokenize, sent_tokenize


def tokenize_text(text):
   
# Utilisation du tokeniseur de NLTK pour diviser le texte en mots

    tokens = word_tokenize(text)
    return tokens

# Exemple d'utilisation de la fonction

tokens = tokenize_text(text)
print(tokens)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.']


### Stopwords
Les mots d'arrêt sont des mots qui n'ajoutent pas de sens significatif au texte. Utiliser NLTK pour lister les stop words et les supprimer du textes.

In [21]:
from nltk.corpus import stopwords # module for stop words that come with NLTK

#récupérer les stopwords

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Supprimer les stopwords

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def remove_stopwords(tokens):
    
 # Importation des mots d'arrêt français
    french_stopwords = set(stopwords.words('french'))
    
    # Suppression des stopwords de la liste de tokens
    filtered_tokens = [word for word in tokens if word.lower() not in french_stopwords]
    
    return filtered_tokens

# utilisation de la fonction

tokens = word_tokenize(text)
filtered_tokens = remove_stopwords(tokens)
print(filtered_tokens)


# Tokenisation du texte en mots
words = word_tokenize(text)

# Importation des mots d'arrêt
french_stopwords = set(stopwords.words('french'))

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'well', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.']


### Stemming
L'étymologie est le processus qui consiste à réduire les mots à leur racine, leur base ou leur forme ([Stemming](https://en.wikipedia.org/wiki/Stemming) ).

In [23]:
from nltk.stem import SnowballStemmer

def stem_words(tokens):
   
    # Initialisation du stemmer pour le français
    stemmer = SnowballStemmer('french')

    # Stemmatisation de chaque mot dans la liste
    stemmed_words = [stemmer.stem(word) for word in tokens]
    
    return stemmed_words

# utilisation de la fonction
tokens = word_tokenize(text)
stemmed_tokens = stem_words(tokens)
print(stemmed_tokens)

['natural', 'languag', 'processing', '(', 'nlp', ')', 'is', 'a', 'subfield', 'of', 'linguistic', ',', 'comput', 'scienc', ',', 'and', 'artificial', 'intelligent', 'concerned', 'with', 'the', 'interact', 'between', 'computer', 'and', 'human', 'languag', ',', 'in', 'particular', 'how', 'to', 'program', 'computer', 'to', 'process', 'and', 'analyz', 'larg', 'amount', 'of', 'natural', 'languag', 'dat', '.', 'the', 'goal', 'is', 'a', 'comput', 'capabl', 'of', '``', 'understanding', "''", 'the', 'content', 'of', 'docu', ',', 'including', 'the', 'contextual', 'nuanc', 'of', 'the', 'languag', 'within', 'them', '.', 'the', 'technology', 'can', 'then', 'accurately', 'extract', 'inform', 'and', 'insight', 'contained', 'in', 'the', 'docu', 'as', 'wel', 'as', 'categoriz', 'and', 'organiz', 'the', 'docu', 'themselv', '.']


# What about Twitter messages !! :)

Dans cette partie nous allons appliquer les étapes de prétraitement de texte sur une base de données des messages Twitters 

In [25]:
import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                              # pseudo-random number generator

In [26]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\hlakh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


True

In [27]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [28]:
#print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

[92m@CjayBlanco  follow @jnlazts &amp; http://t.co/RCvcYYO0Iq follow u back :)
[91m@misses0wl still sad that they kicked out the epic motherfucker :(
