# Text Preprocessing

In any machine learning task, cleaning or preprocessing the data is as important as building the model. Textual data is one of the least structured forms of data available and when it comes to processing human language, it is too complex. 
In this Brief we will work on preprocessing textual data using [NLTK](http://www.nltk.org).

## Veille technologique: Natural Language processing (NLP)
1- How NLP is used in our lives  
2- How Fecebook, Google and Amazon use NLP  
3- Text data preparation   

## Setup


In [None]:
# Import the necessary libraries

import nltk, re

In [None]:
# Download NLTK data 
nltk.download('punkt')
nltk.download('stopwords')

## Data cleaning

In this section we will use [NLTK](http://www.nltk.org) to net a text from [wikipidéa](https://en.wikipedia.org/wiki/Natural_language_processing) on the definition of NLP 

"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

In [None]:
# Lowercase: Put all text in lower case

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."""

text = text.lower()

print(text)

In [None]:
# Remove punctuation

text = re.sub(r'[^\w\s]', '', text)
 
print(text)

### Word Tokenization

Tokenization([Tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)) consists in dividing strings into individual words without blanks or tabs.


In [None]:
# Word Tokenization

# text = text.split(' ')

# OR

text = nltk.tokenize.word_tokenize(text)

### Stopwords
Stop words are words that do not add significant meaning to the text. 

In [None]:
# Use NLTK to list stop words and remove them from the text.

text = [word for word in text if word not in nltk.corpus.stopwords.words('english')] 

### Stemming
Etymology is the process of reducing words to their root, base or form ([Stemming](https://en.wikipedia.org/wiki/Stemming) ).

In [None]:
# Stemming

text = [nltk.stem.PorterStemmer().stem(word) for word in text] 

## Function development

In [None]:
# Develop each step of the text preprocessing in a function

def text_preproc(text):
    import nltk, re

    nltk.download('punkt')
    nltk.download('stopwords')
    
    text = re.sub(r'[^\w\s]', '', text.lower())

    text = nltk.tokenize.word_tokenize(text)

    text = [word for word in text if word not in nltk.corpus.stopwords.words('english')] 

    text = [nltk.stem.PorterStemmer().stem(word) for word in text] 

    return(text)

# What about Twitter messages !! :)

In this part we will apply the text preprocessing steps on a database of Twitters messages 

In [None]:
# Import Python library for NLP

import nltk, random
from nltk.corpus import twitter_samples

# Import Sample Twitter dataset from NLTK

nltk.download('twitter_samples') 

# Import library for visualization

import matplotlib.pyplot as plt 

In [None]:
# select the set of positive and negative tweets

data_negative = twitter_samples.strings("negative_tweets.json")
data_negative

data_positive = twitter_samples.strings("positive_tweets.json")
data_positive

In [None]:
#print positive in greeen

sample_positive = []
sample_negative = []


for _ in range(5):
    sample_positive.append(data_positive[random.randint(0, len(data_positive))])

for i in range(0, 5):
    plt.text(0, 0.01 + (i * 0.1), sample_positive[i], color = 'green', fontsize = 7)

# print negative in red

for _ in range(5):
    sample_negative.append(data_negative[random.randint(0, len(data_negative))])

for i in range(5, 10):
    plt.text(0, 0.01 + (i * 0.1), sample_negative[i - 5], color = 'red', fontsize = 7)

plt.show()