# NLP preprocessing

This notebook gathers a few preprocessing techniques to use (or not) while dealing with texts analysis.

## Lowercasing

It is always important to put all the text to lowercase, to avoid having trouble with similar words but with different caps such as : 
> 'house', 'House', 'HOUSE', ...

In [1]:
example_1 = 'house HOUSE housE House'
example_1 = example_1.lower()
example_1

'house house house house'

## Removing all accents

In [2]:
import unidecode

In [3]:
example_accents = 'â ï î ô ê é è ù'
example_accents = unidecode.unidecode(example_accents)
example_accents

'a i i o e e e u'

## Numbers and/or punctuation

Sometimes your text can be filled with numbers and/or punctuation that you should clean beforehand.  
A good way to deal with it is to use regex.

In [4]:
import re

### Finding numbers (and the words with it)

In [5]:
get_number_words = r'(\S*\d+\S*)'
example_2 = 'Numbers can sometimes be important with text : 3x4, 150g, 74F, etc.'
re.findall(pattern=get_number_words, string=example_2)

['3x4,', '150g,', '74F,']

### Finding plain numbers

In [6]:
get_any_number = r'(\d+)'
example_3 = 'This is a1 test fille235d with useless numbers 12. 456'
re.findall(pattern=get_any_number, string=example_3)

['1', '235', '12', '456']

### Removing punctuation

In [7]:
get_punctuation = r'[^(\w+\s)]'
example_4 = 'We \want to remo?ve punctuation ???in some cas.e.s'
re.sub(pattern=get_punctuation, repl='', string=example_4)

'We want to remove punctuation in some cases'

## Stopwords

Some words are often present in sentences but don't carry any relevant information for the current analysis.  
You can remove them (for a pandas series of sentences) as follows : 

In [8]:
import pandas as pd

In [9]:
example_series = pd.Series(['le chat mange une souris', 'une gallette des rois', 'le crocodile du samedi'])
stopwords = ['le', 'une', 'des', 'du']
example_series = example_series.apply(lambda x: ' '.join([word for word in re.split("\W+", x) if word not in stopwords]))
example_series

0    chat mange souris
1        gallette rois
2     crocodile samedi
dtype: object