<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=160px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Natural Language Processing</h1>
<h1>Text Cleaning</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import re

import tqdm as tq
from tqdm import tqdm

import string
import nltk

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 24.3.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 9c8c00758f3a2fa8e55e08f5aad405a157ca5dd2

numpy     : 1.26.4
pandas    : 2.2.3
tqdm      : 4.66.4
watermark : 2.4.3
re        : 2.2.1
nltk      : 3.8.1
matplotlib: 3.8.0



Load default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Stemming

In [4]:
words = ['playing', 'loved', 'ran', 'river', 'friendships', 
         'misunderstanding', 'trouble', 'troubling']

stemmers = { 
    'LancasterStemmer' : nltk.stem.LancasterStemmer(),
    'PorterStemmer' : nltk.stem.PorterStemmer(),
    'RegexpStemmer' : nltk.stem.RegexpStemmer('ing$|s$|e$|able$'),
    'SnowballStemmer' : nltk.stem.SnowballStemmer('english')
}

In [5]:
matrix = []

for word in words:
    row = []
    for stemmer in stemmers:
        stem = stemmers[stemmer]
        row.append(stem.stem(word))
        
    matrix.append(row)

comparison = pd.DataFrame(matrix, index=words, columns=stemmers.keys())

In [6]:
comparison

Unnamed: 0,LancasterStemmer,PorterStemmer,RegexpStemmer,SnowballStemmer
playing,play,play,play,play
loved,lov,love,loved,love
ran,ran,ran,ran,ran
river,riv,river,river,river
friendships,friend,friendship,friendship,friendship
misunderstanding,misunderstand,misunderstand,misunderstand,misunderstand
trouble,troubl,troubl,troubl,troubl
troubling,troubl,troubl,troubl,troubl


# Lemmatization

In [7]:
wordnet = nltk.stem.WordNetLemmatizer()

results_n = [wordnet.lemmatize(word, 'n') for word in words]
results_v = [wordnet.lemmatize(word, 'v') for word in words]

In [8]:
comparison['WordNetLemmatizer Noun'] = results_n
comparison['WordNetLemmatizer Verb'] = results_v

In [9]:
comparison

Unnamed: 0,LancasterStemmer,PorterStemmer,RegexpStemmer,SnowballStemmer,WordNetLemmatizer Noun,WordNetLemmatizer Verb
playing,play,play,play,play,playing,play
loved,lov,love,loved,love,loved,love
ran,ran,ran,ran,ran,ran,run
river,riv,river,river,river,river,river
friendships,friend,friendship,friendship,friendship,friendship,friendships
misunderstanding,misunderstand,misunderstand,misunderstand,misunderstand,misunderstanding,misunderstand
trouble,troubl,troubl,troubl,troubl,trouble,trouble
troubling,troubl,troubl,troubl,troubl,troubling,trouble


# Regular Expressions

## Basics

The first step is to compile the pattern we're interested in

In [10]:
regex = re.compile(r'\d+')

In [11]:
sentence = "The Earth is estimated to be 4.54 billion years old, plus or minus about 50 million years."

Match tries to look for values at the start of the string and fails to find any

In [12]:
res = regex.match(sentence)
print(res)

None


Search will scan the sentence until the first match is found

In [13]:
res = regex.search(sentence)
print(res)

<re.Match object; span=(29, 30), match='4'>


We can easily extract information from the resulting match object

In [14]:
res.span()

(29, 30)

The match was found in position 29 and ended before position 30. Now we can slice the original string using __start()__ and __end()__

In [15]:
sentence[res.start():res.end()]

'4'

Find all will return all matching substrings

In [16]:
res_lst = regex.findall(sentence)

We found 3 results

In [17]:
len(res_lst)

3

That are returned as a list of strings

In [18]:
for i, res in enumerate(res_lst):
    print(i, res)

0 4
1 54
2 50


You'll note that our regex matched only decimal digits, so the value 4.54 got split in two as '.' didn't match. If we want to return floating point values we must use instead

In [19]:
regex = re.compile(r'\d+\.?\d+')

Where we are now allowing for an optional . surrounded by decimals.

In [20]:
[match for match in regex.finditer(sentence)]

[<re.Match object; span=(29, 33), match='4.54'>,
 <re.Match object; span=(73, 75), match='50'>]

## Groups

We can refer to the match in a previous group by it's number

In [21]:
regex = re.compile(r'\b(\w+)\s+\1\b')
regex.search('Paris in the the spring').group()

'the the'

You'll note that this different from just duplicating the pattern as then the result might be different :

In [22]:
regex = re.compile(r'\b(\w+)\s+(\w+)\b')
regex.findall('Paris in the the spring')

[('Paris', 'in'), ('the', 'the')]

Since here we're allowing any word to follow any word and not just a repetition. Also, regex matches are non-overlapping so we can't match spring unless there was another word following it

In [23]:
regex = re.compile(r'\b(\w+)\s+(\w+)\b')
regex.findall('Paris in the the spring weather')

[('Paris', 'in'), ('the', 'the'), ('spring', 'weather')]

## Modifying strings

We can use __.sub()__ to replace any matches with a pre-defined string

In [24]:
regex = re.compile('(blue|white|red)')
regex.sub('colour', 'blue socks and red shoes')

'colour socks and colour shoes'

The count argument specifies how many replacments we allow

In [25]:
regex.sub('colour', 'blue socks and red shoes', count=1)

'colour socks and red shoes'

Interestingly, within the replacement string we can also refer to the match results with group numbers, so if we just want to add quotation marks to our matches

In [26]:
regex.sub(r'"\1"', 'blue socks and red shoes')

'"blue" socks and "red" shoes'

<center>
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>