## Handling Text

### CLeaning text
You have some unstructured text data and want to complete some basic cleaning.

Most basic text cleaning operations should only replace Python’s core string opera‐
tions, in particular strip, replace, and split:

In [3]:
text_data = [" Interrobang. By Aishwarya Henriette ",
 "Parking And Going. By Karl Gautier",
 " Today Is The night. By Jarek Prakash "]

#Strip whitespace
strip_whitespace = [string.strip() for string in text_data]

strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [4]:
# Remove periods

remove_periods = [string.replace('.','') for string in strip_whitespace]
remove_periods

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [7]:
#We also create and apply a custom transformation function.

def capitalizer(string: str) -> str: #It specifies the return value. It has nothing to do with the code at all and is just for documentation purposes.
    return string.upper()

#Apply function
[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

Finally, we can use regular expressions to make powerful string operations:

In [12]:
import re

#Create function
def replace_letters(string: str) -> str:
    return re.sub(r'[a-zA-Z]', 'X', string)

#Apply function
[replace_letters(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

### Parsing and cleaning HTML
You have text data with HTML elements and want to extract just the text.

Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [15]:
from bs4 import BeautifulSoup

#Create some html code

html = """

<div class='full_name'><span style='font_weight:bold'>Masego</span>Azra</div>

"""
#Parse html
soup = BeautifulSoup(html, 'lxml')

#find the div with the class 'Full name', show text
soup.find("div", {"class":"full_name"}).text

'MasegoAzra'

### Removing Punctuation
You have a feature of text data and want to remove punctuation.

Define a function that uses translate with a dictionary of punctuation characters:

In [17]:
#Load libraries
import unicodedata
import sys

#Create text
text_data = ['Hi!!!!!!!!! I. Love. This. Song...',
            '1000% Agree!!!! #LoveIT', 'Right??']

#Create a dictionary of punctuation characters

punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                           if unicodedata.category(chr(i)).startswith('P'))

# For each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '1000 Agree LoveIT', 'Right']