<center>
    <h1>
        <b>Chapter 6. Handling Text</b>
    </h1>
</center>

---

### **6.0 Introduction**

---

### **6.1 Cleaning Text**


**Problem**
- You have some unstructured text data and want to complete some basic cleaning

**Solution**
- Using Python's core string operations, in particular `strip`, `replace`, and `split`

In [3]:
#Create text
text_data = ["   Interrobang. By Aishwarya Henriette    ",
            "Parking And Going. By Karl Gautier  ",
            "  Today Is The night. By Jarek Prakash  "]

#Strip whitespaces
strip_whitespaces = [string.strip() for string in text_data]

#Show text
strip_whitespaces

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [4]:
#Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespaces]

#show text
remove_periods

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

Create and apply a custom transformation function

In [5]:
#Create function
def capitalizer(string: str) -> str:
    return string.upper()

#Apply function
[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

Use regular expressions to make powerful string operations

In [6]:
#Import library
import re

#Create function
def replace_letters_with_x(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "X", string)

#Apply function
[replace_letters_with_x(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

---

### **6.2 Parsing and Cleaning HTML**


**Problem**
- You have text data with HTML elements and want to extract just the text

**Solution**
- Use Beautiful Soup's extensive set of options to parse and extract from HTML

In [12]:
#Load library
from bs4 import BeautifulSoup

#Create some HTML code 
html = """
        <div class='full_name'><span style='font-weight:bold'>
        Masego</span> Azra</div>"
    """
    
#Parse html
soup = BeautifulSoup(html)

#Find the div with the class "full_name", show text
soup.find("div", {"class": "full_name"}).text

'\n        Masego Azra'

---

### **6.3 Removing Punctuation**


**Problem**
- You have a feature of text data and want to remove punctuation

**Solution**
- Define a function that uses `translate` with a dictionary of punctuation characters

In [14]:
#Load libraries
import unicodedata
import sys

#Create text
text_data = ['Hi!!!! I. Love. This. Song....',
             '10000% Agree!!!! #LoveIT',
             'Right?!?!']

#Create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

#For each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

---


### **6.4 Tokenizing Text**

**Problem**
- You have text and want to break it up into individual words

**Solution**
- Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation operations, including word tokenizing:

In [3]:
#Load library 
from nltk.tokenize import word_tokenize

#Create text
string = 'The science of today is the technology of tomorrow'

#Tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

Tokenize into sentences

In [5]:
#Load library
from nltk import sent_tokenize

#Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."

#Tokenize sentences
sent_tokenize(string)

['The science of today is the technology of tomorrow.', 'Tomorrow is today.']

---

### **6.5 Removing Stop Words**

---

### **6.6 Stemming Words**

---

### **6.7 Tagging Parts of Speech**

---

### **6.8 Encoding Text as a Bag of Words**

---

### **6.9 Weighting Word Importance**

---