# Chapter 8 Cleaning Your Dirty Data

## Cleaning in Code

Let's have a look at a simple function `getNgrams()`:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
def getNgrams(content, n):
    content = content.split(' ')
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

In [3]:
html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams[:5]) # first five 2-grams
print('2-grams count is: '+str(len(ngrams)))

[['General-purpose,', 'high-level'], ['high-level', 'programming'], ['programming', 'language\n\n\nPythonParadigmMulti-paradigm:'], ['language\n\n\nPythonParadigmMulti-paradigm:', 'functional,'], ['functional,', 'imperative,']]
2-grams count is: 8800


As shown above, we do get useful 2-grams like `['General-purpose,', 'high-level']` and `['functional,', 'imperative,']`, but we also get some gibberish like `['programming', 'language\n\n\nPythonParadigmMulti-paradigm:']`. We can use Regex to solve this problem effectively!

In [4]:
import re

def getNgrams(content, n):
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    content = content.split(' ')
    content = [word for word in content if word != '']
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

In [6]:
html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams[:5]) # first five 2-grams
print('2-grams count is: '+str(len(ngrams)))

[['General-purpose,', 'high-level'], ['high-level', 'programming'], ['programming', 'language'], ['language', 'PythonParadigmMulti-paradigm:'], ['PythonParadigmMulti-paradigm:', 'functional,']]
2-grams count is: 9567


Now we have a new problem: some 2-grams like `['programming', 'language']`, whereas others like `['language', 'PythonParadigmMulti-paradigm:']` don't. In order to clean these meaningless combinations, we have four functions:

In [7]:
import string

def cleanSentence(sentence):
    """Split sentence into words, strip punctuation and whitespace, and remove single character words besides I and a"""
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
    sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
    return sentence

def cleanInput(content):
    """Remove newlines and citations; split the text into 'sentences' based on the location of periods"""
    content = content.upper()
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, "UTF-8")
    content = content.decode("ascii", "ignore")
    sentences = content.split('. ')
    return [cleanSentence(sentence) for sentence in sentences]

def getNgramsFromSentence(content, n):
    """Ensure that n-grams are not created that span multiple sentences"""
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = []
    for sentence in content:
        ngrams.extend(getNgramsFromSentence(sentence, n))
    return(ngrams)

In [8]:
html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()
print(len(getNgrams(content, 2)))

7391


We have reduced the number of 2-grams on this page from 9567 to 7391. However, current codes don't check duplicate 2-grams.

In [9]:
from collections import Counter

def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = Counter()
    ngrams_list = []
    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
        ngrams_list.extend(newNgrams)
        ngrams.update(newNgrams)
    return(ngrams)

In [10]:
html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()
two_grams = getNgrams(content, 2)
print(len(two_grams))

5545


After removing duplicates, we further reduce the number to 5545. To see the frequency of "programming language" in `two_grams`, use it like dictionary.

In [11]:
two_grams["PROGRAMMING LANGUAGE"]

15