Because of errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the web. This chapter covers a few tools and techniques to help you prevent the problem at the source by changing the way you
write code, and clean the data after it’s in the database.

## Cleaning in Code
In this example, we want to build a function that returns a Python list of all the N-grams it found inside a text document :

In [9]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

def getNgrams(content, n):
    """ 
    takes a string content and the number of 'grams'.
    returns a python list of all the N-grams it fond inside the string content
    """
    
    content = content.split(' ')
    output = list()
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()
print(content)

General-purpose programming language


.mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}PythonParadigmMulti-paradigm: object-oriented,[1] procedural (imperative), functional, structured, reflectiveDesigned byGuido van RossumDeveloperPython Software FoundationFirst appeared20 February 1991; 30 years ago (1991-02-20)[2]Stable release3.10.0[3] 
   / 4 October 2021; 2 months ago (4 October 2021)Preview release3.11.0a2[4] 
   / 5 November 2021; 31 days ago (5 November 2021)
Typing disciplineDuck, dynamic, strong typing;[5] gradual (since 3.5, but ignored in CPython)[6]OSWindows, Linux/UNIX, macOS and more[7]LicensePython Software Foundation LicenseFilename extensions.py, .pyi, .pyc, .pyd, .pyo (prior to 3.5),[8] .pyw, .pyz (since 3.5)[9]Websitewww.python.orgMajor implementationsCPython, PyPy, Stackless Python, MicroPython, CircuitP

In [10]:
ngrams = getNgrams(content, 2)
print('2-grams count is: '+str(len(ngrams)))

2-grams count is: 11908


In [11]:
ngrams[:5]

[['General-purpose', 'programming'],
 ['programming', 'language\n\n\n.mw-parser-output'],
 ['language\n\n\n.mw-parser-output',
  '.infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output'],
 ['.infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output',
  '.infobox-3cols-child{margin:auto}PythonParadigmMulti-paradigm:'],
 ['.infobox-3cols-child{margin:auto}PythonParadigmMulti-paradigm:',
  'object-oriented,[1]']]

The initial function returns a lot of junk. Using regex and other tools we can clean the text before discovering the N-grams. This replaces all instances of the newline character with a space, removes citations like [123], and filters out all empty strings, caused by multiple spaces in a row. Then, escape characters are eliminated by encoding the content with UTF-8. These steps greatly improve the output of the function, but some issues still remain:

In [15]:
import re

def getNgrams(content, n):
    # delete '\n' character
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    
    content = content.split(' ')
    content = [word for word in content if word != '']
    output = list()
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

ngrams = getNgrams(content, 2)
ngrams[:5]

[['General-purpose', 'programming'],
 ['programming', 'language'],
 ['language', '.mw-parser-output'],
 ['.mw-parser-output', '.infobox-subbox{padding:'],
 ['.infobox-subbox{padding:', ';border:none;margin:-']]

These steps greatly improve the output of the function, but some issues still remain. We can improve on this by removing all punctuation before and after each word.

In [26]:
import re
import string


from urllib.request import urlopen
from bs4 import BeautifulSoup

def cleanSentence(sentence):
    """ remove punctuation or words with a length smaller than two"""
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
    sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
    return sentence

def cleanInput(content):
    content = content.lower()
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, "UTF-8")
    content = content.decode("ascii", "ignore")
    sentences = content.split('. ')
    return [cleanSentence(sentence) for sentence in sentences]

def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = []
    for sentence in content:
        ngrams.extend(getNgramsFromSentence(sentence, n))
    return(ngrams)
        

In [27]:
html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id':'mw-content-text'}).get_text()

ngrams = getNgrams(content, 2)
print(len(ngrams))
ngrams[:5]

9766


[['general-purpose', 'programming'],
 ['programming', 'language'],
 ['language', 'mw-parser-output'],
 ['mw-parser-output', 'infobox-subbox{padding'],
 ['infobox-subbox{padding', 'border:none;margin']]

Finaly, we use a Counter object so that we can count the number of times a N-grams appears inside the text : 

In [28]:
from collections import Counter

def getNgrams(content, n):
    # we clean the input text
    content = cleanInput(content)
    
    # instanciate a counter object
    ngrams = Counter()
    ngrams_list = []
    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
        ngrams_list.extend(newNgrams)
        ngrams.update(newNgrams)
    return(ngrams)

ngrams =  getNgrams(content, 2)
print(ngrams.most_common(10))

[('from the', 220), ('the original', 211), ('original on', 209), ('archived from', 206), ('on june', 62), ('software foundation', 40), ('python software', 40), ('of the', 39), ('of python', 34), ('in python', 32)]
