# ADDRESS_CLEANER
Written in Python 3.6 by Erin Ochoa

Jupyter Notebook written by Jerry Shi

In [1]:
from ADDRESS_CLEANER import *

This function cleans addresses by replacing forbidden characters with spaces or deleting them. It also standardizes addresses to some extent.

In [2]:
import re
import numpy as np
import pandas as pd

First, regex, numpy, and pandas are imported. Then the function address_cleaner is defined.

```
def address_cleaner(string):
    # Contains non-printing characters that will be cleaned out of strings
    cleaner = ['\n','\r','\t']
```
\n, \r, and \t mean new line, carriage return, and tab respectively.
```
    # Strip leading and trailing spaces, commas, & periods; convert to uppercase
    string = string.strip(' ,./').upper()
```
Strip is a method in python to remove the characters inputted from the **beginning and end** of the string. For example,

In [3]:
def ex_strip(string):
    return string.strip( '0' )
ex_strip("0000000this is string example....wow!!!0000000")

'this is string example....wow!!!'

However, this means characters in the middle of the string will be ignored.

In [4]:
ex_strip("0000000this 0 is 0 string 0 example....wow!!!0000000")

'this 0 is 0 string 0 example....wow!!!'

```
# If present in the string, replace each NPC from cleaner with a space
    for npc in cleaner:
        string = string.replace(npc, ' ')
```
Next, a for statement is used to create a new variable npc that has the same definition as cleaner. The replace function is used to then replace npc (meaning \n, \r, \t) with a space, effectively deleting them.

```
# Replace multiple spaces with a single space
    string = re.sub(r'\s+',' ',string)
```
In this case, \s matches Unicode whitespace characters, which includes \t,\n,\r \f,\v, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages. The + appended to \s means the function will match with one or more instances of \s. 

Lastly, the r preceding '\s+' is not part of the characters to be replaced but rather a part of regex syntax. Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning (\s, for example). This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. Appending r to the beginning of the string means backslashes (and other special characters in Python syntax) are not handled in any special way.
```
    # Remove replacement character, which will break things later on
    string = string.replace('�','')
```
String.replace avoids all the pitfalls of regular expressions (like escaping), and is generally faster. Thus, it is used whenever possible instead of re.sub.
```
    # Replace each parenthesis with a space
    string = re.sub(r'\(',' ',string)
    string = re.sub(r'\)',' ',string)
```
Escaping refers to when a backslash is used in regular expressions to allow for the use of special characters without invoking their special meanings. It is used here again to replace parenthesis.

```
    # Standardize P.O. BOX
    string = re.sub(r'^P\.?\s?O\.?\s?BOX','P.O. BOX',string)
    string = re.sub(r'^POST\s?OFFICE\s?BOX','P.O. BOX',string)
    string = re.sub(r'^POST\s?OFC\s?BOX','P.O. BOX',string)
```
Next, P.O. Box has to be standardized in spelling from it's various forms. In addition to escapes, **conditionals** are now used. A conditional is when a question mark is affixed to a character. In regular expression, this will cause the compiler to match with one or zero instances of the character. For example,

In [5]:
def whatmakesuparainbow(string):
    string = re.sub('colou?r','color',string)
    return string
whatmakesuparainbow(colour)

NameError: name 'colour' is not defined