# Cleaning Data with Python: Basics

## Basic text cleaning

We'll spend a lot of time cleaning up text.  Mostly this is because:

* A) although you see 'Capital' and 'capital' as the same words, an algorithm will see these as different because of the capital letter in one of them
* B) people leave a lot of invisible characters in their data (NB they do this to string representations of numerical data too - and many spreadsheet programs will store numbers as strings) - this is known as "whitespace", and can really mess up your day because "whitespace" and "whitespace " look the same to you, but an algorithm will see as different.

In the example below, lower() converts a string into lowercase (upper() converts it into uppercase, but the convention in data science is to use all-lowercase, probably because it's less shouty to read), and strip() removes any whitespace before the first character, and after the last character (a-z etc) in the string. 

In [4]:
mystring = ' CApiTalIsaTion  Sucks  '
print('original text is -{}-'.format(mystring))
mystring = mystring.lower()
print('lowercased text is -{}-'.format(mystring))
mystring = mystring.strip()
print('Text without whitespace is -{}-'.format(mystring))

original text is - CApiTalIsaTion  Sucks  -
lowercased text is - capitalisation  sucks  -
Text without whitespace is -capitalisation  sucks-


## Regular Expressions

You'll notice that, even after we've run lower() and strip() on the example above, it still has two spaces together in it.  This is another common human error when entering text.  In a word processor, you'd probably do a search and replace at this point.  Python's (and many other language's) equivalent of search and replace is "regular expressions", also known as RegEx. You can call these using the "re" python library.

^\w = everything that isn’t a character or number.
[] = a group of possible characters, e.g. [^\w ] = alphanumeric plus space. 
\s+,\s+ = one or more spaces followed by a comma then one or more spaces

In [1]:
import re
string1 = 'This is a! sentence&& with junk!@'
cleanstring1 = re.sub(r'[^\w ]', '', string1)
print('{}'.format(string1))
print('{}'.format(cleanstring1))

string2 = 'comma , list ,  with , extra , spaces'
cleanstring2 = re.sub(r'\s+,\s+', ',', string2)
print('{}'.format(string2))
print('{}'.format(cleanstring2))

This is a! sentence&& with junk!@
This is a sentence with junk
comma , list ,  with , extra , spaces
comma,list,with,extra,spaces


## Cleaning Dates and Times

Dates can also be a headache, especially if you're mixing data from different sources, or have both Europeans and Americans inputting data for you (e.g. Americans write 2/14/16, Europeans write 14/2/16).

You can fix some of these differences using the datetime python library: for instance, the code below converts American-formatted dates into European-formatted ones.

* We start by telling datetime about the format our date string is in.  Here, we've said that it's day (d), then month (m), then year (y), all separated by '/'. 

* Strptime converts that string into a datetime structure. The structure itself contains elements for day, month, year, hour, minute and second.

* Strftime converts a datetime structure into a string, in the format that we ask for.  Here, we've asked for month (m), day (d), year (y), all separated by '-'. There are many other options available (see section 8.1.8. in https://docs.python.org/3/library/datetime.html), only some of which are explored below.

Datetime also allows you to ask for things like the day of the week (as numbers 0 to 6, where 0 is Monday). You can access each of these from the date_structure individually, as you can see below. If you want to know more, look at the datetime documentation (https://docs.python.org/3/library/datetime.html)... but for now, note how when you convert the year to 4 characters, you get '2048' (even though you might have been expecting '1948' here): yes, you too can introduce data errors if you're not careful!

In [26]:
import datetime

US_date_string = '14/03/48'

date_struct = datetime.datetime.strptime(US_date_string, '%d/%m/%y')
print("Date_structure is {}".format(date_struct))

EU_date_string = date_struct.strftime('%m-%d-%y')
print("EU version is {}".format(EU_date_string))

print("Weekday was {}".format(date_struct.weekday()))
# the shorter version of this is 
#datetime.datetime.strptime(date_string, '%d/%m/%y').strftime('%m/%d/%y')

longer_date_string = date_struct.strftime('%A %B %d %Y')
print("Date was {}".format(longer_date_string))

Date_structure is 2048-03-14 00:00:00
EU version is 03-14-48
Weekday was 5
Date was Saturday March 14 2048


Aside: You might be wondering why printing the date_structure gave you a string, not a structure. That's because the datetime library contains a "print" function that gives you a human-readable version of its contents. You can access each part of this structure individually, as seen below with "month".

In [27]:
US_date_string = '14/03/48'

date_structure = datetime.datetime.strptime(US_date_string, '%d/%m/%y')
print("Type is {}".format(type(date_structure)))
print("Month is {}".format(date_structure.month))

Type is <class 'datetime.datetime'>
Month is 3


## Dealing with Language Encodings

AKA what about cote d'ivoire?