# MAST4016 Data Science Practice

## Week 2: Data Cleaning and Preprocessing in Python

This week you'll be learning key data science skills: data cleaning and preprocessing.

If you work as a data scientist, you're likely to be given a dataset and be asked to analyse it. You won't necessarily know much about where the dataset has come from and even if you do, you may not be sure whether it contains errors or not. If you don't clean your data, the results from your analyses may not be useful.

As we'll see, sometimes the errors will be obvious. At other times, it won't be clear and you may need to judge whether there's a problem with the data or not. Here, you may need to consult other people.

### Simple example

Imagine that we have a dataset that we know is on ages, but we don't have any other details.

In [1]:
ages = [5, 0.25, -99, 167]

Here, we know for sure that we can't have a negative age because that doesn't make sense. We should remove that data point.

We need to consider what data type our ages are stored in. In this case, we see that we have square brackets, so this is a list in Python. Lists are ordered and each value has an index which starts at 0. In this case, there are four values, so the final index is 3.

We can remove the negative entry using the code below.

In [3]:
del ages[2]
ages

[5, 0.25, 167]

<span style="color:blue">Task 1: Now you need to consider the rest of the data. Discuss what your views are: should you remove anything else?</span>

We should remove continuous data, that being 0.25, as it is inappropriate to use with the context of this data.

### Second example

For the data below, you are told that the ages are in years and are for tortoises. <span style="color:blue">Task 2: Carry out similar data cleaning and calculate the mean age for the tortoises with plausible values. You may need to do a search on how to calculate a mean in Python.</span>

In [8]:
tortoises = [-9, -0.01, 105, 2250, 52, 12, -999]
del tortoises[0]
del tortoises[0]
del tortoises[-1]
del tortoises[1]
tortoises

[105, 52, 12]

In [12]:
mean = sum(tortoises)/len(tortoises)
mean

56.333333333333336

#### Optional task

Use a search engine to find out which formats the data below are stored in. Also find out how to convert these data to a more familiar format.

In [14]:
test1 = '0x1F'
test1
solution1 = (1*16)+(15*1)
solution1

31

In [15]:
test2 = '0x31'
test2
solution2 = (3*16)+(1*1)
solution2

49

In [48]:
test3 = '0b101110111'
test3
solution3 = 120
solution3

120

In [54]:
test4 = '0o1642145'
test4
solution4 = (1*8**6)+(6*8**5)+(4*8**4)+(2*8**3)+(1*8**2)+(4*8**1)+(5*8**0)
solution4

476261

### Strings in Python

Data can also be stored as strings in Python. These are quite straightforward as you will see below. 

You can access particular parts of strings using their index. Don't forget that in Python the index starts at zero! In this case, if we want to remove the 'xxx' below, we can extract a slice of str1. This slice starts at the 5th entry (which has the index 4) and goes up to but not including the 11th entry, which has the index 10.

In [54]:
str1 = "xxxxabcdefxxxx"
str1[4:10]

'abcdef'

A slice can also include steps as a third value as shown below. In this case, we start at the 5th entry and then go forwards in steps of two until we reach the 16th entry which isn't included.

In [68]:
str2 = 'xxxxaxbxcxdxexfxxx'
str2[4:15:2]

'abcdef'

<span style="color:blue">Task 4: Now you need to write code to clean the data by removing the 'x' characters in the examples below.</span>

<span style="color:blue">(Extra challenge: convert the cleaned data to integer format.)</span>

In [16]:
str3 = "xxxxxx123xx"
str3[6:9]

'123'

In [3]:
str4 = '12345xxxx'
str4[0:5]

'12345'

In [4]:
str5 = "xx1xx2xx3xxxx"
str5[2:9:3]

'123'

In [5]:
str6 = 'xxx1x2x3x4x5'
str6[3::2]

'12345'

In [13]:
str7 = 'xxxx1xxx2xxx3xxxxx'
str7[4:13:4]

'123'

In [14]:
str8 = "x1x2x3x4x5x6x7"
str8[1::2]

'1234567'

## Lists

In Python, we will want to work with more complex data types than just numbers or strings. There are various data types in Python that allow us to store multiple numbers, strings, or both. They have different characteristics and need to be handled in different ways.

We can create a list using square brackets as in the example below. Each entry in a list has an index, again starting with zero. We are likely to want to clean lists by removing particular entries.

In [63]:
list1 = [1, 'abc', 3, 5]
list1[2]

3

The strings in our list also have their own index, so we can extract a string from a list and then part of the string.

In [64]:
list1[1][1:3]

'bc'

We can also change part of the list using code like that below.

In [65]:
list1[2] = 'replaced!'
list1

[1, 'abc', 'replaced!', 5]

We can copy our lists and then make changes to the copies.

In [66]:
list2 = list1
list2[0] = "also replaced!"
list2

['also replaced!', 'abc', 'replaced!', 5]

**For lists, this also changes the original list!! You need to be careful when making changes to lists in Python!**

In [67]:
list1

['also replaced!', 'abc', 'replaced!', 5]

<span style="color:blue">Task 5: For the examples below, clean the data by removing the 'x' characters.</span> 

<span style="color:blue">Challenging: calculate the mean value for each list.</span>

In [46]:
example1 = [123, 'xx456', '1264', 'x4x5x6']
cleaned1 = [str(number).replace('x','') for number in example1]
print(cleaned1)
integers1 = [int(number) for number in cleaned1]
mean_value1 = sum(integers1)/len(integers1)
print(mean_value1)


example2 = ['1xxx3xxx5xxx6', 5.75, 1.23, 'x3.573']
cleaned2 = [float(str(number).replace('x','')) for number in example2]
print(cleaned2)
integers2 = [int(number) for number in cleaned2]
mean_value2 = sum(integers2)/len(integers2)
print(mean_value2)

example3 = ['xx123xxxx', 953, 0x1F]
cleaned3 = [str(number).replace('x','') for number in example3]
print(cleaned3)
integers3 = [int(number) for number in cleaned3]
mean_value3 = sum(integers3)/len(integers3)
print(mean_value3)

example4 = ['xx14xx57xx41', 3.57, 1245]
cleaned4 = [float(str(number).replace('x','')) for number in example4]
print(cleaned4)
integers4 = [int(number) for number in cleaned4]
mean_value4 = sum(integers4)/len(integers4)
print(mean_value4)

['123', '456', '1264', '456']
574.75
[1356.0, 5.75, 1.23, 3.573]
341.25
['123', '953', '31']
369.0
[145741.0, 3.57, 1245.0]
48996.333333333336


<span style="color:blue">Task 6: As a final test on lists, calculate the mean value for example5 but without using the text 'example5'!</span>

In [47]:
example5 = [456, 1.35, 'x1x3x5']
example6 = example5

cleaned6 = [float(str(number).replace('x','')) for number in example6]
print(cleaned6)
integers6 = [int(number) for number in cleaned6]
mean_value6 = sum(integers4)/len(integers6)
print(mean_value6)

[456.0, 1.35, 135.0]
48996.333333333336


## Tuples

Tuples are similar to lists but are created using parentheses rather than square brackets.

In [79]:
tuple1 = (1, 'abc', 3, 5)

Tuples are **immutable**, as shown below. This behaviour is different from lists.

In [81]:
tuple1[2] = 'attempted change!'

TypeError: 'tuple' object does not support item assignment

We have to convert the tuple to a list then make the change then convert it back to a tuple

In [82]:
temp = list(tuple1)
temp[2] = 'change!'
tuple1 = tuple(temp)
tuple1

(1, 'abc', 'change!', 5)

<span style="color:blue">Task 7: Clean the tuples below by removing the 'x' characters.</span>

In [59]:
tupleexample1 = (123, 'xx456', '1264', 'x4x5x6')
list1 = list(tupleexample1)
removed1 = [str(number).replace('x','') for number in list1]
tupleexample1 = tuple(removed1)
print(tupleexample1)

tupleexample2 = ('1xxx3xxx5xxx6', 5.75, 1.23, 'x3.573')
list2 = list(tupleexample2)
removed2 = [str(number).replace('x','') for number in list2]
tupleexample2 = tuple(removed2)
print(tupleexample2)

tupleexample3 = ('xx123xxxx', 953, 0x1F)
list3 = list(tupleexample3)
removed3 = [str(number).replace('x','') for number in list3]
tupleexample3 = tuple(removed3)
print(tupleexample3)

tupleexample4 = ('xx14xx57xx41', 3.57, 1245)
list4 = list(tupleexample4)
removed4 = [str(number).replace('x','') for number in list4]
tupleexample4 = tuple(removed4)
print(tupleexample4)

('123', '456', '1264', '456')
('1356', '5.75', '1.23', '3.573')
('123', '953', '31')
('145741', '3.57', '1245')


## Dictionaries

Dictionaries are another way for Python to store data. Dictionaries are mutable like lists but are not ordered. Instead they have **keys**. Dictionaries are created using curly brackets.

In [84]:
tortoise = {'name': 'George', 'age': 57, 'eats': 'lettuce'}
tortoise

{'name': 'George', 'age': 57, 'eats': 'lettuce'}

The dictionary can be changed using code such as the below.

In [87]:
tortoise['name'] = tortoise['name'][:5] + 'ina'
tortoise

{'name': 'Georgina', 'age': 57, 'eats': 'lettuce'}

<span style="color:blue">Task 8: Clean the dictionaries below by removing 'x' characters.

In [1]:
dictexample1 = {'key1': 123, 'key2':'xx456', 'key3':'1264', 'key4':'x4x5x6'}
cleaned1 = {key: str(value).replace('x','') for key, value in dictexample1.items()}
print(cleaned1)

dictexample2 = {'key1': '1xxx3xxx5xxx6', 'key2':5.75, 'key3':1.23, 'key4':'x3.573'}
cleaned1 = {key: str(value).replace('x','') for key, value in dictexample1.items()}
print(cleaned1)

dictexample3 = {'key1':'xx123xxxx', 'key2':953, 'key3':0x1F}
cleaned3 = {key: str(value).replace('x','') for key, value in dictexample3.items()}
print(cleaned3)

dictexample4 = {'key1':'xx14xx57xx41', 'key3':3.57, 'keyx':1245}
cleaned4 = {key: str(value).replace('x','') for key, value in dictexample4.items()}
print(cleaned4)

{'key1': '123', 'key2': '456', 'key3': '1264', 'key4': '456'}
{'key1': '123', 'key2': '456', 'key3': '1264', 'key4': '456'}
{'key1': '123', 'key2': '953', 'key3': '31'}
{'key1': '145741', 'key3': '3.57', 'keyx': '1245'}


## Github

Now that you've completed this notebook, you should upload it to Github. Try to do this yourself but if you get stuck then please ask!