# Data Cleaning in Pandas

> This lesson outlines some of the techinques for cleaning various different data types in Pandas. It will draw on the methods discussed in previous lessons, but with a focus on how to apply them to specific tasks.

## Cleaning Text Data


### Common Text Operations

In data cleaning, one often deals with text data that requires standardization and formatting. Pandas offers several built-in methods that make these tasks straightforward. 

#### String Trimming 
The `.str.strip()` method is used to remove whitespace from the beginning and end of a string. It's particularly useful when your dataset contains extra spaces that may affect data quality or analysis. 


In [42]:
import pandas as pd
example_series = pd.Series([' Horse', 'Horse   ', '   HORSE', 'HORSE  ', 'H@RSE'])
example_series.head()

0       Horse
1    Horse   
2       HORSE
3     HORSE  
4       H@RSE
dtype: object

In [38]:
example_series = example_series.str.strip()
example_series.head()

0    Horse
1    Horse
2    HORSE
3    HORSE
4    H@RSE
dtype: object


### Converting Case

>When handling categorical variables it is often necessary to standardise the case of your text data. In such instances, it's not uncommon to find the same category represented in varying cases - some in uppercase, others in lowercase, or even a mix. A common solution is to force the text into either upper or lower case, and this can be achieved with the `.str.lower()` and `.str.upper()` methods:


In [39]:
# Convert all to Uppercase
example_series.str.upper()


0    HORSE
1    HORSE
2    HORSE
3    HORSE
4    H@RSE
dtype: object

In [40]:
# Convert all to Lowercase
example_series = example_series.str.lower()
example_series.head()



0    horse
1    horse
2    horse
3    horse
4    h@rse
dtype: object

### Fixing Incorrect Values with the `replace()` Method

> Another scenario that occurs often is the case where some values in a column are incorrect in some way. This is heavily data dependent, and will require an understanding of what the column is supposed to contain. 

The simplest usage of `replace()` is to simply replace one character string with another:


In [41]:
example_series = example_series.str.replace('@', 'o')
example_series.head()

0    horse
1    horse
2    horse
3    horse
4    horse
dtype: object

## Advanced String Manipulation

>Aside from the simple examples described above, Pandas is capable of much more advanced string manipulations. Let's consider some situations and how they are handled.


### Scenario: Cleaning a Boolean Column

In this example, we will look at a column called `CANCELLED`, which is intended to be a Boolean column indicating whether a service or order has been cancelled:

In [45]:
cancelled = pd.read_csv('cancellations.csv')
cancelled.value_counts()

CANCELLED
False        19
0            13
F            13
True          3
1             1
T             1
dtype: int64

From this we can see that the column is intended as a Boolean, but the values have been expressed in a variety of ways. There are a couple of techniques to fix this situation. The first is to use the `.replace()` method to replace one value with another. For example we can replace all the `0` values with `False` as follows:

In [47]:
cancelled.replace({'0': False}, inplace=True)
cancelled.value_counts()

CANCELLED
False        19
False        13
F            13
True          3
1             1
T             1
dtype: int64

Note that `False` appears twice in the `value_counts` result. This is because `Pandas` is distinguishing between the string  `"False"` and the Boolean value `False`. If we want to convert the column to a Boolean type, we will need to ensure that all values in it are of Boolean type.

The `.replace()` method can also accept a dictionary, where the dictionary keys are the values to match, and the dictionary values are the replacement values, e.g. `df.replace({'0': False, '1' : True})` to replace all instances of `0` or `1` with `False` and `True` respectively. 

In [48]:
mapping_dictionary = {'0': False, '1': True, 'F': False, 'T': True, 'True': True, 'False': False}
cancelled.replace(mapping_dictionary, inplace=True)
cancelled.value_counts()

CANCELLED
False        45
True          5
dtype: int64

### Scenario: Forcing Values to Adhere to a Pattern

In some cases, it might be clear from the data in a column that a particular pattern should be expected for all values, in which case it may make sense to remove or replace any values that do not adhere to this pattern. An example might be a column containing UK phone numbers. There are multiple ways to represent a UK phone number, for example `+44 7555 555 555` or `07555 555555`. To handle the situation where multiple possible formats exist, the solution is to apply a *regular expression* to handle as many cases as possible. 

>A **regular expression**, often abbreviated as *regex*, is a sequence of characters that defines a search pattern that can be used for matching, allowing for complex search, replace, and validation operations.

**Regex** is an extensive topic, and the details of constructing **regex** patterns are beyond the scope of this lesson, but as with much in the world of data, the work has often already been done for you, and can be found on various internet websites such as [Stack Overflow](https://stackoverflow.com/), or in various searchable **regex** repositories such as [regexlib](https://regexlib.com/). Searching **regexlib** for `UK Phone Number` provides this option:

 `^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$`

Which covers the majority of UK phone number variants, including area codes with brackets (e.g. `(020)`), and extensions following a `#` symbol. 

Let's try it out on an example column of phone numbers. In the code block below, we will create an example dataframe of phone numbers, including some invalid numbers, and then write code to apply the **regex** to each row in the column, and replace any values that do not comply with `NaN`. We will use the `str.match()` method to apply the **regex** expression, and then use logical indexing to replace the non-matching values.

In [49]:
# Creating a sample dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
    'Phone': ['0123456789', '01234 567890', '+441234567890', '0123-456-789', 
              '(0123) 456789', '1234567890', '0123456789a', '01234-567-890', 
              '+44 1234 567890', '01234']
}

phone_df = pd.DataFrame(data)

phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,1234567890
6,Grace,0123456789a
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,01234


In [50]:
import numpy as np # We will need the `nan` constant from the numpy library to apply to missing values

regex_expression = '^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$' #Our regular expression to match
phone_df.loc[~phone_df['Phone'].str.match(regex_expression), 'Phone'] = np.nan # For every row  where the Phone column does not match our regular expression, replace the value with NaN
phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,
6,Grace,
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,


### Scenario: Cleaning Numeric Columns with `.replace()`

The `.replace()` method can also be used to clean up numeric data, for example if you have a column of prices that contain the `Â£` symbol, thereby preventing the column from being cast to a numeric data type. 

In the example of the phone numbers `DataFrame`, we still have a variety of non-numeric characters in the data which should be replaced in order to regularise the numbers. To rectify this the following actions are needed:

- Replace any instances of `+44` with `0`, as this is how to write the number for calling within the UK
- Replace the `(` and `-` characters with nothing (i.e. remove them)
- Remove all spaces

The code block below shows how to achieve this:

In [None]:
# You can do each step one by one, for example with the following syntax for the `+44`: 0 replacement:

phone_df['Phone'] = phone_df['Phone'].str.replace('+44', '0', regex=False)
phone_df

# Or by setting `regex=True`, you can do it all in one step:

phone_df['Phone'] = phone_df['Phone'].replace({r'\+44': '0', r'\(': '', r'\)': '', r'-': '', r' ': ''}, regex=True)
phone_df