# Data Cleaning in Pandas

> This lesson outlines some of the techinques for cleaning different data types in Pandas. It will draw on the methods discussed in previous lessons, but with a focus on how to apply them to specific tasks.

## Cleaning Text Data

### Common Text Operations

> In data cleaning, one often deals with text data that requires standardisation and formatting. Pandas offers several built-in methods that make these tasks straightforward. 

#### 1. String Trimming 
The `.str.strip()` method is used to remove whitespace from the beginning and end of a string. It's particularly useful when your dataset contains extra spaces that may affect data quality or analysis. 


In [1]:
import pandas as pd
example_series = pd.Series([' Horse', 'Horse   ', '   HORSE', 'HORSE  ', 'H@RSE'])
example_series.head()

0       Horse
1    Horse   
2       HORSE
3     HORSE  
4       H@RSE
dtype: object

In [2]:
example_series = example_series.str.strip()
example_series.head()

0    Horse
1    Horse
2    HORSE
3    HORSE
4    H@RSE
dtype: object


### 2. Converting Case

>When handling categorical variables it is often necessary to standardise the case of your text data. In such instances, it's not uncommon to find the same category represented in varying cases - some in uppercase, others in lowercase, or even a mix. A common solution is to force the text into either upper or lower case, and this can be achieved with the `.str.lower()` and `.str.upper()` methods:


In [3]:
# Convert all to Uppercase
example_series.str.upper()


0    HORSE
1    HORSE
2    HORSE
3    HORSE
4    H@RSE
dtype: object

In [4]:
# Convert all to Lowercase
example_series = example_series.str.lower()
example_series.head()



0    horse
1    horse
2    horse
3    horse
4    h@rse
dtype: object

### 3. Fixing Incorrect Values with the `replace()` Method
> Another common scenario is a column containing values that are systematically incorrect in some way, for example a word that is often mis-spelled. This is heavily data dependent, and will require an understanding of what the column is supposed to contain. 

The simplest usage of `replace()` is to simply replace one character string with another:


In [5]:
example_series = example_series.str.replace('@', 'o')
example_series.head()

0    horse
1    horse
2    horse
3    horse
4    horse
dtype: object

## Advanced String Manipulation
>Aside from the simple examples described above, Pandas is capable of much more advanced string manipulations. Let's consider some situations and how they are handled.

### Scenario: Cleaning a Boolean Column
In this example, we will look at a column called `CANCELLED`, which is intended to be a Boolean column indicating whether a service or order has been cancelled:

In [6]:
cancelled = pd.read_csv('https://cdn.theaicore.com/content/lessons/b17e0a6b-68db-4a1f-9433-04ab57d6da3a/cancellations.csv')

cancelled.value_counts()

CANCELLED
False        19
0            13
F            13
True          3
1             1
T             1
dtype: int64

From this we can see that the column is intended as a Boolean, but the values have been expressed in a variety of ways. There are a couple of techniques to fix this situation. The first is to use the `.replace()` method to replace one value with another. For example, we can replace all the `0` values with `False` as follows:

In [7]:
cancelled.replace({'0': False}, inplace=True)
cancelled.value_counts()

CANCELLED
False        19
False        13
F            13
True          3
1             1
T             1
dtype: int64

Note that `False` appears twice in the `value_counts` result. This is because Pandas is distinguishing between the string  `"False"` and the Boolean value `False`. If we want to convert the column to a Boolean type, we will need to ensure that all values in it are of Boolean type.

The `.replace()` method can also accept a dictionary, where the dictionary keys are the values to match, and the dictionary values are the replacement values, e.g. `df.replace({'0': False, '1' : True})` to replace all instances of `0` or `1` with `False` and `True` respectively. 

In [8]:
mapping_dictionary = {'0': False, '1': True, 'F': False, 'T': True, 'True': True, 'False': False}
cancelled.replace(mapping_dictionary, inplace=True)
cancelled = cancelled.astype('bool')
cancelled.value_counts()

CANCELLED
False        45
True          5
dtype: int64

### Scenario: Forcing Values to Adhere to a Pattern

In some cases, it might be clear from the data in a column that a particular pattern should be expected for all values, in which case it may make sense to remove or replace any values that do not adhere to this pattern. An example might be a column containing UK phone numbers. There are multiple ways to represent a UK phone number, for example `+44 7555 555 555` or `07555 555555`. To handle the situation where multiple possible formats exist, the solution is to apply a *regular expression* to handle as many cases as possible. 

>A **regular expression**, often abbreviated as *regex*, is a sequence of characters that defines a search pattern that can be used for matching, allowing for complex search, replace, and validation operations.

**Regex** is an extensive topic, and the details of constructing **regex** patterns are beyond the scope of this lesson, but as with much in the world of data, the work has often already been done for you, and can be found on various internet websites such as [Stack Overflow](https://stackoverflow.com/), or in various searchable **regex** repositories such as [regexlib](https://regexlib.com/). Searching **regexlib** for `UK Phone Number` provides this option:

 `^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$`

Which covers the majority of UK phone number variants, including area codes with brackets (e.g. `(020)`), and extensions following a `#` symbol. 

Let's try it out on an example column of phone numbers. In the code block below, we will create an example `DataFrame` of phone numbers, including some invalid numbers, and then write code to apply the **regex** to each row in the column, and replace any values that do not comply with `NaN`. We will use the `str.match()` method to apply the **regex** expression, and then use logical indexing to replace the non-matching values.

In [9]:
# Creating a sample dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
    'Phone': ['0123456789', '01234 567890', '+441234567890', '0123-456-789', 
              '(0123) 456789', '1234567890', '0123456789a', '01234-567-890', 
              '+44 1234 567890', '01234']
}

phone_df = pd.DataFrame(data)

phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,1234567890
6,Grace,0123456789a
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,01234


In [10]:
import numpy as np # We will need the `nan` constant from the numpy library to apply to missing values

regex_expression = '^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$' #Our regular expression to match
phone_df.loc[~phone_df['Phone'].str.match(regex_expression), 'Phone'] = np.nan # For every row  where the Phone column does not match our regular expression, replace the value with NaN
phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,
6,Grace,
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,


### Scenario: Cleaning Numeric Columns with `.replace()`

The `.replace()` method can also be used to clean up numeric data, for example if you have a column of prices that contain the `£` symbol, thereby preventing the column from being cast to a numeric data type. 

In the example of the phone numbers `DataFrame`, we still have a variety of non-numeric characters in the data which should be replaced in order to regularise the numbers. To rectify this the following actions are needed:

- Replace any instances of `+44` with `0`, as this is how to write the number for calling within the UK
- Replace the `(` and `-` characters with nothing (i.e. remove them)
- Remove all spaces

The code block below shows how to achieve this:

In [11]:
# You can do each step one by one, for example with the following syntax for the `+44`: 0 replacement:

phone_df['Phone'] = phone_df['Phone'].str.replace('+44', '0', regex=False)
phone_df

# Or by setting `regex=True`, you can do it all in one step:

phone_df['Phone'] = phone_df['Phone'].replace({r'\+44': '0', r'\(': '', r'\)': '', r'-': '', r' ': ''}, regex=True)
phone_df

Unnamed: 0,Name,Phone
0,Alice,123456789.0
1,Bob,1234567890.0
2,Charlie,1234567890.0
3,Diana,123456789.0
4,Eva,123456789.0
5,Frank,
6,Grace,
7,Hank,1234567890.0
8,Ivy,1234567890.0
9,Jack,


## Unique Values



It is sometimes necessary to find the number of unique values in a column. For example, we might be working with a column of product IDs, where it would negatively affect our analysis to have multiple products with the same ID. To check whether an issue like this exists, we can use the methods `unique` and `nunique`.

- The `unique` method returns all the unique (i.e. distinct) values in the data series. For example, given a series of `[ 1, 1, 2, 3, 4]` it would return `[1, 2, 3, 4]`.
- The `nunique` method returns the **count** of unique values in the series. For example, from a series of `[ 1, 1, 2, 3, 4]` it would return `4`.

In [13]:
# Creating a sample dataframe with a column of product IDs
data = {'product_ids': ['P001', 'P002', 'P003', 'P001', 'P004', 'P005', 'P003', 'P006', 'P002']}
products_df = pd.DataFrame(data)

# Using `unique` to get unique product IDs
unique_ids = products_df['product_ids'].unique()

# Using `nunique` to get the number of unique product IDs
num_unique_ids = products_df['product_ids'].nunique()

# Displaying the original DataFrame
print("Original dataframe:")
print(products_df)

# Displaying the unique product IDs
print("\nUnique product IDs:")
print(unique_ids)

# Displaying the number of unique product IDs and the total number of rows in the DataFrame
print("\nNumber of unique product IDs:")
print(num_unique_ids)


print("\nTotal number of rows in the dataframe:")
print(len(products_df))


Original dataframe:
  product_ids
0        P001
1        P002
2        P003
3        P001
4        P004
5        P005
6        P003
7        P006
8        P002

Unique product IDs:
['P001' 'P002' 'P003' 'P004' 'P005' 'P006']

Number of unique product IDs:
6

Total number of rows in the dataframe:
9


## Handling Duplicates


> *Duplicates* in data refer to two or more rows that are identical across all columns, or, depending on the context, identical in a subset of columns, which can lead to redundancy and inaccuracies in data analysis and interpretation. The presence of duplicates is a common data cleaning issue, as duplicated data can distort descriptive statistics and data visualisations, leading to inaccurate insights and misinformed decisions. For instance, duplicate entries can artificially inflate the count of a category, skewing measures of central tendency like the mean and median, and affecting the distribution of data in visual representations.
Duplicates can either be *exact*, where a row is identical to another row across all columns, or *fuzzy*, which is where the two rows differ in some columns, but appear to describe the same entity. 

Exact duplicates are trivial to handle in `Pandas`. You can find all the duplicated rows in a `DataFrame` using the `.duplicated()` method, or drop them using the `drop_duplicates()` method. Run the two code blocks below to generate some example duplicate data, and then use the `drop_duplicates()` method to drop the duplicate rows.

In [14]:

# Creating a sample dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva', 'Charlie'],
    'Age': [28, 34, 45, 28, 23, 45],
    'Phone': ['123-456', '456-789', '789-012', '123-456', '345-678', '789-012'],
    'Email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'alice@email.com', 'eva@email.com', 'charlie@email.com']
}
duplicate_df = pd.DataFrame(data)
duplicate_df


Unnamed: 0,Name,Age,Phone,Email
0,Alice,28,123-456,alice@email.com
1,Bob,34,456-789,bob@email.com
2,Charlie,45,789-012,charlie@email.com
3,Alice,28,123-456,alice@email.com
4,Eva,23,345-678,eva@email.com
5,Charlie,45,789-012,charlie@email.com


We can find and remove the exact duplicates in the `DataFrame` as follows:

In [15]:
# Find duplicates based on all columns
print("result of duplicate_df.duplicated():")
print(duplicate_df.duplicated())

# Identifying and dropping exact duplicates
df_no_duplicates = duplicate_df.drop_duplicates()

# Displaying the dataframe after removing duplicates
print("\ndataframe After Removing Duplicates:")
print(df_no_duplicates)

result of duplicate_df.duplicated():
0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

dataframe After Removing Duplicates:
      Name  Age    Phone              Email
0    Alice   28  123-456    alice@email.com
1      Bob   34  456-789      bob@email.com
2  Charlie   45  789-012  charlie@email.com
4      Eva   23  345-678      eva@email.com


The possible existence of **fuzzy** duplicates can present a greater challenge. Consider the following example:

In [16]:
data = {
    'First_Name': ['Alice', 'Alice', 'Alice',  'Alice'],
    'Last_Name': ['Smith', 'Smith', 'Smith', 'Smith'],
    'Age': [28, 34, 45, 45],
    'Phone': ['123-456', '456-789', '123-456', '123-456'],
    'Email': ['alice@email.com', 'alice@smith.com', 
              'alice@theinternet.com',  'Alice@theinternet.com']
}

fuzzy_duplicates_df = pd.DataFrame(data)
fuzzy_duplicates_df

Unnamed: 0,First_Name,Last_Name,Age,Phone,Email
0,Alice,Smith,28,123-456,alice@email.com
1,Alice,Smith,34,456-789,alice@smith.com
2,Alice,Smith,45,123-456,alice@theinternet.com
3,Alice,Smith,45,123-456,Alice@theinternet.com


In this case, each column is a partial match for all the other columns. Handling this kind of partial (**fuzzy**) matching requires consideration of what each column represents. For example the name `Alice Smith` is relatively common in English-speaking countries, so we should not assume that all of the entries are the same person. Meanwhile ages can be mistyped, and it's even possible that there are multiple people called Alice Smith at the same address, sharing the same phone number. On the other hand row IDs `2` and `3` are probably the same person, given that the only difference is the capitalisation in the `Email` column.

How you choose to handle **fuzzy** duplicates depends on your analysis goals. In some cases it might make sense to take a conservative approach to avoid losing data unnecessarily, whereas in other cases it might be necessary to apply more stringent measures to avoid duplicates. You will also need to decide whether to simply drop the **fuzzy** duplicates you identify, or average the values of certain columns (e.g. a product price column).

Generally, the more columns you have for reference, the easier it is to determine whether a partial match is a duplicate. There are also software tools to help you, for example this python [library](https://pypi.org/project/fuzzywuzzy/) that uses the [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to calculate the differences between strings.
### Using GroupBy to Handle Duplicates



> In certain scenarios involving fuzzy duplicates, simply dropping the duplicates might not be sufficient. We may want to average the results of certain columns of interest to avoid biasing the data. This can be achieved using the `groupby()` method. It allows us to group data by certain criteria and then apply aggregate functions like `mean()`, `max()`, or `first()`.


#### Example: Aggregating Customer Reviews

Imagine a dataset of customer reviews where the same customer might have submitted multiple reviews for the same product, possibly with slight variations in their contact details. Our goal is to consolidate these reviews to maintain one entry per customer per product.


In [18]:


# Sample review data
data = {
    'CustomerID': ['C01', 'C01', 'C02', 'C02', 'C03'],
    'ProductID': [101, 101, 102, 102, 103],
    'ReviewScore': [4, 5, 3, 3, 4],
    'CustomerEmail': ['customer1@email.com', 'customer1@domain.com', 'customer2@email.com', 'customer2@domain.com', 'customer3@email.com']
}
reviews_df = pd.DataFrame(data)

# Grouping by CustomerID and ProductID, then taking the highest review score
aggregated_reviews = reviews_df.groupby(['CustomerID', 'ProductID']).agg({'ReviewScore': 'mean'})

aggregated_reviews


Unnamed: 0_level_0,Unnamed: 1_level_0,ReviewScore
CustomerID,ProductID,Unnamed: 2_level_1
C01,101,4.5
C02,102,3.0
C03,103,4.0





In this code block, we've grouped the reviews by both `CustomerID` and `ProductID`. We then used the `agg()` function to keep only the highest review score for each customer-product pair. This approach ensures that we have a unique, representative review score for each product from each customer, reducing redundancy while preserving crucial information.


### Understanding the `agg` Function

>The `agg` function is a tool for performing aggregate operations on data. It stands for 'aggregate' and is used to apply one or more operations over the specified axis. This function is particularly useful in scenarios involving `groupby` operations. It allows you to apply a function or a list of function names to be executed along one axis of the `DataFrame` (by default the `0` or row axis).



When used with a single function, `agg` applies that function to all columns. Here we use it with the keyword argument `'sum'` , to apply the `sum` method to all columns.


In [20]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

print("Original dataframe:")
print(df)
# Applying a single aggregate function
result = df.agg('sum')
print("\nResult of using the `agg` function with `sum`:")
result


Original dataframe:
   A  B
0  1  2
1  3  4

Result of using the `agg` function with `sum`:


A    4
B    6
dtype: int64

## Categorical Data

> Categorical columns are those in which the values are drawn from a predefined set of categories. Examples include the possible colour schemes of a product like a laptop, the country of residence of a customer, or the manufacturer of a car.  We will learn more about the different types of categories elsewhere in this course. 

The primary aim in cleaning categorical data is to modify the column so that each real world entity has only one corresponding unique value in the column. If you know something about the data the column is describing, you can get an idea of whether the column needs to be cleaned to meet this requirement.

For example, when dealing with countries, there should be somewhere in the region of 193 to 237 possible entries, depending on the definition of the word "country" that is being used. It is possible for the same country to go by multiple names in your column, in which case it would be helpful to regularise the names such that each country is represented by only a single value in the column. 

Look at the `DataFrame` below. The column `postal region` contains the country names `UK`, `England`, `Wales`, `Cymru` and `Scotland`, among others. For the purposes of the cost of sending mail, these are all one region, the `United Kingdom`. It would therefore be preferable to set them all to the same value. As with missing values, we can fix this representation with the `.replace()` method


In [25]:
# Creating a sample dataframe
data = {
    'Customer Name': ['Amina', 'Bahru', 'Charlie', 'Dion', 'Ebo', 'Frank', 'Giana'],
    'Postal Region': ['UK', 'England', 'Wales', 'Cymru', 'Scotland', 'USA', 'Canada']
}
df = pd.DataFrame(data)

# Displaying the original dataframe
print("Original dataframe:")
print(df)



Original dataframe:
  Customer Name Postal Region
0         Amina            UK
1         Bahru       England
2       Charlie         Wales
3          Dion         Cymru
4           Ebo      Scotland
5         Frank           USA
6         Giana        Canada


In [26]:
# Creating a mapping dictionary to unify the country names
country_mapping = {
    'UK': 'United Kingdom',
    'England': 'United Kingdom',
    'Wales': 'United Kingdom',
    'Cymru': 'United Kingdom',
    'Scotland': 'United Kingdom'
}

# Replacing the country names in the 'Postal Country' column
df['Postal Region'] = df['Postal Region'].replace(country_mapping)

# Displaying the DataFrame after cleaning the 'Postal Country' column
print("\nDataFrame After Cleaning 'Postal Country' Column:")
print(df)



DataFrame After Cleaning 'Postal Country' Column:
  Customer Name   Postal Region
0         Amina  United Kingdom
1         Bahru  United Kingdom
2       Charlie  United Kingdom
3          Dion  United Kingdom
4           Ebo  United Kingdom
5         Frank             USA
6         Giana          Canada


### Creating Categorical Columns from Continuous Data

Continuous variables are those which can take any value on a spectrum. For example the price of an item can potentially take any value greater than zero, usually with a granularity of 1 penny. You might meet instances in your dataset where you might want to generate new categories based on continuous data.  This can be achieved through a process called *binning*, which means dividing the spectrum of possible values into regions, known as **bins**. 

As an example, consider a `DataFrame` of flight routes, together with their distances in miles. An airline might want to divide them into `short haul`, `medium haul` and `long haul` based on threshold values. To achieve this, we can use the `cut()` method:

In [27]:

# Creating a sample dataframe
data = {
    'Route': ['NYC-LON', 'LON-PAR', 'NYC-TOK', 'LON-SYD', 'PAR-BER'],
    'Distance': [3461, 214, 6749, 10562, 546]
}
flights = pd.DataFrame(data)

# Displaying the original dataframe
print("Original dataframe:")
print(flights)

# Defining the bin edges and labels
bin_edges = [0, 1500, 4000, 12000]  # in miles
bin_labels = ['short haul', 'medium haul', 'long haul']

# Creating a new categorical column 'Flight Type' by binning the 'Distance' column
flights['Flight Type'] = pd.cut(flights['Distance'], bins=bin_edges, labels=bin_labels, right=False)

# Displaying the dataframe with the new 'Flight Type' column
print("\ndataframe with 'Flight Type' Column:")
print(flights)

Original dataframe:
     Route  Distance
0  NYC-LON      3461
1  LON-PAR       214
2  NYC-TOK      6749
3  LON-SYD     10562
4  PAR-BER       546

dataframe with 'Flight Type' Column:
     Route  Distance  Flight Type
0  NYC-LON      3461  medium haul
1  LON-PAR       214   short haul
2  NYC-TOK      6749    long haul
3  LON-SYD     10562    long haul
4  PAR-BER       546   short haul


In the above example example:
- We create a `DataFrame` `flights` with columns `Route` and `Distance`, containing various flight routes and their distances in miles
- We define the bin edges `bin_edges` and labels `bin_labels` to specify the ranges and labels for our new categorical data. Note that there is one more element in the `bin edges` list than the number of bins we need. The bin edges define the lower and upper bounds of each bin. So a short haul flight will be from `0` to `1500` miles in this case.
- We use `pd.cut()` to create a new column `Flight Type` by binning the `Distance` column based on the defined bins and labels. The argument `right = False` is used to specify that the bins are *left-closed*, meaning that the left bin edge is included in the bin, but the right bin edge is not.
- Finally, we display the original and modified `DataFrames` to observe the changes

This approach allows you to categorise continuous data into discrete bins, simplifying analysis and enabling you to gain insights into the distribution and frequency of the data across different categories. This is particularly useful when you want to analyze or visualise your data at a higher or more generalised level than the raw, continuous data allows.

## Key Takeaways

- Use Pandas' `.str.strip()` method to remove whitespace from the start and end of strings in a dataset
- Standardise case of categorical variables in Pandas using `.str.lower()` or `.str.upper()` methods
- Use the `replace()` function in Pandas to correct incorrect values in a column
- Use the `.replace()` method in Pandas to standardise values in a column, and `.astype('bool')` to convert the column to Boolean type
- Regular expressions (regex) can be used in data cleaning to match and replace values in a column that don't adhere to a specific pattern
- Use the `.replace()` method in Pandas to clean and regularise numeric data, including removing or replacing non-numeric characters
- Use `unique` to get distinct values and `nunique` to count unique values in a Pandas series
- In Pandas, use `.duplicated()` to find and `drop_duplicates()` to remove exact duplicates; handling fuzzy duplicates requires careful consideration of data context
- Use Pandas' `groupby()` and `agg()` functions to consolidate fuzzy duplicates and reduce bias in data
- The `agg` function in Pandas performs aggregate operations, often used with `groupby` on a specified axis of a `DataFrame`
- Use the `.replace()` method to standardise categorical data, ensuring each category is represented by a single unique value
- Binning in Pandas allows categorisation of continuous data into discrete bins using the `cut()` method, simplifying analysis and visualisation