# <center> Pandas Data Cleaning </center>

- [Split DataFrame Columns](#section_1)
- [Text Cleaning with Regular Expressions](#section_2)
- [Update Column Datatypes](#section_3)
- [Drop Rows and Columns](#section_4)
- [Rename Columns](#section_5)

<hr>

Many real-life datasets come with problems such as missing values, wrong datatype, and bad formatting. Data professionals usually need to spend lots of time correcting these issues before the dataset becomes ready for analysis. Luckily, Pandas library comes with a set of built-in functions to help users fix these issues. In this section, we will learn how to use Pandas to identify and correct some common data quality issues.

## Data Cleaning with Pandas

To demonstrate the data cleaning process, we will use a toy DataFrame about countries. Each country has different piece of information such as name, population, size, and independence date as shown in the code below:

In [1]:
# Import pandas
import pandas as pd

In [2]:
# Create a list of dictionaries. Refer to lesson video for details.

list_of_countries = [{
    'Country Name':'China','ISO Code':'CN','Country Population':1433783686,'Country Area km2 (mi2)':'9,596,961 (3,705,407)','Independence Day':'1 October 1949'},
    {'Country Name':'New Zealand','ISO Code':'NZ','Country Population':5122600,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'},
    {'Country Name':'India','ISO Code':'IN','Country Population':1406631776,'Country Area km2 (mi2)':'3,287,263 (1,269,219)','Independence Day':'15 August 1947'},
    {'Country Name':'Australia','ISO Code':'AU','Country Population':25763300,'Country Area km2 (mi2)':'7,692,024 (2,969,907)', 'Independence Day':'1 January 1901'},
    {'Country Name':'United States','ISO Code':'US','Country Population':329064917,'Country Area km2 (mi2)':'9,525,067 (3,677,649)','Independence Day':'4 July 1776'},
    {'Country Name':'New Zealand','ISO Code':'NZ','Country Population':5122600,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'}]

# Create a Pandas DataFrame from a list of dictionaries
df_countries = pd.DataFrame(list_of_countries)

# Display the DataFrame
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907


Looking at our DataFrame, we notice the information about the country `New Zealand` was repeated twice in rows 1 and 5. Also, we notice the `Country Area` has values in both `km2` and `mi2`.

In [3]:
# Display summary of the DataFrame columns

df_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country Name            6 non-null      object
 1   ISO Code                6 non-null      object
 2   Country Population      6 non-null      int64 
 3   Country Area km2 (mi2)  6 non-null      object
 4   Independence Day        6 non-null      object
dtypes: int64(1), object(4)
memory usage: 368.0+ bytes


Examining the DataFrame using the [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) attribute shows that both `country area` and `independence day` columns were incorrectly assigned the `object` datatypes.

In order to clean up the data for further analysis, we need to perform the following steps:

* Split the `Country Area` values into two columns for `km2` and `mi2` respectively
* Remove any non-numeric characters from the area values
* Change the `Country Area` and `independence day` columns to the correct data types
* Drop unwanted rows and columns
* Rename all the columns to have lower case letters separated by underscores

### Split DataFrame Columns <a class="anchor" id="section_1"></a>

The `Country Area` column is represented in both square kilometers and square miles. The `mi2` values are included within `parentheses ()` and there is an `empty space` between the square kilometers and square miles values. Therefore, we can use the Pandas built-in [`split()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html) function to separate these values into new different columns as shown in the following code:

In [4]:
# Apply split() function to separate values into 2 different columns
df_countries[['Area km2', 'Area mi2']] = df_countries['Country Area km2 (mi2)'].str.split(' ', expand = True)

# Display DataFrame head
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,"(3,705,407)"
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,"(104,428)"
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947,3287263,"(1,269,219)"
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,"(2,969,907)"
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,"(3,677,649)"
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,"(104,428)"


The first part of the code `countries[['Area km2','Area mi2']]` on the left side of the equal `=` sign creates two new columns in our DataFrame. Then on the right side of the equal sign, we first identify the column country area in kilometers and miles, and we apply the [`split()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html) function to split strings around the given separator.

Notice that anytime we need to handle the data as a string, we need to identify the column name, then we put a `dot (.)` string and then one of many possible text functions.

In this case, the function splits the text into two different pieces, and one of the important parameters to pass is the string that we will use to split the value. As we note that there is always a `space` between the two values, we will use that as our splitting criteria. 

The last parameter the `expand equals True` is what we use to split these values and assign them into two new columns at the same time.

As a result, you will notice that the column `Country Area km2 (mi2)` was split into two new columns - `Area km2` and `Area mi2`.

However, we still notice there are some extra strings such as the parentheses and commas that need further cleaning.

### Text Cleaning with Regular Expressions <a class="anchor" id="section_2"></a>

After we split the `Country Area` into two separate columns, we need to continue our work to convert these values into numeric format by removing any non-numeric characters such as parentheses and commas from the newly created columns `Area km2` and `Area mi2`.

To do that, we can use the built-in [`replace()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function to replace occurrences of specific patterns in a given series with some other string. The function will take the first parameter as the targeted string or regular expression pattern, and the second parameter as the replacement value. The following code demonstrates how we replace any non-numeric values using the regular expression `(\D+)` within the function.

In [5]:
# Apply regular expression patterns to replace any non-numeric values
df_countries['Area km2'] = df_countries['Area km2'].str.replace('(\D+)','')
df_countries['Area mi2'] = df_countries['Area mi2'].str.replace('(\D+)','')

  df_countries['Area km2'] = df_countries['Area km2'].str.replace('(\D+)','')
  df_countries['Area mi2'] = df_countries['Area mi2'].str.replace('(\D+)','')


Pay attention that in some cases, you may see a [`warning`](https://stackoverflow.com/questions/66603854/futurewarning-the-default-value-of-regex-will-change-from-true-to-false-in-a-fu) about future changes in how the [`replace()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function would handle regular expressions. This warning can be removed by applying the `regex=True` parameter.

In [6]:
# Display the dataset
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,3705407
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,104428
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947,3287263,1269219
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,2969907
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,3677649
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,104428


So far, by looking at the result, we have converted the country area column into two new columns, and we further cleaned and removed all the non-numerical values.

## Update Column Datatypes <a class="anchor" id="section_3"></a>

Once we separated and cleaned up the country area into the proper format, we can move forward to assign the country area and independent day columns to the correct data types. To do that, we will make use of Pandas [`astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) function to pass a Python dictionary representing the name of each column and the corresponding data type as shown in this code below:

In [7]:
# Change specific columns' data types

df_countries = df_countries.astype({'Area km2': 'int64', 
                              'Area mi2':'int64', 
                              'Independence Day':'datetime64'})

We see for the new columns about the `Country Area`, we assigned them the `integer` data type; while the column `independence day` is assigned the `date` data type. 

The code runs without any problem which means the data types were correctly changed and reassigned. 

## Drop Rows and Columns <a class="anchor" id="section_4"></a>

Any data processing task would often include removing duplicate records or unwanted columns. In our countries DataFrame, we notice such cases as repeated rows for `New Zealand`. As we already split the country area into two new columns, there is no need to keep the old country area column too.

The Pandas library provides us with built-in functions to remove any unwanted records and columns.

The [`drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function needs to specify the corresponding axis on which the action will be applied.

0: to indicate the action will be taken at the `row-level`; 
<br>
1: to indicate the action will be taken at the `column-level`

The following code demonstrates how to use the [`drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function to remove unwanted rows and columns:

In [8]:
# To remove the old country area column
df_countries.drop('Country Area km2 (mi2)', axis = 1, inplace = True)

Note above that if we ignore the `in_place` parameter, the execution of this function will only occur during runtime, which means it will not be permanent. That's why we need to set the `inplace = True` if we want to make the changes permanent.

Let's see how we can drop unwanted rows below:

In [9]:
# To remove duplicate row for New Zealand
df_countries.drop(5, axis = 0, inplace = True)

In [10]:
# Display the dataset
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Independence Day,Area km2,Area mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,5122600,1907-09-26,270467,104428
2,India,IN,1406631776,1947-08-15,3287263,1269219
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649


This time we specified the record index five and assigned axis value 0 to indicate that we wanted to remove the record. To make that change permanent, we assigned the inplace parameter to True. The cell was run successfully which means the record was removed successfully. 

This can be a very practical way to remove a small number of records that we may be already aware of. However, imagine you have 10 repeated records. 

The Pandas library provides another way to remove large numbers of unwanted records.

Instead of using the [`drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function multiple times, we can make use of the [`Pandas drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) to remove all duplicate rows at the same time. 

In [11]:
# Remove duplicated records using drop_duplicates() function
df_countries.drop_duplicates(inplace = True)

## Rename Columns <a class="anchor" id="section_5"></a>

Now we move to the last item on our to do list which is to change the column labels by following the naming convention.

This typically means to have all the title names in small letters separated by an underscore. 

To achieve this, we can make use of the Pandas [`rename()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function by passing a dictionary with current and new column names as shown in the code below:

In [12]:
# Rename columns
df_countries.rename(columns = {'Country Name': 'country_name', 
                               'ISO Code': 'country_code',
                               'Country Population': 'country_population',
                               'Independence Day': 'independence_date',
                               'Area km2': 'area_km2',
                               'Area mi2': 'area_mi2'}, inplace = True)

Now we can run the [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method to print a nice summary of the DataFrame and examine all the changes applied to this tutorial.

In [13]:
# Display DataFrame information
df_countries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   country_name        5 non-null      object        
 1   country_code        5 non-null      object        
 2   country_population  5 non-null      int64         
 3   independence_date   5 non-null      datetime64[ns]
 4   area_km2            5 non-null      int64         
 5   area_mi2            5 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 280.0+ bytes


Looks like we have successfully changed the column names. We have the correct data types and there is no unnecessary columns.

Let's have a final look at our DataFrame.

In [14]:
# Display DataFrame
df_countries

Unnamed: 0,country_name,country_code,country_population,independence_date,area_km2,area_mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,5122600,1907-09-26,270467,104428
2,India,IN,1406631776,1947-08-15,3287263,1269219
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649


We can see now it looks squeaky clean!

In this lesson we have learned about the most common and widely used data cleaning techniques that data professionals use in their daily work. Want to learn more? Stay tuned!