# <center> Pandas Data Cleaning </center>

- [Split DataFrame Columns](#section_1)
- [Text Cleaning with Regular Expressions](#section_2)
- [Update Column Datatypes](#section_3)
- [Drop Rows and Columns](#section_4)
- [Rename Columns](#section_5)

<hr>

Previously, we learned how to use the different Pandas tools and functions to get external datasets into a DataFrame object. Many real-life datasets come with problems such as missing values, wrong datatype, and bad formatting. Data professionals usually need to spend lots of time correcting these issues before the dataset becomes ready for analysis. Luckily, Pandas library comes with a set of built-in functions to help users fix these issues. In this section, we will learn how to use Pandas to identify and correct some common data quality issues.

To demonstrate the process, we will use a toy DataFrame about countries. Each country has different pieces of information such as name, population, size, and independence date as shown in the code below:

In [1]:
import pandas as pd

## Data Cleaning with Pandas

To demonstrate the data cleaning process, we will use a toy DataFrame about countries. Each country has different pieces of information such as name, population, size, and independence date as shown in the code below:

In [2]:
# Create a list of dictionaries
list_of_countries = [{'Country Name':'China','ISO Code':'CN','Country Population':1433783686,'Country Area km2 (mi2)':'9,596,961 (3,705,407)','Independence Day':'1 October 1949'},
{'Country Name':'New Zealand','ISO Code':'NZ','Country Population':4783063,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'},
{'Country Name':'South Africa','ISO Code':'ZA','Country Population':58558270,'Country Area km2 (mi2)':'1,221,037 (471,445)','Independence Day':'31 May 1910'},
{'Country Name':'Australia','ISO Code':'AU','Country Population':25763300,'Country Area km2 (mi2)':'7,692,024 (2,969,907)', 'Independence Day':'1 January 1901'},
{'Country Name':'United States','ISO Code':'US','Country Population':329064917,'Country Area km2 (mi2)':'9,525,067 (3,677,649)','Independence Day':'4 July 1776'},
{'Country Name':'New Zealand','ISO Code':'NZ','Country Population':4783063,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'}]

# Create a Pandas DataFrame from a list of dictionaries
countries = pd.DataFrame(list_of_countries)

# Display the DataFrame
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907


Looking at our DataFrame, we notice the information about the country New Zealand was repeated twice in rows 1 and 5. Also, we notice the Country Area has values in both square kilometers and square miles.

In [3]:
# Display summary of the DataFrame columns
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country Name            6 non-null      object
 1   ISO Code                6 non-null      object
 2   Country Population      6 non-null      int64 
 3   Country Area km2 (mi2)  6 non-null      object
 4   Independence Day        6 non-null      object
dtypes: int64(1), object(4)
memory usage: 368.0+ bytes


Examining the DataFrame using the info() attribute shows that both country area and independence day columns were assigned as text datatypes.

In order to clean up the data for further analysis, we need to perform the following steps:

* Split the Country Area values into two columns for both kilometer values and square miles
* Remove any non-numeric characters from the area values
* Change the country area and independence date columns to the correct data types
* Drop unwanted rows and columns
* Rename all the columns to have lower case letters separated by underscores

### Split DataFrame Columns <a class="anchor" id="section_1"></a>

The country area column is represented in both square kilometers and square miles. The square miles values are included within parentheses () and there is an empty space between the square kilometers and square miles values. Therefore, we can use the Pandas built-in split() function to separate these values into new different columns as shown in the following code:

In [4]:
# Apply split() function to separate values into new different columns
countries[['Area km2', 'Area mi2']] = countries['Country Area km2 (mi2)'].str.split(' ', expand = True)

# Display DataFrame head
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,"(3,705,407)"
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,"(104,428)"
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910,1221037,"(471,445)"
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,"(2,969,907)"
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,"(3,677,649)"
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,"(104,428)"


The first part of the code countries[['Area km2','Area mi2']] on the left side of the equal = sign creates two new DataFrame series. The split() function was used on the Country Area km2 (mi2) Series to separate the numerical values into square km and mile respectively. Note that we needed to tell the function to use the quotes ‘ ’ with space inside as the splitting point and to send each of the generated values into a new column using the expand parameter.

The split() is one of the many built-in string formatting methods that can be applied to the Pandas series using the format Series.str.<function/property>. The following figure demonstrates the result of using the split() function in our DataFrame.

### Text Cleaning with Regular Expressions <a class="anchor" id="section_2"></a>

After we split the country area into two separate columns for square kilometers and square miles, we need to continue our work to convert these values into numeric format by removing any non-numeric characters such as parentheses and commas from the newly created columns Area km2 and Area mi2.

To do that, we can use the built-in replace() function to replace occurrences of specific patterns in a given series with some other string. The function will take the first parameter as the targeted string or regular expression pattern, and the second parameter as the replacement value. The following code demonstrates how we replace any non-numeric values using the regular expression (\D+) within the Pandas replace() function.

In [5]:
# Apply regular expression patternto replace any non-numeric values
countries['Area km2'] = countries['Area km2'].str.replace('(\D+)','')
countries['Area mi2'] = countries['Area mi2'].str.replace('(\D+)','')

  countries['Area km2'] = countries['Area km2'].str.replace('(\D+)','')
  countries['Area mi2'] = countries['Area mi2'].str.replace('(\D+)','')


In [6]:
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,3705407
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,104428
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910,1221037,471445
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,2969907
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,3677649
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,104428


## Update Column Datatypes <a class="anchor" id="section_3"></a>

Once we separated and cleaned up the country area into the proper format, we can move forward to assign the country area and independent day columns to the correct data types. To do that, we will make use of Pandas astype() function to pass a Python dictionary representing the name of each column and the corresponding data type as shown in this code below:

In [7]:
# Change specific columns data types
countries = countries.astype({'Area km2': 'int64', 
                              'Area mi2':'int64', 
                              'Independence Day':'datetime64'})

## Drop Rows and Columns <a class="anchor" id="section_4"></a>

Any data processing task would often include removing duplicate records or unwanted columns. In our countries DataFrame, we notice such cases as repeated rows for New Zealand. As we already split the country area into two new columns, there is no need to keep the old country area column too.

To remove unnecessary rows, we can make use of the Pandas drop() function to remove rows or columns by specifying label names and corresponding axes. The axis parameter can take two values:

0: to indicate the action will be taken at the row-level or
1: to indicate the action will be taken at the column-level
The following code demonstrates how to use the drop() function to remove unwanted rows and columns:

In [8]:
# To remove the old country area column
countries.drop('Country Area km2 (mi2)', axis = 1, inplace = True)

# To remove duplicate row for New Zealand
countries.drop(5, axis = 0, inplace = True)

In [9]:
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Independence Day,Area km2,Area mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,4783063,1907-09-26,270467,104428
2,South Africa,ZA,58558270,1910-05-31,1221037,471445
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649


If you have several duplicate rows, we can also make use of the Pandas drop_duplicates() function to keep only the unique rows in the DataFrame. The following code would achieve the same result as the second line of code above.

In [10]:
# Remove duplicated records using drop_duplicates() function
countries.drop_duplicates(inplace = True)

## Rename Columns <a class="anchor" id="section_5"></a>

Another common data processing task is to standardize any DataFrame column names by using lower case letters separated by underscores. To achieve this, we can make use of the Pandas rename() function by passing a dictionary with current and new column names as shown in the code below:

In [11]:
# Rename columns
countries.rename(columns = {'Country Name': 'country_name', 
                          'ISO Code': 'country_code',
                          'Country Population': 'country_population',
                          'Independence Day': 'independence_date',
                          'Area km2': 'area_km2',
                          'Area mi2': 'area_mi2'}, inplace = True)

Then, we can run the info() method to print a concise summary of the DataFrame and examine all the changes applied to this tutorial. We can compare the changes below:

In [12]:
# Display DataFrame information
countries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   country_name        5 non-null      object        
 1   country_code        5 non-null      object        
 2   country_population  5 non-null      int64         
 3   independence_date   5 non-null      datetime64[ns]
 4   area_km2            5 non-null      int64         
 5   area_mi2            5 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 280.0+ bytes


In [13]:
# Display DataFrame
countries

Unnamed: 0,country_name,country_code,country_population,independence_date,area_km2,area_mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,4783063,1907-09-26,270467,104428
2,South Africa,ZA,58558270,1910-05-31,1221037,471445
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649
