# <center> Pandas Data Cleaning </center>

- [Split DataFrame Columns](#section_1)
- [Text Cleaning with Regular Expressions](#section_2)
- [Update Column Datatypes](#section_3)
- [Drop Rows and Columns](#section_4)
- [Rename Columns](#section_5)

<hr>

In [1]:
import pandas as pd

## Data Cleaning with Pandas

To demonstrate the data cleaning process, we will use a toy DataFrame about countries. Each country has different pieces of information such as name, population, size, and independence date as shown in the code below:

In [2]:
# Create a list of dictionaries
list_of_countries = [
    {'Country Name': 'China','ISO Code':'CN','Country Population':1433783686,'Country Area km2 (mi2)':'9,596,961 (3,705,407)','Independence Day':'1 October 1949'},
    {'Country Name': 'New Zealand','ISO Code':'NZ','Country Population':5122600,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'},
    {'Country Name': 'India','ISO Code':'IN','Country Population':1406631776,'Country Area km2 (mi2)':'3,287,263 (1,269,219)','Independence Day':'15 August 1947'},
    {'Country Name': 'Australia','ISO Code':'AU','Country Population':25763300,'Country Area km2 (mi2)':'7,692,024 (2,969,907)', 'Independence Day':'1 January 1901'},
    {'Country Name': 'United States','ISO Code':'US','Country Population':329064917,'Country Area km2 (mi2)':'9,525,067 (3,677,649)','Independence Day':'4 July 1776'},
    {'Country Name': 'New Zealand','ISO Code':'NZ','Country Population':5122600,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'}]

# Create a Pandas DataFrame from a list of dictionaries
df_countries = pd.DataFrame(list_of_countries)

# Display the DataFrame
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907


In [3]:
# Display summary of the DataFrame columns
df_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country Name            6 non-null      object
 1   ISO Code                6 non-null      object
 2   Country Population      6 non-null      int64 
 3   Country Area km2 (mi2)  6 non-null      object
 4   Independence Day        6 non-null      object
dtypes: int64(1), object(4)
memory usage: 368.0+ bytes


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

### Split DataFrame Columns <a class="anchor" id="section_1"></a>

The country area column is represented in both square kilometers and square miles. We can use the Pandas built-in `split()` function to separate this column into two different columns as shown in the following code:

In [4]:
# Apply split() function to separate values into new different columns
df_countries[['Area km2', 'Area mi2']] = df_countries['Country Area km2 (mi2)'].str.split(' ', expand = True)

# Display DataFrame head
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,"(3,705,407)"
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,"(104,428)"
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947,3287263,"(1,269,219)"
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,"(2,969,907)"
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,"(3,677,649)"
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,"(104,428)"


### Clean Text with Regular Expression <a class="anchor" id="section_2"></a>

The following code demonstrates how we replace any non-numeric values using the regular expression `(\D+)` within the Pandas `replace()` function.

In [10]:
# Apply regular expression pattern to replace any non-numeric values
df_countries['Area km2'] = df_countries['Area km2'].str.replace('(\D+)','', regex=True)
#df_countries['Area mi2'] = df_countries['Area mi2'].str.replace('(\D+)','')

# Display the DataFrame
df_countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,3705407
1,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,104428
2,India,IN,1406631776,"3,287,263 (1,269,219)",15 August 1947,3287263,1269219
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,2969907
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,3677649
5,New Zealand,NZ,5122600,"270,467 (104,428)",26 September 1907,270467,104428


## Update Column Datatypes <a class="anchor" id="section_3"></a>

The following code will make use of Pandas `astype()` function to pass a Python dictionary representing the name of each column and the corresponding data type.

In [8]:
# Change specific columns' data types
df_countries = df_countries.astype({'Area km2':'int64',
                                   'Area mi2':'int64',
                                   'Independence Day':'datetime64'})


## Drop Rows and Columns <a class="anchor" id="section_4"></a>

The following code will make use of the Pandas `drop()` and `drop_duplicates()` functions to remove rows and columns by specifying label names and corresponding axes.

In [9]:
# Remove the old country area column
df_countries.drop('Country Area km2 (mi2)', 
                 axis = 1, inplace = True)


In [11]:
# Remove duplicated records
df_countries.drop_duplicates(inplace = True)


In [10]:
# Remove duplicate row for New Zealand
df_countries.drop(5, axis = 0, inplace = True)


In [12]:
# Display the DataFrame
df_countries


Unnamed: 0,Country Name,ISO Code,Country Population,Independence Day,Area km2,Area mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,5122600,1907-09-26,270467,104428
2,India,IN,1406631776,1947-08-15,3287263,1269219
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649


## Rename Columns <a class="anchor" id="section_5"></a>

The following code will make use of the Pandas `rename()` built-in function to change DataFrame column labels

In [13]:
# Rename columns
df_countries.rename(columns = {'Country Name':'country_name',
                              'ISO Code':'country_code',
                              'Country Population':'country_population',
                              'Independence Day':'independence_date',
                              'Area km2':'area_km2',
                              'Area mi2':'area_mi2'}, inplace = True)


In [15]:
# Display DataFrame information
df_countries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   country_name        5 non-null      object        
 1   country_code        5 non-null      object        
 2   country_population  5 non-null      int64         
 3   independence_date   5 non-null      datetime64[ns]
 4   area_km2            5 non-null      int64         
 5   area_mi2            5 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 280.0+ bytes


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

In [17]:
# Display DataFrame
df_countries


Unnamed: 0,country_name,country_code,country_population,independence_date,area_km2,area_mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,5122600,1907-09-26,270467,104428
2,India,IN,1406631776,1947-08-15,3287263,1269219
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649
