# Data Cleaning in `Pandas`



In [None]:
#@title ### Run the following cell to download the necessary files for this lesson { display-mode: "form" } 
#@markdown Don't worry about what's in this collapsed cell

print('Downloading flights_sample.csv...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/3fad431d-917f-4989-abc8-6d22d2961e37/flights_sample.csv -q -O flights_sample.csv


> The goal of data cleaning is to convert your raw data into a format that is ready to be used for analysis. The nature and extent of the cleaning you need to do is therefore a function of both the dataset and your analysis goals. However it typically includes the following:

- Converting columns into the correct **data types**, for example ensuring numbers are represented by a numeric format like `int` or `float` that can support computation
<br><br>
- Handling data issues that could impact your analysis, for example the presence of `NaN` or `NULL` values or *duplicated* data
<br><br>
- Creating new columns from existing columns to support your analysis, such as categorising a list of products as `High`, `Mid` or `Low` price, based on a price column


This lesson will address how to perform these tasks using the `Pandas` library in Python, and provide examples for how to address certain common data issues. As we will see, there is no specific order in which we should approach the tasks, as they are interdependent. The order in which you perform them will depend on your analysis goals and the specifics of your dataset.

## Data Types

Once you have imported your data, a common first step is to check the data type of each column, and adjust any that aren't optimal for your analysis. When you import your raw data, `Pandas` will attempt to automatically assign a data type to each column, but it doesn't always make the best choice.

Let's import some example data to take a look at how to do this. Run the code block below to import the `flights.txt` file and print the head of the dataframe.


In [37]:
import pandas as pd

pd.set_option('display.max_columns', None) # You can use this setting display all the columns in your dataframe
flights_df = pd.read_csv("flights_sample.csv") 
flights_df.head(5)

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,DESTAIRPORTCODE,CRSDEPTIME,DEPTIME,DEPDELAY,ARRTIME,ARRDELAY,CANCELLED,DIVERTED,DISTANCE
0,117865800,20110511,YV,N77302,1026,HNL,KOA,1305,1250.0,-15.0,1336.0,-9.0,F,False,163 miles
1,146323500,20160520,B6,N203JB,1068,CHS,BOS,700,649.0,-11.0,847.0,-27.0,False,F,818 miles
2,71623600,20040922,XE,N14939,2024,CLE,ROC,945,937.0,-8.0,1045.0,-2.0,F,False,245 miles
3,147762700,20160527,EV,N695CA,5325,ATL,OMA,850,847.0,-3.0,955.0,-16.0,False,F,821 miles
4,29225500,19970511,US,N284AU,1001,CMH,PHL,755,754.0,-1.0,912.0,1.0,0,0,405 miles


We can see from the first few rows of data that there is a variety of different types of information in the table. Some columns contain text and others numerals, but there is also variation in terms of the type of information that is being encoded.

For example the `FLIGHTNUM` and `DISTANCE` columns both contain numeric information, but while the `DISTANCE` column contains information that we might want to perform arithmetic on, the numeric information in `FLIGHTNUM` is presumably just a set of unique identifiers for different flight routes. 

Complications like these indicate why it is always advisable to do at least some manual cleaning of the data. Even though packages exist to automate many of the cleaning tasks, the cleaning process is also an important step in your EDA, allowing you to get familiar with your dataset and anticipate any issues that might arise later in your analysis.

### Checking the Data Type

The data types of your columns can be accessed via the `.dtypes` attribute, or by calling the `.info()` method. 

- The `.dtypes` attribute only returns the data type of each column
- The `.info()` method returns both the data type and some additional information: the number of rows and the memory usage of the dataframe, as well as the number of non-null values in each column. We will deal with handling `NULL` values later in the lesson.

Run the two code blocks below to see the difference in output:

In [38]:
flights_df.dtypes

TRANSACTIONID          int64
FLIGHTDATE             int64
AIRLINECODE           object
TAILNUM               object
FLIGHTNUM              int64
ORIGINAIRPORTCODE     object
DESTAIRPORTCODE       object
CRSDEPTIME             int64
DEPTIME              float64
DEPDELAY             float64
ARRTIME              float64
ARRDELAY             float64
CANCELLED             object
DIVERTED              object
DISTANCE              object
dtype: object

In [39]:
flights_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119180 entries, 0 to 119179
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   TRANSACTIONID      119180 non-null  int64  
 1   FLIGHTDATE         119180 non-null  int64  
 2   AIRLINECODE        119180 non-null  object 
 3   TAILNUM            103353 non-null  object 
 4   FLIGHTNUM          119180 non-null  int64  
 5   ORIGINAIRPORTCODE  119180 non-null  object 
 6   DESTAIRPORTCODE    119180 non-null  object 
 7   CRSDEPTIME         119180 non-null  int64  
 8   DEPTIME            116328 non-null  float64
 9   DEPDELAY           116328 non-null  float64
 10  ARRTIME            116129 non-null  float64
 11  ARRDELAY           116040 non-null  float64
 12  CANCELLED          119180 non-null  object 
 13  DIVERTED           119180 non-null  object 
 14  DISTANCE           119180 non-null  object 
dtypes: float64(4), int64(4), object(7)
memory usage: 13

### Data Types in `Pandas`




Below is a table that shows various `Pandas` data types, their corresponding Python data types, and a brief description of each.

| Pandas Dtype | Python Dtype        | Description                                           |
|--------------|---------------------|-------------------------------------------------------|
| object       | str                 | Used for text or mixed types of data                  |
| int64        | int                 | Integer numbers                                       |
| float64      | float               | Floating-point numbers                                |
| bool         | bool                | Boolean values (True/False)                           |
| datetime64   | datetime.datetime   | Date and time values                                  |
| timedelta64  | datetime.timedelta  | Differences between two datetimes                     |
| category     | (special type)      | Finite list of text values                            |
| period       | pd.Period           | Periods of time, useful for time-series data          |
| sparse       | (special type)      | Sparse array to contain mostly NaN values             |
| string       | str                 | Text                                                  |
Note that:

- The `int64` and `float64` data types indicate 64-bit storage for integer and floating-point numbers, respectively. `Pandas` also supports other sizes (like `int32` and `float32`) to save memory when the larger sizes are not necessary.
- The `category` dtype is not a native Python data type but is provided by `Pandas` to optimize memory usage and performance for data with a small number of distinct values
- The `sparse` dtype is used for data that is mostly missing to save memory by only storing the non-missing values
- The `period` dtype is specific to `Pandas` and is used for handling period data, which is not directly analogous to a native Python type



### Changing the Data Type

You can change the data type of a column using the `astype()` method. Let's look at a simple example:



In [40]:
# Create a simple dataframe of names and ages
data = {'Name': ['Alice', 'Bashar', 'Carlos', 'Diana', "Ephraim", "Frank", "Gina"],
        'Age': [21, 22, 'n/a', 24, 25, 'missing', 27]}
age_df = pd.DataFrame(data)
age_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      object
 1   Age     7 non-null      object
dtypes: object(2)
memory usage: 244.0+ bytes


Looking at the output of `info()`, both columns have defaulted to the `Object` datatype. We can easily change the `Name` column to the `string` type:

In [41]:
age_df.Name = age_df.Name.astype('string')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      string
 1   Age     7 non-null      object
dtypes: object(1), string(1)
memory usage: 244.0+ bytes


But when we try to cast the `Age` column as `int64`, the method throws a `ValueError`, because it encountered some values (`missing` and `n/a`) that it could not cast as an integer.

In [42]:
age_df.Age = age_df.Age.astype('int64')


ValueError: invalid literal for int() with base 10: 'n/a'

By default, the `astype()` method throws an error when it encounters a non-convertible value, as this prevents accidental data loss. We can override this behaviour by setting the `errors` flag to `ignore` rather than `raise`:

In [None]:
age_df.Age = age_df.Age.astype('int64', errors='ignore')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      string
 1   Age     7 non-null      object
dtypes: object(1), string(1)
memory usage: 244.0+ bytes


But this actually fails to alter the datatype, as it leaves the unconvertible values unchanged, as can be seen in the output below.

In [None]:
age_df.head(10)

Unnamed: 0,Name,Age
0,Alice,21
1,Bashar,22
2,Carlos,
3,Diana,24
4,Ephraim,25
5,Frank,missing
6,Gina,27


In this scenario, we are happy to lose the information in the non-numeric values, for the sake of being able to treat the column as integers. To do this, we can use a separate function, `pd.to_numeric()`, which we can use with the `errors` parameter set to `coerce` to force the conversion:

In [None]:
age_df.Age = pd.to_numeric(age_df.Age, errors='coerce')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      string 
 1   Age     5 non-null      float64
dtypes: float64(1), string(1)
memory usage: 244.0 bytes


In [None]:
age_df.head(10)

Unnamed: 0,Name,Age
0,Alice,21.0
1,Bashar,22.0
2,Carlos,
3,Diana,24.0
4,Ephraim,25.0
5,Frank,
6,Gina,27.0


We have now successfully converted our column to numeric values, with non-numeric values being replaced with `NaN`, but the data type is `float64` not `int64`. Attempting to cast it to `int64` will still fail, as the data type doesn't support `NaN` values. We will need return to this issue in a later section of our data cleaning process.

In [None]:
age_df.Age = age_df.Age.astype('int64', errors='ignore')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      string 
 1   Age     5 non-null      float64
dtypes: float64(1), string(1)
memory usage: 244.0 bytes


### The `datetime64` Data Type




In `Pandas`, the `datetime64` data type provides a memory-efficient structure for working with date and time data, allowing for operations like time-based indexing, slicing, and resampling to be performed. This data type is necessary for effective time-series data analysis, as it allows complex temporal computations and aggregations to be performed with relative ease.

Date-time columns can be challenging to assign correctly, because there is a very large range of ways that date and time columns can be formatted, and there is no guarantee that each column will only use one of these formats, so it is important to determine which formats are used in your data before attempting to convert a column to `datetime64`.


### Casting a Column to Datetime

The `pd.to_datetime` function can be used to cast a column to `datetime64`. In order to ensure that the conversion is accurate, it is necessary to consider the format that the date/time values are in.

Run the three code blocks below to perform a simple example conversion, and confirm that the `datetime64`-encoded dates are correct. In this case, the data are initially formatted as strings, and the date format is unambiguous, and so the conversion works without any additional work:


In [None]:
data = {
    'date_strings': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01']
}
date_df = pd.DataFrame(data)

date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date_strings  4 non-null      object
dtypes: object(1)
memory usage: 164.0+ bytes


In [None]:
date_df.date_strings = pd.to_datetime(date_df.date_strings)
date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_strings  4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 164.0 bytes


In [None]:
date_df.head()

Unnamed: 0,date_strings
0,2023-01-01
1,2023-02-01
2,2023-03-01
3,2023-04-01



### Common Issues: *Epoch* Time

Date and time are typically represented in computers using a system known as *epoch time* or *Unix time*, which counts the number of seconds that have elapsed since a predefined point in time, known as the *epoch*. The **epoch** is set at `00:00:00 UTC on Thursday, January 1, 1970` and the count of seconds (or milliseconds in some applications) from this point is used to represent subsequent points in time, allowing for a standardized, system-independent representation of time that can be easily computed and converted into various human-readable formats.

This convention is the source of common issue encountered when converting columns to datetime. For an example, return to the `flights_df`, and look at the `FLIGHTDATE` column:



In [None]:
flights_df['FLIGHTDATE'].head(5)

0    20110511
1    20160520
2    20040922
3    20160527
4    19970511
Name: FLIGHTDATE, dtype: int64



`Pandas` has detected the `dtype` of the  `FLIGHTDATE` column as `int64`, whereas it would be more helpful to have it as `datetime64`, so that it can be used in time-series analysis.

Run the two code block below, to see what happens when we try using the function on the `FLIGHTDATE` column:







In [None]:
pd.to_datetime(flights_df['FLIGHTDATE']).head(5)


0   1970-01-01 00:00:00.020110511
1   1970-01-01 00:00:00.020160520
2   1970-01-01 00:00:00.020040922
3   1970-01-01 00:00:00.020160527
4   1970-01-01 00:00:00.019970511
Name: FLIGHTDATE, dtype: datetime64[ns]

It looks like the first few values of the column are in January, 1970. This should arouse suspicion, given what we know about **Epoch time**. We can see from the head of the original column that the first 5 values should be between 1997 and 2016, not in 1970. The function has interpreted the integer values in `FLIGHTDATE` as a number of seconds after the **Epoch**, when actually it should be a date in the format`YYYmmdd`.

To work around this, we can specify a format as an argument to the `to_datetime` function, as follows:


In [None]:
## Assign date_format
date_format = "%Y%m%d"
pd.to_datetime(flights_df["FLIGHTDATE"], format=date_format)

0        2011-05-11
1        2016-05-20
2        2004-09-22
3        2016-05-27
4        1997-05-11
            ...    
119175   2016-05-18
119176   2014-09-25
119177   2016-01-13
119178   1997-01-04
119179   2012-09-07
Name: FLIGHTDATE, Length: 119180, dtype: datetime64[ns]

### Common Issues: Mixed Date Formats

Handling multiple date formats in a single column can be a bit tricky, but `pd.to_datetime` is quite flexible and can infer different formats automatically in most cases. Below is a simple example of a column called `mixed_dates`, which has dates in multiple formats:


In [None]:
# Create a sample dataframe with multiple date formats
data = {
    'mixed_dates': ['01/02/2023', '2023-03-01', '04-Apr-2023', '20230505']
}
mixed_date_df = pd.DataFrame(data)

# Displaying the original dataframe
print("Original Dataframe:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)

Original DataFrame:
   mixed_dates
0   01/02/2023
1   2023-03-01
2  04-Apr-2023
3     20230505

Data types:
mixed_dates    object
dtype: object


In [None]:
# Converting the 'mixed_dates' column to datetime
# Note: infer_datetime_format=True can help to infer different formats, but might not handle all cases
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['mixed_dates'], infer_datetime_format=True, errors='coerce')

# Displaying the modified dataframe
print("\nModified dataframe:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)


Modified DataFrame:
   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01        NaT
2  04-Apr-2023        NaT
3     20230505        NaT

Data types:
mixed_dates            object
dates          datetime64[ns]
dtype: object


Unfortunately in this case, automatic conversion has not been very effective, and we can see multiple values have been returned as `NaT` (Not a Time).

A more effective approach is to use the `parse` function from the `dateutil` library, in conjunction with the `.apply` method:

In [None]:
from dateutil.parser import parse
mixed_date_df['dates'] = mixed_date_df['mixed_dates'].apply(parse)
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['dates'], infer_datetime_format=True, errors='coerce')
print("\nModified dataframe:\n")
print(mixed_date_df)
print("\nData types:\n")
print(mixed_date_df.dtypes)


Modified DataFrame:

   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01 2023-03-01
2  04-Apr-2023 2023-04-04
3     20230505 2023-05-05

Data types:

mixed_dates            object
dates          datetime64[ns]
dtype: object


### The `timedelta64` Data Type

The `timedelta64` data type is used to represent differences in `datetime64` objects. While a `datetime64` object represents a specific point in time, with a defined year, month, day, hour, minute, and so on, a `timedelta64` object represents a duration that is not anchored to a specific start or end point. It tells you how much time is between two points, without specifying what those points are.

The distinction between `timedelta64` and `datetime64` data types in `Pandas` (and similarly, `timedelta` and `datetime` in Python's `datetime` module) is crucial due to the inherent differences in representing and utilizing points in time versus durations of time, which are fundamentally different concepts.

**Arithmetic Operations:**
   - When you perform arithmetic with two `datetime64` objects, the result is a `timedelta64` object because subtracting one point in time from another gives you a duration
   - Conversely, when you add or subtract a `timedelta64` from a `datetime64` object, you get another `datetime64` object because you're shifting a point in time by a certain duration

By having separate data types, `Pandas` (and Python more broadly) allows for clear, intuitive operations on time data, ensuring that the operations are semantically meaningful and that the results are what users expect when performing arithmetic or comparisons with time-related data. This distinction also helps prevent misinterpretation of the data and ensures that operations are performed with the appropriate level of precision and efficiency for each type of data.

For example, the code block below creates a new `timedelta64` column by subtracting a specific timestamp from the `dates` column of the `mixed_dates_df` dataframe:

In [None]:
# Subtracting a single date from the 'dates' column
single_date = pd.Timestamp('2023-01-01')  # Creating a Timestamp object
mixed_date_df['date_difference'] = mixed_date_df['dates'] - single_date  # Subtracting the single date

# Displaying the modified dataframe
print("\nModified dataframe:")
print(mixed_date_df)


Modified DataFrame:
   mixed_dates      dates date_difference
0   01/02/2023 2023-01-02          1 days
1   2023-03-01 2023-03-01         59 days
2  04-Apr-2023 2023-04-04         93 days
3     20230505 2023-05-05        124 days


## Selecting a Subset of Data

>The process of selecting a subset of your data based on the evaluation of a logical expression is known as *logical indexing* or *logic masking*. There may be times when you wish to separate out sections of your data, based on the value taken by a specific column. For example if you are working with a dataframe of customers for an international business, you might want to separate out just those customers where the `country` column contains the value `Switzerland`. Alternatively you might wish to send out an email to only those of your customers who are within a certain age range. These are both situations where **logical indexing** is useful.

You have already met the `loc` method in a previous lesson, in the context of selecting a row based on an index. You can also use it to select a row based on a logical expression. 

Let's take the example of selecting the entries in a `customers` dataframe where the country matches `Switzerland`. We use the `loc` function to select rows in which a logical expression is evaluated to true. The syntax to assign matches to a new dataframe is as follows:

`new_df = df.loc[<your logical expression>]`


In [None]:


# Creating a sample dataframe
data = {
    'Name': ['Alansana', 'Briana', 'Chanmony', 'Dietrich', 'Eva'],
    'Age': [23, 34, 45, 36, 50],
    'Country': ['USA', 'Switzerland', 'UK', 'Switzerland', 'Canada']
}
customers = pd.DataFrame(data)

# Displaying the original dataframe
print("Original dataframe:")
print(customers)

# Using `loc` to select rows based on a logical expression
swiss_customers = customers.loc[customers['Country'] == 'Switzerland']

# Displaying the dataframe with only Swiss customers
print("\nSwiss Customers:")
print(swiss_customers)


Original DataFrame:
       Name  Age      Country
0  Alansana   23          USA
1    Briana   34  Switzerland
2  Chanmony   45           UK
3  Dietrich   36  Switzerland
4       Eva   50       Canada

Swiss Customers:
       Name  Age      Country
1    Briana   34  Switzerland
3  Dietrich   36  Switzerland


If you want to select multiple values from your column for a subset of the data, you can use the `.isin()` method to achieve this. The `isin()` method returns `True` if a row's value is a member of a list:

In [45]:

# Selecting rows where 'Country' is either 'USA' or 'UK'
selected_countries = ['USA', 'UK']
mask = customers['Country'].isin(selected_countries)

# Using the mask to get the subset of the DataFrame
subset_df = customers[mask]

# Displaying the subset of the DataFrame
print("\nSubset DataFrame (Countries: USA, UK):")
print(subset_df)




Subset DataFrame (Countries: USA, UK):
       Name  Age Country  marketing_email
0  Alansana   23     USA            False
2  Chanmony   45      UK            False


Logical indexing can also be used in conjunction with other methods, such as dropping values. In many cases this requires the creation of a *logical mask*, which is a Boolean data series of the same length as the number of rows in the dataframe. This can then be passed to other functions, or used to selectively apply a function to the logical subset of the indices.

Let's look at an example, where we want to create a new column called `marketing_email` which is set to `True` if the customer age is less than 40 and greater than or equal to 30:

In [26]:
# Creating a logical mask for ages between 30 and 40
age_mask = (customers['Age'] >= 30) & (customers['Age'] < 40)

print(age_mask)

0    False
1     True
2    False
3     True
4    False
Name: Age, dtype: bool


The logical mask is constructed by chaining together a set of conditions (e.g. `customers['Age'] >=30`) with logical operators (e.g. `&` and `or`), and enclosing the result in brackets. Each condition must evaluate to a Boolean series of the same length as the number of rows in the dataframe.

Once the logical mask has been generated, it can be used to create a new column as follows:

In [27]:
# Using the logical mask to create a new column 'marketing_email'
# Set to True where the condition is met, and False otherwise
customers['marketing_email'] = False  # Initialize column with False
customers.loc[age_mask, 'marketing_email'] = True  # Apply mask to set True where condition is met

# Displaying the dataframe with the new column
print("\ndataframe with 'marketing_email' Column:")
print(customers)


DataFrame with 'marketing_email' Column:
       Name  Age      Country  marketing_email
0  Alansana   23          USA            False
1    Briana   34  Switzerland             True
2  Chanmony   45           UK            False
3  Dietrich   36  Switzerland             True
4       Eva   50       Canada            False


## Handling Missing and Incorrect Values



### Missing Values and `NaN`s

>It is typically necessary to find a way to handle situations where some columns contain rows without data. Missing values can cause errors or incorrect responses, and will need to be dealt with by removing the rows that contain missing data, or by imputing an appropriate value. 

An obvious example of where missing values can interfere with analysis is the presence of `NaN` values in a numeric column. While many `Pandas` aggregation functions can handle `NaN`s by skipping them, even a single `NaN` in a numeric column can prevent some aggregation functions from returning a usable result, and will also break many machine learning algorithms. 

Returning to the earlier example of the column of ages in `age_df`, we saw earlier that we could not assign the data type of `int64`, because we had `NaN` values in the column:


In [28]:
age_df.head(10)

Unnamed: 0,Name,Age
0,Alice,21.0
1,Bashar,22.0
2,Carlos,
3,Diana,24.0
4,Ephraim,25.0
5,Frank,
6,Gina,27.0


There are two `NaN` values in the column, and we have the choice to either drop of impute the values to resolve this. The decision of which action to perform depends on your data and analysis goals. For example in this case, we cannot be certain of the ages of `Frank` and `Carlos`, so if our analysis depends on the `Age` column, it might be better to drop the missing values. This can be achieved with the `.dropna()` method.

In [29]:
age_df.dropna()  # Drop rows with any column having NA/null data. If we wanted to apply this result permanently, we would need to assign it to a new dataframe or use the `inplace=True`` argument

Unnamed: 0,Name,Age
0,Alice,21.0
1,Bashar,22.0
3,Diana,24.0
4,Ephraim,25.0
6,Gina,27.0


Alternatively, we might know in advance that all of the students are between the ages of 20 and 30, and so it might be acceptable to replace the missing values with the mean or median average. This can be achieved with the `fillna()` method:

In [30]:
value_to_impute = age_df.Age.mean().round()  # Calculate the mean of the Age column and round it to the nearest integer

age_df.fillna(value_to_impute, inplace = True)  # Fill NaNs with the value we calculated

Handling the `NaN` values will now permit the column to be cast to an `int64`:

In [31]:
age_df.Age = age_df.Age.astype('int')  # Round the Age column and convert it to an integer
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      string
 1   Age     7 non-null      int64 
dtypes: int64(1), string(1)
memory usage: 244.0 bytes


### Detecting and Fixing Incorrect Values

Another scenario that occurs often is the case where some values in a column are incorrect in some way. This is heavily data dependent, and will require an understanding of what the column is supposed to contain. 

Let's consider a couple of examples. Our `flights_df` contain two columns that appear to be intended as Boolean values: `CANCELLED` and `DIVERTED`. We can see what values are in the column, and the number of instances of each value, by using the `value_counts()` method:


In [32]:
flights_df['CANCELLED'].value_counts()

False    63587
0        34964
F        17752
True      1619
1          851
T          407
Name: CANCELLED, dtype: int64

From this we can see that the column is intended as a Boolean, but the values have been expressed in a variety of ways. There are a couple of techniques to fix this situation. The first is to use the `.replace()` method to replace one value with another. For example we can replace all the `0` values with `False` as follows:

In [33]:
flights_df['CANCELLED'].replace({'0': False}, inplace=True)
flights_df['CANCELLED'].value_counts()

False    63587
False    34964
F        17752
True      1619
1          851
T          407
Name: CANCELLED, dtype: int64

Note that `False` appears twice in the `value_counts` result. This is because `Pandas` is distinguishing between the string  `"False"` and the Boolean value `False`. If we want to convert the column to a Boolean type, we will need to ensure that all values in it are of Boolean type.

The `.replace()` method can also accept a dictionary, where the dictionary keys are the values to match, and the dictionary values are the replacement values, e.g. `df.replace({'0': False, '1' : True})` to replace all instances of `0` or `1` with `False` and `True` respectively. This can also be achieved using the `map` function, which can apply a mapping (or even a function) to all the elements in a column or series.



In [34]:
mapping_dictionary = {'0': False, '1': True, 'F': False, 'T': True, 'True': True, 'False': False}
flights_df['CANCELLED'].replace(mapping_dictionary, inplace=True)
flights_df['CANCELLED'].value_counts()

False    116303
True       2877
Name: CANCELLED, dtype: int64

Now that our column contains just `True` and `False` values, we can convert it to ` bool` type:

In [35]:
flights_df['CANCELLED'] = flights_df['CANCELLED'].astype('bool')

print('datatype of flights_df["CANCELLED"]:')
print(flights_df['CANCELLED'].dtype)

flights_df['CANCELLED'].value_counts() # note that the result of value_counts will always show the data type as an integer, as it is a count.

datatype of flights_df["CANCELLED"]:
bool


False    116303
True       2877
Name: CANCELLED, dtype: int64

### Forcing Values to Adhere to a Pattern

In some cases, it might be clear from the data in a column that a particular pattern should be expected for all values, in which case it may make sense to remove or replace any values that do not adhere to this pattern. An example might be a column containing UK phone numbers. There are multiple ways to represent a UK phone number, for example `+44 7555 555 555` or `07555 555555`. To handle the situation where multiple possible formats exist, the solution is to apply a *regular expression* to handle as many cases as possible. 

>A **regular expression**, often abbreviated as *regex*, is a sequence of characters that defines a search pattern that can be used for matching, allowing for complex search, replace, and validation operations.

**Regex** is an extensive topic, and the details of constructing **regex** patterns are beyond the scope of this lesson, but as with much in the world of data, the work has often already been done for you, and can be found on various internet websites such as [Stack Overflow](https://stackoverflow.com/), or in various searchable **regex** repositories such as [regexlib](https://regexlib.com/). Searching **regexlib** for `UK Phone Number` provides this option:

 `^((\(?0\d{4}\)?\s?\d{3}\s?\d{3})|(\(?0\d{3}\)?\s?\d{3}\s?\d{4})|(\(?0\d{2}\)?\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$`

Which covers the majority of UK phone number variants, including area codes with brackets (e.g. `(020)`), and extensions following a `#` symbol. 

Let's try it out on an example column of phone numbers. In the code block below, we will create an example dataframe of phone numbers, including some invalid numbers, and then write code to apply the **regex** to each row in the column, and replace any values that do not comply with `NaN`. We will use the `str.match()` method to apply the **regex** expression, and then use logical indexing to replace the non-matching values.

In [70]:
# Creating a sample dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
    'Phone': ['0123456789', '01234 567890', '+441234567890', '0123-456-789', 
              '(0123) 456789', '1234567890', '0123456789a', '01234-567-890', 
              '+44 1234 567890', '01234']
}

phone_df = pd.DataFrame(data)

phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,1234567890
6,Grace,0123456789a
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,01234


In [71]:
import numpy as np # We will need the `nan` constant from the numpy library to apply to missing values

regex_expression = '^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$' #Our regular expression to match
phone_df.loc[~phone_df['Phone'].str.match(regex_expression), 'Phone'] = np.nan # For every row  where the Phone column does not match our regular expression, replace the value with NaN
phone_df.head(10)

Unnamed: 0,Name,Phone
0,Alice,0123456789
1,Bob,01234 567890
2,Charlie,+441234567890
3,Diana,0123-456-789
4,Eva,(0123) 456789
5,Frank,
6,Grace,
7,Hank,01234-567-890
8,Ivy,+44 1234 567890
9,Jack,


### Cleaning Numeric Columns with `.replace()`

The `.replace()` method can also be used to clean up numeric data, for example if you have a column of prices that contain the `£` symbol, thereby preventing the column from being cast to a numeric data type. 

In the example of the phone numbers dataframe, we still have a variety of non-numeric characters in the data which should be replaced in order to regularise the numbers. To rectify this the following actions are needed:

- Replace any instances of `+44` with `0`, as this is how to write the number for calling within the UK
- Replace the `(` and `-` characters with nothing (i.e. remove them)
- Remove all spaces

The code block below shows how to achieve this:

In [72]:
# You can do each step one by one, for example with the following syntax for the `+44`: 0 replacement:

phone_df['Phone'] = phone_df['Phone'].str.replace('+44', '0', regex=False)
phone_df

# Or by setting `regex=True`, you can do it all in one step:

phone_df['Phone'] = phone_df['Phone'].replace({r'\+44': '0', r'\(': '', r'\)': '', r'-': '', r' ': ''}, regex=True)
phone_df

Unnamed: 0,Name,Phone
0,Alice,123456789.0
1,Bob,1234567890.0
2,Charlie,1234567890.0
3,Diana,123456789.0
4,Eva,123456789.0
5,Frank,
6,Grace,
7,Hank,1234567890.0
8,Ivy,1234567890.0
9,Jack,


## Unique Values

It is sometimes necessary to find the number of unique values in a column. For example we might be working with a column of product IDs, where it would negatively affect our analysis to have multiple products with the same ID. To check whether an issue like this exists, we can use the methods `unique` and `nunique`.

- The `unique` method returns all the unique (i.e. distinct) values in the data series. For example from a series of `[ 1, 1, 2, 3, 4]` it would return `[1, 2, 3, 4]`.
- The `nunique` method returns the **count** of unique values in the series. For example from a series of `[ 1, 1, 2, 3, 4]` it would return `4`.

In [153]:
# Creating a sample dataframe with a column of product IDs
data = {'product_ids': ['P001', 'P002', 'P003', 'P001', 'P004', 'P005', 'P003', 'P006', 'P002']}
products_df = pd.DataFrame(data)

# Using `unique` to get unique product IDs
unique_ids = products_df['product_ids'].unique()

# Using `nunique` to get the number of unique product IDs
num_unique_ids = products_df['product_ids'].nunique()

# Displaying the original dataframe
print("Original dataframe:")
print(products_df)

# Displaying the unique product IDs
print("\nUnique product IDs:")
print(unique_ids)

# Displaying the number of unique product IDs
print("\nNumber of unique product IDs:")
print(num_unique_ids)

# Display the total number of rows in the dataframe
print("\nTotal number of rows in the dataframe:")
print(len(products_df))


Original DataFrame:
  product_ids
0        P001
1        P002
2        P003
3        P001
4        P004
5        P005
6        P003
7        P006
8        P002

Unique product IDs:
['P001' 'P002' 'P003' 'P004' 'P005' 'P006']

Number of unique product IDs:
6

Total number of rows in the DataFrame:
9


## Duplicates

> *Duplicates* in data refer to two or more rows that are identical across all columns, or, depending on the context, identical in a subset of columns, which can lead to redundancy and inaccuracies in data analysis and interpretation. The presence of duplicates is a common data cleaning issue, as duplicated data can distort descriptive statistics and data visualizations, leading to inaccurate insights and misinformed decisions. For instance, duplicate entries can artificially inflate the count of a category, skewing measures of central tendency like the mean and median, and affecting the distribution of data in visual representations.

Duplicates can either be *exact*, where a row is identical to another row across all columns, or *fuzzy*, which is where the two rows differ in some columns, but appear to describe the same entity. 

Exact duplicates are trivial to handle in `Pandas`. You can find all the duplicated rows in a dataframe using the `.duplicated()` method, or drop them using the `drop_duplicates()` method. Run the two code blocks below to generate some example duplicate data, and then use the `drop_duplicates()` method to drop the duplicate rows.

In [154]:
import pandas as pd

# Creating a sample dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva', 'Charlie'],
    'Age': [28, 34, 45, 28, 23, 45],
    'Phone': ['123-456', '456-789', '789-012', '123-456', '345-678', '789-012'],
    'Email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'alice@email.com', 'eva@email.com', 'charlie@email.com']
}
duplicate_df = pd.DataFrame(data)
duplicate_df




Unnamed: 0,Name,Age,Phone,Email
0,Alice,28,123-456,alice@email.com
1,Bob,34,456-789,bob@email.com
2,Charlie,45,789-012,charlie@email.com
3,Alice,28,123-456,alice@email.com
4,Eva,23,345-678,eva@email.com
5,Charlie,45,789-012,charlie@email.com


In [155]:

# Identifying and dropping exact duplicates
df_no_duplicates = duplicate_df.drop_duplicates()

# Displaying the dataframe after removing duplicates
print("\ndataframe After Removing Duplicates:")
print(df_no_duplicates)


DataFrame After Removing Duplicates:
      Name  Age    Phone              Email
0    Alice   28  123-456    alice@email.com
1      Bob   34  456-789      bob@email.com
2  Charlie   45  789-012  charlie@email.com
4      Eva   23  345-678      eva@email.com


The possible existence of **fuzzy** duplicates can present a greater challenge. Consider the following example:

In [164]:
data = {
    'First_Name': ['Alice', 'Alice', 'Alice',  'Alice'],
    'Last_Name': ['Smith', 'Smith', 'Smith', 'Smith'],
    'Age': [28, 34, 45, 45],
    'Phone': ['123-456', '456-789', '123-456', '123-456'],
    'Email': ['alice@email.com', 'alice@smith.com', 
              'alice@theinternet.com',  'Alice@theinternet.com']
}

fuzzy_duplicates_df = pd.DataFrame(data)
fuzzy_duplicates_df

Unnamed: 0,First_Name,Last_Name,Age,Phone,Email
0,Alice,Smith,28,123-456,alice@email.com
1,Alice,Smith,34,456-789,alice@smith.com
2,Alice,Smith,45,123-456,alice@theinternet.com
3,Alice,Smith,45,123-456,Alice@theinternet.com


In this case, each column is a partial match for all the other columns. Handling this kind of partial (**fuzzy**) matching requires consideration of what each column represents. For example the name `Alice Smith` is relatively common in English-speaking countries, so we should not assume that all of the entries are the same person. Meanwhile ages can be mistyped, and it's even possible that there are multiple people called Alice Smith at the same address, sharing the same phone number. On the other hand row IDs `2` and `3` are probably the same person, given that the only difference is the capitalisation in the `Email` column.

How you choose to handle **fuzzy** duplicates depends on your analysis goals. In some cases it might make sense to take a conservative approach to avoid losing data unnecessarily, whereas in other cases it might be necessary to apply more stringent measures to avoid duplicates. You will also need to decide whether to simply drop the **fuzzy** duplicates you identify, or average the values of certain columns (e.g. a product price column).

Generally, the more columns you have for reference, the easier it is to determine whether a partial match is a duplicate. There are also software tools to help you, for example this python [library](https://pypi.org/project/fuzzywuzzy/) that uses the [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to calculate the differences between strings.

## Categorical Data

> Categorical columns are those in which the values are drawn from a predefined set of categories. Examples include the possible colour schemes of a product like a laptop, the country of residence of a customer, or the manufacturer of a car.  We will learn more about the different types of categories elsewhere in this course. 

In the context of data cleaning, the thing that the categorical column is describing can inform you as to whether the column requires cleaning. Ideally you want each type of thing that the column is describing to have a single entry in the set of unique values in the column. For example, when dealing with countries, there should be somewhere in the region of 193 to 237 possible entries, depending on the definition of the word "country" that is being used. It is possible for the same country to go by multiple names in your column, in which case it would be helpful to regularise the names such that each country is represented by only a single value in the column. 

Look at the dataframe below. The column `postal region` contains the country names `UK`, `England`, `Wales`, `Cymru` and `Scotland`, among others. For the purposes of the cost of sending mail, these are all one region, the `United Kingdom`. It would therefore be preferable to set them all to the same value. As with missing values, we can fix this representation with the `.replace()` method


In [36]:
# Creating a sample dataframe
data = {
    'Customer Name': ['Amina', 'Bahru', 'Charlie', 'Dion', 'Ebo', 'Frank', 'Giana'],
    'Postal Region': ['UK', 'England', 'Wales', 'Cymru', 'Scotland', 'USA', 'Canada']
}
df = pd.DataFrame(data)

# Displaying the original dataframe
print("Original dataframe:")
print(df)



Original DataFrame:
  Customer Name Postal Region
0         Amina            UK
1         Bahru       England
2       Charlie         Wales
3          Dion         Cymru
4           Ebo      Scotland
5         Frank           USA
6         Giana        Canada


In [173]:
# Creating a mapping dictionary to unify the country names
country_mapping = {
    'UK': 'United Kingdom',
    'England': 'United Kingdom',
    'Wales': 'United Kingdom',
    'Cymru': 'United Kingdom',
    'Scotland': 'United Kingdom'
}

# Replacing the country names in the 'Postal Country' column
df['Postal Region'] = df['Postal Region'].replace(country_mapping)

# Displaying the DataFrame after cleaning the 'Postal Country' column
print("\nDataFrame After Cleaning 'Postal Country' Column:")
print(df)



DataFrame After Cleaning 'Postal Country' Column:
  Customer Name  Postal Country
0         Amina  United Kingdom
1         Bahru  United Kingdom
2       Charlie  United Kingdom
3          Dion  United Kingdom
4           Ebo  United Kingdom
5         Frank             USA
6         Giana          Canada


### Creating Categorical Columns from Continuous Data

Continuous variables are those which can take any value on a spectrum. For example the price of an item can potentially take any value greater than zero, usually with a granularity of 1 penny. You might meet instances in your dataset where you might want to generate new categories based on continuous data.  This can be achieved through a process called *binning*, which means dividing the spectrum of possible values into regions, known as **bins**. 

As an example, consider a dataframe of flight routes, together with their distances in miles. An airline might want to divide them into `short haul`, `medium haul` and `long haul` based on threshold values. To achieve this, we can use the `cut()` method:

In [174]:

# Creating a sample dataframe
data = {
    'Route': ['NYC-LON', 'LON-PAR', 'NYC-TOK', 'LON-SYD', 'PAR-BER'],
    'Distance': [3461, 214, 6749, 10562, 546]
}
flights = pd.DataFrame(data)

# Displaying the original dataframe
print("Original dataframe:")
print(flights)

# Defining the bin edges and labels
bin_edges = [0, 1500, 4000, 12000]  # in miles
bin_labels = ['short haul', 'medium haul', 'long haul']

# Creating a new categorical column 'Flight Type' by binning the 'Distance' column
flights['Flight Type'] = pd.cut(flights['Distance'], bins=bin_edges, labels=bin_labels, right=False)

# Displaying the dataframe with the new 'Flight Type' column
print("\ndataframe with 'Flight Type' Column:")
print(flights)

Original DataFrame:
     Route  Distance
0  NYC-LON      3461
1  LON-PAR       214
2  NYC-TOK      6749
3  LON-SYD     10562
4  PAR-BER       546

DataFrame with 'Flight Type' Column:
     Route  Distance  Flight Type
0  NYC-LON      3461  medium haul
1  LON-PAR       214   short haul
2  NYC-TOK      6749    long haul
3  LON-SYD     10562    long haul
4  PAR-BER       546   short haul


In the above example example:
- We create a dataframe `flights` with columns `Route` and `Distance`, containing various flight routes and their distances in miles
- We define the bin edges `bin_edges` and labels `bin_labels` to specify the ranges and labels for our new categorical data. Note that there is one more element in the `bin edges` list than the number of bins we need. The bin edges define the lower and upper bounds of each bin. So a short haul flight will be from `0` to `1500` miles in this case.
- We use `pd.cut()` to create a new column `Flight Type` by binning the `Distance` column based on the defined bins and labels. The argument `right = False` is used to specify that the bins are *left-closed*, meaning that the left bin edge is included in the bin, but the right bin edge is not.
- Finally, we display the original and modified dataframes to observe the changes

This approach allows you to categorize continuous data into discrete bins, simplifying analysis and enabling you to gain insights into the distribution and frequency of the data across different categories. This is particularly useful when you want to analyze or visualize your data at a higher or more generalized level than the raw, continuous data allows.

## Key Takeaways

- You can check the data type of your dataframe columns using the `dtypes` attribute or the `.info()` method
- You can change the data type of the column using the `astype()` method, and convert from string to numeric information using the `to_numeric()` method
- The `pd.Datetime` function is used to convert a column to the `datetime64` data type
- Common issues with converting to `datetime64` include `Pandas` assuming an integer representation is encoding **Epoch** time, and handling columns with multiple date formats
- The `timedelta64` data type is used to represent time differences between datetime objects
- The `.dropna()` and `.fillna()` methods can be used to handle `NaN` and `NULL` values
- The `.replace()` method can be used to handle incorrect values
- You can force a column's values to stick to a pattern using **regular expressions**
- The `.unique()` and `.nunique()` methods can be used to identify and count the unique values in a column
- **Exact** duplicates can be removed using the `.drop_duplicates()` method, whereas **fuzzy** duplicates might need more advanced methods to identify and handle
- **Categorical** data should be **regularised**, so that each relevant entity is described by a single category label
- You can create **categorical** data from **continuous** data by **binning**