# Data Cleaning

## Learning Objectives

- Identify what dirty data is
- The side effects of working with dirty data
- How to clean data

A critical step before we run our data through the model is **Exploratory Data Analysis**, or EDA. EDA, as the name suggests, is an indepth analysis of our data. The EDA process intertwines data cleaning, dealing with missing values, and visualising data and its statistical properties. Typically these processes are all done together, but for the sake of clarifying the different concepts we will introduce them here separately. I will be providing toy examples through screenshots, while you guys will be trying to apply the concepts to a messy dataset!

## Prerequisites
[flights.txt](https://drive.google.com/file/d/1cVV3TZcxS31fk9JrskaRP1pbOfaoNcwe/view?usp=sharing)
(source: https://www.kaggle.com/mmetter/flights/data).
In most cases you will receive data which has documentation. **Reading data documentation is important**!!!! This particular piece of data doesn't have any documentation however - we'll have to use our intuition regarding the column/variable names. 

Run the following command to download the file into your directory:

In [1]:
!wget "https://aicore-files.s3.amazonaws.com/Data-Eng/flights.txt"

--2021-10-12 16:02:17--  https://aicore-files.s3.amazonaws.com/Data-Eng/flights.txt
Resolviendo aicore-files.s3.amazonaws.com (aicore-files.s3.amazonaws.com)... 52.217.164.25
Conectando con aicore-files.s3.amazonaws.com (aicore-files.s3.amazonaws.com)[52.217.164.25]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 314586167 (300M) [text/plain]
Grabando a: «flights.txt»


2021-10-12 16:04:12 (2.63 MB/s) - «flights.txt» guardado [314586167/314586167]



## Data types
Before we get started, let's talk about some different data types we can expect to come across:
<p align=center>
	<table >
		<tbody>
			<tr>
				<td><b>Data type</b></td>
				<td><b>Python data type</b></td>
				<td><b>Examples</b></td>
			</tr>
			<tr>
				<td>Text data</td>
				<td>str</td>
				<td>Names, addresses</td>
			</tr>
			<tr>
				<td>Integers</td>
				<td>int</td>
				<td># items, # people</td>
			</tr>
			<tr>
				<td>Floats/Decimals</td>
				<td>float</td>
				<td>Currency, distances</td>
			</tr>
			<tr>
				<td>Binary/Boolean</td>
				<td>bool</td>
				<td>Is married, yes/no</td>
			</tr>
			<tr>
				<td>Date (and times)</td>
				<td>datetime</td>
				<td>Dispatch date, arrival time</td>
			</tr>
			<tr>
				<td>Categories</td>
				<td>category</td>
				<td>States, colours, gender</td>
			</tr>
		</tbody>
	</table>
</p>

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
flights_df = pd.read_csv("flights.txt", sep="|") # Make sure flights.txt is in the same directory

In [3]:
flights_df.head()

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
0,54548800,20020101,WN,Southwest Airlines Co.: WN,N103@@,1425,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1425,1425.0,0.0,8.0,1433.0,1648.0,4.0,1655,1652.0,-3.0,90.0,87.0,F,False,580 miles
1,55872300,20020101,CO,Continental Air Lines Inc.: CO,N83872,150,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1130,1136.0,6.0,12.0,1148.0,1419.0,16.0,1426,1435.0,9.0,116.0,119.0,False,F,744 miles
2,54388800,20020101,WN,Southwest Airlines Co.: WN,N334@@,249,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,1215,1338.0,83.0,7.0,1345.0,1618.0,2.0,1500,1620.0,80.0,105.0,102.0,F,False,718 miles
3,54486500,20020101,WN,Southwest Airlines Co.: WN,N699@@,902,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,LAS,Las VegasNV: McCarran International,Las Vegas,NV,Nevada,1925,1925.0,0.0,5.0,1930.0,1947.0,1.0,1950,1948.0,-2.0,85.0,83.0,0,0,487 miles
4,55878700,20020103,CO,Continental Air Lines Inc.: CO,N58606,234,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1455,1453.0,-2.0,11.0,1504.0,1742.0,5.0,1750,1747.0,-3.0,115.0,114.0,F,False,744 miles


There are many issues that I can see with the data. We'll tackle them in an arbitrary order one by one, but lets first try and decide what we think each of the non-trivial columns is meant to represent.

- **TRANSACTIONID**: Unique identifer

- **FLIGHTDATE**: Date of the flight. Looks like its encoded as a number instead of a date object

- **TAILNUM**: Looks like it contains @@ in some of its rows. 

- **ORIGAIRPORTNAME** and **DESTAIRPORTNAME**: Looks like it has the city name and state concatenated and appended before the actual name of the airport 

- **CRSDEPTIME** and **DEPTIME**: Look like they represent incorrectly formatted (military) times. Also it seems to be that **CRSDEPTTIME** + **DEPDELAY** = **DEPTIME**

- **DEPDELAY**: Departure delay in minutes?

- **TAXIOUT**: How long it took from departure to wheels off. Looks like **DEPTIME** + **TAXIOUT** = **WHEELSOFF**

- **WHEELSOFF**: The (military) time when wheels left the ground

- **WHEELSON**: Military time when wheels touched the ground on descent

- **TAXIIN**: Looks like the number of minutes since the wheels touched the ground to "parking"

- **CRSARRTIME**: The military encoded expected arrival time

- **ARRTIME**: Actual arrival time

- **ARRDELAY**: Difference between **CRSARRTIME** and **ARRTIME**

- **CRSELAPSEDTIME**: Planned journey time (minutes)

- **ACTUALELAPSEDTIME**: Actual journey time (minutes)

- **CANCELLED**: Whether the flight was cancelled or not. Looks like some values are False, others are F, and others 0. Possibly a similar variation for True

- **DIVERTED**: Whether the plane was diverted. Similar issues regarding True/False as above?

- **DISTANCE**: The (integer) distance the plane travelled, encoded as a string with "miles" concatenated to it

In [4]:
flights_df

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
0,54548800,20020101,WN,Southwest Airlines Co.: WN,N103@@,1425,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1425,1425.0,0.0,8.0,1433.0,1648.0,4.0,1655,1652.0,-3.0,90.0,87.0,F,False,580 miles
1,55872300,20020101,CO,Continental Air Lines Inc.: CO,N83872,150,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1130,1136.0,6.0,12.0,1148.0,1419.0,16.0,1426,1435.0,9.0,116.0,119.0,False,F,744 miles
2,54388800,20020101,WN,Southwest Airlines Co.: WN,N334@@,249,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,1215,1338.0,83.0,7.0,1345.0,1618.0,2.0,1500,1620.0,80.0,105.0,102.0,F,False,718 miles
3,54486500,20020101,WN,Southwest Airlines Co.: WN,N699@@,902,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,LAS,Las VegasNV: McCarran International,Las Vegas,NV,Nevada,1925,1925.0,0.0,5.0,1930.0,1947.0,1.0,1950,1948.0,-2.0,85.0,83.0,0,0,487 miles
4,55878700,20020103,CO,Continental Air Lines Inc.: CO,N58606,234,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1455,1453.0,-2.0,11.0,1504.0,1742.0,5.0,1750,1747.0,-3.0,115.0,114.0,F,False,744 miles
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191800,126750200,20130106,EV,ExpressJet Airlines Inc.: EV,N683BR,5272,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1357,1348.0,-9.0,22.0,1410.0,1500.0,3.0,1523,1503.0,-20.0,146.0,135.0,0,0,721 miles
1191801,127294500,20130106,DL,Delta Air Lines Inc.: DL,N949DL,1711,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,2150,2147.0,-3.0,23.0,2210.0,2307.0,10.0,2321,2317.0,-4.0,151.0,150.0,False,F,731 miles
1191802,127294900,20130106,DL,Delta Air Lines Inc.: DL,N907DE,1810,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,1617,1617.0,0.0,18.0,1635.0,1728.0,9.0,1750,1737.0,-13.0,153.0,140.0,F,False,731 miles
1191803,126594900,20130106,EV,ExpressJet Airlines Inc.: EV,N855AS,5208,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,FWA,Fort WayneIN: Fort Wayne International,Fort Wayne,IN,Indiana,1516,1514.0,-2.0,21.0,1535.0,1651.0,4.0,1658,1655.0,-3.0,102.0,101.0,False,F,508 miles


So walking through each column identifies the issues that this dataset might have. Let's tackle the **distance** issue first. It looks like this is meant to be _integer_ encoded (notice that I said that distances could be floats in the table at the beginning of this lecture. Why the change of mind?), and they have 'miles' appended to the end of the number. We can do a lot more stuff with numeric data than text data which represents numbers, so lets first convert this to an int.

<table>
    <tr>
        <td> </td>
        <td><b>OrderID</b></td>
        <td><b>Cost</b></td>
        <td><b>Quantity</b></td>
        <td><b>Address</b></td>
    </tr>
    <tr>
        <td>0</td>
        <td>1234</td>
        <td>£1000.00</td>
        <td>10</td>
        <td>123 Fake Street</td>
    </tr>
    <tr>
        <td>1</td>
        <td>7890</td>
        <td>£35.50</td>
        <td>3</td>
        <td>789 Real Road</td>
    </tr>
    
</table>

In the above table, we can see that cost should be a float - however it has a £ symbol attached to it. To use this column as a float, we need to remove the £. Before doing this, however, let's take a look at the datatypes of the columns. This is done by calling the `.dtypes` attribute on our dataframe. In the above example case, we would have returned:

<table>
    <tr>
        <td>OrderID</td>
        <td>int64</td>
    </tr>
    <tr>
        <td>Cost</td>
        <td>object</td>
    </tr>
    <tr>
        <td>Quantity</td>
        <td>int64</td>
    </tr>
    <tr>
        <td>Address</td>
        <td>object</td>
    </tr>
</table>

We can also use `.info()`, which returns us the amount of null information in each column too (we'll cover how to deal with null/missing values soon)

In [7]:
## Find the object types for each column using .dtypes

TRANSACTIONID          int64
FLIGHTDATE             int64
AIRLINECODE           object
AIRLINENAME           object
TAILNUM               object
FLIGHTNUM              int64
ORIGINAIRPORTCODE     object
ORIGAIRPORTNAME       object
ORIGINCITYNAME        object
ORIGINSTATE           object
ORIGINSTATENAME       object
DESTAIRPORTCODE       object
DESTAIRPORTNAME       object
DESTCITYNAME          object
DESTSTATE             object
DESTSTATENAME         object
CRSDEPTIME             int64
DEPTIME              float64
DEPDELAY             float64
TAXIOUT              float64
WHEELSOFF            float64
WHEELSON             float64
TAXIIN               float64
CRSARRTIME             int64
ARRTIME              float64
ARRDELAY             float64
CRSELAPSEDTIME       float64
ACTUALELAPSEDTIME    float64
CANCELLED             object
DIVERTED              object
DISTANCE              object
dtype: object

In [8]:
## Find the object types and number of nulls using .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1191805 entries, 0 to 1191804
Data columns (total 31 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   TRANSACTIONID      1191805 non-null  int64  
 1   FLIGHTDATE         1191805 non-null  int64  
 2   AIRLINECODE        1191805 non-null  object 
 3   AIRLINENAME        1191805 non-null  object 
 4   TAILNUM            1034988 non-null  object 
 5   FLIGHTNUM          1191805 non-null  int64  
 6   ORIGINAIRPORTCODE  1191805 non-null  object 
 7   ORIGAIRPORTNAME    1191805 non-null  object 
 8   ORIGINCITYNAME     1191805 non-null  object 
 9   ORIGINSTATE        1180963 non-null  object 
 10  ORIGINSTATENAME    1180963 non-null  object 
 11  DESTAIRPORTCODE    1191805 non-null  object 
 12  DESTAIRPORTNAME    1191805 non-null  object 
 13  DESTCITYNAME       1191805 non-null  object 
 14  DESTSTATE          1180967 non-null  object 
 15  DESTSTATENAME      1180967 non-n

If we were to sum our above <b>cost</b> column (`sales['cost'].sum()`), something akin to the following would be returned:
```£1000.00£35.50£46.10£76.35```...

Obviously this isn't what we want.. we'd rather have all our costs summed.

Try the same with the 'DISTANCE' column with our flights data

In [9]:
## Sum the first 10 instances of 'DISTANCE' column in the flights data.
# Be aware about where you slice ;). What's the technical difference between slicing before the .sum() and after?

'580 miles744 miles718 miles487 miles744 miles289 miles569 miles1240 miles223 miles677 miles'

So to resolve the issue with our sales data, we need to do two things:
1. Remove the '£'
2. Convert the column to a float data type

This would be done as follows:
```python
sales['cost'] = sales['cost'].str.strip('£')
sales['cost'] = sales['cost'].astype('float64')
```

Armed with this knowledge, let's convert the distance column to an int!

In [11]:
## Remove the ' miles' from the dataframe

## Convert the column to an int64 type

## Verify that the column has converted to an int sucessfully
flights_df["DISTANCE"]

0          580
1          744
2          718
3          487
4          744
          ... 
1191800    721
1191801    731
1191802    731
1191803    508
1191804    306
Name: DISTANCE, Length: 1191805, dtype: int64

Cool! Ok, so that's how we can convert messy text data to numbers. Let's look at converting data to categorical values now.

In our dataset, we have many columns which could be categorical. Can you identify which ones they are?
<br>
<details>
    <summary><b>></b> Categorical variables (click to reveal)</summary>
    <ul>
        <li>AIRLINECODE</li>
        <li>AIRLINENAME</li>
        <li>ORIGINAIRPORTCODE</li>
        <li>ORIGAIRPORTNAME</li>
        <li>ORIGINCITYNAME</li>
        <li>ORIGINSTATE</li>
        <li>ORIGINSTATENAME</li>
        <li>DESTAIRPORTCODE</li>
        <li>DESTAIRPORTNAME</li>
        <li>DESTCITYNAME</li>
        <li>DESTSTATE</li>
        <li>DESTSTATENAME</li>
    </ul>
</details>

Using the `.describe()` method, we can identify more information about a particular column. Let's use `AIRLINECODE` as an example

In [12]:
flights_df['AIRLINECODE'].describe()

count     1191805
unique         26
top            WN
freq       189985
Name: AIRLINECODE, dtype: object

In [13]:
flights_df['AIRLINECODE']

0          WN
1          CO
2          WN
3          WN
4          CO
           ..
1191800    EV
1191801    DL
1191802    DL
1191803    EV
1191804    EV
Name: AIRLINECODE, Length: 1191805, dtype: object

We get some (midly) useful statistics returned when we run `.describe()` over this variable. However, we can see that the datatype of this column has been interpreted as `object`. From the data types table we introduced earlier, we can see that support for categories exist. Let's convert this to a category and see the difference from the describe method.

In [14]:
flights_df['AIRLINECODE'] = flights_df['AIRLINECODE'].astype('category')
flights_df['AIRLINECODE']

0          WN
1          CO
2          WN
3          WN
4          CO
           ..
1191800    EV
1191801    DL
1191802    DL
1191803    EV
1191804    EV
Name: AIRLINECODE, Length: 1191805, dtype: category
Categories (26, object): ['9E', 'AA', 'AS', 'B6', ..., 'VX', 'WN', 'XE', 'YV']

In [15]:
flights_df['AIRLINECODE'].describe()

count     1191805
unique         26
top            WN
freq       189985
Name: AIRLINECODE, dtype: object

No visible difference `.describe()` method really (weirdly it still returns dtype of object 🤔)! Although we do see a new attribute when displying the actual information about the column. Either way, lets take a quick detour under the hood and see the advantages of representihng things as categories. We'll make use of the `.info()` method to look at our memory consumption 

In [16]:
flights_df['AIRLINECODE'] = flights_df['AIRLINECODE'].astype('object')
flights_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1191805 entries, 0 to 1191804
Data columns (total 31 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   TRANSACTIONID      1191805 non-null  int64  
 1   FLIGHTDATE         1191805 non-null  int64  
 2   AIRLINECODE        1191805 non-null  object 
 3   AIRLINENAME        1191805 non-null  object 
 4   TAILNUM            1034988 non-null  object 
 5   FLIGHTNUM          1191805 non-null  int64  
 6   ORIGINAIRPORTCODE  1191805 non-null  object 
 7   ORIGAIRPORTNAME    1191805 non-null  object 
 8   ORIGINCITYNAME     1191805 non-null  object 
 9   ORIGINSTATE        1180963 non-null  object 
 10  ORIGINSTATENAME    1180963 non-null  object 
 11  DESTAIRPORTCODE    1191805 non-null  object 
 12  DESTAIRPORTNAME    1191805 non-null  object 
 13  DESTCITYNAME       1191805 non-null  object 
 14  DESTSTATE          1180967 non-null  object 
 15  DESTSTATENAME      1180967 non-n

In [17]:
flights_df['AIRLINECODE'] = flights_df['AIRLINECODE'].astype('category')
flights_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1191805 entries, 0 to 1191804
Data columns (total 31 columns):
 #   Column             Non-Null Count    Dtype   
---  ------             --------------    -----   
 0   TRANSACTIONID      1191805 non-null  int64   
 1   FLIGHTDATE         1191805 non-null  int64   
 2   AIRLINECODE        1191805 non-null  category
 3   AIRLINENAME        1191805 non-null  object  
 4   TAILNUM            1034988 non-null  object  
 5   FLIGHTNUM          1191805 non-null  int64   
 6   ORIGINAIRPORTCODE  1191805 non-null  object  
 7   ORIGAIRPORTNAME    1191805 non-null  object  
 8   ORIGINCITYNAME     1191805 non-null  object  
 9   ORIGINSTATE        1180963 non-null  object  
 10  ORIGINSTATENAME    1180963 non-null  object  
 11  DESTAIRPORTCODE    1191805 non-null  object  
 12  DESTAIRPORTNAME    1191805 non-null  object  
 13  DESTCITYNAME       1191805 non-null  object  
 14  DESTSTATE          1180967 non-null  object  
 15  DESTSTATENAME  

We see our our memory usage has dropped by around 8mb by just converting this one column to a category! Ok, yes admittedly this isn't that big of a deal when working with data for this size, but remember, this saving on memory comes from just one of the many categorical columns we have.

So why is this? Well, under the hood, Pandas represents categories as integer types. In fact, something that you may come across when working with other datasets is explictly seeing a category column encoded as integers. Let's modify our dataframe to see what happens when this could be the case.

In [18]:
flights_df['AIRLINECODE_ASINT'] = flights_df['AIRLINECODE'].cat.codes.astype('int64')
flights_df['AIRLINECODE_ASINT']

0          23
1           4
2          23
3          23
4           4
           ..
1191800     7
1191801     6
1191802     6
1191803     7
1191804     7
Name: AIRLINECODE_ASINT, Length: 1191805, dtype: int64

When we run `.describe()` we can see statistics returned which don't exactly make sense for our column:

In [19]:
flights_df['AIRLINECODE_ASINT'].describe()

count    1.191805e+06
mean     1.305387e+01
std      8.211102e+00
min      0.000000e+00
25%      6.000000e+00
50%      1.500000e+01
75%      2.100000e+01
max      2.500000e+01
Name: AIRLINECODE_ASINT, dtype: float64

It doesn't make sense for a categorical column to have a mean or any other those statistical propteries. We'll look at why this in a bit more detail further down the line

In [20]:
flights_df = flights_df.drop('AIRLINECODE_ASINT', 1)

  flights_df = flights_df.drop('AIRLINECODE_ASINT', 1)


What datatypes (Numeric, datetime, text, or categorical) would you group the following examples as?:

- Description of an item
- Yearly income
- Size of clothing
- Arrival time of a plane
- Birthdays of this cohort
- Flavours of milkshakes at McDonalds
- First half of a postcode
- Full postcode
- The time it took for runners to complete a 5K

## Duplicate Values

Another common issue that we might face is **duplicate values**. As the name suggests, this occurs when we have the same values repeated across multiple rows or columns:
<table>
    <tr>
        <td><b>first_name</b></td>
        <td><b>last_name</b></td>
        <td><b>address</b></td>
        <td><b>age</b></td>
        <td><b>income</b></td>
    </tr>
    <tr>
        <td>John</td>
        <td>Doe</td>
        <td>123 Real Street</td>
        <td>25</td>
        <td>£28000</td>
    </tr>
    <tr>
        <td>Jane</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td>29</td>
        <td>£32000</td>
    </tr>
    <tr>
        <td>Jane</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td>29</td>
        <td>£32000</td>
    </tr>
    <tr>
        <td>Mark</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td>31</td>
        <td>£32000</td>
    </tr>
</table>

In the above example, we can see that Jane Smith has two entries directly duplicated. However, in some cases, we might see extremely similar entries:

<table>
    <tr>
        <td><b>first_name</b></td>
        <td><b>last_name</b></td>
        <td><b>address</b></td>
        <td><b>age</b></td>
        <td><b>income</b></td>
    </tr>
    <tr>
        <td>John</td>
        <td>Doe</td>
        <td>123 Real Street</td>
        <td>25</td>
        <td>£28000</td>
    </tr>
    <tr>
        <td>Jane</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td><b>28</b></td>
        <td>£32000</td>
    </tr>
    <tr>
        <td>Jane</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td>29</td>
        <td>£32000</td>
    </tr>
    <tr>
        <td>Mark</td>
        <td>Smith</td>
        <td>789 Fake Road</td>
        <td>31</td>
        <td>£32000</td>
    </tr>
</table>

(The age difference between both Jane Smith's). This type of duplicate error is most likely due to a data entry issue or a resubmission of whatever form Jane had submitted - which was entered into the database without removing her old entry. 

More often than not though, duplicate data arises from either bugs/design patterns in data pipelines, or most commonly, due to database joins and data consolidation from various datasets/databases, which may retain the duplicate values.

Pandas provides us with a [`.duplicated()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method. Let's use this over our dataframe to see what it returns

In [21]:
flights_df.duplicated()

0          False
1          False
2          False
3          False
4          False
           ...  
1191800    False
1191801    False
1191802    False
1191803    False
1191804    False
Length: 1191805, dtype: bool

Note that we can `.sum()` over boolean values. Basically, False's are interpreted as 0s and True's as 1. So by summing over the dataframe, we can get the total number of duplicate values! 

In [22]:
flights_df.duplicated().sum()

0

No duplicate values!! This is good right? Well... not necessarily. Remind yourself of the second Jane Smith example above. The duplicated method would not have returned true because the whole row wasn't an exact duplicate. To tackle this issue, the `.duplicated()` method takes in two arguments: `subset` and `keep`. For the subset argument, provide a list of column names we want to check duplicates over, and the keep argument takes on 1 of 3 values: `first`, `last`, or `False`. From the documentation:
- `first` : Mark duplicates as True except for the first occurrence.
- `last` : Mark duplicates as True except for the last occurrence.
- `False` : Mark all duplicates as True.

In many cases, picking the subset is more of an art form than a science. Your intuition is going to be king.

In [23]:
## Find the duplicates on the flights dataframe over the following columns with keep = False:
 # "ORIGAIRPORTNAME", "DESTAIRPORTNAME", "AIRLINECODE", "FLIGHTDATE", "CRSDEPTIME", "DEPTIME", "ARRTIME"
 # Assign this to the variable 'duplicates'
 # Why did I choose these column names? Would you have chosen others?
    
# Using df[duplicates], we are returned the data points where duplicates exist.
## Return the duplicates for the flights dataframe


Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
34300,1974100,19920528,AS,Alaska Airlines Inc.: AS,,88,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1400,1405.0,5.0,190.0,195.0,False,F,1448
103931,32273300,19980108,AA,American Airlines Inc.: AA,UNKNOW,496,LAX,Los AngelesCA: Los Angeles International,Los Angeles,CA,California,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0,,,,,True,False,1745
125014,22096600,19960118,AA,American Airlines Inc.: AA,UNKNOW,1388,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0,,,,,True,False,460
125015,22097200,19960118,AA,American Airlines Inc.: AA,UNKNOW,1988,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0,,,,,True,False,460
162444,27454200,19970113,AA,American Airlines Inc.: AA,UNKNOW,2087,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0,,,,,True,False,224
162445,27453500,19970113,AA,American Airlines Inc.: AA,UNKNOW,2056,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0,,,,,True,False,224
192865,21755500,19960108,AA,American Airlines Inc.: AA,UNKNOW,613,LGA,New YorkNY: LaGuardia,New York,NY,New York,MIA,MiamiFL: Miami International,Miami,FL,Florida,0,,,,,,,0,,,,,True,False,1097
377880,21910100,19960107,AA,American Airlines Inc.: AA,UNKNOW,883,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0,,,,,True,False,612
377882,21910200,19960107,AA,American Airlines Inc.: AA,UNKNOW,1009,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0,,,,,True,False,612
453974,1977500,19920528,AS,Alaska Airlines Inc.: AS,,118,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1410,1405.0,-5.0,200.0,195.0,False,F,1448


As a secondary observation, we see that `TAILNUM` also takes on a value of `UNKNOW` for missing values. We'll make a note of this so we can deal with it later.

We can use the [`.sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method to sort our dataframe. Read the documentation and use this method to sort the dataframe on a column name that you think is appropiate (one that allows you to easily verify whether the entries returned are valid duplicates). 

In [24]:
## Sort the duplicated values by an appropiate index


Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
34300,1974100,19920528,AS,Alaska Airlines Inc.: AS,,88,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1400,1405.0,5.0,190.0,195.0,False,F,1448
453974,1977500,19920528,AS,Alaska Airlines Inc.: AS,,118,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1410,1405.0,-5.0,200.0,195.0,False,F,1448
769567,22394500,19960107,AA,American Airlines Inc.: AA,UNKNOW,2,LAX,Los AngelesCA: Los Angeles International,Los Angeles,CA,California,JFK,New YorkNY: John F. Kennedy International,New York,NY,New York,0,,,,,,,0,,,,,True,False,2475
769566,22395100,19960107,AA,American Airlines Inc.: AA,UNKNOW,32,LAX,Los AngelesCA: Los Angeles International,Los Angeles,CA,California,JFK,New YorkNY: John F. Kennedy International,New York,NY,New York,0,,,,,,,0,,,,,True,False,2475
377880,21910100,19960107,AA,American Airlines Inc.: AA,UNKNOW,883,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0,,,,,True,False,612
377882,21910200,19960107,AA,American Airlines Inc.: AA,UNKNOW,1009,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0,,,,,True,False,612
1175821,20838000,19960108,AA,American Airlines Inc.: AA,UNKNOW,1920,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,BDL,HartfordCT: Bradley International,Hartford,CT,Connecticut,0,,,,,,,0,,,,,True,False,783
810576,21756100,19960108,AA,American Airlines Inc.: AA,UNKNOW,1485,LGA,New YorkNY: LaGuardia,New York,NY,New York,MIA,MiamiFL: Miami International,Miami,FL,Florida,0,,,,,,,0,,,,,True,False,1097
603813,21097100,19960108,AA,American Airlines Inc.: AA,UNKNOW,1684,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,0,,,,,,,0,,,,,True,False,1192
192865,21755500,19960108,AA,American Airlines Inc.: AA,UNKNOW,613,LGA,New YorkNY: LaGuardia,New York,NY,New York,MIA,MiamiFL: Miami International,Miami,FL,Florida,0,,,,,,,0,,,,,True,False,1097


In [25]:
flights_df["CRSARRTIME"].isna().sum()

0

## Dealing with duplicate values

So.. how do we deal with duplicate values? Well, we only need one entry - not both - so we have one of two options:
1. Average over duplicate values where possible
2. Drop one of the duplicated rows (or many in the case of one entry having multiple duplicates)


### Averaging

Averaging over duplicate values can only really be performed on datatypes which make sense. In the above table, the first two entries have valid times that we can average over. In general, the way we average is by grouping on relevant columns (via `.groupby()`), and chaining this with the `.agg()` function. In this case, we want to group on our subset of columns apart from the columns of interest (i.e. to times). Our argument to `.agg()` is a dictionary with key value pairs of column names and the aggregation function we want to apply over them (e.g. sum, difference, mean etc).


In [26]:
summaries = {"CRSARRTIME": "mean", "ARRTIME": "mean", "ARRDELAY": "mean", "CRSELAPSEDTIME": "mean", "ACTUALELAPSEDTIME": "mean"}

grouped_duplicates = flights_df[duplicates].groupby(["FLIGHTDATE", "AIRLINECODE", "ORIGAIRPORTNAME", "DESTAIRPORTNAME"])
grouped_duplicates_min_transactionid = grouped_duplicates["TRANSACTIONID"].min().reset_index()

f_df_duplicates = pd.merge(
    grouped_duplicates_min_transactionid,
    grouped_duplicates.agg(summaries).reset_index(),
    how="inner"
).sort_values("TRANSACTIONID")

f_df_duplicates


Unnamed: 0,FLIGHTDATE,AIRLINECODE,ORIGAIRPORTNAME,DESTAIRPORTNAME,TRANSACTIONID,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME
151,19920528,AS,AnchorageAK: Ted Stevens Anchorage International,SeattleWA: Seattle/Tacoma International,1974100.0,1405.0,1405.0,0.0,195.0,195.0
3827,19960108,AA,ChicagoIL: Chicago O'Hare International,HartfordCT: Bradley International,20837400.0,0.0,,,,
3842,19960108,AA,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,WashingtonDC: Ronald Reagan Washington National,21095900.0,0.0,,,,
3873,19960108,AA,New YorkNY: LaGuardia,MiamiFL: Miami International,21755500.0,0.0,,,,
5701,19960111,AA,ChicagoIL: Chicago O'Hare International,MinneapolisMN: Minneapolis-St Paul International,21808100.0,0.0,,,,
...,...,...,...,...,...,...,...,...,...,...
16843,19980108,YV,WashingtonDC: Ronald Reagan Washington National,MinneapolisMN: Minneapolis-St Paul International,,,,,,
16844,19980108,YV,WashingtonDC: Ronald Reagan Washington National,New YorkNY: John F. Kennedy International,,,,,,
16845,19980108,YV,WashingtonDC: Ronald Reagan Washington National,New YorkNY: LaGuardia,,,,,,
16846,19980108,YV,WashingtonDC: Ronald Reagan Washington National,SeattleWA: Seattle/Tacoma International,,,,,,


In [27]:
# Why are there so many new NaN's in the TRANSACTIONID field now?
## How should we get rid of them?


## Re-encode TRANSACTIONID to int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,FLIGHTDATE,AIRLINECODE,ORIGAIRPORTNAME,DESTAIRPORTNAME,TRANSACTIONID,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME
151,19920528,AS,AnchorageAK: Ted Stevens Anchorage International,SeattleWA: Seattle/Tacoma International,1974100,1405.0,1405.0,0.0,195.0,195.0
3827,19960108,AA,ChicagoIL: Chicago O'Hare International,HartfordCT: Bradley International,20837400,0.0,,,,
3842,19960108,AA,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,WashingtonDC: Ronald Reagan Washington National,21095900,0.0,,,,
3873,19960108,AA,New YorkNY: LaGuardia,MiamiFL: Miami International,21755500,0.0,,,,
5701,19960111,AA,ChicagoIL: Chicago O'Hare International,MinneapolisMN: Minneapolis-St Paul International,21808100,0.0,,,,
2007,19960107,AA,WashingtonDC: Ronald Reagan Washington National,ChicagoIL: Chicago O'Hare International,21910100,0.0,,,,
7597,19960118,AA,Kansas CityMO: Kansas City International,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,22096600,0.0,,,,
1994,19960107,AA,Los AngelesCA: Los Angeles International,New YorkNY: John F. Kennedy International,22394500,0.0,,,,
11317,19970126,AA,ChicagoIL: Chicago O'Hare International,MinneapolisMN: Minneapolis-St Paul International,27174500,0.0,,,,
9460,19970113,AA,HoustonTX: George Bush Intercontinental/Houston,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,27453500,0.0,,,,


In [29]:
# The .update() method allows us to update records in one dataframe from values in an other
# Some way of 'linking' which records to overwrite/update is needed if we do not want to use the default dataframe index
## So, using the .set_index() method, set our flights_df and f_df_duplicates new index to a unique indentifer key they both share


# Now we can update the flights_df dataframe with the new dataframe


## And finally, we may optionally reset the index to obtain the default dataframe indexing

flights_df

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
0,54548800,20020101.0,WN,Southwest Airlines Co.: WN,N103@@,1425,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1425,1425.0,0.0,8.0,1433.0,1648.0,4.0,1655.0,1652.0,-3.0,90.0,87.0,F,False,580
1,55872300,20020101.0,CO,Continental Air Lines Inc.: CO,N83872,150,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1130,1136.0,6.0,12.0,1148.0,1419.0,16.0,1426.0,1435.0,9.0,116.0,119.0,False,F,744
2,54388800,20020101.0,WN,Southwest Airlines Co.: WN,N334@@,249,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,1215,1338.0,83.0,7.0,1345.0,1618.0,2.0,1500.0,1620.0,80.0,105.0,102.0,F,False,718
3,54486500,20020101.0,WN,Southwest Airlines Co.: WN,N699@@,902,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,LAS,Las VegasNV: McCarran International,Las Vegas,NV,Nevada,1925,1925.0,0.0,5.0,1930.0,1947.0,1.0,1950.0,1948.0,-2.0,85.0,83.0,0,0,487
4,55878700,20020103.0,CO,Continental Air Lines Inc.: CO,N58606,234,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1455,1453.0,-2.0,11.0,1504.0,1742.0,5.0,1750.0,1747.0,-3.0,115.0,114.0,F,False,744
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191800,126750200,20130106.0,EV,ExpressJet Airlines Inc.: EV,N683BR,5272,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1357,1348.0,-9.0,22.0,1410.0,1500.0,3.0,1523.0,1503.0,-20.0,146.0,135.0,0,0,721
1191801,127294500,20130106.0,DL,Delta Air Lines Inc.: DL,N949DL,1711,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,2150,2147.0,-3.0,23.0,2210.0,2307.0,10.0,2321.0,2317.0,-4.0,151.0,150.0,False,F,731
1191802,127294900,20130106.0,DL,Delta Air Lines Inc.: DL,N907DE,1810,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,1617,1617.0,0.0,18.0,1635.0,1728.0,9.0,1750.0,1737.0,-13.0,153.0,140.0,F,False,731
1191803,126594900,20130106.0,EV,ExpressJet Airlines Inc.: EV,N855AS,5208,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,FWA,Fort WayneIN: Fort Wayne International,Fort Wayne,IN,Indiana,1516,1514.0,-2.0,21.0,1535.0,1651.0,4.0,1658.0,1655.0,-3.0,102.0,101.0,False,F,508


In [30]:
flights_df[flights_df["TRANSACTIONID"]==1974100]

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
34300,1974100,19920528.0,AS,Alaska Airlines Inc.: AS,,88,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1405.0,1405.0,0.0,195.0,195.0,False,F,1448


##### Dropping duplicates

Regarding dropping duplicates, Pandas provides us with a `.drop_duplicates()` method which takes three arguments:
1. `subset`
2. `keep`
3. `inplace` - a boolean value of whether we want to perform the operation inplace or not

In [31]:
subset = ["ORIGAIRPORTNAME", "DESTAIRPORTNAME", "AIRLINECODE", "FLIGHTDATE", "CRSDEPTIME", "DEPTIME", "ARRTIME"]
## Using inplace = True, drop the duplicates. Think about what value we should provide to the keep argument


  flights_df.drop_duplicates(subset, 'first', True)


In [32]:
flights_df[duplicates]

  flights_df[duplicates]


Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
34300,1974100,19920528.0,AS,Alaska Airlines Inc.: AS,,88,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,SEA,SeattleWA: Seattle/Tacoma International,Seattle,WA,Washington,950,950.0,0.0,,,,,1405.0,1405.0,0.0,195.0,195.0,False,F,1448
103931,32273300,19980108.0,AA,American Airlines Inc.: AA,UNKNOW,496,LAX,Los AngelesCA: Los Angeles International,Los Angeles,CA,California,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0.0,,,,,True,False,1745
125014,22096600,19960118.0,AA,American Airlines Inc.: AA,UNKNOW,1388,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0.0,,,,,True,False,460
162444,27454200,19970113.0,AA,American Airlines Inc.: AA,UNKNOW,2087,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,0,,,,,,,0.0,,,,,True,False,224
192865,21755500,19960108.0,AA,American Airlines Inc.: AA,UNKNOW,613,LGA,New YorkNY: LaGuardia,New York,NY,New York,MIA,MiamiFL: Miami International,Miami,FL,Florida,0,,,,,,,0.0,,,,,True,False,1097
377880,21910100,19960107.0,AA,American Airlines Inc.: AA,UNKNOW,883,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,0,,,,,,,0.0,,,,,True,False,612
583756,27175600,19970126.0,AA,American Airlines Inc.: AA,UNKNOW,2085,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,MSP,MinneapolisMN: Minneapolis-St Paul International,Minneapolis,MN,Minnesota,0,,,,,,,0.0,,,,,True,False,334
585484,21808100,19960111.0,AA,American Airlines Inc.: AA,UNKNOW,1311,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,MSP,MinneapolisMN: Minneapolis-St Paul International,Minneapolis,MN,Minnesota,0,,,,,,,0.0,,,,,True,False,334
585779,20837400,19960108.0,AA,American Airlines Inc.: AA,UNKNOW,218,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,BDL,HartfordCT: Bradley International,Hartford,CT,Connecticut,0,,,,,,,0.0,,,,,True,False,783
603812,21095900,19960108.0,AA,American Airlines Inc.: AA,UNKNOW,236,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,DCA,WashingtonDC: Ronald Reagan Washington National,Washington,VA,Virginia,0,,,,,,,0.0,,,,,True,False,1192


## Categorical data

We touched on categorical data earlier on in this notebook (with categories), but here we take a more ridigly define the concept. Categorical data variables take on their value from a predefined set of categories. We saw the example above with AIRLINE codes.

Would you say the following variables categorical or not?
- TAILNUM
- FLIGHTNUM
- ORIGINAIRPORTCODE
- ORIGAIRPORTNAME
- CANCELLED

What about the columns in the following table?

<table>
    <tr>
        <td><b>First Name</b></td>
        <td><b>Last Name</b></td>
        <td><b>Age</b></td>
        <td><b>Address</b></td>
        <td><b>District Postcode</b></td>
        <td><b>Full Postcode</b></td>
        <td><b>Married</b><td>
    </tr>
    <tr>
        <td>John</td>
        <td>Doe</td>
        <td>31</td>
        <td>123 Fake Street, Realtown</td>
        <td>RT1</td>
        <td>RT1 3NV</td>
        <td>True</td>
    </tr>
    <tr>
        <td>Diane</td>
        <td>Smith</td>
        <td>31</td>
        <td>42 World Road, Realtown</td>
        <td>RT2</td>
        <td>RT2 7XU</td>
        <td>False</td>
    </tr>
    <tr>
        <td>Kate</td>
        <td>Doe</td>
        <td>29</td>
        <td>123 Fake Street, Realtown</td>
        <td>RT1</td>
        <td>RT1 3NV</td>
        <td>False</td>
    </tr>
    <tr>
        <td>Charlie</td>
        <td>Doe</td>
        <td>33</td>
        <td>789 Real Road, Fakecity</td>
        <td>FC2</td>
        <td>FC2 9ER</td>
        <td>True</td>        
    </tr>    
</table>

Categorical data can only take on one of a finite set of values and it is impossible for them to go beyond these predefined categories. However, during the data collection process, noise in our data could occur (e.g. if our cateogorical data was collected via a free entry text box).

There are a couple of ways to deal with inconsistent categories:
1. Dropping data
2. Remapping the categories
3. Inferring the categories

### Dropping Data

The first approach we'll look at membership constraints over is the `ORIGINSTATENAME` variable. Dropping data is required when we have a value which in our entry isn't (conceptually) in the predinfed set of categories. We'll start by returning all the unique values in the variable. 

In [33]:
## Construct a set of the unique values in ORIGINSTATENAME
states = ### YOUR CODE HERE

{'Alabama',
 'Alaska',
 'Arizona',
 'Arkansas',
 'California',
 'Colorado',
 'Connecticut',
 'Delaware',
 'Florida',
 'Georgia',
 'Hawaii',
 'Idaho',
 'Illinois',
 'Indiana',
 'Iowa',
 'Kentucky',
 'Louisiana',
 'Maine',
 'Maryland',
 'Massachusetts',
 'Michigan',
 'Minnesota',
 'Mississippi',
 'Missouri',
 'Montana',
 'Nebraska',
 'Nevada',
 'New Hampshire',
 'New Jersey',
 'New Mexico',
 'New York',
 'North Carolina',
 'North Dakota',
 'Ohio',
 'Oregon',
 'Pennsylvania',
 'Puerto Rico',
 'Rhode Island',
 'South Carolina',
 'South Dakota',
 'Tennessee',
 'Texas',
 'U.S. Pacific Trust Territories and Possessions',
 'U.S. Virgin Islands',
 'Utah',
 'Vermont',
 'Virginia',
 'Washington',
 'West Virginia',
 'Wisconsin',
 'Wyoming',
 nan}

Now let's say we received some new entries which had statenames not present in that predefined set of categories (for example, "Fakestate"). 

In [34]:
## using the .at() or .iat() methods on flights_df, modify one of the rows 
## in our table to have its ORIGINSTATENAME as Fakestate
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html


In [35]:
# Displays the unique entries in ORIGINSTATENAME in our modified dataframe
flights_df["ORIGINSTATENAME"].unique()
# OR set(flights_df["ORIGINSTATENAME"])

array(['New Mexico', 'Georgia', 'New York', 'Texas', 'Ohio', 'Alabama',
       'Montana', 'Oregon', 'Alaska', 'Louisiana', 'Washington',
       'Minnesota', 'Maine', 'Maryland', 'Illinois', 'California',
       'New Jersey', 'Wisconsin', 'North Carolina', 'Michigan',
       'Pennsylvania', 'Colorado', 'West Virginia', 'Wyoming', 'Utah',
       'South Carolina', 'Iowa', 'Vermont', 'Puerto Rico', 'South Dakota',
       'Tennessee', 'Missouri', 'Massachusetts', 'North Dakota', 'Idaho',
       'Florida', 'Nevada', 'Hawaii',
       'U.S. Pacific Trust Territories and Possessions', 'Kentucky', nan,
       'Arkansas', 'Nebraska', 'Mississippi', 'Indiana', 'New Hampshire',
       'Virginia', 'Connecticut', 'Arizona', 'Delaware', 'Rhode Island',
       'U.S. Virgin Islands', 'Fakestate'], dtype=object)

In [36]:
## Using set operations, find the difference between the origin states in our dataframe and our predefined list  
inconsistent_categories = ### YOUR CODE HERE
inconsistent_categories

{'Fakestate'}

In [37]:
# This .isin method returns all rows from the dataframe where the Series meets the condition its passed
inconsistent_rows = flights_df["ORIGINSTATENAME"].isin(inconsistent_categories)
flights_df[inconsistent_rows]

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
1191793,127265600,20130106.0,DL,Delta Air Lines Inc.: DL,N385DN,1033,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Fakestate,BHM,BirminghamAL: Birmingham-Shuttlesworth Interna...,Birmingham,AL,Alabama,1840,1900.0,20.0,32.0,1932.0,1859.0,4.0,1838.0,1903.0,25.0,58.0,63.0,False,F,134


In [38]:
# Nifty trick we can use to drop rows
# What do you think ~ means?
flights_df = flights_df[~inconsistent_rows]

### Remapping Categories

What we saw above was data that was not present in the predefined set of categories. However, we may also come across some other type of category data issues which are better solved by remapping categories than dropping data. Appropiate places to perform this remapping would be when:
1. **Inconsistency in values**: `married`, `not married`, `unmarried`, ` Maried` <br>
 1. Be careful of trailing white space too!
2. **Converting data to categories or too many categories**: Let's say we had a household income column in our dataframe. 
 1. We could change this type of data to be categorical by grouping the income (e.g. `0 - 20k`, `20k - 40k`, `40k - 60k`, `60k +` etc).
 2. We could also reduce this further to `low_class`, `middle_class`, `upper_class`
 
Let's tackle these in order. In our flights dataframe, the columns `CANCELLED` and `DIVERTED` both take on inconsistent values. Perhaps the safest option is to run `.value_counts()` on one of these columns (`.value_counts()` on runs on type `Series`)

In [39]:
flights_df["CANCELLED"].value_counts()

False    637287
0        347545
F        178357
True      16359
1          8160
T          4084
Name: CANCELLED, dtype: int64

Great! So we see that our Falsy values can take on one of three values, and the Truthy values are similar too. We can arbitrarily whichever ones of these we want to use moving forward. For explicitness, let's choose False and True respectively.

In [40]:
## Use the replace method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html
## To replace 0, F, 1 and T to their relevant values for the CANCELLED column

### YOUR CODE HERE
flights_df["CANCELLED"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


False    1163189
True       28603
Name: CANCELLED, dtype: int64

In [41]:
flights_df["DIVERTED"].value_counts()

F        426570
False    407670
0        354906
T           966
True        881
1           799
Name: DIVERTED, dtype: int64

In [42]:
# We can alternatively use a dictionary to "reduce" our categories.
mapping = {"F": "False", "0": "False", "1": "True", "T": "True"}
flights_df["DIVERTED"] = flights_df["DIVERTED"].replace(mapping)
flights_df["DIVERTED"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


False    1189146
True        2646
Name: DIVERTED, dtype: int64

As previously mentioned, another situation in which we may want to remap categories is when we want to reduce the number of values in a column. In our case, let's say a flight company would like to categorise the flights based on how far they've travel. So, anything between 0 and 1000 miles is `short`, between 1000 and 2500 is `medium` and 2500+ is `long`.

Here, we can use the [`.cut`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) method to segment our data. We will provide three argments to the function:
1. The `Series` we want to segment
2. The bins - that is, a list of ranges we'll be wanting to segment against
3. The labels - that is, a list of labels we want to assign to each of our bins

In [43]:
import numpy as np

bins = [0, 1000, 2500, np.inf]
labels = ["short", "medium", "long"]
flights_df["DISTANCE_CATEGORY"] = pd.cut(flights_df["DISTANCE"], bins=bins, labels=labels)

flights_df[["DISTANCE", "DISTANCE_CATEGORY"]]

Unnamed: 0,DISTANCE,DISTANCE_CATEGORY
0,580,short
1,744,short
2,718,short
3,487,short
4,744,short
...,...,...
1191800,721,short
1191801,731,short
1191802,731,short
1191803,508,short


In [44]:
flights_df[flights_df["DISTANCE_CATEGORY"] == "long"]

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE,DISTANCE_CATEGORY
3660,120678800,20120103.0,DL,Delta Air Lines Inc.: DL,N810NW,837,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,1050,1047.0,-3.0,20.0,1107.0,1518.0,5.0,1615.0,1523.0,-52.0,625.0,576.0,False,False,4502,long
5518,117053600,20110506.0,AS,Alaska Airlines Inc.: AS,N566AS,897,BLI,BellinghamWA: Bellingham International,Bellingham,WA,Washington,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,1720,1715.0,-5.0,5.0,1720.0,1953.0,6.0,2035.0,1959.0,-36.0,375.0,344.0,False,False,2716,long
5519,117079900,20110523.0,AS,Alaska Airlines Inc.: AS,N589AS,897,BLI,BellinghamWA: Bellingham International,Bellingham,WA,Washington,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,1720,1713.0,-7.0,14.0,1727.0,2017.0,4.0,2035.0,2021.0,-14.0,375.0,368.0,False,False,2716,long
5520,120549200,20120102.0,AS,Alaska Airlines Inc.: AS,N559AS,897,BLI,BellinghamWA: Bellingham International,Bellingham,WA,Washington,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,1805,1801.0,-4.0,11.0,1812.0,2227.0,5.0,2225.0,2232.0,7.0,380.0,391.0,False,False,2715,long
5521,120578800,20120122.0,AS,Alaska Airlines Inc.: AS,N529AS,897,BLI,BellinghamWA: Bellingham International,Bellingham,WA,Washington,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,1805,1749.0,-16.0,18.0,1807.0,2154.0,5.0,2225.0,2159.0,-26.0,380.0,370.0,False,False,2715,long
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1187039,11875100,19940523.0,AA,American Airlines Inc.: AA,,73,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,915,913.0,-2.0,,,,,1256.0,1311.0,15.0,521.0,538.0,False,False,4244,long
1187465,8514000,19930925.0,AA,American Airlines Inc.: AA,,73,ORD,ChicagoIL: Chicago O'Hare International,Chicago,IL,Illinois,HNL,HonoluluHI: Honolulu International,Honolulu,HI,Hawaii,914,914.0,0.0,,,,,1245.0,1252.0,7.0,511.0,518.0,False,False,4244,long
1189798,119348600,20110910.0,DL,Delta Air Lines Inc.: DL,N554NW,1088,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,MSP,MinneapolisMN: Minneapolis-St Paul International,Minneapolis,MN,Minnesota,2140,2140.0,0.0,29.0,2209.0,544.0,8.0,555.0,552.0,-3.0,315.0,312.0,False,False,2519,long
1189810,118919800,20110920.0,AA,American Airlines Inc.: AA,N625AA,278,ANC,AnchorageAK: Ted Stevens Anchorage International,Anchorage,AK,Alaska,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,2030,2035.0,5.0,22.0,2057.0,543.0,4.0,540.0,547.0,7.0,370.0,372.0,False,False,3043,long


### Dealing with Datetimes

One common issue you're going to come across is dealing with dates and datetimes (simply a date and a time). Why? Because there are many ways that we can format a date, such as `DD/MM/YYYY`, `MM/DD/YY`, `Xth MONTH YEAR` etc. In our dataframe above, our dates are actually formatted as just one number. Pandas provides us with a useful helper for constructing datetimes - that is, with the [`.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method.

Before we look into this, it's valuable to quickly introduce how dates are typically stored in computers. Typically, dates are calculated against the number of seconds elapsed since **1st January 1970**. When we want to, let's say, find the difference in time between 3/2/2013 16:00 and 21/1/2013 09:00, the program performs its operations on the **Epoch/Unix/POSIX time** for those values, and we can subsequently code something up to provide us the value back back in a format we want (e.g. 13 days, 7 hours). Speaking with numbers:
- **3/2/2013 16:00** = 1,359,907,200
- **21/1/2013 09:00** = 1,358,758,800

Difference in dates = 1,359,907,200 - 1,358,758,800 = 1,148,400 seconds

`format(1148400) = 13 days, 7 hours`

Great! Ok, so to solidify how you common looking datetimes are formatted:
<table>
    <tr>
        <td><b>Date</b></td>
        <td><b>Datetime format</b></td>
    </tr>
    <tr>
        <td>15th June 2020</td>
        <td>%c</td>
    </tr>
    <tr>
        <td>15/06/2020</td>
        <td>%d/%m/%Y</td>
    </tr>
    <tr>
        <td>06-15-2020</td>
        <td>%m-%d-%Y</td>
    </tr>
</table>

Let's use the method!!

In [45]:
flights_df = pd.read_csv('flights.txt', sep='|')

In [46]:
flights_df["FLIGHTDATE"]

0          20020101
1          20020101
2          20020101
3          20020101
4          20020103
             ...   
1191800    20130106
1191801    20130106
1191802    20130106
1191803    20130106
1191804    20130106
Name: FLIGHTDATE, Length: 1191805, dtype: int64

In [47]:
pd.to_datetime(flights_df["FLIGHTDATE"])

0         1970-01-01 00:00:00.020020101
1         1970-01-01 00:00:00.020020101
2         1970-01-01 00:00:00.020020101
3         1970-01-01 00:00:00.020020101
4         1970-01-01 00:00:00.020020103
                       ...             
1191800   1970-01-01 00:00:00.020130106
1191801   1970-01-01 00:00:00.020130106
1191802   1970-01-01 00:00:00.020130106
1191803   1970-01-01 00:00:00.020130106
1191804   1970-01-01 00:00:00.020130106
Name: FLIGHTDATE, Length: 1191805, dtype: datetime64[ns]

Uhhh... what happened there? Why are all our dates 1970-01-01 now?

Well, it's because, as I mentioned, dates are internally stored as seconds (numbers). Our `FLIGHTDATE` column is also displaying the flight dates as numbers. Thus, when we run the `.to_datetime()` method, all of our dates are interpreted as POSIX time.

One simple solution we can do to fix that is to explicitly specify our datetime format. Given the examples above, what do you think the date format is going to be?

In [48]:
## Assign date_format
date_format = "%Y%m%d"
# date_format = "%d-%m-%Y"
pd.to_datetime(flights_df["FLIGHTDATE"], format=date_format)

0         2002-01-01
1         2002-01-01
2         2002-01-01
3         2002-01-01
4         2002-01-03
             ...    
1191800   2013-01-06
1191801   2013-01-06
1191802   2013-01-06
1191803   2013-01-06
1191804   2013-01-06
Name: FLIGHTDATE, Length: 1191805, dtype: datetime64[ns]

Better! Cool! Ok, so this solution was quite specific to the problem we had at hand. But in the real world you may often encounter mixed formats for dates in one dataframe. For example:

<table>
    <tr>
        <td><b>Name</b></td>
        <td><b>Date of Birth</b></td>
        <td><b>Age</b></td>
    </tr>
    <tr>
        <td>John</td>
        <td>01/07/1995</td>
        <td>25</td>
    </tr>
    <tr>
        <td>Jane</td>
        <td>20-04-1992</td>
        <td>28</td>
    </tr>
    <tr>
        <td>Mark</td>
        <td>3rd January 1990</td>
        <td>30</td>
    </tr>
    </table>
    
`.to_datetime()` once again comes at the rescue here! In the previous code cell, we explicitly set the date format (because of the unusual nature of the way this date was stored in the dataframe) - but more generally we can use `.to_datetime()` to automatically infer the format of each individual date.

```python
# errors='coerce' means we'll return NA rows for invalid dates
df["DATE"] = pd.to_datetime(df["DATE"], infer_datetime_format=True, errors='coerce') 
```

## Challenge

Modify CRSDEPTIME and CRSARRTIME to be in a datetime format. Notice that the days might tick over on to the next day... Which column can you use to help you infer whether this could be the case?

In [50]:
flights_df = flights_df.convert_dtypes()
flights_df.head()

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
0,54548800,2002-01-01,WN,Southwest Airlines Co.: WN,N103@@,1425,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,1425,1425,0,8,1433,1648,4,1655,1652,-3,90,87,F,False,580 miles
1,55872300,2002-01-01,CO,Continental Air Lines Inc.: CO,N83872,150,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1130,1136,6,12,1148,1419,16,1426,1435,9,116,119,False,F,744 miles
2,54388800,2002-01-01,WN,Southwest Airlines Co.: WN,N334@@,249,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,1215,1338,83,7,1345,1618,2,1500,1620,80,105,102,F,False,718 miles
3,54486500,2002-01-01,WN,Southwest Airlines Co.: WN,N699@@,902,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,LAS,Las VegasNV: McCarran International,Las Vegas,NV,Nevada,1925,1925,0,5,1930,1947,1,1950,1948,-2,85,83,0,0,487 miles
4,55878700,2002-01-03,CO,Continental Air Lines Inc.: CO,N58606,234,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,1455,1453,-2,11,1504,1742,5,1750,1747,-3,115,114,F,False,744 miles


In [51]:
## Create a new variable CRSDEPDATETIME which concatenates the FLIGHTDATE and CRSDEPTIME columns
# If you're unsure how... try googling it
# You'll probably also need to type convert the relevant columns
flights_df["FLIGHTDATE"] = ### Your code here
CRSDEPDATETIME = ### Your Code here
CRSDEPDATETIME

0          2002-01-01 1425
1          2002-01-01 1130
2          2002-01-01 1215
3          2002-01-01 1925
4          2002-01-03 1455
                ...       
1191800    2013-01-06 1357
1191801    2013-01-06 2150
1191802    2013-01-06 1617
1191803    2013-01-06 1516
1191804    2013-01-06 1452
Length: 1191805, dtype: object

In [52]:
## Convert CRSDEPDATETIME to a datetime object and overwrite the CRSDEPTIME with the new series
flights_df["CRSDEPTIME"] = pd.to_datetime(CRSDEPDATETIME, infer_datetime_format=True, errors='coerce')
flights_df.head()

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
0,54548800,2002-01-01,WN,Southwest Airlines Co.: WN,N103@@,1425,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,2002-01-01 14:25:00,1425,0,8,1433,1648,4,1655,1652,-3,90,87,F,False,580 miles
1,55872300,2002-01-01,CO,Continental Air Lines Inc.: CO,N83872,150,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,2002-01-01 11:30:00,1136,6,12,1148,1419,16,1426,1435,9,116,119,False,F,744 miles
2,54388800,2002-01-01,WN,Southwest Airlines Co.: WN,N334@@,249,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,MCI,Kansas CityMO: Kansas City International,Kansas City,MO,Missouri,2002-01-01 12:15:00,1338,83,7,1345,1618,2,1500,1620,80,105,102,F,False,718 miles
3,54486500,2002-01-01,WN,Southwest Airlines Co.: WN,N699@@,902,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,LAS,Las VegasNV: McCarran International,Las Vegas,NV,Nevada,2002-01-01 19:25:00,1925,0,5,1930,1947,1,1950,1948,-2,85,83,0,0,487 miles
4,55878700,2002-01-03,CO,Continental Air Lines Inc.: CO,N58606,234,ABQ,AlbuquerqueNM: Albuquerque International Sunport,Albuquerque,NM,New Mexico,IAH,HoustonTX: George Bush Intercontinental/Houston,Houston,TX,Texas,2002-01-03 14:55:00,1453,-2,11,1504,1742,5,1750,1747,-3,115,114,F,False,744 miles


In [53]:
# Now we want to convert CRSARRTIME to a datetime object.
# What would be the issue with just doing the same trick as above and using FLIGHTDATE and appending the CRSARRTIME to it?
# So we know that we need to keep the answer to the above question in mind when trying to construct the new datetime object

# I found this stackoverflow link to guide me through how to solve my problem:
# https://stackoverflow.com/questions/34519536/convert-integer-series-to-timedelta-in-pandas
# What do you think I googled to find this page?
# You may need to research the method the answer provides to find the correct arguments to pass in

## Assign time_to_shift using the timedelta method
# I imagine you will run into a type error. Take a couple of minutes to see if you can find how to resolve the issue
# Otherwise: https://stackoverflow.com/a/21290084/3297011 should provide you with the conceptual solution

time_to_shift = ### Your Code Here

## Assign CRSARRDATETIME to CRSDEPTIME + time_to_shift

CRSARRDATETIME = ### Your Code Here
CRSARRDATETIME


0         2002-01-01 15:55:00
1         2002-01-01 13:26:00
2         2002-01-01 14:00:00
3         2002-01-01 20:50:00
4         2002-01-03 16:50:00
                  ...        
1191800   2013-01-06 16:23:00
1191801   2013-01-07 00:21:00
1191802   2013-01-06 18:50:00
1191803   2013-01-06 16:58:00
1191804   2013-01-06 16:09:00
Length: 1191805, dtype: datetime64[ns]

In [54]:
flights_df.tail()

Unnamed: 0,TRANSACTIONID,FLIGHTDATE,AIRLINECODE,AIRLINENAME,TAILNUM,FLIGHTNUM,ORIGINAIRPORTCODE,ORIGAIRPORTNAME,ORIGINCITYNAME,ORIGINSTATE,ORIGINSTATENAME,DESTAIRPORTCODE,DESTAIRPORTNAME,DESTCITYNAME,DESTSTATE,DESTSTATENAME,CRSDEPTIME,DEPTIME,DEPDELAY,TAXIOUT,WHEELSOFF,WHEELSON,TAXIIN,CRSARRTIME,ARRTIME,ARRDELAY,CRSELAPSEDTIME,ACTUALELAPSEDTIME,CANCELLED,DIVERTED,DISTANCE
1191800,126750200,2013-01-06,EV,ExpressJet Airlines Inc.: EV,N683BR,5272,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DAL,DallasTX: Dallas Love Field,Dallas,TX,Texas,2013-01-06 13:57:00,1348,-9,22,1410,1500,3,1523,1503,-20,146,135,0,0,721 miles
1191801,127294500,2013-01-06,DL,Delta Air Lines Inc.: DL,N949DL,1711,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,2013-01-06 21:50:00,2147,-3,23,2210,2307,10,2321,2317,-4,151,150,False,F,731 miles
1191802,127294900,2013-01-06,DL,Delta Air Lines Inc.: DL,N907DE,1810,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,DFW,Dallas/Fort WorthTX: Dallas/Fort Worth Interna...,Dallas/Fort Worth,TX,Texas,2013-01-06 16:17:00,1617,0,18,1635,1728,9,1750,1737,-13,153,140,F,False,731 miles
1191803,126594900,2013-01-06,EV,ExpressJet Airlines Inc.: EV,N855AS,5208,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,FWA,Fort WayneIN: Fort Wayne International,Fort Wayne,IN,Indiana,2013-01-06 15:16:00,1514,-2,21,1535,1651,4,1658,1655,-3,102,101,False,F,508 miles
1191804,126620300,2013-01-06,EV,ExpressJet Airlines Inc.: EV,N138EV,5549,ATL,AtlantaGA: Hartsfield-Jackson Atlanta Internat...,Atlanta,GA,Georgia,GSO,Greensboro/High PointNC: Piedmont Triad Intern...,Greensboro/High Point,NC,North Carolina,2013-01-06 14:52:00,1458,6,27,1525,1611,4,1609,1615,6,77,77,False,False,306 miles


Interesting!! Take a look at these results and `CRSARRTIME` - what do you notice? Looks like CRSARRTIME has offsets here and there. Why do you think this is?

It occurs because the destinations have different timezones than the departure city/state. As data scientists we have to ensure that manipulations to our data maintains its correctness and integrity.

I won't overwrite the CRSARRTIME variable here but have a think about how you could correctly modify CRSARRTIME to be a datetime object. Click the arrow below to reveal the answer:

<details>
    <summary><b>> How I would modify CRSARRTIME</b></summary>
    <ol>
        <li>Build dictionary of US state timezone offsets. I would probably start the dictionary the 'earliest' state we have in the dataframe (e.g. if Hawai is present, then I'd use Hawaiian time (UTC-10) as the 0 offset). Every other state would have a value which is relative to the the 0 offset (e.g. New York's offset would be +5).</li>
        <li>We know the origin and departure states, so the only thing left to do now is add the offset to the above <code>time_to_shift</code> variable we calculated. This seems trivial. We take the difference in offsets between the origin state and the destination state and add it to <code>time_to_shift</code>.</li>
        <li>Use <code>time_to_shift</code> as above!</li>
    </ol>
</details>

## Cross Field Validation

What does it meant to check the integrity of our data? Essentially, we need to be aware that a column of data we're seeing is consistent based on some other columns of data. We half handedly recognised and discovered an example of where we learnt a fact about one of the variables during the previous exercise. This is what **cross field validation** investigates. Before expanding on some of the cross field checks on this dataset, I will provide a slightly more trivial example to demonstrate where not performing such checks could skew analysis:

The dummy table below shows entries of some student finance undergraduate (U.G) and postgraduate (P.G) loan holders. The dataset consists of a loan holder's name, date of birth (D.O.B), current age (or deceased age if relevant), whether they are deceased or not, their U.G and P.G loan amounts, and the total amount that they owe - which should be the summation of the previous two fields. In the table below, I have italicised the questionable fields.
<table>
    <tr>
        <td><b>Name</b></td>
        <td><b>D.O.B</b></td>
        <td><b>Age</b></td>
        <td><b>Deceased</b></td>
        <td><b>U.G Loan (£)</b></td>
        <td><b>P.G Loan (£)</b></td>
        <td><b>Total Loan (£)</b></td>
    </tr>
    <tr>
        <td>Idaline</td>
        <td>1971-04-27</td>
        <td>49</td>
        <td>F</td>
        <td>24100</td>
        <td>11900</td>
        <td>36000</td>
    </tr>
    <tr>
        <td>Freddie</td>
        <td>1962-12-27</td>
        <td>57</td>
        <td>F</td>
        <td>26600</td>
        <td>12600</td>
        <td>39200</td>
    </tr>
    <tr>
        <td>Debee</td>
        <td>1970-11-19</td>
        <td>49</td>
        <td>F</td>
        <td>32400</td>
        <td>97000</td>
        <td><i>42100</i></td>
    </tr>
    <tr>
        <td>Joyann</td>
        <td>1957-01-24</td>
        <td><i>41</i></td>
        <td>T</td>
        <td>24400</td>
        <td>11500</td>
        <td>35900</td>
    </tr>
    <tr>
        <td>Ajay</td>
        <td>1960-05-12</td>
        <td><i>50</i></td>
        <td>F</td>
        <td>25500</td>
        <td>18800</td>
        <td>44300</td>
    </tr>
    <tr>
        <td>Emelia</td>
        <td>1957-11-23</td>
        <td><i>57</i></td>
        <td>T</td>
        <td>34000</td>
        <td>17500</td>
        <td><i>0</i></td>
    </tr>
            
</table>

In [55]:
html_table = """
<table>
    <tr>
        <td><b>Name</b></td>
        <td><b>D.O.B</b></td>
        <td><b>Age</b></td>
        <td><b>Deceased</b></td>
        <td><b>U.G Loan (£)</b></td>
        <td><b>P.G Loan (£)</b></td>
        <td><b>Total Loan (£)</b></td>
    </tr>
    <tr>
        <td>Idaline</td>
        <td>19710427</td>
        <td>50</td>
        <td>F</td>
        <td>24100</td>
        <td>11900</td>
        <td>36000</td>
    </tr>
    <tr>
        <td>Freddie</td>
        <td>19621227</td>
        <td>58</td>
        <td>F</td>
        <td>26600</td>
        <td>12600</td>
        <td>39200</td>
    </tr>
    <tr>
        <td>Debee</td>
        <td>19701119</td>
        <td>49</td>
        <td>F</td>
        <td>32400</td>
        <td>97000</td>
        <td><i>42100</i></td>
    </tr>
    <tr>
        <td>Joyann</td>
        <td>19570124</td>
        <td><i>41</i></td>
        <td>T</td>
        <td>24400</td>
        <td>11500</td>
        <td>35900</td>
    </tr>
    <tr>
        <td>Ajay</td>
        <td>19600512</td>
        <td><i>50</i></td>
        <td>F</td>
        <td>25500</td>
        <td>18800</td>
        <td>44300</td>
    </tr>
    <tr>
        <td>Emelia</td>
        <td>19571123</td>
        <td><i>57</i></td>
        <td>T</td>
        <td>34000</td>
        <td>17500</td>
        <td><i>0</i></td>
    </tr>
            
</table>
"""

html_df = pd.read_html(html_table, header=0)[0]
html_df

Unnamed: 0,Name,D.O.B,Age,Deceased,U.G Loan (£),P.G Loan (£),Total Loan (£)
0,Idaline,19710427,50,F,24100,11900,36000
1,Freddie,19621227,58,F,26600,12600,39200
2,Debee,19701119,49,F,32400,97000,42100
3,Joyann,19570124,41,T,24400,11500,35900
4,Ajay,19600512,50,F,25500,18800,44300
5,Emelia,19571123,57,T,34000,17500,0


Let's work on the **Age** variable first. According to our data documentation, the age in the field should reflect the current age of the loan holders. The exception to this is when the loan holder is deceased, in which case the age should contain the loan holder's age at the time of their passing. Let's first work out which rows break this condition.

In [56]:
# Before we attempt this, we'll rename the rows to something that's easier to work with
html_df.columns = ["name", "dob", "age", "deceased", "ug_loan", "pg_loan", "total_loan"]
html_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        6 non-null      object
 1   dob         6 non-null      int64 
 2   age         6 non-null      int64 
 3   deceased    6 non-null      object
 4   ug_loan     6 non-null      int64 
 5   pg_loan     6 non-null      int64 
 6   total_loan  6 non-null      int64 
dtypes: int64(5), object(2)
memory usage: 464.0+ bytes


In [57]:
## Convert 'dob' into a date object
html_df['dob'] = ### Your Code Here
html_df

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan
0,Idaline,1971-04-27,50,F,24100,11900,36000
1,Freddie,1962-12-27,58,F,26600,12600,39200
2,Debee,1970-11-19,49,F,32400,97000,42100
3,Joyann,1957-01-24,41,T,24400,11500,35900
4,Ajay,1960-05-12,50,F,25500,18800,44300
5,Emelia,1957-11-23,57,T,34000,17500,0


In [59]:
# Creates a new column 'now_date' populated with the now datetime
html_df["now_date"] = pd.Timestamp(pd.datetime.now())

## Calculate the difference between the 'dob' and 'now_date' and return the value as years
now_date_dob_difference = ### YOur Code here


  html_df["now_date"] = pd.Timestamp(pd.datetime.now())


In [60]:
now_date_dob_difference

0   18431 days 16:21:40.930590
1   21474 days 16:21:40.930590
2   18590 days 16:21:40.930590
3   23637 days 16:21:40.930590
4   22433 days 16:21:40.930590
5   23334 days 16:21:40.930590
dtype: timedelta64[ns]

In [61]:
import numpy as np
# This line changes the the timedelta objects to a floating point year, which we then convert to an int
now_date_dob_difference = (now_date_dob_difference / np.timedelta64(1, 'Y')).astype("int64")
now_date_dob_difference

0    50
1    58
2    50
3    64
4    61
5    63
dtype: int64

In [63]:
# By eye, we can see which ages do not match the dataframe we showed previously. 
# For generality however, let's code this up with pandas logic.
## Return rows where 'now_date_dob_difference' is different to the dataframe's age variable
is_diff = ### Your Code Here

In [64]:
is_diff

0    False
1    False
2     True
3     True
4     True
5     True
dtype: bool

In [65]:
html_df[is_diff]

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,now_date
2,Debee,1970-11-19,49,F,32400,97000,42100,2021-10-12 16:21:40.930590
3,Joyann,1957-01-24,41,T,24400,11500,35900,2021-10-12 16:21:40.930590
4,Ajay,1960-05-12,50,F,25500,18800,44300,2021-10-12 16:21:40.930590
5,Emelia,1957-11-23,57,T,34000,17500,0,2021-10-12 16:21:40.930590


Let's take a look into why the above cell's were returned. As mentioned previously, if a loan holder is deceased, then their age should reflect that. So this means that Joyann's and Emelia's age is actually correct. Using boolean logic, let's filter out these rows to return only the rows which have mathematically incorrect ages.

In [66]:
## Filter out the relevant loan holders using boolean logic (hint: &)
incorrect_age_rows = ### Your Code Here
incorrect_age_rows

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,now_date
2,Debee,1970-11-19,49,F,32400,97000,42100,2021-10-12 16:21:40.930590
4,Ajay,1960-05-12,50,F,25500,18800,44300,2021-10-12 16:21:40.930590


In [67]:
## Update incorrect_age_rows dataframe with the corrected ages
incorrect_age_rows["age"] = ### Your Code Here
incorrect_age_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,now_date
2,Debee,1970-11-19,50,F,32400,97000,42100,2021-10-12 16:21:40.930590
4,Ajay,1960-05-12,61,F,25500,18800,44300,2021-10-12 16:21:40.930590


In [68]:
## Now update the relevant entries html_df with the age column from the incorrect_age_rows dataframe
html_df.update(incorrect_age_rows["age"])
html_df

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,now_date
0,Idaline,1971-04-27,50.0,F,24100,11900,36000,2021-10-12 16:21:40.930590
1,Freddie,1962-12-27,58.0,F,26600,12600,39200,2021-10-12 16:21:40.930590
2,Debee,1970-11-19,50.0,F,32400,97000,42100,2021-10-12 16:21:40.930590
3,Joyann,1957-01-24,41.0,T,24400,11500,35900,2021-10-12 16:21:40.930590
4,Ajay,1960-05-12,61.0,F,25500,18800,44300,2021-10-12 16:21:40.930590
5,Emelia,1957-11-23,57.0,T,34000,17500,0,2021-10-12 16:21:40.930590


In [69]:
## Convert age back to an int
html_df["age"] = ### Your Code Here
## Drop the now_date column
html_df.drop(### Your Code Here)
html_df

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan
0,Idaline,1971-04-27,50,F,24100,11900,36000
1,Freddie,1962-12-27,58,F,26600,12600,39200
2,Debee,1970-11-19,50,F,32400,97000,42100
3,Joyann,1957-01-24,41,T,24400,11500,35900
4,Ajay,1960-05-12,61,F,25500,18800,44300
5,Emelia,1957-11-23,57,T,34000,17500,0


Let's work on the loan amounts now. Return all the columns where `ug_loan` + `pg_loan` is not equal to `total_loan`

In [71]:
## Subset `ug_loan` and `pg_loan` from our dataframe, and then sum along the column axis
sum_loans = ### YOur Code Here
html_df['computed'] = sum_loans
html_df

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,computed
0,Idaline,1971-04-27,50,F,24100,11900,36000,36000
1,Freddie,1962-12-27,58,F,26600,12600,39200,39200
2,Debee,1970-11-19,50,F,32400,97000,42100,129400
3,Joyann,1957-01-24,41,T,24400,11500,35900,35900
4,Ajay,1960-05-12,61,F,25500,18800,44300,44300
5,Emelia,1957-11-23,57,T,34000,17500,0,51500


In [73]:
## Return the rows which have incorrect sum values
incorrect_loan_rows = ### Your Code Here
incorrect_loan_rows

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,computed
2,Debee,1970-11-19,50,F,32400,97000,42100,129400
5,Emelia,1957-11-23,57,T,34000,17500,0,51500


### How should we deal with fields which fail validation?

Here we see two rows which don't have the correct total loan amounts. Analysing each one individually, we see that the first row's data has most likely had the `pg_loan` value entered incorrectly (£97,000 for a postgraduate loan). In the second, for some reason the `total_loan` value has not been calculated. A naive strategy could be to overwrite the whole total loan amounts with the summation of `ug_loan` and `pg_loan`. This fixes the types of error on which the second row is returned. However there could be an underlying issue because of the first row. If we sum the `ug_loan` and `pg_loan` here, we will create an **outlier** (these will be covered in detail later). In a real dataset, it is a very real threat mistakes like these could occur which compromise the integrity of the data - issues like these can easily slip the mind so make sure you take the time to think through how your actions are going to affect your data.

As I mentioned, some aspects of data science are an art - but whatever heuristic decision we make, we have to find a strong justifcation for it. In this particular case, I am going to drop the rows with the incorrect `total_loan` as this error probably occured due to a mistake in human data entry. The rows with `total_loan = 0` probably occured due to some systematic error - perhaps from from some other database where the total_loan amount wasn't present. Other checks considering, one solution we could opt for is to sum the two columns together.

In [74]:
## Identify the rows where total_loan is NOT 0, but is incorrect
incorrect_loan_but_not_zero_rows = html_df[(html_df["total_loan"] != sum_loans) & (html_df["total_loan"] != 0)]
print(incorrect_loan_but_not_zero_rows)

## Drop these rows from the original dataframe
html_df = html_df.drop(incorrect_loan_but_not_zero_rows.index)

html_df

    name        dob  age deceased  ug_loan  pg_loan  total_loan  computed
2  Debee 1970-11-19   50        F    32400    97000       42100    129400


Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,computed
0,Idaline,1971-04-27,50,F,24100,11900,36000,36000
1,Freddie,1962-12-27,58,F,26600,12600,39200,39200
3,Joyann,1957-01-24,41,T,24400,11500,35900,35900
4,Ajay,1960-05-12,61,F,25500,18800,44300,44300
5,Emelia,1957-11-23,57,T,34000,17500,0,51500


In [75]:
# Assuming that all we're happy with all the other entries in our loan holders, we can directly compute and overwrite total_loan in our dataframe
## Overwrite total_loan with the sum of ug_loan and pg_loan
html_df["total_loan"] = ### Your Code Here
html_df

Unnamed: 0,name,dob,age,deceased,ug_loan,pg_loan,total_loan,computed
0,Idaline,1971-04-27,50,F,24100,11900,36000,36000
1,Freddie,1962-12-27,58,F,26600,12600,39200,39200
3,Joyann,1957-01-24,41,T,24400,11500,35900,35900
4,Ajay,1960-05-12,61,F,25500,18800,44300,44300
5,Emelia,1957-11-23,57,T,34000,17500,51500,51500


## Working with text data and strings

Text data is obviously an extremely common type of data, and it can take on many forms - ranging from free unstructured text to emails, names, phone numbers etc. There are many types of problems we can encounter with text data:
- Data inconsistency (e.g. 	+86 195 448 8582 vs 0086-195-448-8582)
- Text violations (e.g. illegal characters, input field errors, text typos)
- "Structured" typos (e.g. +86.1954.48858.2)

In the example table below, we see a list of people with their names and phone numbers. As seen - most likely due to free text fields, the names and phone numbers have been entered in a variety of different formats. Our job is to standardise these fields so they're consistent throughout the dataframe:

<table>
    <tr>
        <td><b>Name</b></td>
        <td><b>Phone Number</b></td>
    </tr>
    <tr>
        <td>Dr Darci Abela</td>
        <td>+86-185-338-1819</td>
    </tr>
    <tr>
        <td>Mr Patten St. Queintain</td>
        <td>00865872411917</td>
    </tr>
    <tr>
        <td>mr conant burden</td>
        <td>0086-289-702-0948</td>
    </tr>
    <tr>
        <td>miss marcia Dutnell</td>
        <td>0668</td>
    </tr>
    <tr>
        <td>dr Greggory lurner</td>
        <td>+31 778 813 8432</td>
    </tr>
    <tr>
        <td>MS Doe Beavan</td>
        <td>+420-731-276-7633</td>
    </tr>
    <tr>
        <td>Tamarah Delgado</td>
        <td>+868431029051</td>
    </tr>
    <tr>
        <td>Miss Arlee daborne</td>
        <td>+33-307-220-2746</td>
    </tr>
    <tr>
        <td>Ly b. Grima</td>
        <td>+238-863-946-4232</td>
    </tr>
</table>

I have created a small dummy csv we can work with for this task

In [76]:
# np = names_phones
np_df = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Eng/mock_names_phones.csv", header=0, index_col=0)
np_df

Unnamed: 0,name,phone number
0,Dr rafaello Vlasenko,215-183-0246
1,miss Dido Rosbrough,913 520 0662
2,MISS Drusie Merriday,0355
3,MRS Jillayne Kiloh,0087-407-5997
4,Rev VALERIA SHEVLANE,916-308-0837
...,...,...
95,REV gun Ornillos,0036 831 9242
96,Rev Peg Splevins,212-524-1998
97,Mr Rosalind Beyer,+ 447 730 3458
98,Dr Wenonah O' Hern,0036-477-2705


Ok - there are 4 tasks for this dataframe:
1. Create a 'title' column which contains each individual's title (e.g. Mrs, Miss etc). This column should be standardized and categorical
2. Split the actual name into a first name and last name column. Both columns should have a capital letter for the first letter of the name
3. Drop the `name` row
4. Standardise the phone numbers with the format `00XXXXXXXXX`. That is - two zeros prepended to the rest of the actual number

Let's tackle these in order

In [77]:
# First, we want to create a new title column which takes on the honourifics in the name column
# To obtain this, we have to split the name on whitespace and take the first element from the split list
example_string = "This string will be split"
print(example_string.split())
print(example_string.split()[0])

['This', 'string', 'will', 'be', 'split']
This


In [78]:
# To perform some string operations on string columns in pandas, we need to prepend our string function with '.str'
np_df["name"].str.split()


0     [Dr, rafaello, Vlasenko]
1      [miss, Dido, Rosbrough]
2     [MISS, Drusie, Merriday]
3       [MRS, Jillayne, Kiloh]
4     [Rev, VALERIA, SHEVLANE]
                ...           
95        [REV, gun, Ornillos]
96        [Rev, Peg, Splevins]
97       [Mr, Rosalind, Beyer]
98     [Dr, Wenonah, O', Hern]
99       [Ms, PATTIE, SOPPETT]
Name: name, Length: 100, dtype: object

In [80]:
## Create and populate a title column.
# This task can be solved in a couple of different ways.
# See how many solutions you can come up with in your groups
np_df["title"] = ### Your Code Here
np_df

Unnamed: 0,name,phone number,title
0,Dr rafaello Vlasenko,215-183-0246,Dr
1,miss Dido Rosbrough,913 520 0662,miss
2,MISS Drusie Merriday,0355,MISS
3,MRS Jillayne Kiloh,0087-407-5997,MRS
4,Rev VALERIA SHEVLANE,916-308-0837,Rev
...,...,...,...
95,REV gun Ornillos,0036 831 9242,REV
96,Rev Peg Splevins,212-524-1998,Rev
97,Mr Rosalind Beyer,+ 447 730 3458,Mr
98,Dr Wenonah O' Hern,0036-477-2705,Dr


In [81]:
# We want our title to be standardised and categorical.
## Convert the column to a categorical column and return all the categories that currently exist in the column
np_df["title"] = ### Your Code here
set(np_df["title"])

{'DR',
 'Dr',
 'MISS',
 'MR',
 'MRS',
 'Miss',
 'Mr',
 'Mrs',
 'Ms',
 'REV',
 'Rev',
 'dr',
 'miss',
 'mr',
 'mrs',
 'ms',
 'rev'}

In [82]:
# We see many different variants. Let's select a method to normalise the entries (e.g. uppercase all).
## Standardise the title column
np_df["title"] = ### Your Code Here
np_df

Unnamed: 0,name,phone number,title
0,Dr rafaello Vlasenko,215-183-0246,DR
1,miss Dido Rosbrough,913 520 0662,MISS
2,MISS Drusie Merriday,0355,MISS
3,MRS Jillayne Kiloh,0087-407-5997,MRS
4,Rev VALERIA SHEVLANE,916-308-0837,REV
...,...,...,...
95,REV gun Ornillos,0036 831 9242,REV
96,Rev Peg Splevins,212-524-1998,REV
97,Mr Rosalind Beyer,+ 447 730 3458,MR
98,Dr Wenonah O' Hern,0036-477-2705,DR


In [83]:
## In somewhat of a similar fashion to the above, create a new column for first name and one for last name.
# Ensure for both new columns, the names are in lower case form, apart from the first letter which is upper cased
np_df["first_name"] = ### Your Code Here
np_df["last_name"] = ### Your Code Here
np_df["first_name"] = ### Your Code Here
np_df["last_name"] = ### Your Code Here

np_df

Unnamed: 0,name,phone number,title,first_name,last_name
0,Dr rafaello Vlasenko,215-183-0246,DR,Rafaello,Vlasenko
1,miss Dido Rosbrough,913 520 0662,MISS,Dido,Rosbrough
2,MISS Drusie Merriday,0355,MISS,Drusie,Merriday
3,MRS Jillayne Kiloh,0087-407-5997,MRS,Jillayne,Kiloh
4,Rev VALERIA SHEVLANE,916-308-0837,REV,Valeria,Shevlane
...,...,...,...,...,...
95,REV gun Ornillos,0036 831 9242,REV,Gun,Ornillos
96,Rev Peg Splevins,212-524-1998,REV,Peg,Splevins
97,Mr Rosalind Beyer,+ 447 730 3458,MR,Rosalind,Beyer
98,Dr Wenonah O' Hern,0036-477-2705,DR,Wenonah,Hern


In [84]:
## Drop the name column
np_df.drop(### Your Code Here)
np_df

Unnamed: 0,phone number,title,first_name,last_name
0,215-183-0246,DR,Rafaello,Vlasenko
1,913 520 0662,MISS,Dido,Rosbrough
2,0355,MISS,Drusie,Merriday
3,0087-407-5997,MRS,Jillayne,Kiloh
4,916-308-0837,REV,Valeria,Shevlane
...,...,...,...,...
95,0036 831 9242,REV,Gun,Ornillos
96,212-524-1998,REV,Peg,Splevins
97,+ 447 730 3458,MR,Rosalind,Beyer
98,0036-477-2705,DR,Wenonah,Hern


Great! This brings us up to the 4th part of the tasks - standardising the phone number and converting them to an int datatype. Recall how we want our phone numbers to look after: start with 00, followed by the rest of the number.

In [85]:
# Returns all the (unique) phone numbers so we can see the different types of issues they contain
set(np_df["phone number"])

{' 215-183-0246',
 ' 308-345-9376',
 ' 372 875 9524',
 ' 575-722-2771',
 ' 639 907 8152',
 ' 731-174-0498',
 ' 748 111 3081',
 ' 913 520 0662',
 '+  103-221-6651',
 '+  254-790-3115',
 '+  265-672-9320',
 '+  878-773-9133',
 '+ 116-341-3550',
 '+ 161-935-4043',
 '+ 170-474-6125',
 '+ 307-275-6256',
 '+ 397-504-0587',
 '+ 405-653-2307',
 '+ 426-945-0672',
 '+ 429-936-7430',
 '+ 447 730 3458',
 '+ 469-180-8942',
 '+ 477-531-0695',
 '+ 545-108-6180',
 '+ 580-877-2968',
 '+ 616-211-4527',
 '+ 677-669-1931',
 '+ 695-276-8827',
 '+ 711-239-5519',
 '+ 758-850-2395',
 '+ 790-814-7053',
 '+ 824 616 7624',
 '+ 832 867 6108',
 '+ 843-164-5335',
 '+ 873-735-5162',
 '+ 950-576-6969',
 '+ 984-358-0707',
 '0006-275-2109',
 '0012 717 7388',
 '00207 236 3362',
 '0026-481-0716',
 '0033-814-2371',
 '0036 831 9242',
 '0036-477-2705',
 '0041-913-3917',
 '0044-670-6014',
 '0057-698-8700',
 '00622-354-4677',
 '00675-449-3771',
 '0070-431-3190',
 '00800-106-0189',
 '0087-407-5997',
 '0096-920-9952',
 '0355',


Ok, so what issues do you see?
<details>
    <summary><b>> Click here to see issues</b></summary>
    <ul>
        <li>Numbers start differently - some start with `+`, others with `00`</li>
        <li>Some numbers have spaces between a group of numbers, others are hyphenated. Some of the numbers don't have 'groups' either</li>
        <li>Some numbers start with a whitespace, others with a `+ `, others with a `+  `.</li>
        <li>Some numbers are only four numbers long</li>
    </ul>
</details>

There are a couple of ways we could go about formatting these strings into our desired output. Here I will guide you through a method where we iterate through the rows and apply a function to reassign the variable. Let's start by creating an intermediate function which takes a phone number and manipulates it to our desired output.

In [86]:
def standardise_phone_number(phone_number):
    
    ## if the first character is a "+", remove it.
    
    ## remove all whitespace from the phone number
    
    ## remove hyphens from the phone number
    
    ## if the number doesn't start with 00, prepend 00 to it beginning of the number
    
    ## return the phone number


In [87]:
# We'll iterate over the rows of the dataframe, and reassign the row to the standardised variant
for index, row in np_df.iterrows():
    
    ## Call our standardisation function on the phone number for the current loop
    row["phone number"] = ### YOur Code Here
np_df

Unnamed: 0,phone number,title,first_name,last_name
0,002151830246,DR,Rafaello,Vlasenko
1,009135200662,MISS,Dido,Rosbrough
2,000355,MISS,Drusie,Merriday
3,00874075997,MRS,Jillayne,Kiloh
4,009163080837,REV,Valeria,Shevlane
...,...,...,...,...
95,00368319242,REV,Gun,Ornillos
96,002125241998,REV,Peg,Splevins
97,004477303458,MR,Rosalind,Beyer
98,00364772705,DR,Wenonah,Hern


In [88]:
# We still have some invalid numbers in our dataframe (i.e. those which were originally of length 4)
## Replace all phone numbers under 10 numbers/characters long with pd.NA
# Hint: The .loc method will be needed
np_df.loc[### Your Code Here] = ### Your Code Here
np_df

Unnamed: 0,phone number,title,first_name,last_name
0,002151830246,DR,Rafaello,Vlasenko
1,009135200662,MISS,Dido,Rosbrough
2,,MISS,Drusie,Merriday
3,00874075997,MRS,Jillayne,Kiloh
4,009163080837,REV,Valeria,Shevlane
...,...,...,...,...
95,00368319242,REV,Gun,Ornillos
96,002125241998,REV,Peg,Splevins
97,004477303458,MR,Rosalind,Beyer
98,00364772705,DR,Wenonah,Hern


In [89]:
# We'll be focusing on missing data in the next section, but let's count the number of rows with NA's in them and drop them too
null_phone_numbers = np_df["phone number"].isnull()
print("Number of null phone numbers:", null_phone_numbers.sum())

# Drop the rows which have null phone numbers
np_df = np_df.dropna(subset=["phone number"])
np_df

Number of null phone numbers: 10


Unnamed: 0,phone number,title,first_name,last_name
0,002151830246,DR,Rafaello,Vlasenko
1,009135200662,MISS,Dido,Rosbrough
3,00874075997,MRS,Jillayne,Kiloh
4,009163080837,REV,Valeria,Shevlane
5,008737355162,REV,Debera,Stryde
...,...,...,...,...
95,00368319242,REV,Gun,Ornillos
96,002125241998,REV,Peg,Splevins
97,004477303458,MR,Rosalind,Beyer
98,00364772705,DR,Wenonah,Hern


More complicated string manipulations can be performed via the use of **Regular Expressions**, otherwise known as [regex](https://docs.python.org/3/howto/regex.html). We won't look at regex here but it's important to know about it's power. Essentially, regex allows us to specify rules for strings that we want to match against. It has a very wide application usecase. Some examples include:
- indentify emails in spans of text 
- validate whether a url has a correct format
- extract only the digits from a text string

When you come across tasks which non-trivially require you to clean text data, regex is the tool for the job. 