<div class="alert alert-block alert-info">
<h3> <b> Part 3. Data Quality Check <b></h3>
<div>

❓ **Your challenge**: 

- Make sure you can succesfully import all the datasets: `Flights`, `Tickets` and `Airport Codes`
- Read the case statement carefully, define the data to the scope requested
- Conduct a detailed and logical flow of data quality check, including (and not limited to): `basic information understanding`, `duplicates`, `abnormalities`, `data type correction`, `missing value`, `skewness and distribution` and `correlation`

💡 Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Create any functions and put them in a .py file to clean/inspect data if necessary 

In [32]:
# Add any packages you need here
import pyforest
import copy
import string
import missingno as msno
import requests
import zipfile
import os

from Airport.get_data import GetData
from Airport import check_data, fix_data

import warnings
warnings.filterwarnings("ignore")

# pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: "%.2f" %x) # suppress scientific notation

# This can help to autoreload the packages you create
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<div class="alert alert-block alert-success">
<h4> <b> 1. Define Data Scope <b></h4>
<div>

👉 Our goal is to make sure we define our data scope clearly - to remove any data we are not interested in our current case study, here are the steps you can tak:

> 1. Import three datasets- `Flights`, `Tickets` and `Airport Codes` using the `get_data.py` we created
2. Read the case statement carefully, identify research scope for each dataset. Do note that the scopes might not in the same paragraph
3. Use pandas to handle the processing flexibly

<details>
    <summary>💡Hint for functions or operations you may need</summary>

- pandas subset - `[]`
- reset_index()

</details>

In [33]:
# Import the datasets - write your code here
data = GetData().get_data()

airport_codes = data.get("Airport_Codes")
flights = data.get("Flights")
tickets = data.get("Tickets")

Data dictionary generated


In [34]:
# Define data scope - write your code here
airport_codes = airport_codes[(airport_codes["TYPE"].isin(["large_airport", "medium_airport"])) & (airport_codes["ISO_COUNTRY"] == "US")].reset_index(drop = True)
flights = flights[flights["CANCELLED"] == 0].reset_index(drop = True)
tickets = tickets[tickets["ROUNDTRIP"] == 1].reset_index(drop = True)

<div class="alert alert-block alert-success">
<h4> <b> 2. Data Quality Check <b></h4>
<div>

#### a) Basic Information Understanding

👉 Our goal is to get a glimpse of the data, feel free to use any technique in your arsenal to understand the data you are dealing with

In [35]:
# Write your code here
flights.info()
tickets.info()
airport_codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1864272 entries, 0 to 1864271
Data columns (total 16 columns):
 #   Column             Dtype  
---  ------             -----  
 0   FL_DATE            object 
 1   OP_CARRIER         object 
 2   TAIL_NUM           object 
 3   OP_CARRIER_FL_NUM  object 
 4   ORIGIN_AIRPORT_ID  int64  
 5   ORIGIN             object 
 6   ORIGIN_CITY_NAME   object 
 7   DEST_AIRPORT_ID    int64  
 8   DESTINATION        object 
 9   DEST_CITY_NAME     object 
 10  DEP_DELAY          float64
 11  ARR_DELAY          float64
 12  CANCELLED          float64
 13  AIR_TIME           object 
 14  DISTANCE           object 
 15  OCCUPANCY_RATE     float64
dtypes: float64(4), int64(2), object(10)
memory usage: 227.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 708600 entries, 0 to 708599
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ITIN_ID            708600 non-nu

#### b) Remove Duplicates

👉 It's important to remove duplicated values as this is real-word data. A good practice is to document number of duplicates removed

In [36]:
# Write your code here

# Check duplicated observation 
print("flights duplicates removed:", flights.shape[0] - flights.drop_duplicates().shape[0])
print("tickets duplicates removed:", tickets.shape[0] - tickets.drop_duplicates().shape[0])
print("airport_codes duplicates removed:", airport_codes.shape[0] - airport_codes.drop_duplicates().shape[0])

flights duplicates removed: 4410
tickets duplicates removed: 47564
airport_codes duplicates removed: 0


In [37]:
# Remove duplicated values
flights.drop_duplicates(inplace = True)
tickets.drop_duplicates(inplace = True)
airport_codes.drop_duplicates(inplace = True)

#### c) Abnormal Value Check & Processing

👉 One of the biggest challenges in dealing with real-world data is messniess in the dataset, to ensure data tidiness, here are some steps you might need to take:

> 1. Inspect your data carefully, are there are any abnormalities in terms of data type?
2. You might find some untidy values due to human or system errors, creat a function to identify them
3. Based on your understanding and data cleaning experience, handle these abnormal values appropriately

In [38]:
# Write your code here

# Flights - DISTANCE
# Identify abnormal values
flights["DISTANCE_check"] = flights["DISTANCE"].map(lambda x: check_data.check_str_punc(str(x)))
flights[flights["DISTANCE_check"] == True]["DISTANCE"].value_counts()

****       203
-198        15
-1947        7
NAN          2
Hundred      1
Twenty       1
Name: DISTANCE, dtype: int64

In [39]:
# Process abnormal values
flights["DISTANCE"] = flights["DISTANCE"].replace({"****": np.nan,
                                                   "NAN": np.nan,
                                                   "-198": np.nan,
                                                   "-1947": np.nan,
                                                   "Hundred": 100,
                                                   "Twenty": 20})

flights["DISTANCE"] = flights["DISTANCE"].astype(float)
flights.drop("DISTANCE_check", axis = 1, inplace = True)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [40]:
# Flights - AIR_TIME
# Identify abnormal values
flights["AIR_TIME_check"] = flights["AIR_TIME"].map(lambda x: check_data.check_str_punc(str(x)))
flights[flights["AIR_TIME_check"] == True]["AIR_TIME"].value_counts()

$$$     181
-121     10
Two       1
NAN       1
Name: AIR_TIME, dtype: int64

In [41]:
# Process abnormal values
flights["AIR_TIME"] = np.where(flights["AIR_TIME_check"] == True, np.nan, flights["AIR_TIME"])

flights["AIR_TIME"] = flights["AIR_TIME"].astype(float)
flights.drop("AIR_TIME_check", axis = 1, inplace = True)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [42]:
# Tickets - INIT_FARE
# Identify abnormal values
tickets["ITIN_FARE_check"] = tickets["ITIN_FARE"].map(lambda x: check_data.check_str_punc(str(x)))
tickets[tickets["ITIN_FARE_check"] == True]["ITIN_FARE"].value_counts()

200 $       677
$ 100.00    273
820$$$      256
Name: ITIN_FARE, dtype: int64

In [43]:
# Process abnormal values
tickets["ITIN_FARE"] = tickets["ITIN_FARE"].replace({"$ 100.00": 100,
                                                     "200 $": 200,
                                                     "820$$$": 820})

tickets["ITIN_FARE"] = tickets["ITIN_FARE"].astype(float)
tickets.drop("ITIN_FARE_check", axis = 1, inplace = True)

#### d) Unify Data Type

👉 Make sure all the data types make sense

In [44]:
# Write your code here
flights_cate_cols = ["ORIGIN_AIRPORT_ID", "DEST_AIRPORT_ID", "CANCELLED"]
tickets_cate_cols = ["ITIN_ID", "YEAR", "QUARTER", "ROUNDTRIP"]

# Convert to categorical 
for col in flights_cate_cols:
    flights[col] = flights[col].astype(object)

for col in tickets_cate_cols:
    tickets[col] = tickets[col].astype(object)

# Convert to datetime
flights["FL_DATE"] = pd.to_datetime(flights["FL_DATE"])

<IPython.core.display.Javascript object>

#### e) Missing Value Detection & Processing

👉 Another challenge in real-world data is missing values, here is what you can do: 

> 1. Create a function `check_missing_values` that return a missing value table containing column name, absolute missing value and missing percentage
2. Build a `check_data.py` which contains the function `check_missing_values` and test it in the jupyter environment
3. Deal with the missing values besed on your understanding of the data, you decide to delete or impute the missing values with appropriate techniques

In [45]:
# Write your code here

# Check missing values
check_data.check_missing_values(flights)
check_data.check_missing_values(tickets)
check_data.check_missing_values(airport_codes)

Unnamed: 0,Attribute,Missing#,Missing%
0,CONTINENT,858,100.0
1,IATA_CODE,37,4.31
2,ELEVATION_FT,3,0.35
3,MUNICIPALITY,3,0.35
4,TYPE,0,0.0
5,NAME,0,0.0
6,ISO_COUNTRY,0,0.0
7,COORDINATES,0,0.0


In [46]:
# Check data observation after removing observation with missing value
print("flights missing record removed:", flights.shape[0] - flights.dropna().shape[0])
print("tickets missing record removed:", tickets.shape[0] - tickets.dropna().shape[0])

flights missing record removed: 4832
tickets missing record removed: 1406


In [47]:
# Remove rows with missing values
flights.dropna(inplace = True)
tickets.dropna(inplace = True)

#### f) Outliers Detection & Processing

👉 You should also pay heed to outliers: values that are out of your business scope or simply make no sense, here is what you can do: 

> 1. Check important attributes carefully: `AIR_TIME`, `DISTANCE`, `DEP_DELAY`, `ARR_DELAY` from `Flights` data and `PASSENGERS`, `ITIN_FARE` from `Tickets` data
2. Based on your observation and common sense, define upper and lower bound to mark values outside of the bounds are outliers
3. Build and test a function `replace_outliers` that replace all the outliers to median and put it in a new module `fix_data.py`

In [48]:
# Write your code here

flights.describe()
tickets.describe()

Unnamed: 0,PASSENGERS,ITIN_FARE
count,659630.0,659630.0
mean,1.96,472.94
std,5.15,344.14
min,1.0,0.0
25%,1.0,279.0
50%,1.0,415.0
75%,1.0,595.0
max,681.0,38400.0


In [49]:
fix_data.replace_outliers(flights, 'AIR_TIME', 50, 1000)
fix_data.replace_outliers(flights, 'DISTANCE', 50, 6000)
fix_data.replace_outliers(flights, 'DEP_DELAY', False, 1750)
fix_data.replace_outliers(flights, 'ARR_DELAY', False, 2000)

fix_data.replace_outliers(tickets, 'PASSENGERS', 0, 300)
fix_data.replace_outliers(tickets, 'ITIN_FARE', 20, 15000)

<div class="alert alert-block alert-success">
<h4> <b> 3. Save Cleaned Data <b></h4>
<div>

👉 Save all the datasets with a suffix `_cleaned` as `.csv` file in the `Raw Data` folder

In [51]:
# Write your code here
if os.path.exists("Raw Data"):
    import_file_path = os.path.join(os.getcwd(), "Raw Data")
    print("Import file path is:", import_file_path)
else: 
    os.mkdir("Raw Data")
    print("Raw Data folder created")

# flights.to_csv(os.path.join(import_file_path, "flights_cleaned.csv"), index = False)
# tickets.to_csv(os.path.join(import_file_path, "tickets_cleaned.csv"), index = False)
# airport_codes.to_csv(os.path.join(import_file_path, "airport_codes_cleaned.csv"), index = False)

Import file path is: F:\Anaconda Files\Datark\商业分析实战项目\Code\Raw Data
