<div class="alert alert-block alert-info">
<h3> <b> Part 2. Data Import <b></h3>
<div>

❓ **Your challenge**: 

- Once your data is successfully extracted from Github and unzipped in the `Raw Data` folder, it's time to import the data into your Jupyter notebook environment for data inspection
- As we have multiple .csv files, a good practice is to store all of them in a `dict`, a data format containing `key: value` pairs. This kind of data format can help us easily access to the data we want by specifying key names
- Our goal is to build a `get_data.py` which contains a function called `get_data` that can help us retrieve data easily

💡 Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `get_data.py` once you are certain of your code logic
- Lastly, import the `get_data.py` module to confirm its feasibility

In [1]:
# Add any packages you need here
import pyforest
import copy
import string
import missingno as msno
import requests
import zipfile
import os

import warnings
warnings.filterwarnings("ignore")

# pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: "%.2f" %x) # suppress scientific notation

# This can help to autoreload the packages you create
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<div class="alert alert-block alert-success">
<h4> <b> 1. Code get_data.py <b></h4>
<div>

#### a) `get_data` Function

👉 Our goal is to create a function called `get_data` that creates a data dict that contains 3 `key: value` pairs. The keys are the names of the data: Airport_Codes, Flights and Tickets, while the values are the DataFrames, you may follow the steps below:

> 1. Set your default working directory to the `Raw Data` folder
2. Check if the `Raw Data` folder is empty, if it's empty, print a warning message
3. Store your data as `key: value` pairs in a data dict called "data"

<details>
    <summary>💡Hint for functions or operations you may need</summary>

- os.path.join() 
- os.listdir()
- list comprehension
- zip

</details>

In [6]:
# Create get_data function - write your code here
def get_data():
    """
    Return a Python dict containing required csv files
    Its values should be pandas.DataFrames loaded from csv files
    """
    # root_dir = os.path.dirname(os.path.dirname(__file__))
    # csv_path = os.path.join(root_dir, "data")
    
    csv_path = os.path.join(os.getcwd(), "Raw Data")

    # Check if data folder is empty
    assert len(os.listdir(csv_path)) > 0, "No file in data folder, please check!"

    # Generate file name list
    file_names = [f for f in os.listdir(csv_path) if f.endswith(".csv")]

    key_names = [
        key_name.replace(".csv", "")
        for key_name in file_names
    ]

    # Create the dictionary
    data = {}
    for k, f in zip(key_names, file_names):
        data[k] = pd.read_csv(os.path.join(csv_path, f))

    print("Data dictionary generated")
    return data

#### b) Test Your `get_data` Function

In [7]:
# Test get_data function with the following code here
data = get_data()
data.keys()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

  data[k] = pd.read_csv(os.path.join(csv_path, f))


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Data dictionary generated


In [14]:
# Get datasets
airport_codes = data.get("Airport_Codes")
flights = data.get("Flights")
tickets = data.get("Tickets")

print("Airport_Codes shape: ", airport_codes.shape)
print("Flights shape: ", flights.shape)
print("Tickets shape: ", tickets.shape)

Airport_Codes shape:  (55369, 8)
Flights shape:  (1915886, 16)
Tickets shape:  (1167285, 12)


<div class="alert alert-block alert-success">
<h4> <b>3. Build & Test get_data.py <b></h4>
<div>

👉 Convert your jupyter notebook code into a .py file for later usage, test the .py file below

In [3]:
# Test get_data.py below
from Airport.get_data import GetData

data = GetData().get_data()
data.keys()