# Lab: Load Data

## Prerequisites
* Copy the `data/hotels/reservations.csv` file into your S3 bucket.
* Create a new Data Connection in RHODS and specify the S3 connection parameters.



In [None]:
import os
import pandas as pd

## Read Local Files

You can read a local CSV file with Pandas, by using the `read_csv` function.
This file loads the file contents into a `DataFrame` object.
A data frame is the main data abstraction in Pandas.

Read the `data/hotels/reservations.csv` file.

In [None]:
pd.read_csv("data/hotels/reservations.csv")

You can specify how Pandas reads the CSV file.

Reload the CSV by parsing the dates and specifying the index column.

In [None]:
reservations = pd.read_csv("data/hotels/reservations.csv", index_col=["id"], parse_dates=["reservation_date", "arrival_date"])
reservations

Inspect the index of the data frame.

In [None]:
reservations.index

Inpect the type of the `reservation_date` column.
The type is `Series`.
A series is a one-dimensional array of data.
This array is stored in the `values` property of the series.

A series also includes the name, which in this case corresponds to the column name, and the index that identifies each row.
In this case, the series index is the dataframe index.

In [None]:
print(type(reservations.reservation_date))

print(type(reservations.reservation_date.values))

print(reservations.reservation_date.name)

print(reservations.reservation_date.index)

Inspect the column. The type is `datetime64`.

In [None]:
reservations.reservation_date

## Read Remote Files

You can also read a remote CSV file, such as from S3.
If you need to read remote files, then you must install the `fsspec` library.
If you want to read from an S3 bucket, then you must also install the `s3fs` library.

<div class="alert alert-block alert-info">
<b>Tip:</b> You can skip this part if you have not configured an S3 data connection.
</div>

Install `fsspec` and the `s3fs` libraries.

In [None]:
%pip install fsspec s3fs

Read the file from the CSV bucket.
To configure the connection, Use the environment variables injected by the RHODS data connection.

In [None]:
s3_bucket = os.getenv("AWS_S3_BUCKET")
s3_key = os.getenv("AWS_ACCESS_KEY_ID")
s3_secret = os.getenv("AWS_SECRET_ACCESS_KEY")
s3_endpoint = os.getenv("AWS_S3_ENDPOINT")
storage_options = {
        "key": s3_key,
        "secret": s3_secret,
        "endpoint_url": s3_endpoint
    }

pd.read_csv(f"s3://{s3_bucket}/reservations.csv", storage_options=storage_options)

### Reading Large Files in Chunks

If the CSV file is large, then you can read the file in chunks, by using the `chunksize` parameter.

Read and print the file row by row.

In [None]:
url = f"s3://{s3_bucket}/reservations.csv"

with pd.read_csv(url, storage_options=storage_options, chunksize=1) as reader:
    for chunk in reader:
        print(chunk)

For more details visit https://pandas.pydata.org/docs/user_guide/io.html?highlight=storage_options#reading-writing-remote-files.