# Feature Engineering with Pandas

This notebook demonstrates how to use AI Platform notebooks to perform feature engineering on a dataset using `pandas`.

You will load the data into a `Pandas DataFrame`, clean up the columns into a usable format, and then restructure the data into feature and target data columns. 

Before you jump in, let's cover some of the different tools you'll be using:

+ [AI Platform](https://cloud.google.com/ai-platform) consists of tools that allow machine learning developers and data scientists to run their ML projects quickly and cost-effectively.

+ [Cloud Storage](https://cloud.google.com/storage/) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

+ [Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. This notebook introduces several `gcloud` and `gsutil` commands, which are part of the Cloud SDK. Note that shell commands in a notebook must be prepended with a `!`.

+ [Pandas](https://pandas.pydata.org/) is a data analysis and manipulation tool built on top of the Python programming language.

# Set up your environment

### Enable the required APIs

In order to use AI Platform, confirm that the required API is enabled:

In [None]:
!gcloud services enable ml.googleapis.com

# Load the data

## Import libraries
Running the following cell will import the libraries needed. 

+ `Pandas`: to store and manipulate the dataset.
* `Google Cloud Storage`: to retrieve the dataset from the GCS bucket where the dataset is stored.

In [None]:
import pandas as pd
from google.cloud import storage

## Define constants
Define the name of your Google Cloud Storage bucket where the cleaned data is stored. 

+ `BLOB_PREFIX`: indicates the folder where the files are stored.
+ `DIR_NAME`: holds the name of the local folder where the files will be downloaded to.

In [None]:
BUCKET_NAME = 'your-bucket-name'
BLOB_PREFIX = 'clean_citibike_data.csv.gz/part'
DIR_NAME = 'citibike_data'

## Load the data

Run the following command to create a local folder where the dataset files will be stored.

In [None]:
!mkdir $DIR_NAME

Since the data cleaning job outputted multiple partioned files into the GCS bucket, you will need to loop through each file to access its contents. The following cell will retrieve all of the files with the `BLOB_PREFIX` defined above and download them. It will also create a list of the file names so they can be referenced later when loading the data into a dataframe.

In [None]:
# Create storage client
storage_client = storage.Client()

# List files in the bucket with the specified prefix
blobs = storage_client.list_blobs(BUCKET_NAME, prefix=BLOB_PREFIX)

# Go through the files and save them into the local folder
filenames = []
for i, blob in enumerate(blobs):
    filename = f'{DIR_NAME}/citibike{i}.csv.gz'
    blob.download_to_filename(filename)
    filenames.append(filename)
    print('Downloaded file: ' + str(blob.name))

Now, you can load the files into a dataframe. 

First, define the schema. From this dataset, you will need 4 columns:

+ **starttime**: to extract the day of the week and date of when the trip starts
+ **stoptime**: to extract the day of the week and date of when the trip has ended
+ **start_station_id**: to find out how many trips started at a station
+ **end_station_id**: to find out how many trips ended at a station

In [None]:
COLUMNS = (
    'starttime',
    'stoptime',
    'start_station_id',
    'end_station_id',
)

Next, run the following cell to loop through the files downloaded to the local folder, create a `Pandas DataFrame`, and view the first ten rows.

In [None]:
# Create empty dataframe
training_data = pd.DataFrame()

# Loop through the files
# For each file: load the contents into a dataframe, concatenate the new dataframe with the existing
for file in filenames:
    print('Processing file: ' + file)
    training_data = pd.concat([training_data, pd.read_csv(file, compression='gzip', usecols=[1, 2, 3, 7], header=None, names=COLUMNS, low_memory=False)])

training_data.head(10)

## Extracting features

The following cell will clean up the dataset in a few ways:

+ Any rows with NAN values will be dropped
+ The station IDs will be converted from floats to integers
+ The times from the start time column will be removed since they are not needed

In [None]:
# Drop rows with NAN values
training_data = training_data.dropna()

# Convert station IDs to integers
training_data['start_station_id'] = training_data['start_station_id'].astype('int32')
training_data['end_station_id'] = training_data['end_station_id'].astype('int32')

# Remove time from the time columns
training_data['starttime'] = training_data['starttime'].apply(lambda t: t.split("T")[0])
training_data['stoptime'] = training_data['stoptime'].apply(lambda t: t.split("T")[0])

training_data.head(10)

Next, you will count the number of trips that have been taken from each station per day. The `groupby` function from `Pandas` will count the number of unique combinations of the start time and start station ID values. Then, the `pivot` function from `Pandas` can be used to convert the station IDs into columns (since they are the target data) and the counts as the values.

You will also use the `add_suffix` function to rename the columns and distinguish that the values indicate trips that have started from the station.

In [None]:
# Find unique combinations of start time and start station ID values
bikes_taken = training_data.groupby(['starttime', 'start_station_id']).size().reset_index().rename(columns={0: 'count'})

# Pivot to make station ID the columns and rename them
bikes_taken = bikes_taken.pivot(index='starttime', columns='start_station_id', values='count').add_prefix('started_at_')

bikes_taken.head(10)

Running the following cell will repeat the same process as above, but will generate values for the number of trips that have ended at the station.

In [None]:
# Find unique combinations of start time and start station ID values
bikes_deposit = training_data.groupby(['stoptime', 'end_station_id']).size().reset_index().rename(columns={0: 'count'})

# Pivot to make station ID the columns and rename them
bikes_deposit = bikes_deposit.pivot(index='stoptime', columns='end_station_id', values='count').add_prefix('ending_at')

bikes_deposit.head(10)

The following cell will combine both dataframes for bikes taken and deposited. Then, the NAN values will be filled as 0's since this indicates that no trips started or ended at the particular stations.

In [None]:
# Combine the dataframes
# Set the index as row number instead of start time
# Fill the NAN values with 0's
training_df = pd.concat([bikes_taken, bikes_deposit], axis=1)\
                .reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')\
                .fillna(0)

# Rename the column with start and end dates
training_df.rename(columns={'index': 'date'}, inplace=True)

training_df.head(10)

Now, you can separate the start time column into more features such as the year, month, and day. Then, the date column can be dropped.

In [None]:
# Define the name and year, month, and day columns
date_columns = training_df['date'].str.split('-', expand=True)
date_names = ['year', 'month', 'day']

# Add the columns at the start of the dataset
for i in range(3):
    training_df.insert(0, date_names[i], date_columns[i])
    training_df[date_names[i]] = training_df[date_names[i]].astype('int32')

# Remove the date column from the dataframe
training_df = training_df.drop('date', axis=1)

training_df.head(10)

The following cell will extract the day of the week from the date information using the `Datetime` python library.

In [None]:
import datetime


def find_weekday(df):
    ''' Creates a datetime object and returns the day of the week '''
    date = datetime.datetime(int(df['year']), int(df['month']), int(df['day']))
    return date.weekday()

# Apply the find_weekday() function to every row of the dataset
weekday_col = training_df.apply(find_weekday, axis=1)

# Insert the weekday column at the start
training_df.insert(0, 'weekday', weekday_col)

training_df.head(10)

You are done with feature engineering for the Citibike Dataset! Now, you can move on to the external datasets you ingested in BigQuery to obtain more features.