# Python

## JupyterLab and notebooks

[Jupyter](https://docs.jupyter.org/en/latest/) is an open source software project which includes [JupyterLab](https://jupyterlab.readthedocs.io/en/latest/). JupyterLab is the notebook editing environment that we are working in. A [notebook](https://jupyterlab.readthedocs.io/en/latest/user/notebook.html) is a document which combines runnable code, text and images. Notebooks are great for beginners as they allow rapid experimentation.

To run a cell of code, you can use the keyboard shortcut: `Shift-Enter`

## Importing libraries

We will be working with [`pandas`](https://pandas.pydata.org/docs/user_guide/index.html#user-guide).

In [None]:
import pandas as pd

## Reading the CSV file using pandas

Pandas has a function called [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) which can be used to read CSV files. We set the `dtype` argument because this column has some missing values and pandas needs to know how to handle them. We set the `parse_dates` argument to be a list of the columns that we want to convert into datetime objects. 

In [None]:
# Avoid warning by setting type on store_and_fwd_flag

df = pd.read_csv(
    filepath_or_buffer='taxi_tripdata.csv', 
    dtype={'store_and_fwd_flag': str}, 
    parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime']
)

## Data exploration

Have a look at the first 5 rows using [DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html).

In [None]:
df.head()

## Calculating the duration of trips

Subtract `lpep_pickup_datetime` from `lpep_dropoff_datetime`, then use [Series.dt.total_seconds](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.total_seconds.html) to get the duration in seconds, and finally divide by 60 to convert seconds to minutes.

In [None]:
duration = df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']
duration_seconds = duration.dt.total_seconds()
duration_minutes = duration_seconds / 60

df['duration'] = duration_minutes
df[['VendorID', 'duration']].head()


## Calculating the speed of trips

Divide `trip_distance` by `duration` and then multiply by 60 to convert miles per minute to miles per hour.

In [None]:
df['speed'] = df['trip_distance'] / df['duration'] * 60
df[['VendorID', 'duration', 'speed']].head()

## Clean the data

Have a look at the summary statistics for the `duration`, `speed`, and `trip_distance` columns.


In [None]:
df[['duration', 'speed', 'trip_distance']].describe().round(1).astype(str)

Notice that the minumum duration and trip distance are both zero. This means that there are probably some rows which not real trips so we can remove them from the dataset to prevent them affecting any later analysis. 

DataFrames can be filtered using [boolean vectors from expressions](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing).


In [None]:
df = df[df.duration > 0]
df = df[df.trip_distance > 0]


Have another look at the summary statistics.

In [None]:
df[['duration', 'speed', 'trip_distance']].describe().round(1).astype(str)

Notice that the maximum speed is over 5 million mph. Trips with a speed of over 100 mph are probably not real trips so let's get rid of these.

In [None]:
df = df[df.speed < 100]
df["speed"].max()

Finally, remove any trips which occured outside of July 2021.

In [None]:
df = df[df['lpep_pickup_datetime'] >= 'July 2021']
df = df[df['lpep_pickup_datetime'] < 'August 2021']
print(f'First Date: {df["lpep_pickup_datetime"].min()}. Last Date: {df["lpep_pickup_datetime"].max()}')

## Find hourly means

Create an `hour` column using [Series.dt.hour](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.hour.html), then group the values using [DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).

In [None]:
df['hour'] = df['lpep_pickup_datetime'].dt.hour
means = df.groupby('hour')[['duration', 'speed', 'trip_distance']].mean()
means

## Plot the results

In [None]:
plot_axes = means.plot()
plot_axes.legend(["Duration (minutes)", "Speed (mph)", "Trip Distance (miles)"])