# Python

## JupyterLab and notebooks

[Jupyter](https://docs.jupyter.org/en/latest/) is an open source software project which includes [JupyterLab](https://jupyterlab.readthedocs.io/en/latest/). JupyterLab is the notebook editing environment that we are working in. A [notebook](https://jupyterlab.readthedocs.io/en/latest/user/notebook.html) is a document which combines runnable code, text and images. Notebooks are great for beginners as they allow rapid experimentation.

To run a cell of code, you can use the keyboard shortcut: `Shift-Enter`

## Importing libraries

We will be working with [`pandas`](https://pandas.pydata.org/docs/user_guide/index.html#user-guide).

In [None]:
import pandas as pd

## Reading the CSV file using pandas

Pandas has a function called [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) which can be used to read CSV files. We set the `dtype` argument because this column has some missing values and pandas needs to know how to handle them. We set the `parse_dates` argument to be a list of the columns that we want to convert into datetime objects. 

In [None]:
# Avoid warning by setting type on store_and_fwd_flag

df = pd.read_csv(
    filepath_or_buffer='taxi_tripdata.csv', 
    dtype={'store_and_fwd_flag': str}, 
    parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime']
)

## Calculating the duration of trips

In [None]:
df['duration'] = (df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']).dt.total_seconds() / 60

## Calculating the speed of trips

In [None]:
df['speed'] = df['trip_distance'] / df['duration'] * 60

## Clean the data

Notice that there are some rows which have a duration or trip distance of zero, and some with a speed of over 100 mph. These are probably not real trips so we can remove them from the dataset to prevent them affecting any later analysis. 


In [None]:
df = df[df.duration > 0]
df = df[df.trip_distance > 0]
df = df[df.speed < 100]

df = df[df['lpep_pickup_datetime'] >= 'July 2021']
df = df[df['lpep_pickup_datetime'] < 'August 2021']

## Find hourly means


In [None]:
df['hour'] = df['lpep_pickup_datetime'].dt.hour
means = df.groupby('hour')[['duration', 'speed', 'trip_distance']].mean()
means

In [None]:
means.plot()