# Challenge Questions - TfL Dataset

# Instructions:
• Please ensure you don't overwrite any existing cells. Add new cells below by pressing ALT+ENTER

• Attempt all of the questions

• You are encouraged to look online for help should you need it

# Dataset overview:
There are three datasets stored in the same directory as this Notebook, they are all related to each other:

• **tfl-daily-cycle-hires.csv**: This dataset contains bike hire data from Transport for London during the period 
30th July 2010 to 30th September 2021. 'Day' is the day in '%d/%m/%Y' format. 'Number of Bicycle Hires' is the total number of bikes hired that day.


# 

## Import pandas, numpy and datetime

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

## Load the files:
• "tfl-daily-cycle-hires.csv" should be assigned to the variable **tfl**

In [27]:
tfl=pd.read_csv('tfl-daily-cycle-hires.csv')

In [3]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,
...,...,...,...
4076,26/09/2021,45120.0,
4077,27/09/2021,32167.0,
4078,28/09/2021,32539.0,
4079,29/09/2021,39889.0,


## Check the head of the DataFrame

In [5]:
tfl.head()

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,


## Check the data types of the DataFrame columns

In [7]:
tfl.dtypes

Day                         object
Number of Bicycle Hires    float64
Unnamed: 2                 float64
dtype: object

## Change the data types and remove unnecessary columns 

• 'Day' should be a datetime64 data type

• 'Number of Bicycle Hires' should be float64

• Any other columns should be deleted

In [28]:
tfl['Day']=pd.to_datetime(tfl['Day'],format='%d/%m/%Y')

In [13]:
tfl.dtypes

Day                        datetime64[ns]
Number of Bicycle Hires           float64
Unnamed: 2                        float64
dtype: object

In [29]:
tfl.drop(columns='Unnamed: 2', inplace=True)

In [15]:
tfl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4081 entries, 0 to 4080
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Day                      4081 non-null   datetime64[ns]
 1   Number of Bicycle Hires  4081 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 63.9 KB


## What is the average number of bicycle hires per day across the entire dataset?

In [119]:
# the phrase per day is a booby-trap for you to group , however,
# if you inspect the dataset the bicycle hires are recorded per day already thus just find the mean of the whole column

tfl['Number of Bicycle Hires'].mean()

26261.932124479295

## Create a new column called 'Year' which contains the 4 digit year

In [130]:
tfl['Year']=tfl['Day'].dt.strftime('%Y')

In [125]:
tfl.dtypes

Day                        datetime64[ns]
Number of Bicycle Hires           float64
Year                               object
Category                           object
dtype: object

In [128]:
# Alternatively
tfl['Year'] = tfl['Day'].dt.year

In [129]:
# However, the above method turns year into an integer not a string
tfl.dtypes

Day                        datetime64[ns]
Number of Bicycle Hires           float64
Year                                int32
Category                           object
dtype: object

In [127]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Year,Category
0,2010-07-30,6897.0,2010,Low
1,2010-07-31,5564.0,2010,Low
2,2010-08-01,4303.0,2010,Low
3,2010-08-02,6642.0,2010,Low
4,2010-08-03,7966.0,2010,Low
...,...,...,...,...
4076,2021-09-26,45120.0,2021,High
4077,2021-09-27,32167.0,2021,Medium
4078,2021-09-28,32539.0,2021,Medium
4079,2021-09-29,39889.0,2021,Medium


## What is the average number of bicycle hires per Year across the entire dataset

In [41]:
tfl.groupby(by='Year').mean()

Unnamed: 0_level_0,Day,Number of Bicycle Hires
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2010,2010-10-15 00:00:00,14069.76129
2011,2011-07-02 00:00:00,19568.353425
2012,2012-07-01 12:00:00,26008.969945
2013,2013-07-02 00:00:00,22042.353425
2014,2014-07-02 00:00:00,27462.731507
2015,2015-07-02 00:00:00,27046.134247
2016,2016-07-01 12:00:00,28152.013661
2017,2017-07-02 00:00:00,28619.29863
2018,2018-07-02 00:00:00,28952.164384
2019,2019-07-02 00:00:00,28561.520548


In [43]:
#Removing the Day column
tfl[['Year','Number of Bicycle Hires']].groupby(by='Year').mean()

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,14069.76129
2011,19568.353425
2012,26008.969945
2013,22042.353425
2014,27462.731507
2015,27046.134247
2016,28152.013661
2017,28619.29863
2018,28952.164384
2019,28561.520548


In [137]:
# via pivot_table
tfl.pivot_table(index='Year', aggfunc='mean', values='Number of Bicycle Hires')

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,14069.76129
2011,19568.353425
2012,26008.969945
2013,22042.353425
2014,27462.731507
2015,27046.134247
2016,28152.013661
2017,28619.29863
2018,28952.164384
2019,28561.520548


## What is the total number of bicycle hires per Year across the entire dataset

In [60]:
# Since datetime cannot be summed, let us query from a dataset that does not have the column Day
tfl[['Year','Number of Bicycle Hires']].groupby(by='Year').sum()

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,2180813.0
2011,7142449.0
2012,9519283.0
2013,8045459.0
2014,10023897.0
2015,9871839.0
2016,10303637.0
2017,10446044.0
2018,10567540.0
2019,10424955.0


In [138]:
#via pivot_table
tfl.pivot_table(index='Year', aggfunc='sum', values = 'Number of Bicycle Hires')

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,2180813.0
2011,7142449.0
2012,9519283.0
2013,8045459.0
2014,10023897.0
2015,9871839.0
2016,10303637.0
2017,10446044.0
2018,10567540.0
2019,10424955.0


## Create a new column called 'Category' on the tfl DataFrame that classifies the number of bike hires per day as:
* 'Low' if the 'Number of Bicycle Hires' is below 10,000
* 'Medium' if the 'Number of Bicycle Hires' is below 40,000 but greater than or equal to 10,000
* 'High' if the 'Number of Bicycle Hires' is greater than or equal to 40,000

In [73]:
def category_func(x):
    if x < 10000:
        return('Low') # the brackets are optional return 'Low' would also have worked
    elif x >= 10000 and x < 40000:
        return('Medium')
    else:
        return('High')

In [74]:
tfl['Category']=tfl['Number of Bicycle Hires'].apply(category_func)

In [75]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Year,Category
0,2010-07-30,6897.0,2010,Low
1,2010-07-31,5564.0,2010,Low
2,2010-08-01,4303.0,2010,Low
3,2010-08-02,6642.0,2010,Low
4,2010-08-03,7966.0,2010,Low
...,...,...,...,...
4076,2021-09-26,45120.0,2021,High
4077,2021-09-27,32167.0,2021,Medium
4078,2021-09-28,32539.0,2021,Medium
4079,2021-09-29,39889.0,2021,Medium


## For each year in the tfl DataFrame how many days are classed as 'Low', 'Medium' or 'High'?

<b>*If the column Year were a string*

In [131]:
tfl.groupby(['Year','Category']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Day,Number of Bicycle Hires
Year,Category,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,Low,44,44
2010,Medium,111,111
2011,Low,30,30
2011,Medium,335,335
2012,High,27,27
2012,Low,19,19
2012,Medium,320,320
2013,Low,25,25
2013,Medium,340,340
2014,High,27,27


In [133]:
# We can exclude Number of bicycle Hires
tfl.groupby(['Year','Category']).count().drop(columns='Number of Bicycle Hires')

Unnamed: 0_level_0,Unnamed: 1_level_0,Day
Year,Category,Unnamed: 2_level_1
2010,Low,44
2010,Medium,111
2011,Low,30
2011,Medium,335
2012,High,27
2012,Low,19
2012,Medium,320
2013,Low,25
2013,Medium,340
2014,High,27


In [139]:
# via pivot_table
# We need to pass in a values column as either 'Day' or 'Number of Bicycle Hires'. 
# ... The count aggregation counts nonzero values and both of these columns get counted unless we specify one
tfl.pivot_table(columns = 'Category', index='Year', aggfunc='count',fill_value=0, values='Day')

Category,High,Low,Medium
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,0,44,111
2011,0,30,335
2012,27,19,320
2013,0,25,340
2014,27,10,328
2015,12,9,344
2016,37,6,323
2017,28,13,324
2018,52,16,297
2019,20,3,342


<b>*If Year were a integer in stead!! The above code would not work.*

In [136]:
#The size method serves as count/len for non-numeric objects
#The unstack method presents the output in terms of a dataframe and the fill_value arguement sets the Nan  to 0
# And finally to re-order the columns to start from low to high
tfl.groupby(['Year','Category']).size().unstack(fill_value=0).reindex(columns=['Low','Medium','High'])

Category,Low,Medium,High
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,44,111,0
2011,30,335,0
2012,19,320,27
2013,25,340,0
2014,10,328,27
2015,9,344,12
2016,6,323,37
2017,13,324,28
2018,16,297,52
2019,3,342,20


The benefit of this method(size in stead of count) is it works for both non-numeric and numeric. If Year was a string, this method still works.