# Python Mini project 8


### Bicycle rental data in Chicago:

- trip_id — trip id;
- start_time - Date and time of the start of the trip
- end_time - Date and time of the end of the trip
- bikeid — bike id
- tripduration — trip duration in minutes
- from_station_id — station id of the start of the trip
- from_station_name — name of the departure point
- to_station_id — id of the arrival point
- to_station_name - name of the destination
- usertype - user type
- gender - gender (if subscriber)
- birthyear - year of birth (if subscriber)

Take data for Q1 only. Before doing .resample() , we need to prepare the data a bit. Place the start_time column as indexes and save the changes to the original dataset. First check the type of the variable, and cast it to the correct one if necessary.


**Note:** In order to use the resample() method, the indexes must be valid day-time objects (the index has its own date format and has been sorted)


In [43]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [44]:
df = pd.read_csv('bikes_q1_sample.csv')

In [45]:
df.head(5)

Unnamed: 0,trip_id,start_time,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
0,17617135,2018-01-22 20:04:31,2018-01-22 20:11:53,1131,442.0,471,Francisco Ave & Foster Ave,468,Budlong Woods Library,Subscriber,Female,1949.0
1,17897619,2018-03-16 19:47:59,2018-03-16 20:04:00,6146,961.0,296,Broadway & Belmont Ave,253,Winthrop Ave & Lawrence Ave,Subscriber,Male,1988.0
2,17881307,2018-03-14 18:49:20,2018-03-14 18:54:38,3847,318.0,260,Kedzie Ave & Milwaukee Ave,503,Drake Ave & Fullerton Ave,Subscriber,Male,1987.0
3,17881130,2018-03-14 18:33:48,2018-03-14 19:07:40,1483,2032.0,199,Wabash Ave & Grand Ave,199,Wabash Ave & Grand Ave,Subscriber,Male,1990.0
4,17686289,2018-02-05 17:39:14,2018-02-05 17:46:13,6391,419.0,596,Benson Ave & Church St,605,University Library (NU),Subscriber,Male,1992.0


In [46]:
df.dtypes

trip_id                int64
start_time            object
end_time              object
bikeid                 int64
tripduration          object
from_station_id        int64
from_station_name     object
to_station_id          int64
to_station_name       object
usertype              object
gender                object
birthyear            float64
dtype: object

In [47]:
df.start_time = pd.to_datetime(df.start_time) # change start_time datatype from object to day-time
df.set_index(keys='start_time', drop=True, inplace=True) #change start_time to indexes, remove extra column

In [48]:
df.head(5)

Unnamed: 0_level_0,trip_id,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-22 20:04:31,17617135,2018-01-22 20:11:53,1131,442.0,471,Francisco Ave & Foster Ave,468,Budlong Woods Library,Subscriber,Female,1949.0
2018-03-16 19:47:59,17897619,2018-03-16 20:04:00,6146,961.0,296,Broadway & Belmont Ave,253,Winthrop Ave & Lawrence Ave,Subscriber,Male,1988.0
2018-03-14 18:49:20,17881307,2018-03-14 18:54:38,3847,318.0,260,Kedzie Ave & Milwaukee Ave,503,Drake Ave & Fullerton Ave,Subscriber,Male,1987.0
2018-03-14 18:33:48,17881130,2018-03-14 19:07:40,1483,2032.0,199,Wabash Ave & Grand Ave,199,Wabash Ave & Grand Ave,Subscriber,Male,1990.0
2018-02-05 17:39:14,17686289,2018-02-05 17:46:13,6391,419.0,596,Benson Ave & Church St,605,University Library (NU),Subscriber,Male,1992.0


The data contains both the date of the lease and its exact start and end time with an accuracy of seconds. Apply the pd.resample() method and aggregate the data by day. Enter the maximum number of rentals per day as your answer.

In [49]:
df.resample(rule='D').size().max() # from all daytime we are interested in days (rule='D'), for each day we count the number of rows with it, choose the maximum value for the answer

4196


Let's look at the distribution of the number of leases for different user groups (usertype) - customers and subscribers in the data for **April**.
Data for the required period can be loaded: bikes_april.csv

Resample by day for each group and give the number of rentals for April 18 made by Subscribers as an answer.

In [50]:
april = pd.read_csv('bikes_april.csv')

In [51]:
april.head(5)

Unnamed: 0,start_time,trip_id,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
0,2018-04-01 00:10:23,18000531,2018-04-01 00:22:12,5065,709.0,228,Damen Ave & Melrose Ave,219,Damen Ave & Cortland St,Subscriber,Male,1983.0
1,2018-04-01 00:15:49,18000533,2018-04-01 00:19:47,4570,238.0,128,Damen Ave & Chicago Ave,130,Damen Ave & Division St,Subscriber,Male,1978.0
2,2018-04-01 00:17:00,18000534,2018-04-01 00:22:53,1323,353.0,130,Damen Ave & Division St,69,Damen Ave & Pierce Ave,Subscriber,Male,1991.0
3,2018-04-01 00:20:00,18000536,2018-04-01 00:26:22,2602,382.0,121,Blackstone Ave & Hyde Park Blvd,351,Cottage Grove Ave & 51st St,Subscriber,Female,1992.0
4,2018-04-01 00:23:19,18000538,2018-04-01 00:35:01,4213,702.0,31,Franklin St & Chicago Ave,180,Ritchie Ct & Banks St,Subscriber,Male,1985.0


In [52]:
april.start_time = pd.to_datetime(april.start_time)
april.set_index(keys='start_time', drop=True, inplace=True)

In [53]:
sub_april = april.query('usertype == "Subscriber"').resample(rule='D').size().loc['2018-04-18']
sub_april

2196

In [54]:
#2 variant
sub_2 = april.groupby(['usertype']).resample(rule='D').size()
sub_2

start_time,2018-04-01,2018-04-02,2018-04-03,2018-04-04,2018-04-05,2018-04-06,2018-04-07,2018-04-08,2018-04-09,2018-04-10,...,2018-04-21,2018-04-22,2018-04-23,2018-04-24,2018-04-25,2018-04-26,2018-04-27,2018-04-28,2018-04-29,2018-04-30
usertype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Customer,239,166,31,82,90,124,335,242,39,117,...,655,1055,345,367,220,544,416,713,1082,1098
Subscriber,825,2841,1873,2253,2502,2520,1416,1252,1798,3114,...,1845,2241,3930,4356,3959,4398,3424,2015,2114,5281


In [55]:
sub_2 = sub_2.T #transpose 
sub_2

usertype,Customer,Subscriber
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-04-01,239,825
2018-04-02,166,2841
2018-04-03,31,1873
2018-04-04,82,2253
2018-04-05,90,2502
2018-04-06,124,2520
2018-04-07,335,1416
2018-04-08,242,1252
2018-04-09,39,1798
2018-04-10,117,3114


In [56]:
sub_2 = sub_2.loc['2018-04-18', 'Subscriber']
sub_2

2196

Let's look at the data for the period from April to December.

Combine the data samples for the desired months into one common bikes dataset. Do a conversion by day for each user group (usertype), then select the days on which the number of leases made by customers was greater than that of subscribers.

**Data:**

* Q2: (1) April: `bikes_q2_sample_apr.csv`, (2) May: `bikes_q2_sample_may.csv`, (3) June: `bikes_q2_sample_jun.csv`
* Q3: (4) July: `bikes_q3_sample_july.csv`, (5) August: `bikes_q3_sample_aug.csv`, (6) September: `bikes_q3_sample_sep.csv`
* Q4: (7) October: `bikes_q4_sample_oct.csv`, (8) November: `bikes_q4_sample_nov.csv`, (9) December: `bikes_q4_sample_dec.csv`

In [70]:
#collect into one dataframe
overall = pd.concat([
        pd.read_csv('bikes_q2_sample_apr.csv'),
        pd.read_csv('bikes_q2_sample_may.csv'),
        pd.read_csv('bikes_q2_sample_jun.csv'),
        pd.read_csv('bikes_q3_sample_july.csv'),
        pd.read_csv('bikes_q3_sample_aug.csv'),
        pd.read_csv('bikes_q3_sample_sep.csv'),
        pd.read_csv('bikes_q4_sample_oct.csv'),
        pd.read_csv('bikes_q4_sample_nov.csv'),
        pd.read_csv('bikes_q4_sample_dec.csv')
        ]) 


In [71]:
overall.shape

(964781, 12)

In [72]:
overall.head(5)

Unnamed: 0,trip_id,start_time,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
0,18000534,2018-04-01 00:17:00,2018-04-01 00:22:53,1323,353.0,130,Damen Ave & Division St,69,Damen Ave & Pierce Ave,Subscriber,Male,1991.0
1,18000536,2018-04-01 00:20:00,2018-04-01 00:26:22,2602,382.0,121,Blackstone Ave & Hyde Park Blvd,351,Cottage Grove Ave & 51st St,Subscriber,Female,1992.0
2,18000538,2018-04-01 00:23:19,2018-04-01 00:35:01,4213,702.0,31,Franklin St & Chicago Ave,180,Ritchie Ct & Banks St,Subscriber,Male,1985.0
3,18000540,2018-04-01 00:24:46,2018-04-01 00:44:23,6401,1177.0,596,Benson Ave & Church St,517,Clark St & Jarvis Ave,Subscriber,Male,1974.0
4,18000541,2018-04-01 00:26:04,2018-04-01 00:31:05,6333,301.0,145,Mies van der Rohe Way & Chestnut St,24,Fairbanks Ct & Grand Ave,Subscriber,Male,1984.0


In [73]:
overall.dtypes

trip_id                int64
start_time            object
end_time              object
bikeid                 int64
tripduration          object
from_station_id        int64
from_station_name     object
to_station_id          int64
to_station_name       object
usertype              object
gender                object
birthyear            float64
dtype: object

In [74]:
overall.start_time = pd.to_datetime(overall.start_time) # change start_time data type from object to day-time
overall.set_index(keys='start_time', drop=True, inplace=True) # change start_time to indexes, remove extra column
overall.head(5)

Unnamed: 0_level_0,trip_id,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-04-01 00:17:00,18000534,2018-04-01 00:22:53,1323,353.0,130,Damen Ave & Division St,69,Damen Ave & Pierce Ave,Subscriber,Male,1991.0
2018-04-01 00:20:00,18000536,2018-04-01 00:26:22,2602,382.0,121,Blackstone Ave & Hyde Park Blvd,351,Cottage Grove Ave & 51st St,Subscriber,Female,1992.0
2018-04-01 00:23:19,18000538,2018-04-01 00:35:01,4213,702.0,31,Franklin St & Chicago Ave,180,Ritchie Ct & Banks St,Subscriber,Male,1985.0
2018-04-01 00:24:46,18000540,2018-04-01 00:44:23,6401,1177.0,596,Benson Ave & Church St,517,Clark St & Jarvis Ave,Subscriber,Male,1974.0
2018-04-01 00:26:04,18000541,2018-04-01 00:31:05,6333,301.0,145,Mies van der Rohe Way & Chestnut St,24,Fairbanks Ct & Grand Ave,Subscriber,Male,1984.0


In [75]:
res = overall.groupby(['usertype']).resample(rule='D').size().T # transpose: columns to rows, rows to columns
res[res.Customer > res.Subscriber]

usertype,Customer,Subscriber
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-05-27,3263,2449
2018-09-02,2752,2183


Let's take a look at summer.

Another advantage of using dates as indexes is the ability to select data for the period of time we are interested in. Store the observations from June 1st to August 31st in the bikes_summer variable. Then write in top_destination the most popular destination (its name). Aggregate the data by day and determine on which day the received destination (top_destination) had the fewest trips. Store the date as bad_day by formatting the timestamp with .strftime('%Y-%m-%d').

May be useful:

- loc
-strftime
- idxmin, idxmax
- size
- query

In [76]:
overall.head(2)

Unnamed: 0_level_0,trip_id,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-04-01 00:17:00,18000534,2018-04-01 00:22:53,1323,353.0,130,Damen Ave & Division St,69,Damen Ave & Pierce Ave,Subscriber,Male,1991.0
2018-04-01 00:20:00,18000536,2018-04-01 00:26:22,2602,382.0,121,Blackstone Ave & Hyde Park Blvd,351,Cottage Grove Ave & 51st St,Subscriber,Female,1992.0


In [77]:
# filtering we are interested in the cut only summer
summer = overall.loc['2018-06-01': '2018-08-31']

In [78]:
summer.shape

(459817, 11)

In [79]:

# for count of categorical variables use method: describe().top 

top = summer.to_station_name.describe().top
top

'Streeter Dr & Grand Ave'


Determine on which day the least number of trips were made to the received point

In [81]:
bad_day = (summer
           .query('to_station_name == @top')
           .resample(rule='D').size()
           .idxmin()
           .strftime('%Y-%m-%d'))
bad_day

'2018-06-21'

In [82]:
#var 2
a = (summer[summer.to_station_name == top]
    .resample(rule='D').size()
    .idxmin()
    .strftime('%Y-%m-%d'))
a

'2018-06-21'

Where do you go most on the weekends? There, where and on weekdays, or to other destinations?

Using data from June 1 to August 31, select the correct statements

In [83]:
summer_d = summer.assign(weekday = lambda x: pd.to_datetime(x.index).strftime('%A'))
summer_d.head()

Unnamed: 0_level_0,trip_id,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear,weekday
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-06-01 00:04:40,18709077,2018-06-01 00:06:47,3155,127.0,128,Damen Ave & Chicago Ave,214,Damen Ave & Grand Ave,Subscriber,Female,1978.0,Friday
2018-06-01 00:06:08,18709080,2018-06-01 00:24:18,2807,1090.0,258,Logan Blvd & Elston Ave,69,Damen Ave & Pierce Ave,Customer,,,Friday
2018-06-01 00:08:01,18709086,2018-06-01 00:32:55,2737,1494.0,337,Clark St & Chicago Ave,225,Halsted St & Dickens Ave,Customer,Male,1988.0,Friday
2018-06-01 00:09:02,18709091,2018-06-01 00:19:21,6089,619.0,210,Ashland Ave & Division St,56,Desplaines St & Kinzie St,Subscriber,Male,1987.0,Friday
2018-06-01 00:09:28,18709092,2018-06-01 00:14:44,2352,316.0,240,Sheridan Rd & Irving Park Rd,303,Broadway & Cornelia Ave,Subscriber,Male,1997.0,Friday


In [86]:
(summer_d 
   .groupby(['weekday', 'to_station_name'])
   .size()
   .sort_values(ascending=False)
)

weekday    to_station_name            
Saturday   Streeter Dr & Grand Ave        3461
Sunday     Streeter Dr & Grand Ave        2565
Friday     Streeter Dr & Grand Ave        1726
Saturday   Lake Shore Dr & North Blvd     1690
Wednesday  Streeter Dr & Grand Ave        1669
                                          ... 
Sunday     Pulaski Rd & Lake St              1
Monday     Racine Ave & 65th St              1
Saturday   Halsted St & 59th St              1
Friday     South Chicago Ave & 83rd St       1
Monday     Austin Blvd & Madison St          1
Length: 3893, dtype: int64