# Pandas Practice on Bike Share Data

When writing code, you don't always have to invent the wheel from scratch. The great advantage of Python is that smart people before you spend a lot of energy on making life easier for the next programers. So please, make your life easier and use code that has already been implemented, don't call it "copying" but "friendly borrowing" of other people's code. If you copy whole functions or great graphs in the future, don't forget to give props to the inventor!

So for this exercise, too, if you get stuck at any point, look at good solutions from others and learn from them how to solve these problems even better.
Here are two good resources for small code snippets which can be very helpful when dealing with DataFrames:

- [Sebastian Raschkas "Things in Pandas I Wish I'd Known Earlier"](https://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb)
- [Chris Albons set of code snippets](https://chrisalbon.com/)

## Learning Objectives
**By the end of this session you should be able to**
- explore data with Pandas to answer conceptual questions
- Write chained commands for effecient one-liners



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../data/bike_share_201402_trip_data.csv')

1. How many observations are there?

In [None]:
#144015 observations
df.info()

2. Change the column-names to be Pythonic:

- lowercase 
- replace " " with `_` as a separator
- replace "#" with `num` 


In [None]:
# Converting column names to lowercase
df.rename(columns=lambda x : x.lower(), inplace=True)
df.rename(columns=lambda x : x.replace(' ', '_'), inplace=True)
df.rename(columns=lambda x : x.replace('#', 'num'), inplace=True)
df

3. How many types of subscription options are there?

In [None]:
df['subscription_type'].unique()

4. What is the frequency of each subscription option?

In [None]:
#df.groupby('subscription_type').count().trip_id
df['subscription_type'].value_counts()

5. Please plot the frequency of each subscription option with pie chart.

In [None]:
# pandas.df.plot
df['subscription_type'].value_counts().plot(kind='pie');

# matplotlib
subs = df['subscription_type'].value_counts()
subs = pd.DataFrame([subs])
x = subs.columns

fig, ax = plt.subplots()
ax.pie([subs.iloc[0,0], subs.iloc[0,1]], labels=x, autopct="%.2f")
plt.show()

6. Please plot the frequency of each subscription option with bar chart.

In [None]:
df.groupby('subscription_type').count().trip_id.plot(kind='bar');

7. Repeat same analysis for start_station but sorted from high to low.

In [None]:
df.groupby('start_station').count().sort_values('trip_id', ascending=False)#.trip_id.plot(kind='bar');

8. Repeat same analysis for end_station but sorted from __low to high__.

In [None]:
df.groupby('end_station').count().sort_values('trip_id', ascending=True)#.trip_id.plot(kind='bar');

9. Looking at just the most popular start stations and the most popular end stations, what are the qualitative similarities and differences between the set of start stations and set of end stations?  

In [None]:
#df.sort_values('bike_num', ascending=False)
df["start_station"].value_counts().nlargest(5)

10. Create a table that has start_station segmented by subscription_type.

Include the marginals.

<details><summary>
Click here for a hint…
</summary>
`pd.crosstab`
</details>

In [None]:
pd.crosstab(df.start_station, df.subscription_type)


11. Let's look at duration....

How long is the shortest trip? How many are that short?

In [None]:
# df.duration.min() = 60. 
# 17 trips are just 60 seconds long.

df.duration.min()
df.query('duration == duration.min()').shape[0]

12. What do you think is going on with the short trips?

In [None]:
# Duration.min() = 60, must be seconds, all short trips end where they began.
# Seem to be interrupted bookings or immediate returns assigned to a minimum value of 60 seconds.

df.query('duration == duration.min()')
df.query('duration == duration.min() and start_station != end_station').shape[0]

13. What is the longest trip? How many trips are "long"?

14. Do the long durations seem reasonable? Why are they so long? What could that tell us about the users?

15. Timebox 15 minutes to explore the data guided by your own intuition or hypotheses…


In [None]:
# 722236 seconds = ~200 hours = ~8 days
# median trip duration = 531 seconds = ~9 Minutes
# mean trip duration = 1230 seconds = ~20 Minutes
# 72073 trips are longer than the median

df.duration.max()
#df.duration.mean()
df.query('duration >= duration.median()').sort_values('duration', ascending=False).head()#.shape[0]
#df.duration.sort_values(ascending=False)
#df.query('duration == duration.max ()').shape[0]
#df.query('duration == duration.max ()')

16. Plot duration.

In [None]:
df['duration_min'] = round(df['duration'] / 60, 2)
dur_count = df['duration_min'].value_counts()
dur_count.reset_index().sort_values(by='index')

dur_count.plot(kind='hist', bins=20, figsize=(10,8));

17. Does that plot give insights?

In [None]:
#skewed distribution, oftentimes short trips, very few long trips

18. Select subsections of the data to make a series of plots to enable insights for the Product Team.

In [None]:
dur_count[dur_count < df['duration_min'].quantile(0.05)].plot(kind='hist', bins=20, figsize=(10,8));

In [None]:
dur_count[dur_count > df['duration_min'].quantile(0.95)].plot(kind='hist', bins=20, figsize=(10,8));

19. The Product Team would like all of the station names to be lower case and  with `_` as a seperator

`South Van Ness at Market` -> `south_van_ness_at_market`  

__DO NOT USE A FOR LOOP. THEY ARE THE 👿__

In [None]:
# Converting station names to lower_snake_case
df.start_station = df.start_station.map(lambda x: x.lower())
df.start_station = df.start_station.map(lambda x : x.replace(' ', '_'))

df.end_station = df.end_station.map(lambda x: x.lower())
df.end_station = df.end_station.map(lambda x : x.replace(' ', '_'))

df.head(5)