# Scientific Python
## Central European University

## 04 Pandas, seaborn -- Even more exercises

Instructor: Márton Pósfai, TA: --

Email: posfaim@ceu.edu

*Don't forget:* use the Slack channel for discussion, to ask questions, or to show solutions to exercises that are different from the ones provided in the notebook. [Slack channel](http://www.personal.ceu.edu/staff/Marton_Posfai/slack_forward.html)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load the bike sharing data

The remaining exercises use the bike data. Load the files and merge them again using the following code cells.

In [None]:
trips = pd.read_csv('Divvy_Trips_2013.csv')
trips.starttime = pd.to_datetime(trips.starttime, format="%Y-%m-%d %H:%M")
trips.stoptime = pd.to_datetime(trips.stoptime, format="%Y-%m-%d %H:%M")
trips['dayofweek']=trips['starttime'].apply(lambda dt: dt.dayofweek)
trips['logduration']=np.log(trips['tripduration'])

stations = pd.read_csv('Divvy_Stations_2013.csv')
trips2 = pd.merge(left=trips, right=stations, how='left', left_on='from_station_name', right_on='name')
trips_extended = pd.merge(trips2, stations, how='inner', left_on='to_station_name', right_on='name',
                    suffixes=['_origin', '_dest'])

In the optional Seaborn part of the class notebook, we calculate the distance between the stations which we will use again here.

We deifine a function that takes two latitude and longitude pairs and returns their distance (see the [haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) on wikipedia for details):

In [None]:
def latlongdist(lat1,long1,lat2,long2):
    rlat1 = math.radians(lat1)
    rlat2 = math.radians(lat2)
    rlong1 = math.radians(long1)
    rlong2 = math.radians(long2)
    dlat = rlat2 - rlat1
    dlong = rlong2 - rlong1
    a = math.sin(dlat / 2)**2 + math.cos(rlat1) * math.cos(rlat2) * math.sin(dlong / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return 6371.0 * c

Create a new column for each trip containing the distance between two stations:

In [None]:
trips_extended['dist']=trips_extended.apply(lambda row: 
                                            latlongdist(row['latitude_origin'],
                                                        row['longitude_origin'],
                                                        row['latitude_dest'],
                                                        row['longitude_dest']),
                                           axis=1)

### 01 Distances

Use the `dist` column in the `trips_extended` dataframe that contains the distance (as the crow flies) between the origin and destination stations to create a figure using seaborn's `pointplot` function to answer the following questions:
* Do men or women go on longer trips?
* How does the trip distance depend on the day of the week?

<details><summary><u>Hint</u></summary>
<p>

This is very similar to what we did in the class notebook with the day of week, gender and tripduration.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
dayofweek_gender_dist = trips_extended.groupby(['dayofweek','gender'])['dist'].mean()
dayofweek_gender_dist = dayofweek_gender_dist.reset_index()

ax=sns.pointplot(data=dayofweek_gender_dist,x='dayofweek',y='dist', hue='gender')
ax.set_ylabel('Average distance of trips [km]')
ax.set_title('Average trip distance per day user gender');
sns.despine()
sns.despine(trim=True,offset=10)
```
    
</p>
</details>

### 02 Age

Using the `birthday` column in `trips` calculate the age of the rider in years at the time of the trip and create a new column `age`.
* What is the highest age in the data? How many rides were taken by people this age?

<details><summary><u>Hint</u></summary>
<p>

The age is (year of the trip - birth year). You can get the year from the  `"starttime"` column the same way we extracted the day of week. (If you must, you can also cut corners by noticing that the entire data set is from 2013...)

</p>
</details>

In [None]:
trips_extended['age']=trips_extended.starttime.apply(lambda dt: dt.year)-trips_extended.birthday


<details><summary><u>Solution.</u></summary>
<p>
    
```python
trips_extended['age']=trips_extended.starttime.apply(lambda dt: dt.year)-trips_extended.birthday
```
    
</p>
</details>

* Create a histogram showing the distribution of the rider ages depending on their gender? Check out `sns.displot` [here](https://seaborn.pydata.org/tutorial/distributions.html). Try both showing the raw counts and normalizing the genders independently by setting `stat='density'` and `common_norm=False`. You can experiment with `alpha`, `hue_order` and other settings to make your plot look nice.

<details><summary><u>Hint</u></summary>
<p>

You need to show the histogram of ages so set `x='age'`. To show the genders separately, set `hue='gender'`.

</p>
</details>

In [None]:
sns.displot(data=trips_extended,x='age',hue='gender',
            hue_order=['Female','Male'],
            bins=20,alpha=.66,
            #stat='density',common_norm=False
           )

<details><summary><u>Solution.</u></summary>
<p>
    
```python
sns.displot(data=trips_extended,x='age',hue='gender',
            hue_order=['Female','Male'],
            bins=20,alpha=.66,
            #stat='density',common_norm=False
           )
```
    
</p>
</details>

* Does the trip distance depend on age? Plot the mean distance as a function of rider age. There are only a few old riders, so the data is noisy for large ages; show only ages under 60.

<details><summary><u>Hint</u></summary>
<p>

Groupby `age` and calculate the mean of `dist`. Plot using pandas or seaborn, your choice.

</p>
</details>

In [None]:
dist_by_age['age'].values[0::10]

In [None]:
ax

In [None]:
dist_by_age = trips_extended.groupby('age')['dist'].mean()
dist_by_age = dist_by_age.reset_index()
ax = sns.pointplot(data=dist_by_age, x='age',y='dist')

#plt.xticks([0,18.,30.],labels=['0','18','30'])

#ax.set_xticklabels(None) #change location of ticks
ax.get_xticks(),ax.get_xticklabels()
#dist_by_age.age

In [None]:
plt.plot([0,1],[0,1])
plt.xticks([0,.4],labels=['kakao','alma'])
plt.xticks([0,1],labels=[0,1])

<details><summary><u>Solution.</u></summary>
<p>
    
```python
ax=trips_extended[trips_extended.age<60].groupby('age')['dist'].mean().plot()
ax.set_ylabel('distance')
```
    
</p>
</details>

### 03 Hours of day

Using the `hour` attribute of the timestamp objects, create a new column that contains the hour of the day when the trip started. Plot the total number of trips that happened in each hour with a separate color for each day of the week using seaborn.

Can you explain the patterns that you see?

<details><summary><u>Hint -- new column</u></summary>
<p>

This is very similar to what we did with the day of week, only instead of using `dt.dayofweek`, we have to use `dt.hour`.

</p>
</details>

<details><summary><u>Hint -- groupby</u></summary>
<p>

You have to `groupby` based on the two columns representing the day of the week and the hour. You need to use the `count()` statistic and you can pick any column to do the statistic on.

</p>
</details>

<details><summary><u>Hint -- plotting</u></summary>
<p>

This is very similar to what we did with the `dayofweek` and `gender`, only instead of using `gender` use the new column. Make sure that the hue represents the days of the week and the x axis represents the hour of the day.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
trips_extended['hour']=trips_extended['starttime'].apply(lambda dt: dt.hour)

dayofweek_hour_numtrips = trips_extended.groupby(['dayofweek','hour'])['trip_id'].count().reset_index()

ax=sns.pointplot(data=dayofweek_hour_numtrips,x='hour',y='trip_id', hue='dayofweek')
ax.set_ylabel('Number of trips')
ax.set_title('Number of trips per hour');
sns.despine()
```
    
</p>
</details>

### 04 Map of stations

Use `sns.jointplot` to create a two dimensional histogram of the longitude and latitude of the stations in the `stations` dataframe. Look up a map of Chicago, does you plot make sense?

<details><summary><u>Hint</u></summary>
<p>

Use `jointplot` and set `x='longitude` and y=`latitude`. Yes, your plot should make sense!

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
sns.jointplot(data=stations,x='longitude',y='latitude',kind='kde')
```
    
</p>
</details>

### 05 East-west

What is the fraction of trips going west (the destionation station is to the west of the origin station) and what is the fraction of trips going north? Exclude trips that originate and end at the same station.

Bonus: Is this statistically different from 50%? Calculate the 95% confidence interval using 
$$CI = 1.96\sqrt{\frac{\hat p(1-\hat p)}{n}},$$
where $\hat p$ is the estimated probability and $n$ is the number of samples.

<details><summary><u>Hint</u></summary>
<p>

Trips go west if
```python
trips_wo_returns.longitude_dest<trips_wo_returns.longitude_origin
```
Do something similar for north.
                                                                  
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
trips_wo_returns = trips_extended[trips_extended.from_station_id!=trips_extended.to_station_id]
n = len(trips_wo_returns)

p_west = (trips_wo_returns.longitude_dest<trips_wo_returns.longitude_origin).sum()/n
p_west_CI = 1.96*np.sqrt(p_west*(1-p_west)/n)
print(f"west:{p_west:.4f}+-{p_west_CI:.4f}")

p_north = (trips_wo_returns.latitude_dest>trips_wo_returns.latitude_origin).sum()/n
p_north_CI = 1.96*np.sqrt(p_north*(1-p_north)/n)
print(f"north:{p_north:.4f}+-{p_north_CI:.4f}")
```
    
</p>
</details>

### 06 Capacity

Investigate the relationship between station capacity and traffic. Which two stations would you expand if you had the budget?
* Create a series indexed by the station name and containing the total number of out-going traffic of each station
* Do the same thing with in-traffic and add the two series together to get the total traffic
* Use `merge()` to add traffic data to the `stations` dataframe
* Create a scatter plot showing the correlation between traffic and capacity

<details><summary><u>Hint 1</u></summary>
<p>

To get the out-traffic group the trips by `from_station_name` and use `count()` on the `trip_id` column.

</p>
</details>

<details><summary><u>Hint 2</u></summary>
<p>

Use `merge()` very similarly as we did earlier to extend the `trips` dataframe. Try renaming the new column!
The next hint reveals the exact code to do the merge.

</p>
</details>

<details><summary><u>Hint 3</u></summary>
<p>

```python
stations_extended = pd.merge(left=stations, right=traffic, how='left', left_on='name', right_on='to_station_name')
```
The series object has a name attribute, in this case `traffic.name`, that will become the name of the column after the merge. You can rename the column after the merge, or you can change the name of the series before the merge.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
outtraffic = trips_extended.groupby('from_station_name')['trip_id'].count()
intraffic = trips_extended.groupby('to_station_name')['trip_id'].count()
traffic = intraffic+outtraffic
traffic.name = "traffic"
stations_extended = pd.merge(left=stations, right=traffic, how='left', left_on='name', right_on='to_station_name')
stations_extended.plot(kind='scatter',x='dpcapacity',y='traffic');
```
    
</p>
</details>

### 07 Sources and sinks

Which stations are sources and which stations are sinks, i.e. which stations have much more departures than arrivals, and vice versa? Print out the top 10 stations with the largest in- and out-traffic difference. To sort a series `S` by its values use `S.sort_values()`.

<details><summary><u>Hint.</u></summary>
<p>

Use the part of the solution of the previous exercise.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
traffic_diff = np.abs(intraffic-outtraffic).sort_values(ascending=False)
traffic_diff.head(10)
```
    
</p>
</details>

### 08 Daily commute map

Building on the previous exercises, create two series:
* `morning_outtraffic`: that counts the number of out-going trips from each station before noon
* `evening_outtraffic`: that counts the number of out-going trips from each station after noon

Create a scatter plot representing the stations such that the x coordinate is the longitude, the y coordinate is the latitude and the color of the markers represent the morning traffic divided by the evening traffic.

For added fun, color the stations red if they have more out-goin traffic in the morning than in the evening, otherwise black.

<details><summary><u>Hint -- series</u></summary>
<p>

Modify the solution of the 06 Capacity exercise to only count the trips for `trips_extended.hour<12`

</p>
</details>

<details><summary><u>Hint -- scatter plot</u></summary>
<p>

Use pandas' or matplotlib's scatter plot, set the color to be `c=morning_outtraffic/evening_outtraffic`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
morning_outtraffic = trips_extended[trips_extended.hour<12].groupby('from_station_name')['trip_id'].count()

evening_outtraffic = trips_extended[trips_extended.hour>12].groupby('from_station_name')['trip_id'].count()

stations_extended.plot(kind='scatter',x='longitude',y='latitude',
                       c=morning_outtraffic/evening_outtraffic
                       ,cmap='inferno',s=20);

stations_extended.plot(kind='scatter',x='longitude',y='latitude',
                       c= ['r' if x>1 else 'k' for x in morning_outtraffic/evening_outtraffic]
                       ,cmap='inferno',s=20);
```
    
</p>
</details>