#Subway Data
Analyzing Subway and Weather Data

Questions:
    - What variables are related to subway ridership?
        - Which stations have the most riders?
        - What are the ridership patterns over time?
        - How does the weather affect ridership?
    - What patterns can I find in the weather?
        - Is the temperature rising throughout the month?
        - How does weather vary across the city?
        
#Two-Dimensional Data
**Python: List of Lists**, **Numpy: 2D Arrays**, **Pandas: DataFrame**

2D Arrays, as opposed to array of array:
    - More memory efficient
    - Accessing elements is a bit different: a[1,3]
    - mean(), std(), etc. operates on entire array

In [1]:
import numpy as np

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

In [2]:
print(ridership[1, 3])
print(ridership[1:3, 3:5])
print(ridership[1, :])

2328
[[2328 2539]
 [6461 2691]]
[1478 3877 3674 2328 2539]


In [3]:
print(ridership[0, :] + ridership[1, :])
print(ridership[:, 0] + ridership[:, 1])

[1478 3877 3676 2333 2539]
[   0 5355 5701 4952 6410 5509  324    2 5223 5385]


In [4]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print(a + b)

[[ 2  3  4]
 [ 6  7  8]
 [10 11 12]]


Writing a function to:
    1. find the max riders on the first day
    2. find the mean riders per day

In [5]:
def mean_riders_for_max_station(ridership):
   
    max_station = ridership[0,:].argmax()
    overall_mean = ridership.mean()
    mean_for_max = ridership[:,max_station].mean()
    
    return (overall_mean, mean_for_max)

In [6]:
mean_riders_for_max_station(ridership)

(2342.5999999999999, 3239.9000000000001)

#Numpy Axis
Operations along an Axis

In [7]:
a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(a.sum())
print(a.sum(axis=0))
print(a.sum(axis=1))

45
[12 15 18]
[ 6 15 24]


Finding the mean ridership per day each subway station. Returning the maximum and minimum ridership per day

In [8]:
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

In [9]:
def min_and_max_riders_per_day(ridership):

    mean_daily_ridership = ridership.mean(axis=0)
    max_daily_ridership = mean_daily_ridership.max()
    min_daily_ridership = mean_daily_ridership.min()
    
    return (max_daily_ridership, min_daily_ridership)

In [10]:
min_and_max_riders_per_day(ridership)

(3239.9000000000001, 1071.2)

#Numpy and Pandas Data Type

In [11]:
np.array([1,2,3,4,5]).dtype

dtype('int32')

In [12]:
enrollments = np.array([
        ['account_key','status','join_date','days_to_cancel','is_udacity'],
        [448,'canceled','2014-11-10',65,True],
        [448,'canceled','2014-11-05',5,True],
        [448,'canceled','2015-01-27',0,True],
        [448,'canceled','2014-11-10',0,True],
        [448,'current','2015-03-10',np.nan,True]
    ])

In [13]:
enrollments

array([['account_key', 'status', 'join_date', 'days_to_cancel',
        'is_udacity'],
       ['448', 'canceled', '2014-11-10', '65', 'True'],
       ['448', 'canceled', '2014-11-05', '5', 'True'],
       ['448', 'canceled', '2015-01-27', '0', 'True'],
       ['448', 'canceled', '2014-11-10', '0', 'True'],
       ['448', 'current', '2015-03-10', 'nan', 'True']], 
      dtype='<U14')

This has converted everything to string. This could create problem while calculating mean and other metrices. Thus, Pandas Dataframe is used.

In [14]:
import pandas as pd

enrollments_df = pd.DataFrame({
        'account_key': [448,448,448,448,448],
        'status': ['canceled','canceled','canceled','canceled','current'],
        'join_date': ['2014-11-10','2014-11-05','2015-01-27','2014-11-10','2015-03-10'],
        'days_to_cancel': [65,5,0,0,np.nan],
        'is_udacity': [True,True,True,True,True]
    })

In [15]:
enrollments_df

Unnamed: 0,account_key,days_to_cancel,is_udacity,join_date,status
0,448,65.0,True,2014-11-10,canceled
1,448,5.0,True,2014-11-05,canceled
2,448,0.0,True,2015-01-27,canceled
3,448,0.0,True,2014-11-10,canceled
4,448,,True,2015-03-10,current


In [16]:
enrollments_df.mean()

account_key       448.0
days_to_cancel     17.5
is_udacity          1.0
dtype: float64

#Accessing Elements of DataFrame

In [17]:
ridership_df = pd.DataFrame({
        'R003': [0,1478,1613,1560,1608,1576,95,2,1438,1342],
        'R004': [0,3877,4088,3392,4802,3933,229,0,3785,4043],
        'R005': [2,3674,3991,3826,3932,3909,255,1,3589,4009],
        'R006': [5,2328,6461,4787,4477,4979,496,27,4174,4665],
        'R007': [0,2539,2691,2613,2705,2685,201,0,2215,3033]
    },index=[
        '05-01-11','05-02-11','05-03-11','05-04-11','05-05-11',
        '05-06-11','05-07-11','05-08-11','05-09-11','05-10-11'
    ])

In [18]:
ridership_df

Unnamed: 0,R003,R004,R005,R006,R007
05-01-11,0,0,2,5,0
05-02-11,1478,3877,3674,2328,2539
05-03-11,1613,4088,3991,6461,2691
05-04-11,1560,3392,3826,4787,2613
05-05-11,1608,4802,3932,4477,2705
05-06-11,1576,3933,3909,4979,2685
05-07-11,95,229,255,496,201
05-08-11,2,0,1,27,0
05-09-11,1438,3785,3589,4174,2215
05-10-11,1342,4043,4009,4665,3033


In [19]:
ridership_df.loc['05-02-11']

R003    1478
R004    3877
R005    3674
R006    2328
R007    2539
Name: 05-02-11, dtype: int64

In [20]:
ridership_df.iloc[9]

R003    1342
R004    4043
R005    4009
R006    4665
R007    3033
Name: 05-10-11, dtype: int64

In [21]:
ridership_df.iloc[1,3]

2328

In [22]:
ridership_df.loc['05-02-11','R003']

1478

In [23]:
ridership_df['R005']

05-01-11       2
05-02-11    3674
05-03-11    3991
05-04-11    3826
05-05-11    3932
05-06-11    3909
05-07-11     255
05-08-11       1
05-09-11    3589
05-10-11    4009
Name: R005, dtype: int64

In [24]:
ridership_df.values

array([[   0,    0,    2,    5,    0],
       [1478, 3877, 3674, 2328, 2539],
       [1613, 4088, 3991, 6461, 2691],
       [1560, 3392, 3826, 4787, 2613],
       [1608, 4802, 3932, 4477, 2705],
       [1576, 3933, 3909, 4979, 2685],
       [  95,  229,  255,  496,  201],
       [   2,    0,    1,   27,    0],
       [1438, 3785, 3589, 4174, 2215],
       [1342, 4043, 4009, 4665, 3033]], dtype=int64)

In [25]:
ridership_df.values.mean()

2342.5999999999999

In [26]:
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)

In [27]:
df_1 = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df_1)

df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])
print(df_2)
   
print(ridership_df.iloc[0])
print(ridership_df.loc['05-05-11'])
print(ridership_df['R003'])
print(ridership_df.iloc[1, 3])

   A  B
0  0  3
1  1  4
2  2  5
   A  B  C
0  0  1  2
1  3  4  5
R003    0
R004    0
R005    2
R006    5
R007    0
Name: 05-01-11, dtype: int64
R003    1608
R004    4802
R005    3932
R006    4477
R007    2705
Name: 05-05-11, dtype: int64
05-01-11       0
05-02-11    1478
05-03-11    1613
05-04-11    1560
05-05-11    1608
05-06-11    1576
05-07-11      95
05-08-11       2
05-09-11    1438
05-10-11    1342
Name: R003, dtype: int64
2328


In [28]:
print(ridership_df.iloc[1:4])

print(ridership_df[['R003', 'R005']])

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df.sum())
print(df.sum(axis=1))
print(df.values.sum())

          R003  R004  R005  R006  R007
05-02-11  1478  3877  3674  2328  2539
05-03-11  1613  4088  3991  6461  2691
05-04-11  1560  3392  3826  4787  2613
          R003  R005
05-01-11     0     2
05-02-11  1478  3674
05-03-11  1613  3991
05-04-11  1560  3826
05-05-11  1608  3932
05-06-11  1576  3909
05-07-11    95   255
05-08-11     2     1
05-09-11  1438  3589
05-10-11  1342  4009
A     3
B    12
dtype: int64
0    3
1    5
2    7
dtype: int64
15


In [29]:
def mean_riders_for_max_station(ridership):
    max_station = ridership.iloc[0].argmax()
    overall_mean = ridership.values.mean()
    mean_for_max = ridership[max_station].mean()
    
    return (overall_mean, mean_for_max)

In [30]:
mean_riders_for_max_station(ridership_df)

(2342.5999999999999, 3239.9)