#Pandas tutorial

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp1

ImportError: No module named pandas

##Loading Data

In [2]:
bikedata = pd.read_csv('citibike-tripdata.csv')
len(bikedata)

NameError: name 'pd' is not defined

##Data Description

Citi Bike is New York City's bike sharing system. Intended to provide New Yorkers and visitors with an additional transportation option for getting around the city, bike sharing is fun, efficient and convenient. This data contains the following features collected for the bike sharing system

Trip Duration (seconds)

Start Time and Date

Stop Time and Date

Start Station Name

End Station Name

Station ID

Station Lat/Long

Bike ID

User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)

Gender (Zero=unknown; 1=male; 2=female)

Year of Birth

##Viewing Data

####Shows top 5 rows

In [None]:
bikedata.head() 

####Customizable - head() can be customised by passing the number of records a user wants to see

In [None]:
bikedata.head(15)

####Shows last 5 rows

In [None]:
bikedata.tail() #Shows last rows

#####Quick descriptive statistics of data

In [None]:
bikedata.describe() 

####Sorting by values

In [None]:
bikedata.sort(columns='birth year')

##Data Selection

####Selecting a single column using column name

In [None]:
bikedata['birth year']

In [None]:
bikedata.tripduration

The above two strategies do not work if your column name is seperated by a space

In [None]:
bikedata.tripdurationbikedata.'birth year' 

In [None]:
bikedata.tripdurationbikedata.'birth year' 

####Get column names

In [None]:
column_names = bikedata.columns
column_names

####Replace space in each column name with an underscore.

In [None]:
new_column_names = [name.replace(' ','_') for name in column_names]
bikedata.columns = new_column_names 

####New column names

In [None]:
bikedata.columns

In [None]:
bikedata.birth_year

####Sorting a column by values -  Sorting data in descending order of trip duration

In [None]:
bikedata.sort('tripduration',ascending=False)

###Alternative ways to select data

####For slicing rows explicitly

In [None]:
bikedata.iloc[1:5,:]

In [None]:
bikedata.iloc[:,0:5]

####Boolean Indexing - Using a single column’s values to select data

In [None]:
duration_under90 = bikedata[bikedata['tripduration']<90]
len(duration_under90)

In [None]:
duration_under90

####Using the isin() method for filtering:

In [None]:
betweenLocations = bikedata[bikedata['start_station_name'].isin(['Pershing Square South','Liberty St & Broadway'])]
len(betweenLocations)

In [None]:
betweenLocations.head()

##Handling missing data
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations

####To drop any rows that have missing data.

In [None]:
withoutNa = bikedata.dropna(how='any')
len(withoutNa)

####To drop any rows where all values are missing.

In [None]:
withoutNa = bikedata.dropna(how='all')
len(withoutNa)

In [None]:
bikedata['birth_year'].head(10)

####Filling missing data

In [None]:
(bikedata['birth_year'].fillna(method='pad')).head(10)

##Data Manipulation

####A userdefined function can also be applied to the whole data set, below statement returns the maximum of every column

In [None]:
bikedata.apply(np.max)

####Merging the data

In [None]:
pd.merge(betweenLocations,duration_under90)

####Creating pivot tables

In [None]:
pd.pivot_table(bikedata, values='tripduration', index=['start_station_name'], columns=['usertype'])

####Grouping the data

In [None]:
grouping = bikedata.groupby(['start_station_name'])

In [None]:
grouping.get_group('York St & Jay St')

##Time Series using pandas
pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space.

####Creating a copy of our bike data and converting it into time series

In [None]:
bikedatats = bikedata.copy()

In [None]:
bikedatats.head()

####Creating a datetime index and forming a time series

In [None]:
bikedatats.index = pd.to_datetime(bikedatats.pop('starttime'))

####Calculate mean duration of a trip on daily basis

In [None]:
dailymeans = bikedatats['tripduration'].resample('D', how='mean')
dailymeans

#####Show all trips starting at 13:00 

In [None]:
bikedatats.at_time('13:00')

####Show all trips between a certain time range

In [None]:
bikedatats.between_time('15:00', '16:00')
bikedatats.head(10)

##Plotting the data using matplotlib
###Plot shows the variation in mean trip duration for each day

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.figure(figsize=(50,8))

In [None]:
%matplotlib inline

In [None]:
plt.plot(dailymeans)
plt.show()