# Required Questions: Please answer completely all five required questions.

## Question 1

* Programmatically download and load into your favorite analytical tool the trip data for September 2015.

* Report how many rows and columns of data you have loaded.

## Question 2

* Plot a histogram of the number of the trip distance (“Trip Distance”).

* Report any structure you find and any hypotheses you have about that structure.

## Question 3

* Report mean and median trip distance grouped by hour of day.

* We’d like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fare, and any other interesting characteristics of these trips.

## Question 4

* Build a derived variable for tip as a percentage of the total fare.

* Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). Provide an estimate of performance using an appropriate sample, and show your work.

## Question 5

##### Choose only one of these options to answer for Question 5. There is no preference as to which one you choose. Please select the question that you feel your particular skills and/or expertise are best suited to. If you answer more than one, only the first will be scored.

### Option A: Distributions

* Build a derived variable representing the average speed over the course of a trip.

* Can you perform a test to determine if the average trip speeds are materially the same in all weeks of September? If you decide they are not the same, can you form a hypothesis regarding why they differ?

* Can you build up a hypothesis of average trip speed as a function of time of day?

### Option B: Visualization

* Can you build a visualization (interactive or static) of the trip data that helps us understand intra- vs. inter-borough traffic? What story does it tell about how New Yorkers use their green taxis?

### Option C: Search

*  We’re thinking about promoting ride sharing. Build a function that given point a point P, find the k trip origination points nearest P.

     * For this question, point P would be a taxi ride starting location picked by us at a given LAT-LONG.

     * As an extra layer of complexity, consider the time for pickups, so this could eventually be used for real time ride sharing matching.

     * Please explain not only how this can be computed, but how efficient your approach is (time and space complexity)

### Option D: Anomaly Detection

* What anomalies can you find in the data? Did taxi traffic or behavior deviate from the norm on a particular day/time or in a particular location?

* Using time-series analysis, clustering, or some other method, please develop a process/methodology to identify out of the norm behavior and attempt to explain why those anomalies occurred.

### Option E: Your own curiosity!

* If the data leaps out and screams some question of you that we haven’t asked, ask it and answer it! Use this as an opportunity to highlight your special skills and philosophies.


URL for NYC Taxi Data = 'http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml'


# ---------------------------------------------------------------------------------------------------------------
# Question 1

* Programmatically download and load into your favorite analytical tool the trip data for September 2015.

* Report how many rows and columns of data you have loaded.

-
Approach: 
1. Use standard **_request_** commands to get the html from the page. 
2. Use **_BeautifulSoup_** to find "href" tags on the page. I used the tutorial at http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup as a reference for this. 
3. Filter the list based on green trips and the date to find the URL for the data
4. Use **_urllib_** to download the csv from the s3 bucket. A good reference for this is https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3 because the python documentation is difficult to use sometimes
5. Use **_Pandas_** to open the file into a dataframe
6. Report the answer nicely in **_Markdown_**. The reference for that is here:https://stackoverflow.com/questions/18878083/can-i-use-variables-on-an-ipython-notebook-markup-cell


In [1]:
################################################################################
####################### USER DEFINED VARIABLES AND NAMES #######################

## Define the target web site
taxiURL = 'http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml'

## the values in filterValues can be edited to modify the analysis later if desired
filterValues = ['green_tripdata', '2015-09']

## Define the local name of the CSV file
csvName = 'cabdata.csv'

In [2]:
from os.path import isfile
import requests
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
from IPython.display import display, Markdown

## Check to see if the csv is already there. If it is then there is no need
## to repeatedly hit their websites to download a large file. 
if not isfile(csvName):
    ################################################################################
    ################## PROGRAMMATICALLY FIND THE URL FOR THE DATA ##################

    ## Send the GET command using Python's requests module
    htmlResponse = requests.get(taxiURL)

    ## Convert the html object to text
    textResponse = htmlResponse.text

    ## Create a Beautiful soup object
    soupResponse = BeautifulSoup(textResponse, 'lxml')

    ## Use a list comprehension with built-in Beautiful Soup methods 
    ## to create a list of all the links
    linkList = [link.get('href') for link in soupResponse.find_all('a')]

    ## Use a list comprehension to filter using filterValues defined above
    ## so that the only entry is the URL for the CSV we want
    csvURL = [item for item in linkList if filterValues[0] in item and filterValues[1] in item][0]

    ################################################################################
    ################## PROGRAMMATICALLY GET THE DATA FROM THE URL ##################

    ## Download the file from csvURL, name it using csvName defined above
    ## and save it in the local working directory
    urllib.request.urlretrieve(csvURL, csvName);

## End if condition
    
################################################################################
###### LOAD THE CSV INTO A DATAFRAME, CHECK THE SIZE, AND PRINT THE ANSWER #####

## Load the csv file into a pandas dataframe
greenDF = pd.read_csv(csvName)

## Determine the shape of the dataframe 
dfRows, dfCols = greenDF.shape

## Print in markdown because it is prettier than simple print commands
mdText = "Answer to Question 1: \n The CSV has **_ {} _** rows and **_ {} _** columns".format(dfRows, dfCols)
display(Markdown(mdText))

Answer to Question 1: 
 The CSV has **_ 1494926 _** rows and **_ 21 _** columns

# ---------------------------------------------------------------------------------------------------------------
# Question 2

* Plot a histogram of the number of the trip distance (“Trip Distance”).

* Report any structure you find and any hypotheses you have about that structure.


-Approach
1. Examine the summary statistics to get a handle on the contents of the data set.
2. Use **_Numpy_** and **_Bokeh_** to compute and plot the desired histogram. There are easier ways (such as pandas.Dataframe.hist) but I like a bit more control over the visualization. Here is a good Bokeh Histogram reference: https://bokeh.pydata.org/en/latest/docs/gallery/histogram.html. I kept making histograms, so I defined it as a subroutine which on one hand cleans up my code, but on the other does not allow for overlays and customization. Bokeh has nice controls, and it is pretty intuitive. 
2. Zoom into areas of interest in the histogram to examine the structure
    - Examine the sawtooth structure
    - Examine trip distances that equal 0 with regard to fare and trip duration to ascertain whether those values are erroneous
    - Fit a statistical model and provide observations regarding the fit


In [3]:
################################################################################
###################### HISTOGRAM VISUALIZATION SUBROUTINE ######################

def bokehhistogram(values, binsVal, tLabel, xLabel, yLabel):
    from numpy import histogram
    import bokeh
    from bokeh.io import output_notebook
    from bokeh.plotting import figure, show
    from bokeh.layouts import gridplot
    output_notebook()

    
    ################################################################################
    ############################ GENERATE HISTOGRAM DATA ###########################
    
    ## Use Numpy's histogram to generate the histogram data
    ## numpy.histogram default options: range=None, normed=False, weights=None, density=None¶
    histVals, binEdges = histogram(values, bins= binsVal) 
    
    
    ################################################################################
    ######################## PLOT THE HISTOGRAM USING BOKEH ########################

    ## Initialize the plot
    histPlot = figure(title=tLabel, plot_height = 500, plot_width = 900)
    histPlot.xaxis.axis_label = xLabel
    histPlot.yaxis.axis_label = yLabel

    ## Use the .quad glyphs to represent the bars in the chart
    histPlot.quad(top=histVals, bottom=0, left=binEdges[:-1], right=binEdges[1:],
            fill_color="#036564", line_color="#033649")

    ## Show the plot
    show(histPlot)


In [13]:
################################################################################
################# EXAMININE SUMMARY STATISTICS ON THE DATAFRAME ################


## Print the name of the column and the type
[print(i, type(greenDF[i].iloc[0])) for i in greenDF]; 

## Split into three dataframes to get descriptive statistics
numDF = greenDF[['Pickup_longitude','Pickup_latitude',
                 'Dropoff_longitude', 'Dropoff_latitude', 
                 'Passenger_count', 'Trip_distance', 'Fare_amount', 
                 'Extra', 'MTA_tax', 'Tip_amount', 'Tolls_amount', 
                 'Ehail_fee', 'improvement_surcharge', 'Total_amount'] ] 

strDF = greenDF[['VendorID','Store_and_fwd_flag', 'RateCodeID', 
                 'Payment_type', 'Trip_type ']] 

dayDF = greenDF[['lpep_pickup_datetime', 'Lpep_dropoff_datetime']]

## Convert dates to Pandas timestamp objects
dayDF = dayDF.apply(pd.to_datetime)
for col in dayDF:
    greenDF[col] = dayDF[col]
display(dayDF.describe())

## Convert categoricals and IDs to strings
strDF = strDF.astype('str')
for col in strDF:
    greenDF[col] = strDF[col]
display(strDF.describe())

## Keep true numerical values 
display(numDF.describe())

VendorID <class 'str'>
lpep_pickup_datetime <class 'pandas._libs.tslib.Timestamp'>
Lpep_dropoff_datetime <class 'pandas._libs.tslib.Timestamp'>
Store_and_fwd_flag <class 'str'>
RateCodeID <class 'str'>
Pickup_longitude <class 'numpy.float64'>
Pickup_latitude <class 'numpy.float64'>
Dropoff_longitude <class 'numpy.float64'>
Dropoff_latitude <class 'numpy.float64'>
Passenger_count <class 'numpy.int64'>
Trip_distance <class 'numpy.float64'>
Fare_amount <class 'numpy.float64'>
Extra <class 'numpy.float64'>
MTA_tax <class 'numpy.float64'>
Tip_amount <class 'numpy.float64'>
Tolls_amount <class 'numpy.float64'>
Ehail_fee <class 'numpy.float64'>
improvement_surcharge <class 'numpy.float64'>
Total_amount <class 'numpy.float64'>
Payment_type <class 'str'>
Trip_type  <class 'str'>


Unnamed: 0,lpep_pickup_datetime,Lpep_dropoff_datetime
count,1494926,1494926
unique,1079075,1077210
top,2015-09-20 02:00:32,2015-09-28 00:00:00
freq,9,172
first,2015-09-01 00:00:00,2015-09-01 00:00:00
last,2015-09-30 23:59:58,2015-10-01 23:56:10


Unnamed: 0,VendorID,Store_and_fwd_flag,RateCodeID,Payment_type,Trip_type
count,1494926,1494926,1494926,1494926,1494926.0
unique,2,2,7,5,3.0
top,2,N,1,2,1.0
freq,1169099,1486192,1454464,783699,1461506.0


Unnamed: 0,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount
count,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,1494926.0,0.0,1494926.0,1494926.0
mean,-73.83084,40.69114,-73.83728,40.69291,1.370598,2.968141,12.5432,0.35128,0.4866408,1.235727,0.1231047,,0.2920991,15.03215
std,2.776082,1.530882,2.677911,1.476698,1.039426,3.076621,10.08278,0.3663096,0.08504473,2.431476,0.8910137,,0.05074009,11.55316
min,-83.31908,0.0,-83.42784,0.0,0.0,0.0,-475.0,-1.0,-0.5,-50.0,-15.29,,-0.3,-475.0
25%,-73.95961,40.69895,-73.96782,40.69878,1.0,1.1,6.5,0.0,0.5,0.0,0.0,,0.3,8.16
50%,-73.94536,40.74674,-73.94504,40.74728,1.0,1.98,9.5,0.5,0.5,0.0,0.0,,0.3,11.76
75%,-73.91748,40.80255,-73.91013,40.79015,1.0,3.74,15.5,0.5,0.5,2.0,0.0,,0.3,18.3
max,0.0,43.17726,0.0,42.79934,9.0,603.1,580.5,12.0,0.5,300.0,95.75,,0.3,581.3


In [12]:
################################################################################
############################ GENERATE HISTOGRAM PLOT ###########################

# Pass the trip distance series to the histogram subroutine previously defined
bokehhistogram(greenDF['Trip_distance'], 'auto', 'Histogram of trip distance', 'Miles', 'Frequency')

##For those unfamiliar with Bokeh, remind them that they can interact with the histogram
display(Markdown("#### Remember that you can interact with this histogram using the tools on the right"))

#### Remember that you can interact with this histogram using the tools on the right

## Initial Observations regarding structure:
1. This distribution looks like a log normal. You can see it better if you ignore the long end and zoom in between 0 and 50.  
2. If you zoom in on the beginning lots of values that in the bin between 0 and 0.05
3. This distribution has a LOOOONG tail.
4. There is a weird sawtooth oscillation of fractional part of trip distance



In [6]:
################################################################################
################## EXAMINE THE SAWTOOTH STRUCTURE IN THE DATA ##################

divisor = 1

## If you compute the fractional part of each tripDistance value, you can plot a histogram of them
rems = [each % divisor for each in greenDF['Trip_distance']]


bokehhistogram(rems, 100, 'histogram of fractional trip distance', 'miles', 'frequency')

## What do I see here...
### It looks like there are likely two different trip distance recording methods:
- one that rounds at the tenth of the mile 
- one that rounds at the 1/100th of the mile. 

### Even miles are slightly more frequent than fractional miles
- Could this be a weird rounding algorithm?
- According to https://www.nytimes.com/2006/09/17/nyregion/thecity/17fyi.html certain combinations of city blocks correspond to whole miles, so it may just be that the city is laid out such that certain trips result in whole miles. The average NY block is 264 by 900 feet according to wikipedia. 

In [7]:
################################################################################
############################### EXAMINE THE ZEROS ##############################

## Let's filter the dataframe to distance values that equal 0
vsTrips = greenDF[greenDF['Trip_distance'] == 0 ]

## Let's look at a histogram of the fares that were recorded when the distance recorded is 0.
bokehhistogram(vsTrips['Fare_amount'], 'auto', 'histogram of fare when distance = 0' , 'fare ($) ', 'frequency')

In [9]:
# Let's compute trip duration and plot a histogram of it when the distance recorded is 0
duTime = [row[2] - row[1] for index, row in vsTrips.iterrows()]
duTimeH = [each.total_seconds()/3600 for each in duTime]
# 
    
## Let's look at a histogram of the fares that were recorded when the distance recorded is 0.
bokehhistogram(duTimeH, 'auto', 'histogram of trip duration when distance = 0' , 'duration (hours) ', 'frequency')

## What do I see here...
### Examining a histogram of the fares when the trip distance was equal to 0 shows that the fares are all over the place. 
- I did not expect such a wide range of values
- Some trip distance values may be erroneous
- Judging simply by the fact that there are negative values, some fares may be erroneous as well. 

### This leads me to believe that there are data recording errors throughout the sheet and that I cannot rely on any column more that any other...

### Examining a historgram of trip duration when trip distance was equal to 0 shows a distribution of durations that would make sense otherwise if the distance was not 0
- Trip duration values make physical sense (i.e. no negatives)
- The distribution of the data is predominately near 0, which makes sense
- I don't know enough about the rules of cabs (i.e. can they charge a fare for waiting if noone actually travels) to explain these variables with regard to each other. 


In [60]:
################################################################################
############################ EXAMINE LOG NORMAL FIT ############################

from scipy import stats 
import numpy as np
from numpy import histogram
import bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
output_notebook()

## Trim the ends of the distribution of tripDistance
subDF = greenDF[greenDF['Trip_distance']>0]
subDF = subDF[subDF['Trip_distance']<25]


## Use Numpy's histogram to generate the histogram data
## numpy.histogram default options: range=None, normed=False, weights=None, density=None
## NOTE: in this case I am norming because I am comparing the histogram to a predicted pdf curve
## and I want them on the same scale
histVals, binEdges = histogram(subDF['Trip_distance'], bins= 'auto', normed = True) 


################################################################################
######################## PLOT THE HISTOGRAM USING BOKEH ########################

## Initialize the plot
histPlot = figure(title= 'Histogram and PDF of truncated trip distance', plot_height = 500, plot_width = 900)
# histPlot.xaxis.axis_label = xLabel
# histPlot.yaxis.axis_label = yLabel

## Use the .quad glyphs to represent the bars in the chart
histPlot.quad(top=histVals, bottom=0, left=binEdges[:-1], right=binEdges[1:],
        fill_color="#036564", line_color="#033649")


################################################################################
########################## PREDICT THE FIT PARAMETERS ##########################


# ## Fit a lognormal curve to the data
shape, loc, scale = stats.lognorm.fit(subDF['Trip_distance'], floc=0) 
mu = np.log(scale) 
sigma = shape 
M = np.exp(mu) 
s = np.exp(sigma) 

## Compute a log Normal curve based on the computed shape, etc.
## For simplicity generate a sequence and predict off that. 
lnX = np.linspace(0, 25, num=400)
lnY = stats.lognorm.pdf(lnX, shape, loc=loc, scale=scale)

################################################################################
########################### PLOT THE LOGNORMAL CURVE ###########################
histPlot.line(lnX,lnY, line_width=2, color="#B3DE69")

histPlot.xaxis.axis_label = 'Trip Distance (mi)'
histPlot.yaxis.axis_label = 'Normalized frequency'

## Show the plot
show(histPlot)




## What do I see here...

### As I hypothesized, this data fits a lognormal distribution fairly well. 
- I cut off the zero values since they are possibly erroneous 
- I could improve fit by rounding trip distance to 1/10th of a mile, thus removing the sawtooth structure
- I could tune the model parameters, to improve the fit
- I could compute an RMS Error of prediction, but for the sake of commenting on the structure, I believe it is sufficient to say that a log normal model generally describes the data. 

# ---------------------------------------------------------------------------------------------------------------
# Question 3

* Report mean and median trip distance grouped by hour of day.

* We’d like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fare, and any other interesting characteristics of these trips.

## Approach
##### Question 1
1. Create a derived value for hour of the day using pandas methods
2. Loop through by hour and create a dictionary of the summary statistics by hour. I am using a dictionary because dataframe conversion is particulary easy. Although the dictionary's main key was hour, I added a subset key/value pair for hour because of how the dicitonary would be converted into the dataframe, and if I wanted to use the hour data directly or alter it, it is convenient to have it be its own column and not just the index. 
3. Convert the dictionary to a dataframe, including the orient = 'index' so that it is a long skinny dataframe. I like my columns to be homogenous.
4. Display the dataframe and plot it as a simple line plot
##### 
##### Question 2
1. Using information from the website http://www.nyc.gov/html/exit-page.html?url=https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv, I determined that JFK Airport is zone 132 and LaGuardia Airport is zone 138.... unfortunately the september 2015 data has lat/lon not zones... But this does have a rate code ID where JFK code #2. 


JFK = 40.6413° N, 73.7781° W
LaGuardia = 40.7769° N, 73.8740° W


If I had more time, I could filter on the lat/lon, but I think that is outside the scope of this challenge...

In [37]:
################################################################################
########################### GET HOUR OUT OF TIMESTAMP ##########################

## use Pandas to extract the hour of the day in 24hour notation from the timestamp
greenDF['puHour'] = pd.DatetimeIndex(greenDF['lpep_pickup_datetime']).hour

## We can view a histogram of the new data to visually check the conversion
bokehhistogram(greenDF['puHour'], 'auto', 'Histogram of Pickup Time ' , 'Hour of the day (24hr format) ', 'Frequency ')


In [55]:
hours = range(0, 24)

## Use a dictionary to make things easier in terms of conversion to a dataframe
summaryDict = {}
for hour in hours:
    subDF = greenDF[greenDF['puHour']==hour]
    summaryDict[hour] = {'Hour': hour, 'Mean Distance': subDF['Trip_distance'].mean(), 'Median Distance': subDF['Trip_distance'].median()}

summaryDF = pd.DataFrame.from_dict(summaryDict, orient = 'index')
display(summaryDF)

Unnamed: 0,Hour,Mean Distance,Median Distance
0,0,3.115276,2.2
1,1,3.017347,2.12
2,2,3.046176,2.14
3,3,3.212945,2.2
4,4,3.526555,2.36
5,5,4.133474,2.9
6,6,4.055149,2.84
7,7,3.284394,2.17
8,8,3.04845,1.98
9,9,2.999105,1.96


In [59]:
################################################################################
############################ PLOT SUMMARY STATISTICS ###########################

import bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
output_notebook()

sumPlot = figure()

sumPlot.line(summaryDF['Hour'], summaryDF['Mean Distance'], color = 'red')
sumPlot.line(summaryDF['Hour'], summaryDF['Median Distance'], color = 'blue')


sumPlot.xaxis.axis_label = 'Hour of Day (24 hr notation)'
histPlot.yaxis.axis_label = 'Trip Distance (mi)'
show(sumPlot)

In [76]:
################################################################################
############################ SUBSET TO AIRPORT ZONES ###########################

airPU = greenDF[greenDF['RateCodeID'] == "2"]

message = '### Total number of airport trips in Sept 2015 was ' + str(len(airPU)) + ' based on the Rate Code ID for the airport'
display(Markdown(message))

message = '### Mean fare of airport trips in Sept 2015 was $' + str( round(airPU['Fare_amount'].mean() ,2)) + ' based on the Rate Code ID for the airport'
display(Markdown(message))

### Total number of airport trips in Sept 2015 was 4435 based on the Rate Code ID for the airport

### Mean fare of airport trips in Sept 2015 was $49.02 based on the Rate Code ID for the airport

In [86]:


bokehhistogram(airPU['Trip_distance'], 10000, 'Histogram of Trip Distance ' , 'Distance', 'Frequency ')

In [62]:
def commentbar(stringly):
    if len(stringly)>70:
        return stringly
    else:
        
        if len(stringly)>0:
            stringly = " "+stringly+" "
        
        while len(stringly)<80:
            stringly = "#"+stringly+"#"

        if len(stringly)>80:
            stringly = stringly[:-1]
        return stringly
    
print(commentbar(''))    
print(commentbar('SUBSET TO AIRPORT ZONES'))




################################################################################
############################ SUBSET TO AIRPORT ZONES ###########################
