# Required Questions: Please answer completely all five required questions.

## Question 1

* Programmatically download and load into your favorite analytical tool the trip data for September 2015.

* Report how many rows and columns of data you have loaded.

## Question 2

* Plot a histogram of the number of the trip distance (“Trip Distance”).

* Report any structure you find and any hypotheses you have about that structure.

## Question 3

* Report mean and median trip distance grouped by hour of day.

* We’d like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fare, and any other interesting characteristics of these trips.

## Question 4

* Build a derived variable for tip as a percentage of the total fare.

* Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). Provide an estimate of performance using an appropriate sample, and show your work.

## Question 5

##### Choose only one of these options to answer for Question 5. There is no preference as to which one you choose. Please select the question that you feel your particular skills and/or expertise are best suited to. If you answer more than one, only the first will be scored.

### Option A: Distributions

* Build a derived variable representing the average speed over the course of a trip.

* Can you perform a test to determine if the average trip speeds are materially the same in all weeks of September? If you decide they are not the same, can you form a hypothesis regarding why they differ?

* Can you build up a hypothesis of average trip speed as a function of time of day?

### Option B: Visualization

* Can you build a visualization (interactive or static) of the trip data that helps us understand intra- vs. inter-borough traffic? What story does it tell about how New Yorkers use their green taxis?

### Option C: Search

*  We’re thinking about promoting ride sharing. Build a function that given point a point P, find the k trip origination points nearest P.

     * For this question, point P would be a taxi ride starting location picked by us at a given LAT-LONG.

     * As an extra layer of complexity, consider the time for pickups, so this could eventually be used for real time ride sharing matching.

     * Please explain not only how this can be computed, but how efficient your approach is (time and space complexity)

### Option D: Anomaly Detection

* What anomalies can you find in the data? Did taxi traffic or behavior deviate from the norm on a particular day/time or in a particular location?

* Using time-series analysis, clustering, or some other method, please develop a process/methodology to identify out of the norm behavior and attempt to explain why those anomalies occurred.

### Option E: Your own curiosity!

* If the data leaps out and screams some question of you that we haven’t asked, ask it and answer it! Use this as an opportunity to highlight your special skills and philosophies.



## Question 1

* Programmatically download and load into your favorite analytical tool the trip data for September 2015.

* Report how many rows and columns of data you have loaded.

-
Approach: 
1. Use standard request commands to get the html from the page. 
2. Use beautiful soup's built in methods to find "href" tags on the page. I used the tutorial at http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup as a reference for this. 
3. Filter the list based on green trips and the date to find the URL for the data
4. Use urllib to download the csv from the s3 bucket. A good reference for this is https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3 because the python documentation is difficult to use sometimes
5. Use pandas to open the file into a dataframe
6. Report the answer nicely in markdown. The reference for that is here:https://stackoverflow.com/questions/18878083/can-i-use-variables-on-an-ipython-notebook-markup-cell

NOTE: I put some of these steps in different cells so that I could debug and run routines without repeatedly hitting others' URLs. 
This is just the respectful thing to do :)


In [18]:
################################################################################
########## PROGRAMMATICALLY FIND THE URL FOR THE DATA ##########################

import requests
from bs4 import BeautifulSoup

## Define the target web site
taxiURL = 'http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml'

## Send the GET command using Python's requests module
htmlResponse = requests.get(taxiURL)

## Convert the html object to text
textResponse = htmlResponse.text

## Create a Beautiful soup object
soupResponse = BeautifulSoup(data, 'lxml')

## Use a list comprehension with built-in Beautiful Soup methods 
##to create a list of all the links
linkList = [link.get('href') for link in soup.find_all('a')]

## filter linkList so that the only entry is the one we want
## the values in filterValues can be edited to modify the analysis later if desired
filterValues = ['green_tripdata', '2015-09']
csvURL = [item for item in linkList if filterValues[0] in item and filterValues[1] in item][0]

print (csvURL)

https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv


In [24]:
################################################################################
########## PROGRAMMATICALLY GET THE DATA FROM THE URL ##########################

import urllib.request

## Download the file from csvURL and save it in the same directory as this script is using
csvName = 'cabdata.csv'
urllib.request.urlretrieve(csvURL, csvName)

In [30]:
################################################################################
########## LOAD THE CSV INTO A DATAFRAME #######################################

import pandas as pd

## Load the csv file into a pandas dataframe
greenDF = pd.read_csv(csvName)

## Preview the data to ensure everything makes sense
greenDF.head(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
0,2,2015-09-01 00:02:34,2015-09-01 00:02:38,N,5,-73.979485,40.684956,-73.979431,40.68502,1,...,7.8,0.0,0.0,1.95,0.0,,0.0,9.75,1,2.0
1,2,2015-09-01 00:04:20,2015-09-01 00:04:24,N,5,-74.010796,40.912216,-74.01078,40.912212,1,...,45.0,0.0,0.0,0.0,0.0,,0.0,45.0,1,2.0
2,2,2015-09-01 00:01:50,2015-09-01 00:04:24,N,1,-73.92141,40.766708,-73.914413,40.764687,1,...,4.0,0.5,0.5,0.5,0.0,,0.3,5.8,1,1.0
3,2,2015-09-01 00:02:36,2015-09-01 00:06:42,N,1,-73.921387,40.766678,-73.931427,40.771584,1,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0
4,2,2015-09-01 00:00:14,2015-09-01 00:04:20,N,1,-73.955482,40.714046,-73.944412,40.714729,1,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0
5,2,2015-09-01 00:00:39,2015-09-01 00:05:20,N,1,-73.945297,40.808186,-73.937668,40.821198,1,...,5.5,0.5,0.5,1.36,0.0,,0.3,8.16,1,1.0
6,2,2015-09-01 00:00:52,2015-09-01 00:05:50,N,1,-73.890877,40.746426,-73.876923,40.756306,1,...,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,1,1.0
7,2,2015-09-01 00:02:15,2015-09-01 00:05:34,N,1,-73.946701,40.797321,-73.937645,40.804516,1,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0
8,2,2015-09-01 00:02:36,2015-09-01 00:07:20,N,1,-73.96315,40.693829,-73.956787,40.680531,1,...,6.0,0.5,0.5,1.46,0.0,,0.3,8.76,1,1.0
9,2,2015-09-01 00:02:13,2015-09-01 00:07:23,N,1,-73.89682,40.746128,-73.888626,40.752724,1,...,5.5,0.5,0.5,0.0,0.0,,0.3,6.8,2,1.0


In [38]:
################################################################################
################### ANSWER THE QUESTION  #######################################

from IPython.display import Markdown as md

## Determine the shape of the dataframe and return the answer nicely
rows, cols = greenDF.shape



SyntaxError: invalid syntax (<ipython-input-38-73b4889a36b8>, line 13)