# Case Study 3 : Data Science in NYC Taxi and Uber Data

**Required Readings:** 
* [Analyzing 1.1 Billion NYC Taxi and Uber Trips](http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/) 
* Please download the NYC taxi and Uber dataset from [here](https://github.com/toddwschneider/nyc-taxi-data).
* [TED Talks](https://www.ted.com/talks) for examples of 7 minutes talks.


**NOTE**
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

----------------------

# Problem: pick a data science problem that you plan to solve using Uber/Taxi Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using the data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

# Data Collection/Processing: 

In [1]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary






# Data Exploration: Exploring the Uber/Taxi Dataset

**plot the spatial distribution of the pickup locations of 5000 Uber trips** 
* collect a set of 5000 Uber trips
* plot the distribution of the pickup locations using a scatter plot figure.

In [2]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary















# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

Write codes to implement the solution in python:

In [9]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping

In [5]:
# Load the data

daily_trips = pd.read_csv('daily_trips_with_location_id.csv')
election_results = pd.read_csv('election_results_by_taxi_zone.csv')

In [35]:
# Calculate the features to use for predicting trump voter percent
# Features:
# Fraction of rides that are yellow in 2015
# Fraction of rides that are green in 2015
# Fraction of rides that are lyft in 2015
# Fraction of rides that are other in 2015
# Fraction of rides that are uber in 2015
# Fraction of rides that are via in 2015
# Total rides in 2015
# Fraction of Trump voters in 2016 (this is the target variable)

feature_lists = []
car_types = daily_trips['car_type'].unique()
car_type_num = len(car_types)
car_type_dict = dict(zip(car_types,list(range(car_type_num))))
for location_id,location in daily_trips[daily_trips['date'].str.contains('2015')].groupby('pickup_location_id'):
    feature_list = [0]*(car_type_num+3)
    type_counts = location.groupby('car_type')['trips'].count()
    indicies = [car_type_dict[i] for i in type_counts.index.values]
    values = [v for v in type_counts.values]
    feature_list[-3] = np.sum(type_counts.values)
    for i,v in zip(indicies,values):
        feature_list[i] = v/feature_list[-3]
    try:
        feature_list[-2] = election_results.loc[election_results['locationid'] == location_id, 'trump'].iloc[0]>=0.5
        feature_list[-1] = election_results.loc[election_results['locationid'] == location_id, 'trump'].iloc[0]<0.5
        feature_lists.append(feature_list)
    except:
        print('Location id '+str(location_id)+' has no voter information.')
        
df = pd.DataFrame(feature_lists)
#print(df)
df.to_csv('feature_lists.csv', index = False)

Location id 1.0 has no voter information.
Location id 57.0 has no voter information.
Location id 105.0 has no voter information.
Location id 199.0 has no voter information.
Location id 253.0 has no voter information.
Location id 264.0 has no voter information.
Location id 265.0 has no voter information.


In [8]:
# Prepare the data for the neural network
trn_data,val_tst_data = train_test_split(feature_lists,train_size=0.6,stratify=np.array(feature_lists)[:,-1])
val_data,tst_data = train_test_split(val_tst_data,train_size=0.5,stratify=np.array(val_tst_data)[:,-1])

trn_data = np.array(trn_data)
val_data = np.array(val_data)
tst_data = np.array(tst_data)

scaler = StandardScaler()
trn_data[:,:-2] = scaler.fit_transform(trn_data[:,:-2])
val_data[:,:-2] = scaler.transform(val_data[:,:-2])
tst_data[:,:-2] = scaler.transform(tst_data[:,:-2])

# Create and train the neural network
model = Sequential()
model.add(Dense(16, activation='sigmoid', input_shape=(features,)))
model.add(Dense(2, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='sgd',metrics=['accuracy'])
callbacks = [EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=False)]
model.fit(trn_data[:,:-2],trn_data[:,-2:],validation_data=(val_data[:,:-2],val_data[:,-2:]),epochs=10000,callbacks=callbacks,verbose=1)




TypeError: Error converting shape to a TensorShape: int() argument must be a string, a bytes-like object or a number, not 'list'.

In [83]:
threshold = 0.5
output = model.predict(val_data[:,:-2])[:,-2] >= threshold
target = val_data[:,-2] >= threshold
print('validation confusion matrix:')
print(confusion_matrix(target,output))
output = model.predict(tst_data[:,:-2])[:,-2] >= threshold
target = tst_data[:,-2] >= threshold
print('testing confusion matrix:')
print(confusion_matrix(target,output))

validation confusion matrix:
[[45  0]
 [ 4  2]]
testing confusion matrix:
[[46  0]
 [ 2  4]]


In [44]:
## Alternate Model
# Prepare the data for the logistic regression
data = pd.read_csv('feature_lists.csv')
#print(data)
car_types = list(car_types)
car_types.append('volume')
car_types.append('trump')
print(car_types)

trn_data,tst_data = train_test_split(data,train_size=0.75,stratify=np.array(feature_lists)[:,-1])
 
trn_y = trn_data[-1]
print(trn_y)
del trn_data[-1]

tst_y = tst_data[-1]
del tst_data[-1]

# Logistic Regression Model
logis = LogisticRegression()
logis.fit(trn_data, trn_y)

model_out = logis.predict(tst_data)
metric = metric(trn_y, tst_y)

# Alt model confusion matrix
threshold = 0.5
output = model.predict(val_data[:,:-2])[:,-2] >= threshold
target = val_data[:,-2] >= threshold
print('validation confusion matrix:')
print(confusion_matrix(target,output))
output = model.predict(tst_data[:,:-2])[:,-2] >= threshold
target = tst_data[:,-2] >= threshold
print('testing confusion matrix:')
print(confusion_matrix(target,output))

['yellow', 'green', 'uber', 'gett', 'juno', 'lyft', 'other', 'via', 'volume', 'trump', 'volume', 'trump']


TypeError: '(slice(None, None, None), slice(None, -2, None))' is an invalid key

# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary








*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 7 minutes' talk) to present about the case study . Each team present their case studies in class for 7 minutes.

Please compress all the files in a zipped file.


**How to submit:**

        Please submit through Canvas, in the Assignment "Case Study 3".
        
**Note: Each team only needs to submit one submission in Canvas**


# Peer-Review Grading Template:

**Total Points: (100 points)** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (10 points) how well did the team describe the problem they are trying to solve using the data? 
       0: not clear
       2: I can barely understand the problem
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection and Processing:
    ----------------------------------
    
    3. (10 points) Do you think the data collected/processed are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect/process
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    
    (1) plot the spatial distribution of the pickup locations of 5000 Uber trips (10 points):
       0: missing answer
       4: okay, but with major problems
       7: good, but with minor problems
      10: perfect
    

    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? (10 points)
       0: not relevant
       2: barely relevant to the problem
       4: okay solution, but there is an easier solution.
       6: good, but can be improved
       8: very good, but solution is simple/old
       10: innovative and technically sound
       
    7. how well did the team implement the solution in python? (10 points)
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think of the results they found in the data?  (5 points)
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


