## Part 1 - Selection (30 points)
Identify and describe your dataset, its source, and what appeals
to you about it.  Acquire the data and perform an initial exploration
to determine which themes you wish to explore.  Describe the questions
you want to be able to answer with the data, any concerns you have
about the data, and any challenges you expect to have to overcome.


As traffic volumes continue to grow in the coming decades, the public sector will need to consider every possible opportunity to better manage all transportation systems and infrastructure.
Better traffic flow is achievable in part with better systems for collecting and analyzing real-time traffic data. In this arena, transportation managers can learn from the technologies and practices deployed by private companies. Government can provide route optimization suggestions during rush hours. The in-time system combines daily data on package delivery commitments and historical route tracking to identify the optimal path. By diving into the data, knowing where people need to go, when people go, determining how to make the transportation system the best it can be, then investing in the needed technology to realize those improvements. 


This dataset includes taxi trips for 2016, reported to the City of Chicago in its role as a regulatory agency. Due to the data reporting process, not all trips are reported but the City believes that most are. 

In this project, we specially focus on the following questions: what time is the highest demand of the taxi?  Which area is the most popular for taxi as a primary vehicle? What is the taxi pricing strategy? What is the main payment type? 



In [3]:
import pandas as pd
from pandas import Timestamp
import numpy as np
%matplotlib inline
from datetime import datetime as dt
from matplotlib.pylab import date2num
import time
from pandas.core.frame import DataFrame
from dateutil.parser import parse
import seaborn as sns
from sqlalchemy import create_engine
import time
from datetime import datetime
import json

# FACT table

**We analyze and drop rows, columns with null values and reset the index.**

In [4]:
!csvcut -n chicago_taxi_trips_2016_01.csv

  1: taxi_id
  2: trip_start_timestamp
  3: Month
  4: Day
  5: Time
  6: 
  7: trip_end_timestamp
  8: trip_seconds
  9: trip_miles
 10: pickup_census_tract
 11: dropoff_census_tract
 12: pickup_community_area
 13: dropoff_community_area
 14: fare
 15: tips
 16: tolls
 17: extras
 18: trip_total
 19: payment_type
 20: company
 21: pickup_latitude
 22: pickup_longitude
 23: dropoff_latitude
 24: dropoff_longitude


**When we look into the meaning of columns, we find that columns related to census are meaningless. Therefore, we decide to delete these two columns.**

In [1]:
!csvcut -C10,11 chicago_taxi_trips_2016_01.csv > chicago1.csv

**There are many null values in our dataset. Since the number of our original columns is greater than 2500000, we decide to delete rows which contain null values. Therefore, we will not get into many troubles in part2 and part3.**

In [None]:
!csvgrep -c 10 -r '^$' -i chicago1.csv > chicago2.csv

In [None]:
!csvgrep -c 18 -r '^$' -i chicago2.csv > chicago3.csv

In [4]:
chi = pd.read_csv('chicago3.csv')

In [5]:
chi = chi.reset_index(drop=True)

**Because of the size of the dataset, we limit it by selecting the date of the first 20 days of a month. We convert  'Month', 'Day', 'trip_end_timestamp' into timestamp to store data into a more efficient way for further uesage in later exploration.**


In [12]:
chi['Day'] = chi['Day'].astype(int)

In [13]:
chi.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Month,Day,Time,Unnamed: 5,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,...,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude


In [10]:
chi = chi[chi.Day <21 ]
chi = chi.reset_index(drop=True)
len(chi)

348840

In [None]:
!xsv search -s 

In [11]:
chi.head(10)

Unnamed: 0,taxi_id,trip_start_timestamp,Month,Day,Time,Unnamed: 5,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,...,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016,1,13,6:15:00,AM,2016/1/13 6:15,180.0,0.4,24.0,...,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,2016,1,6,11:15:00,PM,2016/1/6 23:30,420.0,0.0,8.0,...,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,2016,1,3,7:45:00,AM,2016/1/3 8:00,480.0,0.1,41.0,...,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,2016,1,3,4:30:00,PM,2016/1/3 16:30,720.0,0.0,7.0,...,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,2016,1,11,2:00:00,PM,2016/1/11 14:00,420.0,0.0,28.0,...,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0
5,4251.0,2016,1,1,7:30:00,PM,2016/1/1 19:45,660.0,4.1,8.0,...,3.0,0.0,1.0,15.85,Credit Card,101.0,210.0,470.0,606.0,617.0
6,5132.0,2016,1,8,7:30:00,PM,2016/1/8 19:30,720.0,0.2,32.0,...,2.0,0.0,1.0,15.25,Credit Card,8.0,18.0,610.0,472.0,207.0
7,2647.0,2016,1,15,7:30:00,PM,2016/1/15 19:30,360.0,0.0,8.0,...,0.0,0.0,0.0,6.5,Cash,107.0,754.0,410.0,474.0,204.0
8,373.0,2016,1,2,5:45:00,PM,2016/1/2 17:45,240.0,0.6,8.0,...,0.0,0.0,1.5,6.35,Cash,109.0,688.0,206.0,744.0,605.0
9,5540.0,2016,1,13,7:00:00,PM,2016/1/13 19:15,600.0,0.9,8.0,...,1.11,0.0,0.0,8.11,Credit Card,92.0,754.0,410.0,167.0,754.0


**We want to know the exact date time.**

In [12]:
Mon = chi['Month'].astype(int).astype(str).tolist()
Mon_ = []
for i in Mon:
    i = i.zfill(2)
    Mon_.append(i)
chi['Month'] = DataFrame(Mon_)

In [13]:
chi.head(10)

Unnamed: 0,taxi_id,trip_start_timestamp,Month,Day,Time,Unnamed: 5,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,...,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016,1,13,6:15:00,AM,2016/1/13 6:15,180.0,0.4,24.0,...,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,2016,1,6,11:15:00,PM,2016/1/6 23:30,420.0,0.0,8.0,...,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,2016,1,3,7:45:00,AM,2016/1/3 8:00,480.0,0.1,41.0,...,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,2016,1,3,4:30:00,PM,2016/1/3 16:30,720.0,0.0,7.0,...,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,2016,1,11,2:00:00,PM,2016/1/11 14:00,420.0,0.0,28.0,...,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0
5,4251.0,2016,1,1,7:30:00,PM,2016/1/1 19:45,660.0,4.1,8.0,...,3.0,0.0,1.0,15.85,Credit Card,101.0,210.0,470.0,606.0,617.0
6,5132.0,2016,1,8,7:30:00,PM,2016/1/8 19:30,720.0,0.2,32.0,...,2.0,0.0,1.0,15.25,Credit Card,8.0,18.0,610.0,472.0,207.0
7,2647.0,2016,1,15,7:30:00,PM,2016/1/15 19:30,360.0,0.0,8.0,...,0.0,0.0,0.0,6.5,Cash,107.0,754.0,410.0,474.0,204.0
8,373.0,2016,1,2,5:45:00,PM,2016/1/2 17:45,240.0,0.6,8.0,...,0.0,0.0,1.5,6.35,Cash,109.0,688.0,206.0,744.0,605.0
9,5540.0,2016,1,13,7:00:00,PM,2016/1/13 19:15,600.0,0.9,8.0,...,1.11,0.0,0.0,8.11,Credit Card,92.0,754.0,410.0,167.0,754.0


In [14]:
Day = chi['Day'].astype(str).tolist()
Day_ = []
for i in Day:
    i = i.zfill(2)
    Day_.append(i)
chi['Day'] = DataFrame(Day_)

In [15]:
chi.head(10)

Unnamed: 0,taxi_id,trip_start_timestamp,Month,Day,Time,Unnamed: 5,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,...,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016,1,13,6:15:00,AM,2016/1/13 6:15,180.0,0.4,24.0,...,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,2016,1,6,11:15:00,PM,2016/1/6 23:30,420.0,0.0,8.0,...,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,2016,1,3,7:45:00,AM,2016/1/3 8:00,480.0,0.1,41.0,...,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,2016,1,3,4:30:00,PM,2016/1/3 16:30,720.0,0.0,7.0,...,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,2016,1,11,2:00:00,PM,2016/1/11 14:00,420.0,0.0,28.0,...,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0
5,4251.0,2016,1,1,7:30:00,PM,2016/1/1 19:45,660.0,4.1,8.0,...,3.0,0.0,1.0,15.85,Credit Card,101.0,210.0,470.0,606.0,617.0
6,5132.0,2016,1,8,7:30:00,PM,2016/1/8 19:30,720.0,0.2,32.0,...,2.0,0.0,1.0,15.25,Credit Card,8.0,18.0,610.0,472.0,207.0
7,2647.0,2016,1,15,7:30:00,PM,2016/1/15 19:30,360.0,0.0,8.0,...,0.0,0.0,0.0,6.5,Cash,107.0,754.0,410.0,474.0,204.0
8,373.0,2016,1,2,5:45:00,PM,2016/1/2 17:45,240.0,0.6,8.0,...,0.0,0.0,1.5,6.35,Cash,109.0,688.0,206.0,744.0,605.0
9,5540.0,2016,1,13,7:00:00,PM,2016/1/13 19:15,600.0,0.9,8.0,...,1.11,0.0,0.0,8.11,Credit Card,92.0,754.0,410.0,167.0,754.0


In [16]:
chi["trip_start_timestamp"] = chi['trip_start_timestamp'].map(str) + chi['Month'].map(str) + chi['Day'].map(str)

In [17]:
chi.head()

Unnamed: 0,taxi_id,trip_start_timestamp,Month,Day,Time,Unnamed: 5,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,...,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,20160113,1,13,6:15:00,AM,2016/1/13 6:15,180.0,0.4,24.0,...,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,20160106,1,6,11:15:00,PM,2016/1/6 23:30,420.0,0.0,8.0,...,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,20160103,1,3,7:45:00,AM,2016/1/3 8:00,480.0,0.1,41.0,...,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,20160103,1,3,4:30:00,PM,2016/1/3 16:30,720.0,0.0,7.0,...,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,20160111,1,11,2:00:00,PM,2016/1/11 14:00,420.0,0.0,28.0,...,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0


In [20]:
chi = chi.drop(['Month', 'Day','trip_end_timestamp'], axis=1)

In [21]:
chi.head(10)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,Unnamed: 5,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016-01-13,6:15:00,AM,180.0,0.4,24.0,24.0,4.5,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,2016-01-06,11:15:00,PM,420.0,0.0,8.0,28.0,7.25,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,2016-01-03,7:45:00,AM,480.0,0.1,41.0,69.0,9.0,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,2016-01-03,4:30:00,PM,720.0,0.0,7.0,28.0,10.75,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,2016-01-11,2:00:00,PM,420.0,0.0,28.0,28.0,6.75,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0
5,4251.0,2016-01-01,7:30:00,PM,660.0,4.1,8.0,6.0,11.85,3.0,0.0,1.0,15.85,Credit Card,101.0,210.0,470.0,606.0,617.0
6,5132.0,2016-01-08,7:30:00,PM,720.0,0.2,32.0,24.0,12.25,2.0,0.0,1.0,15.25,Credit Card,8.0,18.0,610.0,472.0,207.0
7,2647.0,2016-01-15,7:30:00,PM,360.0,0.0,8.0,8.0,6.5,0.0,0.0,0.0,6.5,Cash,107.0,754.0,410.0,474.0,204.0
8,373.0,2016-01-02,5:45:00,PM,240.0,0.6,8.0,32.0,4.85,0.0,0.0,1.5,6.35,Cash,109.0,688.0,206.0,744.0,605.0
9,5540.0,2016-01-13,7:00:00,PM,600.0,0.9,8.0,8.0,7.0,1.11,0.0,0.0,8.11,Credit Card,92.0,754.0,410.0,167.0,754.0


In [22]:
chi["Time"] = chi['Time'].map(str) + chi['Unnamed: 5'].map(str)

In [23]:
chi = chi.drop(['Unnamed: 5'], axis=1)

In [24]:
chi.head(10)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016-01-13,6:15:00AM,180.0,0.4,24.0,24.0,4.5,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,6641.0,2016-01-06,11:15:00PM,420.0,0.0,8.0,28.0,7.25,0.0,0.0,0.0,7.25,Cash,8.0,419.0,615.0,411.0,545.0
2,6078.0,2016-01-03,7:45:00AM,480.0,0.1,41.0,69.0,9.0,0.0,0.0,0.0,9.0,Cash,107.0,583.0,195.0,278.0,173.0
3,221.0,2016-01-03,4:30:00PM,720.0,0.0,7.0,28.0,10.75,0.0,0.0,0.0,10.75,Cash,101.0,642.0,32.0,411.0,545.0
4,5367.0,2016-01-11,2:00:00PM,420.0,0.0,28.0,28.0,6.75,0.0,0.0,0.0,6.75,Cash,107.0,158.0,270.0,158.0,270.0
5,4251.0,2016-01-01,7:30:00PM,660.0,4.1,8.0,6.0,11.85,3.0,0.0,1.0,15.85,Credit Card,101.0,210.0,470.0,606.0,617.0
6,5132.0,2016-01-08,7:30:00PM,720.0,0.2,32.0,24.0,12.25,2.0,0.0,1.0,15.25,Credit Card,8.0,18.0,610.0,472.0,207.0
7,2647.0,2016-01-15,7:30:00PM,360.0,0.0,8.0,8.0,6.5,0.0,0.0,0.0,6.5,Cash,107.0,754.0,410.0,474.0,204.0
8,373.0,2016-01-02,5:45:00PM,240.0,0.6,8.0,32.0,4.85,0.0,0.0,1.5,6.35,Cash,109.0,688.0,206.0,744.0,605.0
9,5540.0,2016-01-13,7:00:00PM,600.0,0.9,8.0,8.0,7.0,1.11,0.0,0.0,8.11,Credit Card,92.0,754.0,410.0,167.0,754.0


**In order to match with JSON's format, we need to convert float into integer first then into string.** 

In [25]:
chi['pickup_community_area'] = chi['pickup_community_area'].astype(int).astype(str)
chi['dropoff_community_area'] = chi['dropoff_community_area'].astype(int).astype(str)
chi['pickup_latitude'] = chi['pickup_latitude'].astype(int).astype(str)
chi['pickup_longitude'] = chi['pickup_longitude'].astype(int).astype(str)
chi['dropoff_latitude'] = chi['dropoff_latitude'].astype(int).astype(str)
chi['dropoff_longitude'] = chi['dropoff_longitude'].astype(int).astype(str)
chi['company'] = chi['company'].astype(int).astype(str)
chi['taxi_id'] = chi['taxi_id'].astype(int)

In [26]:
chi.head()

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,107,199,510,199,510
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,8,419,615,411,545
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,107,583,195,278,173
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,101,642,32,411,545
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,107,158,270,158,270


**In the original dataset, pickup latitude and pickup longitude are index. These numbers represent actual locations position. Therefore, in order to analyze, we plan to find the actual values **

We unpacked values of lat/long using column_remapping.json to convert the latitude and longitude into reasonable value

In [27]:
with open('column_remapping.json', 'r') as jsonfile:
    json_string = json.load(jsonfile)

**Convert pickup latitude**

In [28]:
pickup_lat = json_string['pickup_latitude']
chi['pickup_latitude'] = chi['pickup_latitude'].astype(int)
pickup_latitude = pd.DataFrame.from_dict(pickup_lat, orient='index')
pickup_latitude['pickup_latitude'] =pickup_latitude.index
pickup_latitude.rename(columns={ pickup_latitude.columns[0]: "pickup_latitude1" }, inplace=True)
pickup_latitude.head()

Unnamed: 0,pickup_latitude1,pickup_latitude
0,41.941422478,0
1,41.920265121,1
2,41.898425258,2
3,42.005608023,3
4,41.884272021,4


In [29]:
pickup_latitude['pickup_latitude'] = pickup_latitude['pickup_latitude'].astype(int)
result = pd.merge(chi, pickup_latitude, how='left', on=['pickup_latitude', 'pickup_latitude'])

In [30]:
result = result.drop(['pickup_latitude'], axis=1)
result.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_longitude,dropoff_latitude,dropoff_longitude,pickup_latitude1
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,107,510,199,510,41.901206994
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,8,615,411,545,41.89321636
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,107,195,278,173,41.794090253
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,101,32,411,545,41.914747305
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,107,270,158,270,41.874005383


**Convert pickup longitude**

In [31]:
pickup_long = json_string['pickup_longitude']
chi['pickup_longitude'] = chi['pickup_longitude'].astype(int)
pickup_longitude = pd.DataFrame.from_dict(pickup_long, orient='index')
pickup_longitude['pickup_longitude'] =pickup_longitude.index
pickup_longitude.rename(columns={pickup_longitude.columns[0]: "pickup_longitude1" }, inplace=True)
pickup_longitude.head()

Unnamed: 0,pickup_longitude1,pickup_longitude
0,-87.811605775,0
1,-87.670621043,1
2,-87.656411531,2
3,-87.672538401,3
4,-87.61626757,4


In [32]:
pickup_longitude['pickup_longitude'] = pickup_longitude['pickup_longitude'].astype(str)
result = pd.merge(result, pickup_longitude, how='left', on=['pickup_longitude', 'pickup_longitude'])

In [33]:
result = result.drop(['pickup_longitude'], axis=1)
result.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,dropoff_latitude,dropoff_longitude,pickup_latitude1,pickup_longitude1
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,107,199,510,41.901206994,-87.676355989
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,8,411,545,41.89321636,-87.63784421
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,107,278,173,41.794090253,-87.592310855
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,101,411,545,41.914747305,-87.654007029
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,107,158,270,41.874005383,-87.66351755


**Convert dropoff_latitude**

In [34]:
dropoff_latitude = json_string['dropoff_latitude']
chi['dropoff_latitude'] = chi['dropoff_latitude'].astype(int)
dropoff_latitude = pd.DataFrame.from_dict(dropoff_latitude, orient='index')
dropoff_latitude['dropoff_latitude'] =dropoff_latitude.index
dropoff_latitude.rename(columns={dropoff_latitude.columns[0]: "dropoff_latitude1" }, inplace=True)
dropoff_latitude.head()

Unnamed: 0,dropoff_latitude1,dropoff_latitude
0,41.941422478,0
1,41.920265121,1
2,41.898425258,2
3,42.005608023,3
4,41.884272021,4


In [35]:
dropoff_latitude['dropoff_latitude'] =dropoff_latitude['dropoff_latitude'].astype(str)
result = pd.merge(result, dropoff_latitude, how='left', on=['dropoff_latitude', 'dropoff_latitude'])

In [36]:
result = result.drop(['dropoff_latitude'], axis=1)
result.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,dropoff_longitude,pickup_latitude1,pickup_longitude1,dropoff_latitude1
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,107,510,41.901206994,-87.676355989,41.901206994
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,8,545,41.89321636,-87.63784421,41.879255084
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,107,173,41.794090253,-87.592310855,41.763246799
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,101,545,41.914747305,-87.654007029,41.879255084
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,107,270,41.874005383,-87.66351755,41.874005383


**Convert dropoff longitude**

In [37]:
dropoff_longitude = json_string['dropoff_longitude']
chi['dropoff_longitude'] = chi['dropoff_longitude'].astype(int)
dropoff_longitude = pd.DataFrame.from_dict(dropoff_longitude, orient='index')
dropoff_longitude['dropoff_longitude'] =dropoff_longitude.index
dropoff_longitude.rename(columns={dropoff_longitude.columns[0]: "dropoff_longitude1" }, inplace=True)
dropoff_longitude.head()

Unnamed: 0,dropoff_longitude1,dropoff_longitude
0,-87.811605775,0
1,-87.670621043,1
2,-87.656411531,2
3,-87.672538401,3
4,-87.61626757,4


In [38]:
dropoff_longitude['dropoff_longitude'] =dropoff_longitude['dropoff_longitude'].astype(str)
result = pd.merge(result, dropoff_longitude, how='left', on=['dropoff_longitude', 'dropoff_longitude'])
result = result.drop(['dropoff_longitude'], axis=1)
result.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude1,pickup_longitude1,dropoff_latitude1,dropoff_longitude1
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,107,41.901206994,-87.676355989,41.901206994,-87.676355989
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,8,41.89321636,-87.63784421,41.879255084,-87.642648998
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,107,41.794090253,-87.592310855,41.763246799,-87.616134111
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,101,41.914747305,-87.654007029,41.879255084,-87.642648998
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,107,41.874005383,-87.66351755,41.874005383,-87.66351755


**Find th**

In [39]:
company = json_string['company']
chi['company'] = chi['company'].astype(int)
companies = pd.DataFrame.from_dict(company, orient='index')
companies['company'] =companies.index
companies.rename(columns={companies.columns[0]: "company1" }, inplace=True)
companies.head()

Unnamed: 0,company1,company
0,3623-Arrington Enterprises,0
1,5874 - Sergey Cab Corp.,1
2,5874 - 73628 Sergey Cab Corp.,2
3,Chicago Medallion Management,3
4,3011 - JBL Cab Inc.,4


In [40]:
companies['company'] = companies['company'].astype(str)
result = pd.merge(result, companies, how='left', on=['company', 'company'])
result = result.drop(['company'], axis=1)
result.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,pickup_latitude1,pickup_longitude1,dropoff_latitude1,dropoff_longitude1,company1
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,41.901206994,-87.676355989,41.901206994,-87.676355989,Taxi Affiliation Services
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,41.89321636,-87.63784421,41.879255084,-87.642648998,Blue Ribbon Taxi Association Inc.
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,41.794090253,-87.592310855,41.763246799,-87.616134111,Taxi Affiliation Services
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,41.914747305,-87.654007029,41.879255084,-87.642648998,Dispatch Taxi Affiliation
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,41.874005383,-87.66351755,41.874005383,-87.66351755,Taxi Affiliation Services


In [41]:
result.columns= ['taxi_id',
 'trip_start_timestamp',
 'Time',
 'trip_seconds',
 'trip_miles',
 'pickup_community_area',
 'dropoff_community_area',
 'fare',
 'tips',
 'tolls',
 'extras',
 'trip_total',
 'payment_type',
 'pickup_latitude',
 'pickup_longitude',
 'dropoff_latitude',
 'dropoff_longitude',
 'company']

In [42]:
chi = result
chi.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,41.901206994,-87.676355989,41.901206994,-87.676355989,Taxi Affiliation Services
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,41.89321636,-87.63784421,41.879255084,-87.642648998,Blue Ribbon Taxi Association Inc.
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,41.794090253,-87.592310855,41.763246799,-87.616134111,Taxi Affiliation Services
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,41.914747305,-87.654007029,41.879255084,-87.642648998,Dispatch Taxi Affiliation
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,41.874005383,-87.66351755,41.874005383,-87.66351755,Taxi Affiliation Services


In [43]:
len(chi)

348840

In [46]:
chi.to_csv('chicago_taxi.csv')

In [47]:
chi.head(5)

Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,41.901206994,-87.676355989,41.901206994,-87.676355989,Taxi Affiliation Services
1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,41.89321636,-87.63784421,41.879255084,-87.642648998,Blue Ribbon Taxi Association Inc.
2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,41.794090253,-87.592310855,41.763246799,-87.616134111,Taxi Affiliation Services
3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,41.914747305,-87.654007029,41.879255084,-87.642648998,Dispatch Taxi Affiliation
4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,41.874005383,-87.66351755,41.874005383,-87.66351755,Taxi Affiliation Services


In [10]:
chi = pd.read_csv('Chicago_Taxi.csv')
chi.head(5)
time = chi["trip_start_timestamp"].tolist()
time_ = []
for i in time:
    i = parse(str(i))
    time_.append(i)

Unnamed: 0.1,Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
0,0,85,20160113,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,41.901207,-87.676356,41.901207,-87.676356,Taxi Affiliation Services
1,1,6641,20160106,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,41.893216,-87.637844,41.879255,-87.642649,Blue Ribbon Taxi Association Inc.
2,2,6078,20160103,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,41.79409,-87.592311,41.763247,-87.616134,Taxi Affiliation Services
3,3,221,20160103,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,41.914747,-87.654007,41.879255,-87.642649,Dispatch Taxi Affiliation
4,4,5367,20160111,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,41.874005,-87.663518,41.874005,-87.663518,Taxi Affiliation Services


In [17]:
chi["trip_start_timestamp"] = DataFrame(time_)

In [18]:
chi.head(5)

Unnamed: 0.1,Unnamed: 0,taxi_id,trip_start_timestamp,Time,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
0,0,85,2016-01-13,6:15:00AM,180.0,0.4,24,24,4.5,0.0,0.0,0.0,4.5,Cash,41.901207,-87.676356,41.901207,-87.676356,Taxi Affiliation Services
1,1,6641,2016-01-06,11:15:00PM,420.0,0.0,8,28,7.25,0.0,0.0,0.0,7.25,Cash,41.893216,-87.637844,41.879255,-87.642649,Blue Ribbon Taxi Association Inc.
2,2,6078,2016-01-03,7:45:00AM,480.0,0.1,41,69,9.0,0.0,0.0,0.0,9.0,Cash,41.79409,-87.592311,41.763247,-87.616134,Taxi Affiliation Services
3,3,221,2016-01-03,4:30:00PM,720.0,0.0,7,28,10.75,0.0,0.0,0.0,10.75,Cash,41.914747,-87.654007,41.879255,-87.642649,Dispatch Taxi Affiliation
4,4,5367,2016-01-11,2:00:00PM,420.0,0.0,28,28,6.75,0.0,0.0,0.0,6.75,Cash,41.874005,-87.663518,41.874005,-87.663518,Taxi Affiliation Services


**Now, we delete unnecessary columns and we replace index with the actual values. After that, we can do analysis in a more efficient way. We have tried to use sql and pipelines to clean the data but failed,because the original date's format is json.It is easier to clean data use python.**