# Uber Data 


This part of the report refers to the analysis of the uber data. These data include a sample of Uber rides from 01-2009 through 06-2015.

## Importing the Uber data

The following cell illustrates the import of the uber data in the Jupiter Notebook. 

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [2]:
uber_df = pd.read_csv("C:/Users/Dimitris/Downloads/uber_rides_sample.csv")

## Filtering for longitude and latitude 

The cell bellow allows the filtering of the data by removing from the dataset the data that are not within the pre-specified longitude and latitude coordinates

In [3]:
filter_df = uber_df[(uber_df["pickup_longitude"]< -73.717047)&(uber_df["pickup_longitude"]> -74.242330)&
        (uber_df["dropoff_longitude"]< -73.717047)&(uber_df["dropoff_longitude"]> -74.242330)&
       (uber_df["pickup_latitude"]< 40.908524 )&(uber_df["pickup_latitude"]> 40.560445)&
        (uber_df["dropoff_latitude"]< 40.908524)&(uber_df["dropoff_latitude"]> 40.560445)].copy()

## Cleaning the data

In the cell bellow, the data in the uber_final database are filtered to remove the unessesary collumns that don't contain any usefull information. This filtering of the data is performed with the .drop() function

In [4]:
uber_final_df = filter_df.drop(columns = ["Unnamed: 0","key"])
uber_final_df = uber_final_df.reset_index()
uber_final = uber_final_df.drop(columns=["index"])

Similarly, the .strip function is used to remove the "UTC" characters from the datetime collumn, and allow the manipulation of the data. 

In [5]:
uber_final["pickup_datetime"] = [str(i).strip("UTC") for i in uber_final["pickup_datetime"]]

## Manipulating the date

Now we format the date column for future data manipulation based on date. 

In [6]:
uber_final["pickup_datetime"] = pd.to_datetime(uber_final["pickup_datetime"],format='%Y-%m-%d %H:%M:%S')

## Calculating the trip distance

By observing the data, one can identify that there isn't a data collumn for the trip distance, but rather only the coordinates of the pickup and dropoff. To allow the comparison of the data, the distance needs to be calculated. 
To convert the lat and long data to distance, the following formula is used.

* Distance, d = 3963.0 * arccos[(sin(lat-pickup) * sin(lat-dropoff)) + cos(lat-pickup) * cos(lat-dropoff) * cos(long-dropoff – long-pickup)]
* Where the data for lat and long need to be converted in radians, by multiplying each number by pi/180.

Consequently, the calulcation of the distance can be seen in the code bellow:

----------------

Additionally, observing the second to last line of code, it can be seen that a new collumn is created in the database called "trip_distance"


In [7]:
def distance_calc(df):
    import math
    rad_pick_long = df["pickup_longitude"]*math.pi/180
    rad_pick_lat = df["pickup_latitude"]*math.pi/180
    rad_drop_long = df["dropoff_longitude"]*math.pi/180
    rad_drop_lat = df["dropoff_latitude"]*math.pi/180
    
    dist_list = []
    for i in range(len(df)):
        try:
            distance = 3963*math.acos((math.sin(rad_pick_lat[i])*math.sin(rad_drop_lat[i]))+math.cos(rad_pick_lat[i])*math.cos(rad_drop_lat[i])*math.cos(rad_drop_long[i]-rad_pick_long[i]))
            dist_list.append(distance)
        except:
            dist_list.append(0)
            continue
    df["trip_distance"] = dist_list
    return df

In [8]:
distance_calc(uber_final)

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance
0,7.5,2015-05-07 19:52:06,-73.999817,40.738354,-73.999512,40.723217,1,1.047090
1,7.7,2009-07-17 20:04:56,-73.994355,40.728225,-73.994710,40.750325,1,1.528713
2,12.9,2009-08-24 21:45:00,-74.005043,40.740770,-73.962565,40.772647,1,3.132815
3,5.3,2009-06-26 08:22:21,-73.976124,40.790844,-73.965316,40.803349,3,1.033629
4,16.0,2014-08-28 17:47:00,-73.925023,40.744085,-73.973082,40.761247,5,2.783897
...,...,...,...,...,...,...,...,...
195467,3.0,2012-10-28 10:49:00,-73.987042,40.739367,-73.986525,40.740297,1,0.069799
195468,7.5,2014-03-14 01:09:00,-73.984722,40.736837,-74.006672,40.739620,1,1.166351
195469,30.9,2009-06-29 00:42:00,-73.986017,40.756487,-73.858957,40.692588,2,7.993379
195470,14.5,2015-05-20 14:56:25,-73.997124,40.725452,-73.983215,40.695415,1,2.201835


## Appending to the distance collumn 

By observing the data it could be seen that some trips were round trips having the same pick up and dropoff coordinates. Consequently, the total cumulative distance for those trips could not be calculated with the data give. 
The code in the cell bellow selects and appends to the "trip_distance" collumn in the uber_final dataset, the particular rows where the total trip distance is not 0

In [9]:
uber_new = uber_final[(uber_final["trip_distance"]!=0)]

## Converting Pandas dataframe to SQL Table 

Here, the processed panda dataframe for all the uber data is converted into a SQL Table and consequently saved as a table, Uber_new to an SQL Database named project.db


In [None]:
engine = create_engine('sqlite:////Users/rishabhsalwan/Downloads/Project Data/Project.db', echo=False)
uber_new.to_sql('Uber_new', con=engine)