In [None]:
# importing libraries
import pandas as pd

# Reading the Dataset

The dataset used is the ["My Uber Drives"](https://www.kaggle.com/zusmani/uberdrives) dataset available on Kaggle. 

The "My Uber Drives" dataset stores the information of the Uber trips taken by a specific individual during the period of January 2016 to December 2016. 

It consists of the following variables:

- START_DATE* : Date and Time when the trip started
- END_DATE* : Date and Time when the trip ended
- CATEGORY* : Whether the trip was a business trip or personal
- START* : Location where the trip started
- STOP* : Location where the trip ended
- MILES* : Miles covered during the trip
- PURPOSE* : Purpose of the trip (Meals, Errands, Meetings, Customer Support etc.)

The dataset is loaded and displayed.

In [None]:
# reading the dataset
data = pd.read_csv('datasets_1026_1855_My Uber Drives - 2016.csv')
data.tail()

# Cleaning the Dataset

The last row of the dataset stores the total no. of miles and no other useful information. Hence it is deleted.

In [None]:
# dropping the last row
data = data.drop([1155])

Next, the concise summary of the dataframe is displayed to get an idea of the datatype of the columns and presence of missing values.

In [None]:
# printing the info of the dataset
data.info()

The feature 'PURPOSE\*' has missing values. Other than that, there is no missing data. 

The columns 'START_DATE*' and 'END_DATE*' should be of DateTime type. They are also divided into two different columns based on date and time respectively.

In [None]:
# converting to DateTime type
data['START_DATE*'] = pd.to_datetime(data['START_DATE*'])

# making the respective date and time columns
data['Start Date'] = data['START_DATE*'].dt.date
data['Start Time'] = data['START_DATE*'].dt.time

In [None]:
# converting to DateTime type
data['END_DATE*'] = pd.to_datetime(data['END_DATE*'])

# making the respective date and time columns
data['End Date'] = data['END_DATE*'].dt.date
data['End Time'] = data['END_DATE*'].dt.time

In [None]:
data.info()

A column 'Weekday' is added which stores the day of the week that the trip took place. The values of this feature vary from 0 to 6, where 0 represents Monday and 6 represents Sunday.

In [None]:
# finding the day of the week
data['Weekday'] = [pd.to_datetime(date).weekday() for date in data['Start Date']]

In [None]:
data['Weekday'].unique()

Another feature 'Duration' is added which stores the duration of the trip in minutes. It is basically the difference between the start and end time. 

In [None]:
# finding the duration of the trip
duration = data['END_DATE*'] - data['START_DATE*']

data['Duration'] =  [t.total_seconds()/60 for t in duration]

It was also observed from the start and end locations that the rider whom this data pertains to was out of country (USA) at 3 specific instances. The rides corresponding to those trips are deleted to ensure cohesiveness of the data.

In [None]:
# making 3 dataframes corresponding to 3 time intervals
d1 = data[data['Start Date'].between(pd.to_datetime('2016-02-16'), pd.to_datetime('2016-02-21'))]
d2 = data[data['Start Date'].between(pd.to_datetime('2016-08-15'), pd.to_datetime('2016-10-14'))]
d3 = data[data['Start Date'].between(pd.to_datetime('2016-12-17'), pd.to_datetime('2016-12-31'))]

In [None]:
# dropping the concatenated dataframe using the index
data = data.drop(pd.concat([d1, d2, d3]).index)

The dataset now stores information on all the trips that took place in the United States of America.

The columns 'START_DATE*' and 'END_DATE*' are also removed as their information is present in the columns 'Start Date', 'Start Time', 'End Date' and 'End Time'.

In [None]:
# dropping the irrelevant columns
data = data.drop(['START_DATE*', 'END_DATE*'], axis=1)

The feature 'PURPOSE*' is imputed with the modal value since it is a categorical variable.

In [None]:
# data imputation with mode
mode = data['PURPOSE*'].mode()[0]
data['PURPOSE*'] = data['PURPOSE*'].fillna(mode)

This is how the dataset finally looks like.

In [None]:
data

# Saving the Dataset

The dataset is saved using the to_csv() function of a pandas dataframe.

In [None]:
# saving the dataset
data.to_csv('Uber_Drives_Clean.csv', index=False)