## Step 1: Imports

In [23]:
import pandas as pd
import plotly.express as px
import sqlalchemy

### Step 1b: Import the data

In [24]:
df = pd.read_csv('UberDataset.csv')
df.head()

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
0,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,5.0,
2,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,01-05-2016 17:31,01-05-2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,01-06-2016 14:42,01-06-2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


## Step 2: Tidy Data

*PEP 8 Compliance: Column headers should be snake case. Lower-cased and spaces denoted with underscores

In [25]:
df.columns = df.columns.str.lower().str.strip().str.replace(' ', '_')
df.columns

Index(['start_date', 'end_date', 'category', 'start', 'stop', 'miles',
       'purpose'],
      dtype='object')

## Step 3: EDA (Exploratory Data Analysis)

In [26]:
df.describe()

Unnamed: 0,miles
count,1156.0
mean,21.115398
std,359.299007
min,0.5
25%,2.9
50%,6.0
75%,10.4
max,12204.7


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   start_date  1156 non-null   object 
 1   end_date    1155 non-null   object 
 2   category    1155 non-null   object 
 3   start       1155 non-null   object 
 4   stop        1155 non-null   object 
 5   miles       1156 non-null   float64
 6   purpose     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB


In [28]:
df.shape

(1156, 7)

In [29]:
df['start_date'].value_counts()

start_date
6/28/2016 23:34     2
01-01-2016 21:11    1
9/27/2016 21:01     1
9/27/2016 13:21     1
9/27/2016 8:33      1
                   ..
5/27/2016 20:47     1
5/27/2016 20:26     1
5/23/2016 21:09     1
5/23/2016 20:19     1
Totals              1
Name: count, Length: 1155, dtype: int64

In [30]:
df = df.drop(1155)

In [31]:
df['start_date'] = pd.to_datetime(df['start_date'], format='mixed')

In [32]:
df['end_date'] = pd.to_datetime(df['end_date'], format='mixed')

In [33]:
df['purpose'].value_counts()

purpose
Meeting            187
Meal/Entertain     160
Errand/Supplies    128
Customer Visit     101
Temporary Site      50
Between Offices     18
Moving               4
Airport/Travel       3
Charity ($)          1
Commute              1
Name: count, dtype: int64

### Fill our NA Values:

In [34]:
df.isna().sum()

start_date      0
end_date        0
category        0
start           0
stop            0
miles           0
purpose       502
dtype: int64

In [35]:
df['purpose'] = df['purpose'].fillna('Unavailable')


In [36]:
df.isna().sum()

start_date    0
end_date      0
category      0
start         0
stop          0
miles         0
purpose       0
dtype: int64

### Create Visualizations:

In [37]:
for col in df.columns:
    if df[col].dtype != 'O':
        display(px.histogram(df, col));


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result




The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [38]:
for col in df.columns:
    if df[col].dtype != 'O':
        display(px.bar(df, 'purpose', col, 'purpose'))


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result




The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [None]:
con_str = 'mysql+mysqldb://bonfire:Ophelia44@localhost/bonfire129'

my_con = sqlalchemy.create_engine(con_str)

In [40]:
df.to_csv('uber_data.csv', index=False)