**UBER Rides Dataset 2016 ANALYSIS**

**Introduction**

*In this notebook we will analylize " My Uber Rides 2016 dataset" and try to find the hidden relationships among Time( Date),  Miles , Purpose , start and end location, and categories. After that , we will predict the travel miles according to the starting time and places by using machine learning.
*

In [1]:
# import relative python libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns 
import datetime as dt
%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.


In [None]:
# load dataset into Pandas
mydata = pd.read_csv('../input/My Uber Drives - 2016.csv')

**Firstly** , let's have a  look  at  this dataset.

In [None]:
mydata.info()

In [None]:
mydata.head()

In [None]:
mydata.tail()

In [None]:
mydata.isnull().sum()

* We can find that this dataset has seven columns ( START_DATE* , END_DATE*, CATEGORY* , START* , STOP* , MILES* , PURPOSE* )and totally 1156 rows . Another thing we shoule notice is that  there are a lot of missing values in PURPOSES*.If we want to get a good result,we need to fill the missing data.*

 *We also should notice that the last row is "Totals"  that is  unusage data, so we can delete this row.*

In [None]:
# Copy a dataset
datacopy = mydata.copy()

In [None]:
# delete the last line
datacopy = datacopy.drop(datacopy.index[1155])

**Secondly**,*Let's fill the missing values and cleanse the data.*

*In order to fill the missing values, let's look at  relationships between PURPOSE* and Time , MILES*

In [None]:
# Change 'START_DATE*','END_DATE*' to time format
datacopy['START_DATE*'] = pd.to_datetime(datacopy['START_DATE*'])
datacopy['END_DATE*'] = pd.to_datetime(datacopy['END_DATE*'])

In [None]:
# Extract 'Hour','Month','Day of Week','Date' from 'START_DATE*'
datacopy['Hour'] = datacopy['START_DATE*'].apply(lambda time: time.hour)
datacopy['Month'] = datacopy['START_DATE*'].apply(lambda time: time.month)
datacopy['Day of Week'] = datacopy['START_DATE*'].apply(lambda time: time.dayofweek)
datacopy['Date'] = datacopy['START_DATE*'].apply(lambda time: time.date())
datacopy.head()

In [None]:
# Convert 'Day of Week' from numerical to text(that we can understand)
daymap ={0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
datacopy['Day of Week'] = datacopy['Day of Week'].map(daymap)
datacopy.head()

In [None]:
# Try to find the hiden relationship between the missing value and 'Day of Week'
plt.figure(figsize=(20,8))
sns.countplot(x='Day of Week',data = datacopy,hue = 'PURPOSE*')
plt.legend(bbox_to_anchor = (1.05,1),loc=2,borderaxespad=0.)

In [None]:
# Try to find the hiden relationship between the missing value and day of 'Hour'
plt.figure(figsize=(20,8))
sns.countplot(x='Hour',data = datacopy,hue = 'PURPOSE*')
plt.legend(bbox_to_anchor = (1.05,1),loc=2,borderaxespad=0.)

In [None]:
datacopy.head()

*By the analysis before , we can find the correlation between Missing values and day of time . And then let's fill the missing value .  *

In [None]:
datacopy['Hour'].unique()

In [None]:
#Fill the missing value
datacopy[(datacopy['Hour'] >= 1) & (datacopy['Hour'] <= 14)] = datacopy[(datacopy.Hour >= 1) & (datacopy.Hour <= 14)].fillna({'PURPOSE*':'Meeting'})
datacopy[(datacopy['Hour'] >= 15) & (datacopy['Hour'] <= 21)] = datacopy[(datacopy['Hour'] >= 15) & (datacopy['Hour'] <= 21)].fillna({'PURPOSE*':'Meal/Entertain'})
datacopy[(datacopy['Hour'] >= 22) | (datacopy['Hour'] == 0)] = datacopy[(datacopy['Hour'] >= 22) | (datacopy['Hour'] == 0 )].fillna({'PURPOSE*':'Meeting'})
#datacopy[(datacopy['Hour'] == 0)] = datacopy[(datacopy['Hour'] == 0)].fillna({'PURPOSE*':'Meeting'})


In [None]:
datacopy.isnull().sum()

**Then**, *let's go finding the hiden patterns.*

*At first, we will check every column respectively ,and then  go  finding  their corelation.*

*1.  MILES*

*We can divide the MILES* data to 5 sets ("<=5","5-10","10-15","15-20",">20") according to different distence ,and then find their travel  frequence of each set.*

In [None]:
ml_dis=datacopy["MILES*"]
ml_range_lst=["<=5","5-10","10-15","15-20",">20"]
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.03*height, '%s' % int(height))
ml_dic=dict()
for item in ml_range_lst:
    ml_dic[item]=0
for mile in ml_dis.values:
    if mile<=5:
        ml_dic["<=5"]+=1
    elif mile<=10:
        ml_dic["5-10"]+=1
    elif mile<=15:
        ml_dic["10-15"]+=1
    elif mile<=20:
        ml_dic["15-20"]+=1
    else:
        ml_dic[">20"]+=1
ml_dis=pd.Series(ml_dic)
ml_dis.sort_values(inplace=True,ascending=False)
print("Miles:\n",ml_dis)
#figure
rects=plt.bar(range(1,len(ml_dis.index)+1),ml_dis.values)
plt.title("Miles")
plt.xlabel("Miles")
plt.ylabel("Quantity")
plt.xticks(range(1,len(ml_dis.index)+1),ml_dis.index)
plt.grid()
autolabel(rects)
plt.savefig("./ml_dis_fig")

*We can find that there is a decreasing trends. The largest number is 502(<=5 miles), following by 338(5 - 10 miles)、161( 10- 15 miles).And most of the travelling distence is less than 15 miles.*

*2.  PURPOSE:*


In [None]:
datacopy['PURPOSE*'].value_counts()

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(datacopy['PURPOSE*'])

In [None]:
# Combine 'Charity ($)','Commute','Moving','Airport/Travel' into 'Others'
dp = datacopy
dp.replace(['Charity ($)', 'Commute','Moving','Airport/Travel'],'Others',inplace = True)

In [None]:
plt.figure(figsize=(12,12))
dp['PURPOSE*'].value_counts()[:11].plot(kind='pie',autopct='%1.1f%%',shadow=True,legend = True)
plt.show()

*From the table before we can find that  'Meeting' and 'MealEntertain'  occupy  more than 73%  following by 'Errand/Supplies' (11.1%) and 'Customer Visit' (8.7%).  *

* *3. [ CTEGORY]**
***Let's see what happend in  CTEGORY*.**

In [None]:
datacopy['CATEGORY*'].value_counts()

In [None]:
#plot 
plt.figure(figsize=(15,5))
sns.countplot(datacopy['CATEGORY*'])

We can find  the number of "business" is far lager than "personal", respectively 1078, 77.


*4.  Start_Date & End_Date*

*Monthly:*

In [None]:
per_month =pd.DataFrame()
per_month =datacopy.groupby('Month').sum()
plt.figure(figsize=(20,8))
sns.barplot(x='Month',y='MILES*',data=per_month.reset_index())

*December has the lagerst number 146. In contast, september is lowest, and the figure is 36.And the January,April, May are also a little lower than other moths.*

*5.  Start_Time & End_Time*

*Hourly*

In [None]:
ByHour =pd.DataFrame()
ByHour =datacopy.groupby('Hour').sum()
plt.figure(figsize=(20,8))
sns.barplot(x='Hour',y='MILES*',data=ByHour.reset_index())

**Next***, let's look for  their association relationship !*

**1.   The relationship between Purpose and Miles.**

In [None]:
Pur_Mil = datacopy.groupby('PURPOSE*')['MILES*'].sum()
Pur_Mil

In [None]:

plt.figure(figsize=(20,8))
sns.barplot(x='PURPOSE*',y='MILES*',data=Pur_Mil.reset_index())

In [None]:
CAT_Mil_Mean = datacopy.groupby('PURPOSE*').mean()
CAT_Mil_Mean

In [None]:
plt.figure(figsize=(15,10))
CAT_Mil_Mean['PURPOSE*']=CAT_Mil_Mean.index.tolist()
ax = sns.barplot(x='MILES*',y='PURPOSE*',data=CAT_Mil_Mean ,order=CAT_Mil_Mean.sort_values('MILES*',ascending=False)['PURPOSE*'].tolist())
ax.set(xlabel='Avrg Miles', ylabel='Purpose')
plt.show()

**2.   The relationship between Start Date and Miles.**

A.  Total Miles Per  Month

In [None]:
MilPurMon = datacopy.groupby('Month')['MILES*'].sum()

plt.figure(figsize=(20,8))
sns.barplot(x='Month',y='MILES*',data=MilPurMon.reset_index())
plt.tight_layout()

In [None]:
MilPurMon = datacopy.groupby('Month').count()['MILES*'].plot()

In [None]:
#Month purpose regression
sns.lmplot(x='Month',y='PURPOSE*',data=datacopy.groupby('Month').count().reset_index())

In [None]:
#Heatmap
dayHour = datacopy.groupby(by=['Day of Week','Hour']).count()['PURPOSE*'].unstack()
plt.figure(figsize=(20,12))
sns.heatmap(dayHour,cmap='coolwarm',linecolor='white',linewidth=1)

**3.   The relationship between CATEGORY and Miles.**

In [None]:
CAT_Mil_SUM = datacopy.groupby('CATEGORY*').sum()
plt.figure(figsize=(10,8))
sns.barplot(x='CATEGORY*',y='MILES*',data=CAT_Mil_SUM.reset_index())
plt.tight_layout()

**In the end** *, let's have a look at  " Velocity" !*

In [None]:
datacopy["END_DATE*"]=pd.to_datetime(datacopy["END_DATE*"],format="%m/%d/%Y %H:%M")
speed=datacopy["MILES*"]/((datacopy["END_DATE*"]-datacopy["START_DATE*"]).dt.seconds/60)
#print(speed)

In [None]:
datacopy["SPEED*"]=speed
datacopy["START_HOUR*"]=datacopy["START_DATE*"].dt.hour
spd_df=datacopy[datacopy["SPEED*"]!=np.inf].groupby(["START_HOUR*"])["SPEED*"].mean()
datacopy.head()

In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x="START_HOUR*",y="SPEED*",data=spd_df.reset_index())
plt.title("Speed")
plt.xlabel("Time(Hour)")
plt.ylabel("Speed[Mile(s)/min]")
plt.xticks(spd_df.index)
plt.grid()

**Conclusion:**

From our analysis , we can find that  travaling for business is far more than for personal reasons. And car speed at 2:00 AM is much higher than other  day of time.  We also can know that  this person in March and October has the longest travelling distance. The 'Communte' has the largest number among all the travelling purpose. From' Hour' diagram, we can find most of the traveling happened after 9 :00 AM. ' Meal/Entertainment ' and 'Meeting'  occupy more than 73% . 