# DATA 606 Project - Analysis of Uber & Lyft ride details


* We've all experienced those moments when we open the Uber or Lyft app and wonder why the ride prices seem to be all over the place. Well, it turns out that it's not just random. Those prices can change because of a lot of factors, and one of those factors is the weather.

* Think about it - when it's pouring rain, everyone wants to get a ride rather than getting soaked. Or on a super hot day, you might prefer to avoid walking in the scorching sun. These weather conditions can make ride-hailing services like Uber and Lyft busier, and when there's high demand, they often increase their prices. That's what they call "surge pricing."

* But there's more to it than just that. Different types of weather can also affect how long your ride takes or the route your driver takes. All of these things can add up and influence how much you pay for your ride.

* So, here's the deal: we've got a bunch of data that tells us all about rides, prices, and the weather. We're going to dig into this data to find out how the weather and ride prices are connected.

### This analysis is about understanding dynamic pricing on level one difficulty ( like the first parameter as weather ). I think this model is a feedback mechanism to dynamic pricing and understanding these models will help to broaden the view of real case scenarios.

## About the data



* For this analysis, I am taking a dataset from Kaggle. This is a very beginner-friendly dataset. It does contain a lot of NA values. It is a good dataset if you want to use machine learning models and understand real-world data.

* It provides some important trip details and the corresponding weather at that time, I believe the weather will affect the trip prices, you can know more about the details of every column in further notes and I feel this analysis is a level one difficulty with only weather data. More factors will affect the price like driver ratings, user ratings, increase in gas prices, and more.

* you can find the data set here - https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma

# Analysing Uber and lyft ride costs and visualizations


* I am reading the file from the drive, the size of this data set is around 300 MB. About reading file this way looks easy and good for me, I don't need to upload every time.
* if you are reading a file from local storage, then you need to change the file path as you have while reading the file and you can comment out the first line ( This one - from google.colab import drive
.mount('/content/drive') ).

In [None]:
# For accessing file from drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing usseful librabry
import pandas as pd
import numpy as np

In [None]:
# Reading the data as pandas data drame
df = pd.read_csv("/content/drive/MyDrive/rideshare_kaggle.csv")

In [None]:
# Sample of data
df.head()

Unnamed: 0,id,timestamp,hour,day,month,datetime,timezone,source,destination,cab_type,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,1544953000.0,9,16,12,2018-12-16 09:30:07,America/New_York,Haymarket Square,North Station,Lyft,...,0.1276,1544979600,39.89,1545012000,43.68,1544968800,33.73,1545012000,38.07,1544958000
1,4bd23055-6827-41c6-b23b-3c491f24e74d,1543284000.0,2,27,11,2018-11-27 02:00:23,America/New_York,Haymarket Square,North Station,Lyft,...,0.13,1543251600,40.49,1543233600,47.3,1543251600,36.2,1543291200,43.92,1543251600
2,981a3613-77af-4620-a42a-0c0866077d1e,1543367000.0,1,28,11,2018-11-28 01:00:22,America/New_York,Haymarket Square,North Station,Lyft,...,0.1064,1543338000,35.36,1543377600,47.55,1543320000,31.04,1543377600,44.12,1543320000
3,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,1543554000.0,4,30,11,2018-11-30 04:53:02,America/New_York,Haymarket Square,North Station,Lyft,...,0.0,1543507200,34.67,1543550400,45.03,1543510800,30.3,1543550400,38.53,1543510800
4,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,1543463000.0,3,29,11,2018-11-29 03:49:20,America/New_York,Haymarket Square,North Station,Lyft,...,0.0001,1543420800,33.1,1543402800,42.18,1543420800,29.11,1543392000,35.75,1543420800


In [None]:
# Observing null values,data types, column names and total entries
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 57 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           693071 non-null  object 
 1   timestamp                    693071 non-null  float64
 2   hour                         693071 non-null  int64  
 3   day                          693071 non-null  int64  
 4   month                        693071 non-null  int64  
 5   datetime                     693071 non-null  object 
 6   timezone                     693071 non-null  object 
 7   source                       693071 non-null  object 
 8   destination                  693071 non-null  object 
 9   cab_type                     693071 non-null  object 
 10  product_id                   693071 non-null  object 
 11  name                         693071 non-null  object 
 12  price                        637976 non-null  float64
 13 

In [None]:
# Observing the data, checking minimum, maximum values, and average values.
df.describe()

Unnamed: 0,timestamp,hour,day,month,price,distance,surge_multiplier,latitude,longitude,temperature,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
count,693071.0,693071.0,693071.0,693071.0,637976.0,693071.0,693071.0,693071.0,693071.0,693071.0,...,693071.0,693071.0,693071.0,693071.0,693071.0,693071.0,693071.0,693071.0,693071.0,693071.0
mean,1544046000.0,11.619137,17.794365,11.586684,16.545125,2.18943,1.01387,42.338172,-71.066151,39.584388,...,0.037374,1544044000.0,33.457774,1544042000.0,45.261313,1544047000.0,29.731002,1544048000.0,41.997343,1544048000.0
std,689192.5,6.948114,9.982286,0.492429,9.324359,1.138937,0.091641,0.04784,0.020302,6.726084,...,0.055214,691202.8,6.467224,690195.4,5.645046,690135.3,7.110494,687186.2,6.936841,691077.7
min,1543204000.0,0.0,1.0,11.0,2.5,0.02,1.0,42.2148,-71.1054,18.91,...,0.0,1543162000.0,15.63,1543122000.0,33.51,1543154000.0,11.81,1543136000.0,28.95,1543187000.0
25%,1543444000.0,6.0,13.0,11.0,9.0,1.28,1.0,42.3503,-71.081,36.45,...,0.0,1543421000.0,30.17,1543399000.0,42.57,1543439000.0,27.76,1543399000.0,36.57,1543439000.0
50%,1543737000.0,12.0,17.0,12.0,13.5,2.16,1.0,42.3519,-71.0631,40.49,...,0.0004,1543770000.0,34.24,1543727000.0,44.68,1543788000.0,30.13,1543745000.0,40.95,1543788000.0
75%,1544828000.0,18.0,28.0,12.0,22.5,2.92,1.0,42.3647,-71.0542,43.58,...,0.0916,1544807000.0,38.88,1544789000.0,46.91,1544814000.0,35.71,1544789000.0,44.12,1544818000.0
max,1545161000.0,23.0,30.0,12.0,97.5,7.86,3.0,42.3661,-71.033,57.22,...,0.1459,1545152000.0,43.1,1545192000.0,57.87,1545109000.0,40.05,1545134000.0,57.2,1545109000.0


In [None]:
list(df.columns)

['id',
 'timestamp',
 'hour',
 'day',
 'month',
 'datetime',
 'timezone',
 'source',
 'destination',
 'cab_type',
 'product_id',
 'name',
 'price',
 'distance',
 'surge_multiplier',
 'latitude',
 'longitude',
 'temperature',
 'apparentTemperature',
 'short_summary',
 'long_summary',
 'precipIntensity',
 'precipProbability',
 'humidity',
 'windSpeed',
 'windGust',
 'windGustTime',
 'visibility',
 'temperatureHigh',
 'temperatureHighTime',
 'temperatureLow',
 'temperatureLowTime',
 'apparentTemperatureHigh',
 'apparentTemperatureHighTime',
 'apparentTemperatureLow',
 'apparentTemperatureLowTime',
 'icon',
 'dewPoint',
 'pressure',
 'windBearing',
 'cloudCover',
 'uvIndex',
 'visibility.1',
 'ozone',
 'sunriseTime',
 'sunsetTime',
 'moonPhase',
 'precipIntensityMax',
 'uvIndexTime',
 'temperatureMin',
 'temperatureMinTime',
 'temperatureMax',
 'temperatureMaxTime',
 'apparentTemperatureMin',
 'apparentTemperatureMinTime',
 'apparentTemperatureMax',
 'apparentTemperatureMaxTime']

In [None]:
x=['timestamp',
    'temperatureHigh',
 'temperatureHighTime',
 'temperatureLow',
 'temperatureLowTime',
 'apparentTemperatureHigh',
 'apparentTemperatureHighTime',
 'apparentTemperatureLow',
 'apparentTemperatureLowTime',
   'windGust',
 'windGustTime',
   'icon',
   'visibility.1',
   'uvIndexTime',
 'temperatureMin',
 'temperatureMinTime',
 'temperatureMax',
   'precipIntensityMax',
 'temperatureMaxTime',
 'apparentTemperatureMin',
 'apparentTemperatureMinTime',
 'apparentTemperatureMax',
 'apparentTemperatureMaxTime']

In [None]:
# droping the unwanted columns for analysis, keeping the data simple and ready for machine learning.
df1 = df.drop(x, axis=1)

In [None]:
# Checking the columns again.
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 34 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   693071 non-null  object 
 1   hour                 693071 non-null  int64  
 2   day                  693071 non-null  int64  
 3   month                693071 non-null  int64  
 4   datetime             693071 non-null  object 
 5   timezone             693071 non-null  object 
 6   source               693071 non-null  object 
 7   destination          693071 non-null  object 
 8   cab_type             693071 non-null  object 
 9   product_id           693071 non-null  object 
 10  name                 693071 non-null  object 
 11  price                637976 non-null  float64
 12  distance             693071 non-null  float64
 13  surge_multiplier     693071 non-null  float64
 14  latitude             693071 non-null  float64
 15  longitude        

* we can observe, there are different datatypes. For visualizations and machine learning models we need to change column data types to allowable format.

In [12]:
# Cehcking duplicate values
df1[df1.duplicated()]

Unnamed: 0,id,hour,day,month,datetime,timezone,source,destination,cab_type,product_id,...,visibility,dewPoint,pressure,windBearing,cloudCover,uvIndex,ozone,sunriseTime,sunsetTime,moonPhase
