In [1]:
#import the libraries
import pandas as pd

In [2]:
#path for the dataset
path=r'C:\Users\Dell\Downloads\uber_rides_data.xlsx - sample_train.csv'

In [3]:
#load the dataset
df=pd.read_csv(path)
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


In [4]:
#shape of dataset
df.shape

(200000, 8)

In [5]:
# datatypes of columns and missing values in columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ride_id            200000 non-null  int64  
 1   fare_amount        200000 non-null  float64
 2   pickup_datetime    200000 non-null  object 
 3   pickup_longitude   200000 non-null  float64
 4   pickup_latitude    200000 non-null  float64
 5   dropoff_longitude  199999 non-null  float64
 6   dropoff_latitude   199999 non-null  float64
 7   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 12.2+ MB


In [6]:
# convert the pickup_datetime from object to datetime
df['pickup_datetime']=pd.to_datetime(df['pickup_datetime'])

In [7]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5


In [10]:
# drop all the missing values
df=df.dropna()

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199999 entries, 0 to 199999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   ride_id            199999 non-null  int64              
 1   fare_amount        199999 non-null  float64            
 2   pickup_datetime    199999 non-null  datetime64[ns, UTC]
 3   pickup_longitude   199999 non-null  float64            
 4   pickup_latitude    199999 non-null  float64            
 5   dropoff_longitude  199999 non-null  float64            
 6   dropoff_latitude   199999 non-null  float64            
 7   passenger_count    199999 non-null  int64              
dtypes: datetime64[ns, UTC](1), float64(5), int64(2)
memory usage: 13.7 MB


In [15]:
# calculate average fare amount
print("Average fare amount:", df['fare_amount'].mean())

Average fare amount: 11.359891549457748


In [13]:
import numpy as np
from math import radians

# Define the Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  

    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c

    return distance

# Calculate Haversine distances for each row in the DataFrame
distances = df.apply(lambda row: haversine(row['pickup_latitude'], row['pickup_longitude'], row['dropoff_latitude'], row['dropoff_longitude']), axis=1)

# Calculate the median Haversine distance
median_distance = np.median(distances)
print("Median Haversine Distance:", median_distance, "kilometers")

Median Haversine Distance: 2.1209923961833708 kilometers


In [14]:
# maximum haversine distance between pickup and dropoff location 
max_distance = np.max(distances)
print("Maximum Harvesian Distance:", max_distance, "kilometers")

Maximum Harvesian Distance: 16409.239135313168 kilometers


In [16]:
# calculate how many rides have 0.0 haversine distance between pickup and dropoff location
rides_with_zero_distance_count = (distances == 0.0).sum()
print("Number of rides with 0.0 Haversine distance:", rides_with_zero_distance_count)


Number of rides with 0.0 Haversine distance: 5632


In [17]:
# Filter rows with 0.0 Haversine distance
zero_distance_rows = df[distances == 0.0]

# Calculate the mean 'fare_amount' for rides with 0.0 Haversine distance
mean_fare_for_zero_distance = zero_distance_rows['fare_amount'].mean()
print("Mean 'fare_amount' for rides with 0.0 Haversine distance:", mean_fare_for_zero_distance)

Mean 'fare_amount' for rides with 0.0 Haversine distance: 11.585317826704546


* The mean 'fare_amount' for rides with a 0.0 Haversine distance of approximately 11.59 seems unusual and may warrant further investigation. Here are a few reasons why this value could be considered suspicious:

* Impossible Zero Distance: A Haversine distance of 0.0 suggests that the pickup and dropoff locations are the same. In the context of taxi rides, this could imply that the passenger was picked up and dropped off at exactly the same spot, which is highly improbable in most real-world scenarios.

* Data Entry Errors: Such cases could be due to data entry errors or inaccuracies in the dataset. It's possible that the coordinates for the pickup and dropoff locations were not recorded correctly.

* Data Manipulation or Cleaning: It's also possible that there were some data manipulation or cleaning steps performed on the dataset that resulted in these 0.0 distance values. This could be unintentional or intentional.

* Fraud or Anomalies: In some cases, a pattern of rides with 0.0 distance and higher fares could indicate potential fraud or anomalies in the data, where rides were not accurately recorded.

To analyze this further,

* Examine a sample of the records with 0.0 Haversine distance to understand if there are any common characteristics or patterns among them.
* Check the data collection process and the source of the data to identify potential issues with location recording.
* Consider the possibility of outliers or anomalies in the 'fare_amount' values for these rides.
* Review any data cleaning or preprocessing steps that were applied to the dataset to ensure they didn't inadvertently cause this issue.
* It's essential to investigate the root cause of these 0.0 distance values to determine whether they are genuine or indicative of data quality issues.


In [18]:
# calculate maximum 'fare_amount' for a ride
print("Maximum fare amount for a ride: ",df['fare_amount'].max())

Maximum fare amount for a ride:  499.0


In [20]:
# calculate the haversine distance between pickup and dropoff location for the costliest ride
max_fare_rows = df[df['fare_amount'] == df['fare_amount'].max()]
haversine_distance_for_costliest_ride = haversine(
    max_fare_rows['pickup_latitude'].values[0],
    max_fare_rows['pickup_longitude'].values[0],
    max_fare_rows['dropoff_latitude'].values[0],
    max_fare_rows['dropoff_longitude'].values[0]
)

print("The Haversine distance between pickup and dropoff location for the costliest ride:", haversine_distance_for_costliest_ride, "kilometers")

The Haversine distance between pickup and dropoff location for the costliest ride: 0.0007899213191009994 kilometers


* The Haversine distance of approximately 0.00079 kilometers (or about 0.79 meters) between the pickup and dropoff locations for the costliest ride is highly unusual and suggests that there may be an issue or anomaly in the data. Here are a few points to consider:

* Impossibly Short Distance: A Haversine distance of less than 1 kilometer for a taxi ride is extremely unusual. It implies that the pickup and dropoff locations are essentially the same or very close, which is highly improbable for a typical taxi ride.

* Data Entry Errors: Such a short distance could be indicative of data entry errors or inaccuracies in recording the coordinates for the pickup and dropoff locations. It's possible that the latitude and longitude values were not recorded correctly.

* Data Anomalies or Outliers: This value could be an outlier or anomaly in the dataset. It's essential to check if there are any other characteristics or patterns associated with this particular ride, such as unusually high fare amounts.

* Quality of GPS Data: Consider the source and quality of the GPS data used to calculate the coordinates. If the GPS data has issues, it can lead to incorrect distance calculations.

* Data Validation: Validate the coordinates and the fare amount for this ride to ensure they were accurately recorded.

In [23]:
# calculate How many rides were recorded in the year 2014
number_of_rides_df = df[df['pickup_datetime'].dt.year == 2014]
print("Number of rides that were recorded in the year 2014:", number_of_rides_df.shape[0])

Number of rides that were recorded in the year 2014: 29968


In [25]:
# calculate How many rides were recorded in the first quarter of 2014
rides_in_first_quarter_2014 = df[(df['pickup_datetime'].dt.year == 2014) & (df['pickup_datetime'].dt.month >= 1) & (df['pickup_datetime'].dt.month <=3)]
print("Number of rides recorded in the first quarter of 2014:", rides_in_first_quarter_2014.shape[0])

Number of rides recorded in the first quarter of 2014: 7687


In [26]:
# On which day of the week in September 2010, maximum rides were recorded 

df['year'] = df['pickup_datetime'].dt.year
df['month'] = df['pickup_datetime'].dt.month
df['day_of_week'] = df['pickup_datetime'].dt.dayofweek

rides_in_september_2010 = df[(df['year'] == 2010) & (df['month'] == 9)]

# Find the day of the week with the maximum number of rides
max_rides_day = rides_in_september_2010['day_of_week'].value_counts().idxmax()

day_mapping = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}

max_rides_day_name = day_mapping[max_rides_day]

df.drop(['year', 'month', 'day_of_week'], axis=1, inplace=True)

print("On which day of the week in September 2010, maximum rides were recorded:", max_rides_day_name)

On which day of the week in September 2010, maximum rides were recorded: Thursday


In [27]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5


In [28]:
# Calculate 'distance' and 'ride_week_day' and add them to the DataFrame
df['distance'] = df.apply(lambda row: haversine(row['pickup_latitude'], row['pickup_longitude'], row['dropoff_latitude'], row['dropoff_longitude']), axis=1)
df['ride_week_day'] = df['pickup_datetime'].dt.dayofweek

In [29]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,ride_week_day
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1,1.683323,3
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1,2.45759,4
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1,5.036377,0
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3,1.661683,4
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5,4.47545,3


In [30]:
# input features
X=df[['passenger_count','distance','ride_week_day']]
# output feature
y=df['fare_amount']
# task: regression

### Train_Test_split

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(139999, 3) (139999,)
(60000, 3) (60000,)


## Linear Regression

In [33]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [34]:
y_test_pred = regressor.predict(X_test)

In [35]:
from sklearn import metrics

# Adjusted R^2

r2 = metrics.r2_score(y_test, y_test_pred)
n = len(y_test)
k = X_test.shape[1]

r2_adj = 1 - (1-r2)*(n-1)/(n-k-1)

print("Adjusted R^2 score for linear regression model: ", r2_adj)

Adjusted R^2 score for linear regression model:  0.0005663805253751653


## Random Forest Regression

In [36]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)

In [37]:
y_test_pred = regressor.predict(X_test)

In [38]:
# Adjusted R^2

r2 = metrics.r2_score(y_test, y_test_pred)
n = len(y_test)
k = X_test.shape[1]

r2_adj = 1 - (1-r2)*(n-1)/(n-k-1)

print("Adjusted R^2 score for Random Forest Regression model: ", r2_adj)

Adjusted R^2 score for Random Forest Regression model:  0.659960585764324


The adjusted R-squared values you've obtained for the linear regression and random forest regression models indicate how well each model fits the data. Here are some observations on the results:

Linear Regression Model (Adjusted R-squared ≈ 0.0006):

* The adjusted R-squared value is very close to 0, which suggests that the linear regression model is not a good fit for the data.
* An adjusted R-squared close to 0 means that the model explains very little of the variance in the target variable (fare amount).
* This result might indicate that the linear relationship between the input features (passenger count, distances, ride_week_day) and the fare amount is weak or nonlinear.

Random Forest Regression Model (Adjusted R-squared ≈ 0.660):

* The adjusted R-squared value is significantly higher compared to the linear regression model, indicating a better fit.
* An adjusted R-squared closer to 1 means that the random forest model explains a substantial portion of the variance in the target variable.
* This suggests that the random forest model is better at capturing the complex and potentially nonlinear relationships between the input features and the fare amount.

Overall Observation:

* The random forest regression model appears to be a more suitable choice for this prediction task based on the adjusted R-squared values.
