In the name of ALLAH, The Most Beneficient, The Most Merciful

# Uber & Lyft ride prices

**Credits: RaviMunde**

    https://www.kaggle.com/code/ravi72munde/starter-uber-lyft-ride-prices-random-forrest/notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier

In [None]:
datapath_cab_rides = '/content/drive/MyDrive/cab_rides.csv'
datapath_weather = '/content/drive/MyDrive/weather.csv'

In [None]:
cab_df = pd.read_csv(datapath_cab_rides, delimiter='\t', encoding = "utf-16")
weather_df = pd.read_csv(datapath_weather, delimiter='\t', encoding = "utf-16")

In [None]:
cab_df.head()

In [None]:
cab_df.describe()
# So the describe method is used for numerical data only.

In [None]:
"""
About the cab_data:

- It contains 693071 records
- I think there are only 06 usable input columns, 01 output column out of
total 10 columns.
- The data is showing price based on distance, dest, source, cab_type, 
surge_mult, prod_id, and name.
    - I think timestamp also matters
    - Distance and dest+source are complementary features.
- Grouping(of rows) is based on distance, timestamp, source, dest, also cab_type.
    - Do you want to know how many total groups exist?
    - Do you want to know min, max, and mean length groups?
Useless column:
    - I think prod_id column is useless because the name column already exists.
Maybe cab_type column is also useless 
    - Maybe ID column is also useless

Relationship:
- Only time_stamp and dest and source is connected to weather_df
    - As weather_df has only two related features i.e. location and timestamp.
    - [Rescaled ahead] But there is value difference in the time_stamp. No. of digits = 12 & 9
        - 9 digit value denotes standard Unix time.
"""
pass

In [None]:
# rough
cab_df["time_stamp"].astype('int64')

In [None]:
weather_df.head()
# It contains few records

In [None]:
"""
About weather data:
- There are 2 connected columns as location and time_stamp.
- 6 weather columns.
- Total columns = 8
"""

'\nAbout weather data:\n- There are 2 connected columns as location and time_stamp.\n- 6 weather columns.\n- Total columns = 8\n'

In [None]:
# Add a new column which converts unix to standard date-time.
cab_df['date_time'] = pd.to_datetime(cab_df['time_stamp']/1000, unit='s')
weather_df['date_time'] = pd.to_datetime(weather_df['time_stamp'], unit='s')
cab_df.head()
#weather_df.head()

In [None]:
# Dtype of date_time = datetime64[ns]
# Dtype of source = object
# Dtype of location in weather data = object
# cab_df['merge_date'] # dtype = object # run after the next cell

In [None]:
#merge the datasets to reflect same time for a location
    # selecting date and hour only, leaving min and seconds.
cab_df['merge_date'] = cab_df.source.astype(str) +" - "+ cab_df.date_time.dt.date.astype("str") +" - "+ cab_df.date_time.dt.hour.astype("str")
weather_df['merge_date'] = weather_df.location.astype(str) +" - "+ weather_df.date_time.dt.date.astype("str") +" - "+ weather_df.date_time.dt.hour.astype("str")
cab_df.head()

In [None]:
# Assigning the new column, index of weather_df
weather_df.index = weather_df['merge_date']
    # Then why don't you delete the column?
#weather_df.head()

In [None]:
# Join both dataframes on the basis of merge_date (source and timestamp)
# Records from weather_df will repeated after join
pd.set_option('max_columns', None)     #pd.reset_option('max_columns')
merged_df = cab_df.join(weather_df,on=['merge_date'],rsuffix ='_w')
    #rsuffix means suffix to use from right frame’s overlapping columns.
merged_df.head()
    # NaN rain is normal because not all the records contain this value.

In [None]:
# Testing
#weather_df.head()
#weather_df.loc[weather_df['merge_date'] == 'Haymarket Square - 2018-12-16 - 8']

merged_df.isna().sum()
#merged_df[merged_df['price'].isna()][:50]
# NaN means missing values

In [None]:
merged_df['rain'].fillna(0,inplace=True) # inplace helps you to make changes to the mentioned dataframe
# Why this column contains NaN values?
# Why have you replaced with 0.
# Mark it solved

In [None]:
merged_df = merged_df[pd.notnull(merged_df['date_time_w'])]
merged_df = merged_df[pd.notnull(merged_df['price'])]

In [None]:
#Testing
#before: 1269926 rows
#after: 1161392 rows
#merged_df.date_time.head()

In [None]:
# Add 2 new columns derived from date_time:
    #1- day (day of the week from 0 to 6) 
    #2-hour  
merged_df['day'] = merged_df.date_time.dt.dayofweek
merged_df['hour'] = merged_df.date_time.dt.hour
#merged_df.head()

In [None]:
merged_df['day'].describe()

count    1.161392e+06
mean     2.389423e+00
std      1.759752e+00
min      0.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      6.000000e+00
Name: day, dtype: float64

In [None]:
merged_df.columns

In [None]:
merged_df.count() #All columns have a length of 1161392

In [None]:
# Choose lyft_line product_id only with 3 non-weather features and 6 weather features
X = merged_df[merged_df.product_id=='lyft_line'][['day','distance','hour','temp','clouds', 'pressure','humidity', 'wind', 'rain']]
#X.count() # Length decreases to 91041

In [None]:
y = merged_df[merged_df.product_id=='lyft_line']['price']
#y.count() # 91041

In [None]:
# Why we chose Lyft_line only? I know it is just 8% of the data 
  #extra note: means we can accomodate 12.5 more product ids with the same size for each.

In [None]:
# Run only once
X.reset_index(inplace=True) # because the order of index has been effected
X = X.drop(columns=['index'])
X.head(50)

In [None]:
features = pd.get_dummies(X) # No need for this line, we can use X also
features.head()
# features.shape # = (91041, 9)

Unnamed: 0,day,distance,hour,temp,clouds,pressure,humidity,wind,rain
0,6,0.44,8,39.36,0.39,1022.44,0.74,8.14,0.0
1,1,1.08,0,43.96,1.0,1006.26,0.9,9.86,0.0497
2,1,1.08,0,43.83,0.97,1005.9,0.91,10.93,0.2173
3,1,1.08,0,43.82,0.97,1005.87,0.91,11.02,0.2039
4,1,1.08,0,43.82,0.97,1005.89,0.91,10.94,0.2154


In [None]:
#s = pd.Series(list('abca'))
#pd.get_dummies(s)
#type(X)

#features.equals(X) # True

In [None]:
features.columns

In [None]:
y.head() # y index is out of order # no problem

In [None]:
labels = np.array(y) # converts pandas series to numpy array

# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features) # converts pandas dataframe to numpy array

print(labels)
print(features[:5])

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features,
    labels, test_size = 0.25, random_state = 42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (68280, 9)
Training Labels Shape: (68280,)
Testing Features Shape: (22761, 9)
Testing Labels Shape: (22761,)


In [None]:
#Testing
#y #dtype: float64
#max(y) #22.5

22.5

#### Random Forest - Price Prediction

In [None]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(n_estimators=1000, random_state=42)

In [None]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 0.44 degrees.


In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 91.86 %.


In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: distance             Importance: 0.66
Variable: pressure             Importance: 0.1
Variable: day                  Importance: 0.07
Variable: hour                 Importance: 0.05
Variable: temp                 Importance: 0.05
Variable: wind                 Importance: 0.03
Variable: clouds               Importance: 0.02
Variable: humidity             Importance: 0.02
Variable: rain                 Importance: 0.01


Why is pressure important?

#### Random Forest - Surge_multiplier prediction

In [None]:
merged_df_surge = merged_df[merged_df.surge_multiplier < 3]
X = merged_df_surge[['day','hour','temp','clouds', 'pressure','humidity', 'wind', 'rain']]
#X.count()

In [None]:
print(merged_df.shape)
print(merged_df_surge.shape)
#merged_df[merged_df.surge_multiplier >= 3]

(1161392, 24)
(1161380, 24)


In [None]:
features = pd.get_dummies(X)
#features.equals(X) #True

In [None]:
y = merged_df_surge['surge_multiplier']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
len(y) # len=1161380 .Indexing of y is incorrect, therefore I've checked length.
#ignoring multiplier of 3 as there are only 2 values in our dataset
    # Me: not 2, its 12
le.fit([1,1.25,1.5,1.75,2.,2.25,2.5])
y2 = le.transform(y) 

In [None]:
#Testing
#type(y) # numpy array
#np.array_equal(np.array(y), y2) # False
#np.array(y)[1000:2000], y2[1000:2000]
#np.unique(np.array(y)) # = array([1.  , 1.25, 1.5 , 1.75, 2.  , 2.5 ])
np.unique(y2)

array([0, 1, 2, 3, 4, 6])

In [None]:
labels = np.array(y2)

feature_list = list(X.columns)
# Convert to numpy array
features = np.array(features)

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features,
    labels, test_size = 0.25, random_state = 42)

print('Training Features Shape:', train_features.shape) # (871035, 8)
print('Testing Features Shape:', test_features.shape) # (290345, 8)

Training Features Shape: (871035, 8)
Testing Features Shape: (290345, 8)


In [None]:
#Testing
#(np.count_nonzero(y == 1)/len(y))*100 # 97.5%

In [None]:
#The dataset is imbalanced when it comes to surge multipliers.
#More than 95% of the data has a surge multiplier of 1.
#We use SMOTE for blancing the training data

In [None]:
# Some installations

#!pip install -U imbalanced-learn
#!pip install delayed

In [None]:
# Balance train data only
# warning: run only once
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
train_features, train_labels = sm.fit_resample(train_features, train_labels)

print('Training Features Shape:', train_features.shape) # (5096064, 8)

Training Features Shape: (5096064, 8)


In [None]:
# The above cell has increased the rows of training features from 871k to 5096k.
    # (5.85 times increase)
#Testing
#unique, counts = np.unique(train_labels, return_counts=True)
#print(np.asarray((unique, counts)).T) # Wow! 6 classes of same length

In [None]:
# Extra note: This is different. The previous one was RandomForestRegressor 
# Extra note: This cell takes time, more than 10mins
rf = RandomForestClassifier(n_jobs=-1, random_state = 42,class_weight="balanced")
    # Please note that I have tried n_estimators=1000
        # result: negligible improvement
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=42)

In [None]:
# Me: I need to revise randomforest regressor and classifier

In [None]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 0.45 degrees.


In [None]:
# Calculating the weighted precision score(taking imbalance of the dataset in account)
# Me: using test predicitons and test  labels 
from sklearn.metrics import precision_score, recall_score

print(precision_score(test_labels, predictions, average="weighted"))
print(recall_score(test_labels, predictions, average="micro"))
    # Me: Why sometimes using weighted average and sometimes micro

0.9747899096864808
0.7635054848542251


In [None]:
#Confusion Matrix for the Surge Multiplier prediction
# Create confusion matrix
pd.crosstab(le.inverse_transform(test_labels), le.inverse_transform(predictions),rownames=['Actual'],colnames=['Predicted'])

Predicted,1.00,1.25,1.50,1.75,2.00,2.50
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1.0,218069,32802,14799,8742,7394,1172
1.25,253,2278,763,529,368,88
1.5,10,325,682,373,249,64
1.75,3,65,102,317,139,79
2.0,0,24,47,122,320,128
2.5,0,2,1,5,17,14


In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort by most important
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: temp                 Importance: 0.2
Variable: wind                 Importance: 0.2
Variable: pressure             Importance: 0.19
Variable: humidity             Importance: 0.13
Variable: clouds               Importance: 0.09
Variable: hour                 Importance: 0.07
Variable: day                  Importance: 0.06
Variable: rain                 Importance: 0.06


In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(test_labels, predictions)

0.7635054848542251

### XGBoost
for prediction of surge multiplier

In [None]:
!pip install xgboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Error during fit:
    #ValueError: Invalid classes inferred from unique values of `y`.
        #Expected: [0 1 2 3 4 5], got [0 1 2 3 4 6]
train_labels[train_labels == 6] = 5
test_labels[test_labels == 6] = 5
len(train_labels)

5096064

In [None]:
#Time: 8:31 - 9:05
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()
xgb_cl.fit(train_features, train_labels)

preds = xgb_cl.predict(test_features)
# Score
accuracy_score(test_labels, preds)

# Accuracy on local machine = 0.7661506139248135

NameError: ignored

In [None]:
# Try this afterwards:
# XGBRegressor(n_estimators = 1000, learning_rate = 0.05) # Your code here

# Documentation for XGBClassifier and XGBRegressor:
#https://xgboost.readthedocs.io/en/stable/python/python_api.html

In [None]:
# XGBoost experiments:
'''
1- Giving same accuracy as Random Forest normally
2- n_estimators = 1000
3- n_estimators = 1000, learning_rate = 0.05: 
'''

### XGBoost
for prediction of prices

In [None]:
import xgboost as xgb

# Init classifier
xgb_cl = xgb.XGBRegressor(n_estimators = 1000)
xgb_cl.fit(train_features, train_labels)

preds = xgb_cl.predict(test_features)

# Calculate the absolute errors
errors = abs(preds - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: 0.67 degrees.
Accuracy: 87.46 %.


In [None]:
# XGBoost experiments:
'''
1- Accuracy has been decreased from Random Forest from 91.86% to 85.29%
2- n_estimators = 1000. result = 87.46%
3- n_estimators = 1000, learning_rate = 0.05. result = 86.72%
4- n_estimators = 1000, learning_rate = 0.01. result = 85.23%
'''

###**Conclusion:**

  **Performace of Random Forest is better than XGBoost**