# **Transport Demand Pridiction**    - Regression



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Member 1 -**        Hasnain Mazhar Rizvi


# **Project Summary -**

In brief, the project's goal is to construct a predictive model for Mobiticket that forecasts the seat sales for each trip based on particular routes, dates, and times. The trips initiate from 14 towns toward Lake Victoria and conclude in Nairobi. The travel duration is approximately 8 to 9 hours to reach the outskirts of Nairobi, followed by an additional 2 to 3 hours to the central bus terminal. Traffic conditions significantly affect passenger behavior during their journey to the city and onward to their final destinations in Nairobi. Gaining insights into these patterns can lead to better service planning and operational optimization for Mobiticket.

To improve the model's effectiveness, additional features have been created. These new features aim to provide more pertinent information and contribute to more accurate predictions. The dataset has undergone testing using various regression models. These models were utilized to analyze the data and extract valuable insights, enabling a thorough assessment of the predictive capabilities. The most noteworthy features identified by the model are emphasized and presented. These critical features play a vital role in determining the number of seats sold for each trip. By showcasing these significant factors, stakeholders can develop a deeper understanding of the influential aspects that drive seat sales.

# **GitHub Link -**

https://github.com/Hasnain-Rizvi/Capstone-Project---Regression.git

# **Problem Statement**


This challenge asks you to build a model that predicts the number of seats that Mobiticket can expect to sell for each ride, i.e. for a specific route on a specific date and time. There are 14 routes in this dataset. All of the routes end in Nairobi and originate in towns to the North-West of Nairobi towards Lake Victoria.

The towns from which these routes originate are:


* Awendo
* Homa Bay
*Kehancha
*Kendu Bay
*Keroka
*Keumbu
*Kijauri
*Kisii
*Mbita
*Migori
*Ndhiwa
*Nyachenge
*Oyugis
*Rodi
*Rongo
*Sirare
*Sori


The routes from these 14 origins to the first stop in the outskirts of Nairobi takes approximately 8 to 9 hours from time of departure. From the first stop in the outskirts of Nairobi into the main bus terminal, where most passengers get off, in Central Business District, takes another 2 to 3 hours depending on traffic.
The three stops that all these routes make in Nairobi (in order) are:

1. Kawangware: the first stop in the outskirts of Nairobi
2. Westlands
3. Afya Centre: the main bus terminal where most passengers disembark


Passengers of these bus (or shuttle) rides are affected by Nairobi traffic not only during their ride into the city, but from there they must continue their journey to their final destination in Nairobi wherever that may be. Traffic can act as a deterrent for those who have the option to avoid buses that arrive in Nairobi during peak traffic hours. On the other hand, traffic may be an indication for people’s movement patterns, reflecting business hours, cultural events, political events, and holidays.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Capstone Project 2- Regression/train_revised.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows = df.count()[0]
columns = len(df.columns)
print('Rows are',rows)
print('Columns are',columns)

In [None]:
df

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
len(df[df.duplicated()])

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()

***Note : No Duplicate Values Found in Dataset.***

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

***Note : No Null Values Found in Dataset.***

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

From Above Dataset :


1.   The Dataset has 51645 Rows and 10 Columns.
2.   There are no duplicate Values.
3.   There are no Null/Missing Values.

The Dataset is preety clean till now.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='object')

### Variables Description



* #### ride_id: unique ID of a vehicle on a specific route on a specific day and time.
* #### seat_number: seat assigned to ticket
* #### payment_method: method used by customer to purchase ticket from Mobiticket (cash or Mpesa)
* #### payment_receipt: unique id number for ticket purchased from Mobiticket
* #### travel_date: date of ride departure. (MM/DD/YYYY)
* #### travel_time: scheduled departure time of ride. Rides generally depart on time. (hh:mm)
* #### travel_from: town from which ride originated
* #### travel_to: destination of ride. All rides are to Nairobi.
* #### car_type: vehicle type (shuttle or bus)
* #### max_capacity: number of seats on the vehicle

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.describe(include='object').iloc[1,:]

In [None]:
df.describe().iloc[1,:]

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.head()

### What all manipulations have you done and insights you found?

**From our dataset Since we are not given the target variable so we need to find target variable first.**

There might be many ways of finding the target variable but here we use one way that is we will find the count of each ride_id and that will be the number_of_ticket as our target variable.

In [None]:
#Checking number of unique ride ids
len(df['ride_id'].unique())

In [None]:
# Making target variable 'number of ticket' by grouping the dataframe on ride_id column
temp_df = df.groupby('ride_id')['seat_number'].count().reset_index()
temp_df.rename(columns = {'seat_number':'number_of_ticket'},inplace=True)
temp_df.head()

In [None]:
# Dropping unuseful columns

df.drop(['seat_number','payment_method','payment_receipt','travel_to'],axis=1, inplace=True)
df.shape

In [None]:
# Dropping duplicates
df.drop_duplicates('ride_id',inplace=True)
df.shape

In [None]:
# Merging this new dataframe with orginal dataframe on column ride_id
df = df.merge(temp_df, how='left', on='ride_id')
df.head()

In [None]:
#Creating a column date_time which is a combination of columns travel_date and travel_time
df['date_time'] = pd.to_datetime(df['travel_date'] +" "+ df['travel_time'])
df['travel_date'] = pd.to_datetime(df['travel_date'])

In [None]:
df_copy = df.copy()

In [None]:
df_copy

In [None]:
#Creating additonal features with travel_date and travel_time columns

#Creating a function to add the above features
def create_date_cols(df_temp):
  df_temp['travel_month'] = df_temp['travel_date'].dt.month
  df_temp['travel_year'] = df_temp['travel_date'].dt.year
  df_temp['travel_day_of_month'] = df_temp['travel_date'].dt.day
  df_temp['travel_day_of_year'] = df_temp['travel_date'].dt.dayofyear
  df_temp['travel_day_of_week'] = df_temp['travel_date'].dt.dayofweek
  df_temp['travel_hour'] = pd.to_datetime(df_temp['travel_time']).dt.hour
  df_temp['quarter'] = df_temp['travel_date'].dt.quarter
  df_temp['is_weekend'] = df_temp['travel_day_of_week'].apply(lambda x: 1 if x in [5,6] else 0)

  return df_temp

#Applying function on our dataframe
df_copy = create_date_cols(df_copy)

In [None]:
df_copy.columns

In [None]:
df_copy.head()

In [None]:
#Converting travel_time column into integer format
df_copy['travel_time'] = df_copy['travel_time'].str.split(':').apply(lambda x: round(int(x[0]) + int(x[1])/60 ,2) )

In [None]:
#Creating function that defines periods for time intervals
def get_period(hour):
  if hour<7: return 'em'
  elif hour>=7 and hour<=11: return 'mor'
  elif hour>11 and hour<=15: return 'an'
  elif hour>15 and hour<=19: return 'evn'
  elif hour>19 and hour<=24: return 'nght'

df_copy['time_period_of_day'] = df_copy['travel_hour'].apply(get_period)

In [None]:
df_copy.head()

**What all manipulations have you done and insights you found?**

1. Created target variable 'number_of_ticket'
2. Dropped constant and non essential columns
3. Used travel_date and travel_time columns to extract and create datetime related features
4. Created period feature from travel time for data visualization

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Histogram for distribution of tickets
plt.figure(figsize=(10,6))
sns.histplot(df_copy['number_of_ticket'], color='orange')
plt.title("Distribution of tickets")

##### 1. Why did you pick the specific chart?

It gives you clear intutive figure of number of tickets

##### 2. What is/are the insight(s) found from the chart?

From data, usually number of tickets bought per ride_id is between 1 and 12

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No,its fine and it is right schewd too so its a positive side

In [None]:
import warnings
warnings.filterwarnings('ignore')

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Bar chart for types of transport used
plt.figure(figsize=(10,6))
df_copy['car_type'].value_counts().plot(kind='bar',color='teal')
plt.title("Types of transport used")
plt.xlabel('Car Type')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

when  it comes to comparing bar graph gives real comparison.

##### 2. What is/are the insight(s) found from the chart?

The number of Buses and shuttle are nearly equal in the data. Hence, both type of cars are used equally for traveling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No its no effect

#### Chart - 3

In [None]:
# Chart - 3 visualization code

#Bar chart for Max Capacities of transport
plt.figure(figsize=(10,6))
sns.barplot(data=df_copy, x='car_type', y='max_capacity', color='red')
plt.title("Max Capacities of Transport")
plt.xlabel('Max Capacity')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

when  it comes to comparing bar graph gives real comparison.

##### 2. What is/are the insight(s) found from the chart?

We can see Obviously buses have large capacity as compare to shuttle

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes we can see a path which kind of capacity will suite the route

#### Chart - 4

In [None]:
# Chart - 4 visualization code

#Bar chart for total number of ticket from each origin place
plt.figure(figsize=(10,6))
sns.barplot(data=df_copy,x='travel_from',y='number_of_ticket', palette='viridis')
plt.xticks(rotation=90)
plt.title('Total tickets from each origin place')
plt.xlabel('Origin Place')
plt.ylabel('Number of Tickets')

##### 1. Why did you pick the specific chart?

Ease of Use,Aesthetic Appeal and Statistical Summary: sns.barplot can automatically calculate and display a summary statistic

##### 2. What is/are the insight(s) found from the chart?



Most no. of tickets are sold from:

    Sirare,  Mbita, Migori
   
While the least no. of tickets are sold from:

    Keumbu, Kendu Bay

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, we can know where we have to focus

#### Chart - 5

In [None]:
# Chart - 5 visualization code

#Monthwise Distribution of travellers

plt.figure(figsize=(10,6))
sns.histplot(df_copy['travel_month'],bins=12, color='teal')
plt.title('Month wise distribution of travel')
plt.xlabel('Months')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

 provides a wide range of options for customizing the appearance of the histogram, including bin size, color,etc..

##### 2. What is/are the insight(s) found from the chart?

Mostt of the traveling is done in the months of January, February and December.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

we can see which month has most ticket sales

#### Chart - 6

In [None]:
# Chart - 6 visualization code

#Day of Monthwise Distribution of travellers
plt.figure(figsize=(10,6))
sns.histplot(df_copy['travel_day_of_month'],bins=12, color='teal')
plt.title('Day of month wise distribution for travelling')
plt.xlabel('Days of Month')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

 provides a wide range of options for customizing the appearance of the histogram, including bin size, color,etc..

##### 2. What is/are the insight(s) found from the chart?

Mostt of the traveling is done from before 5th of the month, there seems to be no travelling done between 5th and 11th of the month. This can be because of transport holiday during this period everry month.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes , getting know more

#### Chart - 7

In [None]:
# Chart - 7 visualization code

#Scatterplot of number of tickets sold for everyday of the month
plt.figure(figsize=(10,6))
sns.scatterplot(data=df_copy, x='travel_day_of_month',y='number_of_ticket',color='teal')
plt.title('Number of tickets for every day of month')

##### 1. Why did you pick the specific chart?

Visualizing Relationships Scatter plots are excellent for visualizing relationships between two variables.

##### 2. What is/are the insight(s) found from the chart?

Similarr to above graph, we can see there are no tickets sold between 5th and 11th of every month. Transport may^ be closed during this period every month because of tranport holidays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, as the holidays also affect the sales

#### Chart - 8

In [None]:
# Chart - 8 visualization code

#Scatterplot for number of tickets in every hour of day.
plt.figure(figsize=(10,6))
sns.scatterplot(data=df_copy, x='travel_hour',y='number_of_ticket',color='teal')
plt.title('Number of tickets for each hour of day')

##### 1. Why did you pick the specific chart?

Visualizing Relationships Scatter plots are excellent for visualizing relationships between two variables.

##### 2. What is/are the insight(s) found from the chart?

In, the day, most tickets are sold at 7AM and close to 7PM. This can be because people going to and returning from work in Nairobi at these times. Similarly, there are no tickets sold between 12PM and 5:30 PM.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, we have to take timing also in point.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

#Bar chart for Number of tickets for each period of day
plt.figure(figsize=(10,6))
sns.barplot(data=df_copy, x='time_period_of_day',y='number_of_ticket', palette='viridis')
plt.title('Number of tickets for each period of day')

##### 1. Why did you pick the specific chart?

lowing to use DataFrame columns directly for plotting, making it convenient for data analysis workflows.

##### 2. What is/are the insight(s) found from the chart?

Most number of tickets are sold in evening followed by morning hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

time is take in the account also.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(12,5))
sns.heatmap(df_copy.corr(),annot=True, cmap='viridis')

##### 1. Why did you pick the specific chart?

 is used to create a heatmap with the specified data. The function provides a convenient way to visualize patterns in the data using colors

##### 2. What is/are the insight(s) found from the chart?

we can se weekend is very correlated to the weeks so more travel on weekends.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(df_copy, diag_kind='kde')
plt.show()

##### 1. Why did you pick the specific chart?

is a handy tool for gaining a quick and comprehensive overview of the relationships and distributions within a dataset.

##### 2. What is/are the insight(s) found from the chart?

We can see that weekend is very import in number of tickts and weeks is import and have a peak at no. of hours.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**As per my observation, I have come to this conclusion till now.:**
    
There are no travel activities taking place during the afternoon.

The quantity of buses utilized for travel matches the count of shuttles used.

The highest ticket sales occur during the morning.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis** : (H0) = No activity in after non

**Alternative Hypothesis** :(H1) =activity in after non

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
sns.scatterplot(data=df_copy, x='travel_hour',y='number_of_ticket',color='teal')
plt.title('Number of tickets for each hour of day')

##### Which statistical test have you done to obtain P-Value?

From above we can see that between 12pm and 4pm (Afternoon), there are no travellers, or number of tickets sold are 0.

Hence Na = 0, therefore we cannot reject Null Hypothesis.

##### Why did you choose the specific statistical test?

A Graphical representation is most good enough to get to know the result.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Statement: The number of buses used in traveling is same as the number of shuttle used. Let Number of buses used be Nb, and Number of shuttle used be Ns

Null Hypothesis (H0) : Nb = Ns

Arlternative Hypothesis (Ha) : Nb != Ns


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

df_copy['car_type'].value_counts().plot(kind='bar',color='teal')
plt.title("Types of transport used")
plt.xlabel('Car Type')
plt.ylabel('Count')

##### Which statistical test have you done to obtain P-Value?

From above, Nb > Ns, hence we can reject Null Hypothesis.

##### Why did you choose the specific statistical test?

A Graphical representation is most good enough to get to know the result

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Most number of tickets are sold in morning.

Null Hypothesis : Most number of tickets are sold in morning

Alternative Hypothesis : Most number of tickets are not sold in morning

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

sns.barplot(data=df_copy, x='time_period_of_day',y='number_of_ticket', palette='viridis')
plt.title('Number of tickets for each period of day')

##### Which statistical test have you done to obtain P-Value?

From above we can see, max tickets are sold in evening, hence we can reject null hypothesis

##### Why did you choose the specific statistical test?

Bar graphical representation gives that we are above p-value

## ***6. Feature Engineering & Data Pre-processing***

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

df_copy.travel_month.value_counts()

In [None]:
df_copy.travel_day_of_year.value_counts()

In [None]:
df_copy.travel_day_of_month.value_counts()

In [None]:
df_copy.time_period_of_day.value_counts()

From above 3 value counts, we can see:
1. Some months have higher frequency of travel
1. Some days in year have very high frequency of travelers, while others have really low frequency. This is because of more traveling being done in some month than others.
2. Some days in a month have high frequeny of travel than others.
3. Some period in the day have high frequency of travel.

To manage this, we will create a dictionary of the frequency for the 4 columns above and create 4 new columns taking log transform for the same.

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Manipulate Features to minimize feature correlation and create new features
period_dict = dict(df_copy.time_period_of_day.value_counts())
df_copy['travel_hour_wise_weights'] = np.log1p(df_copy.time_period_of_day.map(period_dict))

# Creating a seperate column for giving day of a year wise weights for the hours column
day_of_year_dict = dict(df_copy.travel_day_of_year.value_counts())
df_copy['travel_day_of_year_wise_weights'] = np.log1p(df_copy.travel_day_of_year.map(day_of_year_dict))

# Giving weights to the each days of the month based on the frequency of ticket bookings
day_of_month_wise_weights_dict = {2:1, 12:1, 3:1, 4:2, 1:3, 13:3, 14:3, 16:3, 28:3, 19:3, 18:3, 15:3, 17:3, 20:3, 22:4, 21:4, 27:4, 29:4, 23:4, 24:4, 26:4, 30:4, 25:4, 31:4}
df_copy['travel_day_of_month_wise_weights'] = df_copy.travel_day_of_month.replace(day_of_month_wise_weights_dict)

# Creating a column for giving weights to the each months of a year based on the frequency of ticket bookings
travel_month_wise_weights_dict = {12: 1,
 2: 1,
 1: 1,
 3: 1,
 4: 1,
 11: 2,
 9: 3,
 7: 3,
 8: 3,
 10: 3,
 6: 3,
 5: 3}
df_copy['travel_month_wise_weights'] = df_copy.travel_month.replace(travel_month_wise_weights_dict)

In [None]:
df_copy.head(3)

In [None]:
# Creating a method to create new features in df

# Creating columns for time difference between next and previous buses for each of the origin places (travel_from).
def find_difference_bw_bus(data):

  data.sort_values(["travel_from","date_time"],inplace=True,ascending=True)
  data["Time_gap_btw_0_1_next_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(-1)).dt.total_seconds()/3600
  data["Time_gap_btw_0_1_previous_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(1)).dt.total_seconds()/3600
  data["Time_gap_btw_0_2_next_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(-2)).dt.total_seconds()/3600
  data["Time_gap_btw_0_2_previous_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(2)).dt.total_seconds()/3600
  data["Time_gap_btw_0_3_next_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(-3)).dt.total_seconds()/3600
  data["Time_gap_btw_0_3_previous_bus"]=(data["date_time"]-data.groupby(["travel_from"]).date_time.shift(3)).dt.total_seconds()/3600
  data["Time_gap_btw_next_previous_bus"]=(data.groupby(["travel_from"]).date_time.shift(-1)-data.groupby(["travel_from"]).date_time.shift(1)).dt.total_seconds()/3600
  cols=["Time_gap_btw_0_1_next_bus", "Time_gap_btw_0_1_previous_bus", "Time_gap_btw_0_2_next_bus","Time_gap_btw_0_2_previous_bus",
      "Time_gap_btw_0_3_next_bus", "Time_gap_btw_0_3_previous_bus",
      "Time_gap_btw_next_previous_bus"]

  #Handling missing values
  data[cols]=data.groupby(["travel_from"])[cols].fillna(method="ffill")
  data[cols]=data.groupby(["travel_from"])[cols].fillna(method="backfill")


  return data

In [None]:
transport_data_new = find_difference_bw_bus(df_copy)

In [None]:
transport_data_new.groupby(["travel_from"]).date_time.shift(-1)

In [None]:
transport_data_new[['travel_from','date_time','Time_gap_btw_0_1_next_bus','Time_gap_btw_0_1_previous_bus']].head()

In [None]:
# Making a dictionary containing distances of originating places from nairobi, taken from google maps
distance_from_nairobi = {'Awendo':351, 'Homa Bay':360, 'Kehancha': 387.7, 'Keroka': 280, 'Keumbu':295, 'Kijauri':271,
                         'Kisii':305.1, 'Mbita':401, 'Migori': 370, 'Ndhiwa': 371, 'Nyachenge':326, 'Rodi':348, 'Rongo':332,
                         'Sirare':392, 'Sori':399}

transport_data_new['distance_to_destination'] = transport_data_new['travel_from'].map(distance_from_nairobi)

In [None]:
transport_data_new.columns

##### What all feature selection methods have you used  and why?

I use serverl numpy logic like np.log1p which Return the natural logarithm of one plus the input array element-wise.

some pandas datafram logic seperation and selection coloumns etc..

##### Which all features you found important and why?

as per our dataset it needs some extra exporting columns to determine the next fetures and ML alogrithms luike exporting gaps between the time and traffic gap.

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
transport_data_new.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

only basic operation is enought to get the result.and we found some missing values .

In [None]:
transport_data_new.dropna(inplace=True)
transport_data_new.isnull().sum()

In [None]:
transport_data_new.info()

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

def find_outliers_IQR(transport_data_new):

   q1=transport_data_new.quantile(0.25)

   q3=transport_data_new.quantile(0.75)

   IQR=q3-q1

   outliers = transport_data_new[((transport_data_new<(q1-1.5*IQR)) | (transport_data_new>(q3+1.5*IQR)))]

   return outliers


In [None]:
od = find_outliers_IQR(transport_data_new['ride_id'])
print("max outlier value: "+ str(od.max()))
print("min outlier value: "+ str(od.min()))
od

##### What all outlier treatment techniques have you used and why did you use those techniques?

as per our data we can conclude to outlier because every data has some importans but we already remove missing values

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

from sklearn import preprocessing
final_df = pd.get_dummies(df_copy, columns = ['travel_from','travel_day_of_month_wise_weights','travel_month_wise_weights'])

In [None]:
#Label encoding car_type column
label_encoder = {'Bus':1,'shuttle':0}
final_df.replace(label_encoder, inplace=True)
final_df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

We have done one hot encoding on few categorical features, where label encoding on car_type

### 4. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

#Seperating target variable and indpendent features variable
cols_to_drop = ['ride_id','travel_date','travel_time','max_capacity','travel_year','number_of_ticket','time_period_of_day','date_time','travel_month','travel_day_of_month','travel_day_of_year','travel_hour']
X = final_df.drop(cols_to_drop,axis=1)
X.shape

In [None]:
#Target Variable
y = final_df['number_of_ticket'].values
y.shape

In [None]:
# Split your data to train and test.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape

In [None]:
y_train.shape

##### What data splitting ratio have you used and why?



```
# i use 80%-20% for slipliting because thats the ideal way to do.
```



### 5. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

`As we have only intergers columns and remaining objective columns are already in good shape.`

#### 1. Expand Contraction

In [None]:
# Expand Contraction

`As we have only intergers columns and remaining objective columns are already in good shape.`

#### 2. Lower Casing

In [None]:
# Lower Casing

`As we have only intergers columns and remaining objective columns are already in good shape.`

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

`As we have only intergers columns and remaining objective columns are already in good shape.`

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

`There are no URL in our clean dataset`

#### 5. Removing Stopwords & Removing White spaces

`As we have only intergers columns and remaining objective columns are already in good shape.`

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

`As we have only intergers columns and remaining objective columns are already in good shape.`

In [None]:
# Rephrase Text

#### 7. Tokenization

`As we have only intergers columns and remaining objective columns are already in good shape.`

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

`As we have only intergers columns and remaining objective columns are already in good shape.`

### 6. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

No, We dont need that

## ***7. ML Model Implementation***

In [None]:
#Importing required libraries
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error,mean_absolute_percentage_error
import math

In [None]:
#Creating function for evaluation metrics
def evaluate_metric(actual,predicted):
  print('MSE is {}'.format(mean_squared_error(actual, predicted)))
  print('RMSE is {}'.format(math.sqrt(mean_squared_error(actual, predicted))))
  print('MAE is {}'.format(mean_absolute_error(actual, predicted)))
  print('MAPE is {}'.format(np.mean(np.abs((actual - predicted) / actual)) * 100))
  print('R2 Score is {}'.format(r2_score(actual, predicted)))

### ML Model - 1 (Linear Regression Algorithm)

In [None]:
# ML Model - 1 Implementation
regressor = LinearRegression()

# Fit the Algorithm
regressor.fit(X_train,y_train)

# Predict on the model

y_train_pred = regressor.predict(X_train)

y_test_pred = regressor.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

**Linear Regression is not giving good results on data, hence trying regularized linear regression models**

### ML Model - 2 - (***Lasso Algorithm***)

In [None]:
# ML Model - 2 Implementation
lasso  = Lasso(alpha=0.1 , max_iter= 3000)

# Fit the Algorithm
lasso.fit(X_train, y_train)

# Predict on the model
y_train_pred = lasso.predict(X_train)

y_test_pred = lasso.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:

#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=3)


# Fit the Algorithm
lasso_regressor.fit(X, y)

# Predict on the model

print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

# Predict on the model
y_train_pred = lasso_regressor.predict(X_train)

y_test_pred = lasso_regressor.predict(X_test)

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to tune hyperparameter, The best fit alpha value is found out to be : {'alpha': 0.01}

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The model has accuracy has decreased overall and does not give good results.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

on further algorith we can come to the conclusion.

### ML Model - 3 (***Ridge Algorithm***)

In [None]:
# ML Model - 3 Implementation
ridge = Ridge()

# Fit the Algorithm
ridge.fit(X_train,y_train)

# Predict on the model
y_train_pred = ridge.predict(X_train)

y_test_pred = ridge.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric.

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)

# Fit the Algorithm
ridge_regressor.fit(X,y)

# Predict on the model
y_train_pred = ridge_regressor.predict(X_train)

y_test_pred = ridge_regressor.predict(X_test)

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

##### Which hyperparameter optimization technique have you used and why?

`GridSearchCV is used to tune hyperparameter, The best fit alpha value is found out to be : {'alpha': 1}`

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

`GridSearchCV has little effect on L2 Regression, has not changed the model accuracy much and overfitting still exists.`

### *As the required result we didnt get so we have to try Non linear regression also.*
---

### ML Model - 4 (***Random Forest Algorithm***)

In [None]:
# ML Model - 4 Implementation
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()

# Fit the Algorithm
rfr.fit(X_train,y_train)

# Predict on the model
y_train_pred = rfr.predict(X_train)

y_test_pred = rfr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

**Looks like the model causes overfitting, hence trying hyperparameter tuning.**

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rfr.get_params())


In [None]:
# ML Model - 4 Implementation with hyperparameter optimization techniques
n_estimators = [int(x) for x in np.linspace(start = 400, stop = 1000, num = 4)]

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(40, 100, num = 4)]
max_depth.append(None)

# Create the parameters grid
grid_params_dict = {'n_estimators': n_estimators,
               'max_depth': max_depth
                    }
print(grid_params_dict)

In [None]:
rfr = RandomForestRegressor()

# Grid Search of parameters, using 3 fold cross validation,
rf_gridCV = GridSearchCV(estimator = rfr, param_grid = grid_params_dict, cv = 3, verbose=2, n_jobs = -1)

# Fit the Algorithm
rf_gridCV.fit(X,y)


print(rf_gridCV.best_params_)

In [None]:
rf_gridCV.best_estimator_

In [None]:
#Taking best params and creating a new regressor
rf_grid_optimal_model =rf_gridCV.best_estimator_

# Predict on the model
y_train_pred = rf_grid_optimal_model.predict(X_train)

y_test_pred = rf_grid_optimal_model.predict(X_test)

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to tune hyperparameter, The best hyperparameter values are:
{'max_depth': 40, 'n_estimators': 400}

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

By using GridSearchCV, the training accuracy have increase slightly, but the test accuracy has increased a lot, hence, the problem of overfitting which was faced in train_test_split method is removed.

### ML Model - 5 (***XGBoost Algorithm***)

In [None]:
# ML Model - 5 Implementation
import xgboost as xgb
xgbr = xgb.XGBRegressor()

# Fit the Algorithm
xgbr.fit(X_train,y_train)

# Predict on the model
y_train_pred = xgbr.predict(X_train)

y_test_pred = xgbr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

**Looks like the model causes overfitting, hence trying hyperparameter tuning.**

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 5 Implementation with hyperparameter optimization techniques
n_estimators = [int(x) for x in np.linspace(start = 400, stop = 1000, num = 4)]

max_depth= [6, 8, 10, 12]
min_child_weight= [7, 8, 10, 12]

# Create the random grid
xgb_grid_params_dict = {
         'max_depth': max_depth,
         'min_child_weight': min_child_weight
                         }
print(xgb_grid_params_dict)


xgbr = xgb.XGBRegressor(objective='reg:squarederror', random_state = 3)

# Grid Search of parameters, using 3 fold cross validation,
xgbr_grid = GridSearchCV(estimator = xgbr, param_grid = xgb_grid_params_dict, cv = 3, verbose=2, n_jobs = -1)

# Fit the Algorithm
xgbr_grid.fit(X,y)


print(xgbr_grid.best_params_)

In [None]:
xgbr_grid.best_estimator_

In [None]:
#Taking best params and creating a new regressor
xgbr_optimal_model =xgbr_grid.best_estimator_

# Predict on the model
y_train_pred = xgbr_optimal_model.predict(X_train)

y_test_pred = xgbr_optimal_model.predict(X_test)

In [None]:
#Train data evaluation
evaluate_metric(y_train,y_train_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))

In [None]:
#Test data evaluation
evaluate_metric(y_test,y_test_pred)
print("Adjusted R2 : ",1-(1-r2_score((y_test), (y_test_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to tune hyperparameter, The best fit values are found out to be : {'max_depth': 6, 'min_child_weight': 8}

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

By using GridSearchCV, the training accuracy have increase slightly, but the test accuracy has increased a lot, hence, the problem of overfitting which was faced in train_test_split method is removed.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The following evalutation metrics were chosen:
1. Mean Squared Error (MSE)
2. Root Mean Squared Error (RMSE)
3. Mean Absolute Error (MAE)
4. Mean Absolute Percentage Error (MAPE)
5. R2 Score
6. Adjusted R2 Score

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Final Prediction Model : Random Forest (GridSearchCV) (Params: {'max_depth': 40, 'n_estimators': 600} )

Performance on test data:
- R2 Score : 0.944
- Adjusted R2 :  0.943

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
#Extracting most important features from final prediction model
importances = rf_grid_optimal_model.feature_importances_

In [None]:
#Creating a dictionary with all the important features
importance_dict = {'Feature' : list(X_train.columns), 'Feature Importance' : importances}

#Creating a dataframe from the dictionary of important features
importance_df = pd.DataFrame(importance_dict)

#Sorting features inside df by feature importance in descending order
important_features=importance_df.sort_values(by=['Feature Importance'],ascending=False).head(20)

In [None]:
#Creating a printing the list of important features
imp_features = important_features['Feature'].tolist()
print(f"Import Features are: {imp_features}")

In [None]:
#plotting the important fetures obtainind from the optimal RF model
fig = plt.figure(figsize=(10,6))
sns.barplot(x = 'Feature Importance', y = 'Feature', data=important_features, palette= 'viridis')
plt.xticks(rotation=90)
plt.title('Feature Importance')

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In our project, we employed diverse regression techniques to forecast transportation demand from different locations to Nairobi.

Utilizing the available data, we engineered both the target variable and several other features crucial for enhancing our model's predictive performance.

The regression models encompassed both linear and non-linear approaches. Specifically, we utilized linear models such as :

*   Linear Regression
*   Lasso (L1)
*   Ridge (L2)

As well as non-linear models like :

*   Random Forest
*   XGBoos

To optimize the models further, we conducted **Hyperparameter tuning**. Notably, after this tuning process, the **Random Forest model** demonstrated superior performance, achieving an impressive accuracy level of approximately 95%."

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***