# Predicting Demand and Fare Dynamics in Green Taxi


**Authors**: [Alpha Guya](mailto:alpha.guya@student.moringaschool.com), [Ben Ochoro](mailto:ben.ochoro@student.moringaschool.com), [Caleb Ochieng](mailto:caleb.ochieng@student.moringaschool.com), [Christine Mukiri](mailto:christine.mukiri@student.moringaschool.com), [Dominic Muli](mailto:dominic.muli@student.moringaschool.com), [Frank Mandele](mailto:frank.mandele@student.moringaschool.com), [Jacquiline Tulinye](mailto:jacquiline.tulinye@student.moringaschool.com) and [Lesley Wanjiku](mailto:lesley.wanjiku@student.moringaschool.com)

## 1.0) Project Overview

The project aims to analyze and derive insights from the dataset containing records generated by green taxi Technology Service Providers (TSPs). Each row in the dataset represents a single trip in a green taxi, and the records include various details such as pick-up and drop-off dates/times, locations, trip distances, fares, payment types, and more.

## 1.1) Business Problem

The taxi service providers want to optimize their operations, improve customer satisfaction, and increase revenue. To achieve these goals, they need to understand the patterns and trends in the data. Key business problems include:

1. Operational Efficiency: Identify factors affecting the duration and efficiency of taxi trips to optimize resource allocation and reduce waiting times.

2. Revenue Maximization: Analyze fare distribution and identify opportunities to increase revenue through pricing strategies or service improvements.

3. Customer Satisfaction: Understand customer preferences and behaviors to enhance the overall taxi service experience.


## 1.2) Objectives

1. Data Exploration and Cleaning:

Explore the dataset to understand its structure, features, and statistical properties.
Handle missing values, outliers, and inconsistent data for accurate analysis.

2. Descriptive Analysis:

Generate descriptive statistics and visualizations to gain insights into the distribution of key variables.
Explore relationships between variables to identify potential patterns.

3. Predictive Modeling:

Develop unsupervised machine learning models to identify clusters or patterns within the data.
Explore models such as k-means clustering to group similar trips and uncover hidden structures.

4. Feature Engineering:

Extract relevant features from date/time information, locations, and other variables to enhance model performance.

5. Recommendation System:

Implement a recommendation system based on historical data to suggest optimal routes, fare structures, and service improvements.

6. Performance Evaluation:

Evaluate the performance of the developed models and systems using appropriate metrics.
Iterate and refine models based on feedback and additional data.

Research Questions to help with the analysis

1. Characterize the data and comment on data quality:

Begin by examining the basic statistics of key features such as trip distance, fare amount, tip amount, and other relevant metrics.
Check for missing values, outliers, and anomalies in the dataset.
Evaluate the distribution of categorical variables like payment type, trip type, and store-and-forward flag.
Explore and visualize the data (e.g., a histogram of trip distance):

2. Create visualizations to explore the distribution of key variables, e.g., trip distance, fare amount, tip amount, using histograms, box plots, or kernel density plots.
Visualize geographical patterns by plotting pickup and drop-off locations on a map.
Explore the distribution of trips over time, considering both pickup and drop-off times.
Find interesting trip statistics grouped by hour:

3. Analyze and visualize trip statistics grouped by hour, such as the average trip distance, fare amount, and tip amount.
Identify peak hours or time periods with higher trip demand.
The taxi drivers want to know what kind of trip yields better tips. Can you build a model for them and explain the model?

4. Train a predictive model (e.g., regression model) to estimate tip amount based on relevant features like trip distance, fare amount, payment type, and others.
Evaluate the model's performance using appropriate metrics (e.g., Mean Absolute Error, R-squared).
Explain the significant features influencing tips and provide insights for maximizing tips.
(Option 2) Visualize the data to help understand trip patterns:

Visualize trip patterns by creating heatmaps or density plots of pickup and drop-off locations.
Explore how trip patterns change over the course of a day or week.
Identify any clusters or hotspots of frequent trips.

## 1.3) Metric of Success

## 1.4) Data Relevance and Validation

The data available is relevant for the intended analysis and predictions

## 2.0) Understanding the Data

The data for this project is obtained from the [Yahoo Finance website](https://finance.yahoo.com/).

## 2.1) Reading the Data

### 2.1.1) Installations

In [1]:
# installations
%pip install yfinance

Note: you may need to restart the kernel to use updated packages.


### 2.1.2) Importing Relevant Libraries

In [4]:
# importing necessary libraries
import requests, json
import urllib
import urllib.request
import urllib.error
import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta

### 2.1.3) Reading the Data

In [7]:
ticker_symbol = 'AAPL'

# Setting the start and end dates for the data
start_date = '1985-01-01'
end_date = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')

# Fetching the data
stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)

# Printing the retrieved data
print(stock_data.tail())


[*********************100%%**********************]  1 of 1 completed
                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2023-12-14  198.020004  199.619995  196.160004  198.110001  198.110001   
2023-12-15  197.529999  198.399994  197.000000  197.570007  197.570007   
2023-12-18  196.089996  196.630005  194.389999  195.889999  195.889999   
2023-12-19  196.160004  196.949997  195.889999  196.940002  196.940002   
2023-12-20  196.899994  197.679993  194.830002  194.830002  194.830002   

               Volume  
Date                   
2023-12-14   66831600  
2023-12-15  128256700  
2023-12-18   55751900  
2023-12-19   40714100  
2023-12-20   52242800  


## 2.2) Data Cleaning

## 2.3) EDA

## 2.4) Building Model

## 2.5) Conclusion

## 2.6) Recommendation

## 2.7) Model Deployment