<div style="text-align:right;">Ahmed Haitham 454778</div>
<div style="text-align:right;">Irena Zimovska 414522</div>
<div style="text-align:right;">Jakub Bandurski 417911</div>
<div style="text-align:right;">Pearly Tantra 455722</div>

<h2 style="text-align:center;">TAXI FARES PREDICTION</h2>
<hr>
<h3 style="text-align:center;">Project Documentation</h3>
<hr>

<div style="overflow: auto;">
    <div style="float: left; margin-right: 20px; border-right: 1px solid #ddd; padding-right: 10px; margin-bottom: 20px;">
        <img src="documentation/pictures/taxipicture.jpg" alt="Taxi Picture" style="width: 300px;">
    </div>
    <div style="text-align:justify;">
        <h3>Brief Background</h3>
        <p style="text-align:justify;">In transportation and urban analytics, predictive modeling aids in forecasting taxi fares, pivotal for urban transportation and customer decisions. This project applies Machine Learning to predict taxi fares in New York using geospatial point data. It estimates fare amounts based on factors like pickup/drop-off locations, trip distance, and time of day. Machine learning algorithms offer precision in capturing the spatial and temporal complexities of taxi operations, enhancing fare prediction accuracy.</p>
    </div>
    <hr style="clear:both;">
</div>

<div style="overflow: auto;">
    <div style="float: right; margin-right: 20px; padding-right: 10px; margin-bottom: 20px;">
        <img src="documentation/pictures/whatsnew.jpg" alt="Taxi Picture" style="width: 500px;">
    </div>
        <h3>What's New?</h3>
        <p style="text-align:justify;">In the previous study, the code was structured within Rmd files, arranged linearly without the sophistication necessary for efficient ETL processes. In our study, we embraced a more methodical and structured programming methodology, integrating object-oriented principles and classes within the Python language. This transition empowered us to craft a more adaptable program adept at handling dynamic datasets, a capability absent in Rmd-based code's static nature. Moreover, beyond refining the ETL process, we aim to architect our code to identify optimal models for taxi fare prediction, thereby amplifying the breadth and influence of our research. Our collaborative efforts in this project were facilitated through the utilization of Git.</p>
    </div>
</div>

<div style="overflow: auto;">
    <div style="float: right; margin-right: 20px; padding-right: 10px; margin-bottom: 20px;">
        <img src="documentation/pictures/whatsnew1.jpg" alt="Taxi Picture" style="width: 500px;">
    </div>
        <p style="text-align:justify;">Moreover, beyond refining the ETL process, we aim to architect our code to identify optimal models for taxi fare prediction, thereby amplifying the breadth and influence of our research. Our collaborative efforts in this project were facilitated through the utilization of Git.</p>
    </div>
</div>
<hr>

### Program Architecture

<p style="text-align:justify;"> The project was replicated by initiating the ETL (Extract Transform Load) process, followed by feature engineering, which involved clustering, incorporating temperature and trip distance. Next, we evaluated various models to determine the most effective one, ultimately selecting it to forecast taxi fare amounts. This process occurs in the backend. Subsequently, we will utilize Streamlit to visualize all components in the frontend.</p>

<div style="text-align:center;">
    <img src="documentation/pictures/architecture.jpg" alt="Architecture" style="width: 500px;">
</div>


To provide further elaboration, let's take a look into the architectural schema:
<div style="display: flex;">
    <div style="flex: 1; margin-right: 40px;">
        <img src="documentation/pictures/architecture1.jpg" alt="Architecture 1" style="width: 100%;">
    </div>
    <div style="flex: 1;">
        <img src="documentation/pictures/architecture2.jpg" alt="Architecture 2" style="width: 100%;">
    </div>
</div>

<hr>
<h3 style="text-align:center;">Program Execution</h3>
<hr>

In [3]:
import pandas as pd
import geopandas as gpd
import numpy as np
import sys

import utils.extract_df as extract_df
import utils.transform as transform
import utils.clustering as clustering
import utils.Feature_Engineering as Feature_Engineering

In [4]:
    # extract taxi_data.csv
    filepathcsv = "data/taxi_data.csv"
    df = extract_df.readcsv(filepathcsv)

    # extract nyc.shp
    filepathshp ="data/nyc-boundaries/geo_export_9ca5396d-336c-47af-9742-ab30cd995e41.shp"
    nyc = extract_df.readshp(filepathshp)

    # transform & data cleaning
    transformer = transform.dataTransformation(df,nyc)
    transformedDf = transformer.transform()

    # feature engineering
    filepathtemp = "data/NYC_Weather_2014_2020.csv"
    temperature_df = extract_df.readcsv(filepathtemp)
    merged_df = Feature_Engineering.add_temperature(transformedDf, temperature_df)

    # clustering
    cluster = clustering.pickUpCluster(merged_df)
    df = cluster.clusterCreated()
   

  super()._check_params_vs_input(X, default_n_init=10)


In [5]:
df

Unnamed: 0,dropoff_latitude,dropoff_longitude,fare_amount,feat01,feat02,feat03,feat04,feat05,feat06,feat07,...,passenger_big_group,fare_amount_log,year,month,day,hour,trip_distance,date,avg_temperature_2m (°C),pickup_cluster
0,40.768550,-73.862065,52.713,0.607633,0.680994,0.869333,0.359081,0.283538,0.898003,0.481185,...,1,3.964862,2014,1,8,6,10.461512,2014-01-08,2.366667,0
1,40.746906,-73.990494,19.350,0.353808,0.555256,0.946294,0.530530,0.453938,0.708570,0.161038,...,0,2.962692,2015,2,16,20,2.089068,2015-02-16,1.312500,0
2,40.697496,-73.984946,24.850,0.248761,0.271752,0.418165,0.368993,0.362234,0.257532,0.710595,...,0,3.212858,2014,3,18,13,2.590397,2014-03-18,10.420833,3
3,40.767617,-73.959482,16.600,0.606718,0.809065,0.826723,0.228102,0.819767,0.859372,0.014095,...,0,2.809403,2014,3,20,18,1.402666,2014-03-20,3.391667,2
4,40.724657,-73.994457,14.950,0.386871,0.657538,0.861953,0.155679,0.928781,0.935444,0.381414,...,0,2.704711,2014,4,10,22,1.059340,2014-04-10,4.308333,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88170,40.787151,-73.952049,19.900,0.535587,0.677861,0.561279,0.422915,0.023729,0.182607,0.285767,...,0,2.990720,2015,6,9,7,2.609719,2015-06-09,18.687500,0
88171,40.727375,-73.995237,19.350,0.520380,0.691482,0.886975,0.601929,0.511175,0.507924,0.020218,...,0,2.962692,2014,12,11,10,0.884057,2014-12-11,-1.154167,5
88172,40.763532,-73.981185,17.700,0.529983,0.993864,0.645237,0.527858,0.231053,0.394276,0.447921,...,0,2.873565,2014,12,5,22,1.064042,2014-12-05,4.920833,0
88173,40.748657,-73.908997,27.600,0.533111,0.309530,0.643218,0.880006,0.026861,0.060154,0.304693,...,1,3.317816,2014,1,16,14,5.432428,2014-01-16,6.687500,0
