# DATA WRANGLING HACKATHON

## JOINING THE DATA INTO A SINGLE DATAFRAME

### Overview
This data dictionary describes High Volume FHV trip data. Each row represents a single trip in an FHV dispatched by one of NYC’s licensed High Volume FHV bases. On August 14, 2018, Mayor de Blasio signed Local Law 149 of 2018, creating a new license category for TLC-licensed FHV businesses that currently dispatch or plan to dispatch more than 10,000 FHV trips in New York City per day under a single brand, trade, or operating name, referred to as High-Volume For-Hire Services (HVFHS). This law went into effect on Feb 1, 2019.

### Objective
The main goal of this hackathon is to determine if the client is going to give a tip. 
Your submission file should be a CSV file with two columns (see example in sample_	submission.csv):
ID:  Id of the observation
Tipped: If the client Tipped or not

A dataset spread over several data sources has been provided for you. The total number of features is plentiful and it’s up to you to use as many or as little as you want. Given that, some features might be more relevant than others. 
Keep in mind that this is a Data Wrangling specialization. 

### Datasets:
| **Dataset** | **Information**   | Location|
|-------------|-------------------|---------------------|
|API          | Trip Mileage      | https://hckt02-api.lisbondatascience.org/docs#/default/get_data_data_get |
|Webpage      | Taxi Zone Data    | https://s02-infrastructure.s3.eu-west-1.amazonaws.com/hackathon-02-batch8/index.html |
|Files        | Detailed Trip Data| https://drive.google.com/drive/folders/12MhOAVrplggHVTm6-CtjqkkjI9xrVPek?usp=drive_link|
|Database     | Weather Data      | batch-s02.ctq2kxc7kx1i.eu-west-1.rds.amazonaws.com

## Comments:
* Now that all dataframes are in parquet (a single file format) and curated when necessary, we'll start joining the data into a single dataframe for the final training.

## INITIALIZING DASK DASHBOARD

In [1]:
from dask.distributed import Client, LocalCluster
from dask import config
# Global config for DASK memory used
config.set({
    "distributed.worker.memory.target": 0.80,  # Use no máximo 80% da memória do worker
    "distributed.worker.memory.spill": 0.70,  # Comece a usar o disco após 90% do limite
    "distributed.worker.memory.pause": 0.95,  # Pause o worker se atingir 95% do limite
    "distributed.worker.memory.terminate": 0.99  # Reinicie o worker se atingir 99%
})

# cluster = LocalCluster(processes=False, memory_limit='6GB', local_directory='/Volumes/SSD256G/Dask_temp')
# cluster = LocalCluster(processes=True, memory_limit='5.6GB', local_directory='C:/Dask_temp/')
cluster = LocalCluster(
    n_workers=4,              # Número de workers
    threads_per_worker=1,     # Threads por worker
    memory_limit='6GB',       # Limite de memória por worker
    local_directory=r'C:\Dask_temp'
)
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 22.35 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:60770,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 22.35 GiB

0,1
Comm: tcp://127.0.0.1:60796,Total threads: 1
Dashboard: http://127.0.0.1:60799/status,Memory: 5.59 GiB
Nanny: tcp://127.0.0.1:60773,
Local directory: C:\Dask_temp\dask-scratch-space\worker-s083xx8p,Local directory: C:\Dask_temp\dask-scratch-space\worker-s083xx8p

0,1
Comm: tcp://127.0.0.1:60792,Total threads: 1
Dashboard: http://127.0.0.1:60793/status,Memory: 5.59 GiB
Nanny: tcp://127.0.0.1:60775,
Local directory: C:\Dask_temp\dask-scratch-space\worker-6vul9fn9,Local directory: C:\Dask_temp\dask-scratch-space\worker-6vul9fn9

0,1
Comm: tcp://127.0.0.1:60789,Total threads: 1
Dashboard: http://127.0.0.1:60790/status,Memory: 5.59 GiB
Nanny: tcp://127.0.0.1:60777,
Local directory: C:\Dask_temp\dask-scratch-space\worker-wbd9cxpx,Local directory: C:\Dask_temp\dask-scratch-space\worker-wbd9cxpx

0,1
Comm: tcp://127.0.0.1:60795,Total threads: 1
Dashboard: http://127.0.0.1:60797/status,Memory: 5.59 GiB
Nanny: tcp://127.0.0.1:60779,
Local directory: C:\Dask_temp\dask-scratch-space\worker-43_lbfiv,Local directory: C:\Dask_temp\dask-scratch-space\worker-43_lbfiv




## Library

In [2]:
def list_partitions(dataframe_name):
    partition_sizes = dataframe_name.map_partitions(lambda x: x.memory_usage(deep=True).sum()).compute()/1024/1024
    for i, size in enumerate(partition_sizes):
        print(f'Partition {i}: {size:.2f} MB')

## Reading web scraping data

In [3]:
import pandas as pd
import dask.dataframe as dd

In [4]:
dask_web = dd.read_parquet(".data/webpage/bronze/webpage_data.parquet")
dask_web.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


## Reading database data - Weather

In [5]:
dask_db = dd.read_parquet(".data/database/silver/")
dask_db.head()

Unnamed: 0,timestamp,temp,prcp
0,2021-07-02 17:00:00,25.0,0.2
1,2021-07-03 17:00:00,18.9,0.3
2,2021-07-04 17:00:00,25.6,0.2
3,2021-07-05 17:00:00,28.3,0.3
4,2021-07-06 17:00:00,32.8,0.0


## Reading API data - Trip Miles

In [6]:
dask_api = dd.read_parquet(".data/api/raw/")
dask_api.head()

Unnamed: 0,ID,trip_miles
0,5010374,1.84
1,6063883,1.45
2,4941792,16.732
3,7765520,2.477
4,7881861,0.64


## Reading FILES data - Full Trip Data

In [7]:
dask_files = dd.read_parquet(".data/files/silver/")
dask_files.head()

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,0.0,...,B03404,0,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,0.0,...,B02889,0,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,0.0,...,B03404,0,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,0.0,...,B02869,0,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,0.0,...,B03404,0,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


## Sampling

### Sample

In [8]:
# FRAC prameter from sample Dask function
dask_files = dask_files.sample(frac=0.30)

## Partitioning to Improve Parallelism

In [9]:
dask_files = dask_files.repartition(npartitions=1)
dask_files = dask_files.repartition(npartitions=50)

In [10]:
list_partitions(dask_files)

Partition 0: 10.89 MB
Partition 1: 10.87 MB
Partition 2: 10.87 MB
Partition 3: 10.87 MB
Partition 4: 10.89 MB
Partition 5: 10.89 MB
Partition 6: 10.88 MB
Partition 7: 10.87 MB
Partition 8: 10.87 MB
Partition 9: 10.87 MB
Partition 10: 10.87 MB
Partition 11: 10.87 MB
Partition 12: 10.87 MB
Partition 13: 10.87 MB
Partition 14: 10.87 MB
Partition 15: 10.87 MB
Partition 16: 10.87 MB
Partition 17: 10.87 MB
Partition 18: 10.87 MB
Partition 19: 10.87 MB
Partition 20: 10.87 MB
Partition 21: 10.87 MB
Partition 22: 10.87 MB
Partition 23: 10.89 MB
Partition 24: 10.87 MB
Partition 25: 10.87 MB
Partition 26: 10.87 MB
Partition 27: 10.87 MB
Partition 28: 10.87 MB
Partition 29: 10.87 MB
Partition 30: 10.87 MB
Partition 31: 10.87 MB
Partition 32: 10.87 MB
Partition 33: 10.89 MB
Partition 34: 10.89 MB
Partition 35: 10.89 MB
Partition 36: 10.89 MB
Partition 37: 10.89 MB
Partition 38: 10.89 MB
Partition 39: 10.89 MB
Partition 40: 10.87 MB
Partition 41: 10.87 MB
Partition 42: 10.87 MB
Partition 43: 10.87 M

In [11]:
len(dask_files)

2615021

In [12]:
# Ensuring that timestamp column is timestamp type
dask_db['timestamp'] = dd.to_datetime(dask_db['timestamp'])
################
# Filtering 17h to reduce the data cardinality when joining the dataframes
dask_db = dask_db[dask_db['timestamp'].dt.hour == 17]
dask_files['dropoff_datetime'] = dd.to_datetime(dask_files['dropoff_datetime'])

In [13]:
import pandas as pd
# Creating the join column since timestamp has too many values
dask_db['date'] = dask_db['timestamp'].dt.date
dask_db['date'] = dd.to_datetime(dask_db['date'])

# Creating the join column since timestamp has too many values
dask_files['date'] = dask_files['dropoff_datetime'].dt.date
dask_files['date'] = dd.to_datetime(dask_files['date'])

In [14]:
# Checking if the date column has been created correctly
dask_files.columns

Index(['hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax',
       'shared_request_flag', 'request_datetime', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID',
       'pickup_datetime', 'on_scene_datetime', 'ID', 'originating_base_num',
       'wav_match_flag', 'wav_request_flag', 'PULocationID',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'bcf', 'congestion_surcharge', 'date'],
      dtype='object')

In [15]:
# Dropping the columns that might not improve ML model learning
drop_cols = ['dispatching_base_num', 'hvfhs_license_num', 
             'on_scene_datetime', 'pickup_datetime', 
             'request_datetime', 'sales_tax']
# dask_files.drop(columns=drop_cols, errors='ignore')
dask_files.drop(columns=drop_cols)

Unnamed: 0_level_0,Tipped,driver_pay,shared_request_flag,trip_time,base_passenger_fare,shared_match_flag,tolls,DOLocationID,ID,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge,date
npartitions=50,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
,float64,float64,float64,float64,float64,float64,float64,float64,float64,string,float64,float64,float64,datetime64[ns],float64,float64,float64,float64,datetime64[ns]
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [16]:
# Checking if the date column has been created correctly
dask_files.columns

Index(['hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax',
       'shared_request_flag', 'request_datetime', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID',
       'pickup_datetime', 'on_scene_datetime', 'ID', 'originating_base_num',
       'wav_match_flag', 'wav_request_flag', 'PULocationID',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'bcf', 'congestion_surcharge', 'date'],
      dtype='object')

In [17]:
# Sorting dataframe dask_db before persisting it to memory for faster join
dask_db.sort_values(by='date', ascending=True)

Unnamed: 0_level_0,timestamp,temp,prcp,date
npartitions=7,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,datetime64[ns],float64,float64,datetime64[ns]
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [18]:
# Persisting the small dataframe into memory to improve merge operation performance
dask_db.persist()

Unnamed: 0_level_0,timestamp,temp,prcp,date
npartitions=7,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,datetime64[ns],float64,float64,datetime64[ns]
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [19]:
# Joining the dataframes
dask_files = dd.merge(
    dask_files, 
    dask_db, 
    on='date', 
    how='left' # Filters out unmatched records
)

In [20]:
# Saving the new merged dataframe and checking the new columns avilable, temp and prcp
dask_db.to_parquet('.data/database/gold/dask_db')
dask_db.columns

Index(['timestamp', 'temp', 'prcp', 'date'], dtype='object')

In [21]:
#dask_db = pd.read_parquet(".data/database/gold/dask_db")
len(dask_db)

549

In [22]:
# Checking the new columns avilable, temp and prcp
dask_files.columns

Index(['hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax',
       'shared_request_flag', 'request_datetime', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID',
       'pickup_datetime', 'on_scene_datetime', 'ID', 'originating_base_num',
       'wav_match_flag', 'wav_request_flag', 'PULocationID',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'bcf', 'congestion_surcharge', 'date', 'timestamp', 'temp', 'prcp'],
      dtype='object')

In [23]:
# Saving the new merged dataframe that will be used to train the ML model
dask_files.to_parquet('.data/files/gold/', write_index=False, engine="pyarrow")

In [24]:
# Reloading the dataframe and counting rows
dask_files = dd.read_parquet(".data/files/gold/")
len(dask_files)

8737190

In [25]:
# Saving the new merged dataframe and checking the new columns avilable, temp and prcp
dask_files.head(10)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge,date,timestamp,temp,prcp
0,HV0003,0.0,7.36,1.01,0.0,2021-12-28 09:25:18,579.0,11.36,0.0,0.0,...,B03404,2021-12-28 09:36:24,0.0,2.28,0.34,0.0,2021-12-28,2021-12-28 17:00:00,8.1,0.0
1,HV0003,0.0,7.36,1.01,0.0,2021-12-28 09:25:18,579.0,11.36,0.0,0.0,...,B03404,2021-12-28 09:36:24,0.0,2.28,0.34,0.0,2021-12-28,2021-12-28 17:00:00,8.1,0.0
2,HV0003,0.0,22.75,3.29,0.0,2021-12-28 13:02:36,1170.0,30.57,0.0,6.55,...,B03404,2021-12-28 13:23:58,0.0,8.3,1.11,2.75,2021-12-28,2021-12-28 17:00:00,8.1,0.0
3,HV0003,0.0,22.75,3.29,0.0,2021-12-28 13:02:36,1170.0,30.57,0.0,6.55,...,B03404,2021-12-28 13:23:58,0.0,8.3,1.11,2.75,2021-12-28,2021-12-28 17:00:00,8.1,0.0
4,HV0003,0.0,5.83,0.74,0.0,2021-09-18 19:41:50,389.0,8.32,0.0,0.0,...,B02872,2021-09-18 20:00:16,0.0,0.82,0.25,2.75,2021-09-18,2021-09-18 17:00:00,28.3,0.2
5,HV0003,0.0,5.83,0.74,0.0,2021-09-18 19:41:50,389.0,8.32,0.0,0.0,...,B02872,2021-09-18 20:00:16,0.0,0.82,0.25,2.75,2021-09-18,2021-09-18 17:00:00,28.3,0.2
6,HV0005,1.0,5.47,0.63,0.0,2021-12-28 16:01:25,491.0,7.15,0.0,0.0,...,B03406,2021-12-28 16:12:28,0.0,1.143,0.21,0.0,2021-12-28,2021-12-28 17:00:00,8.1,0.0
7,HV0005,1.0,5.47,0.63,0.0,2021-12-28 16:01:25,491.0,7.15,0.0,0.0,...,B03406,2021-12-28 16:12:28,0.0,1.143,0.21,0.0,2021-12-28,2021-12-28 17:00:00,8.1,0.0
8,HV0005,0.0,19.64,1.74,0.0,2021-09-18 01:51:11,1242.0,19.66,0.0,0.0,...,B02510,2021-09-18 02:18:59,0.0,4.607,0.59,0.0,2021-09-18,2021-09-18 17:00:00,28.3,0.2
9,HV0005,0.0,19.64,1.74,0.0,2021-09-18 01:51:11,1242.0,19.66,0.0,0.0,...,B02510,2021-09-18 02:18:59,0.0,4.607,0.59,0.0,2021-09-18,2021-09-18 17:00:00,28.3,0.2


# ML Model Training

In [26]:
#!pip install scikit-learn

In [27]:
#!pip install dask-ml

In [28]:
cols_drop = [
    'hvfhs_license_num', 'sales_tax', 'request_datetime', 'tolls',
    'pickup_datetime', 'on_scene_datetime', 'ID', 'wav_match_flag',
    'wav_request_flag', 'bcf', 'DOLocationID', 'PULocationID'
]

In [29]:
dask_files = dask_files.drop(columns=cols_drop)

In [30]:
dask_files.columns

Index(['Tipped', 'driver_pay', 'shared_request_flag', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'originating_base_num',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'congestion_surcharge', 'date', 'timestamp', 'temp', 'prcp'],
      dtype='object')

In [31]:
dask_files.dtypes

Tipped                          float64
driver_pay                      float64
shared_request_flag             float64
trip_time                       float64
base_passenger_fare             float64
shared_match_flag               float64
originating_base_num    string[pyarrow]
dispatching_base_num    string[pyarrow]
dropoff_datetime         datetime64[ns]
airport_fee                     float64
trip_miles                      float64
congestion_surcharge            float64
date                     datetime64[ns]
timestamp                datetime64[ns]
temp                            float64
prcp                            float64
dtype: object

In [33]:
dask_files.dropna()

Unnamed: 0_level_0,Tipped,driver_pay,shared_request_flag,trip_time,base_passenger_fare,shared_match_flag,originating_base_num,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,congestion_surcharge,date,timestamp,temp,prcp
npartitions=56,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
,float64,float64,float64,float64,float64,float64,string,string,float64,float64,float64,float64,float64,int64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [45]:
import dask.dataframe as dd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
import pandas as pd

# Updated features list after removing unnecessary columns
features = [
    'driver_pay', 'trip_time', 'base_passenger_fare', 
    'airport_fee', 'trip_miles', 'congestion_surcharge', 
    'temp', 'prcp'
]
target = 'Tipped'

# Step 1: Convert datetime columns to UNIX timestamps
dask_files['dropoff_datetime'] = dd.to_datetime(dask_files['dropoff_datetime']).astype('int64') // 10**9
dask_files['date'] = dd.to_datetime(dask_files['date']).astype('int64') // 10**9
dask_files['timestamp'] = dd.to_datetime(dask_files['timestamp']).astype('int64') // 10**9

# Step 2: Ensure all features are numeric
X = dask_files[features]
X = X.select_dtypes(include=['float64', 'int64'])  # Ensure all features are numeric

# Step 3: Target variable for prediction
y = dask_files[target].astype(float)

# Step 4: Drop rows with missing values in both X and y
# Combine X and y to ensure consistency before dropping missing values
data = dd.concat([X, y], axis=1).dropna()

# Compute the cleaned data to finalize the processing
data = data.compute()

# Split X and y again after cleaning
X = data[features]
y = data[target]

# Step 5: Split the data into training and testing sets using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Handle missing values in X_train and X_test using SimpleImputer
# Replace missing values (NaN) in features with the column mean
imputer = SimpleImputer(strategy='mean')
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X.columns)

# Step 7: Reset indices to ensure alignment
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

# Step 8: Train the Linear Regression model
# Fit the model using the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Step 9: Make predictions on the test set
# Use the trained model to predict target values for the test set
y_pred = model.predict(X_test)

# Step 10: Evaluate the model using Mean Squared Error (MSE)
# Calculate and display the MSE to evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Step 11: Display the coefficients of the model
# Create a DataFrame to show feature names and their corresponding coefficients
coef_df = pd.DataFrame({
    'Feature': features,
    'Coefficient': model.coef_
})
print(coef_df)


Mean Squared Error: 0.1400309323633845
                Feature  Coefficient
0            driver_pay     0.000283
1             trip_time    -0.000006
2   base_passenger_fare     0.001650
3           airport_fee     0.049714
4            trip_miles    -0.003488
5  congestion_surcharge     0.034233
6                  temp    -0.000537
7                  prcp     0.002492


In [None]:
dask_files.columns