## Debugging Dataset Folder

We are working with 2023 dataset. For every month we have around 3 to 5 Million instances. It's a huge dataset. To make things easier ( to run datasets easily ), we have randomly chosen 40,000 instances from every month of 2023. So, we have total `12*40000=480000` instances. <br>
This dataset is saved here: `../raw_data/Smaller_Dataset_of_2023/`. We can consider them as initial debugging dataset folder. <br>
we created this dataset with this simple code:
```python
df = pd.read_csv("../raw_data/202301-citibike-tripdata_1.csv") #original dataset of length 1000000 points
df=df.dropna()
sample_df = df.sample(n=40000, random_state=42) # taking 40000 rows
sample_df.to_csv("../raw_data/Smaller_Dataset_of_2023/2023_January.csv") # saving it for january
```
with the same method we get dataset for other months

## Debugging Dataset
### or Dataset for basic tasks, visualizations, intial calculations etc.
To run it within two minutes We are taking 5,000 rows from every months. So there will be `12*5000=60000` rows.<br>
**This is the part where we are making the basic data preparation**

In [None]:
from google.colab import drive

drive.mount('/content/drive') # Remember to add the folder (as a shortcut) to your drive before running this cell
%cd /content/drive/MyDrive/PROJECT_CS547_IE534

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1ebKvoK7afoaMA3BiVP8gBiRPCDUjeorO/PROJECT_CS547_IE534


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import copy
import pickle
months_dict = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December"
}
def get_time_data(date_time_string):
    """
    from a date time string we will get month, day_of_week and exact_time
    """
    timestamp = pd.to_datetime(date_time_string)
    month = timestamp.month
    month = months_dict[month]
    day_of_week = timestamp.day_name()
    exact_time = timestamp.time()
    return month,day_of_week,exact_time

def initial_preprocessing(some_df):
    """
    here we have done the initial processing. we created df.duration, df.Month,
    df.Day_of_Week df.Exact_start_Time
    """
    df = copy.deepcopy(some_df)
    df.loc[:,"started_at"] = pd.to_datetime(df["started_at"])
    df.loc[:,"ended_at"] = pd.to_datetime(df["ended_at"])
    df.loc[:,"duration"] = df.loc[:,"ended_at"] - df.loc[:,"started_at"]
    df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
    df[['Month', 'Day_of_Week', 'Exact_start_Time']] = df['started_at'].apply(
    lambda x: pd.Series(get_time_data(x)))
    df = df[['rideable_type', 'started_at','Month', 'Day_of_Week',
           'Exact_start_Time', 'ended_at','duration', 'start_station_name',
           'end_station_name', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
           'member_casual']]
    return df

def haversine(lat1, lon1, lat2, lon2):
    ## helper function to get weather data
    ## helps to get the distance value
    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Radius of Earth in kilometers
    return c * r

def get_weather(start_time,lat,long,weather_df):
    """
    Weather data is not available in our actual dataset
    (https://ride.citibikenyc.com/system-data). So we had to download it from
    a different source. Here we match the location and timing of cycle renting and
    closest weather station. we added some important weather values in our dataset
    """
    start_date = pd.to_datetime(start_time).date()
    matching_weather_date = weather_df[weather_df['DATE'] == pd.Timestamp(start_date)]
    # Calculate the distance between the series lat/lng and all matching_weather_data lat/lng
    distances = matching_weather_date.apply(
        lambda row: haversine(lat, long, row['LATITUDE'], row['LONGITUDE']),
        axis=1
    )
    if len(distances) == 0:
        ## this part is just for debugging. I used it because there was a problem with the dataset
        print("Here is a problem")
        print(start_date,lat,long)
        return np.nan,np.nan,np.nan,np.nan,np.nan
    # Find the row with the minimum distance
    closest_location = matching_weather_date.loc[distances.idxmin()]
    return closest_location[["PRCP","AWND","TMAX","TMIN","SNOW"]]

def processing_before_train_test(data_path,weather_df,total_num = 25000,use_weather_data=True):
    """
    putting it all together.
    """
    df = pd.read_csv(data_path)
    df = df.dropna()
    df.started_at = pd.to_datetime(df.started_at)
    df.ended_at = pd.to_datetime(df.ended_at)
    ## I am working with 2023 dataset. they mistakenly added some 2022 data.
    ## I removed them here
    df = df[df.started_at.dt.year == 2023]

    df = df.drop(["ride_id","start_station_id","end_station_id"],axis=1)
    sample_df = df.sample(n=total_num, random_state=42)
    tdf = initial_preprocessing(sample_df)
    if use_weather_data == False:
        return tdf
    else:
        tdf[['Weather_PRCP','Weather_AVG_WIND','Weather_TMAX','Weather_TMIN','Weather_SNOW']]=tdf.apply(
            lambda row: get_weather(row['started_at'], row['start_lat'], row['start_lng'],weather_df),axis=1)
        return tdf

In [None]:
## This is seperate weather report that we have downloaded for new york (2023)
weather_df = pd.read_csv("2023_weather.csv")
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])

  weather_df = pd.read_csv("2023_weather.csv")


In [None]:
total_num = 5000
df_jan = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_January.csv',
                                      weather_df,total_num=total_num)
df_feb = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_February.csv',
                                      weather_df,total_num=total_num)
df_mar = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_March.csv',
                                      weather_df,total_num=total_num)
df_apr = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_April.csv',
                                      weather_df,total_num=total_num)
df_may = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_May.csv',
                                      weather_df,total_num=total_num)
df_jun = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_June.csv',
                                      weather_df,total_num=total_num)
df_jul = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_July.csv',
                                      weather_df,total_num=total_num)
df_aug = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_August.csv',
                                      weather_df,total_num=total_num)
df_sep = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_September.csv',
                                      weather_df,total_num=total_num)
df_oct = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_October.csv',
                                      weather_df,total_num=total_num)
df_nov = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_November.csv',
                                      weather_df,total_num=total_num)
df_dec = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_December.csv',
                                      weather_df,total_num=total_num)


  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerc

**Let's combine them**

In [None]:
df_combined = pd.concat([df_jan,df_feb,df_mar,df_apr,df_may,df_jun,
          df_jul,df_aug,df_sep,df_oct,df_nov,df_dec],axis=0)
df_combined

Unnamed: 0,rideable_type,started_at,Month,Day_of_Week,Exact_start_Time,ended_at,duration,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,Weather_PRCP,Weather_AVG_WIND,Weather_TMAX,Weather_TMIN,Weather_SNOW
7517,classic_bike,2023-01-09 18:45:11.355,January,Monday,18:45:11.355000,2023-01-09 18:52:51.173,459.818,Broadway & W 56 St,E 54 St & 1 Ave,40.765265,-73.981923,40.756265,-73.964179,member,0.01,4.47,44.0,37.0,0.0
13706,classic_bike,2023-01-24 21:59:02.412,January,Tuesday,21:59:02.412000,2023-01-24 22:01:46.344,163.932,Kingston Ave & Herkimer St,MacDonough St & Marcy Ave,40.678907,-73.941428,40.680780,-73.946130,member,0.23,,,,0.0
28386,classic_bike,2023-01-06 05:36:46.608,January,Friday,05:36:46.608000,2023-01-06 05:53:27.805,1001.197,6 Ave & Broome St,Broadway & W 48 St,40.724310,-74.004730,40.760177,-73.984868,member,0.34,0.89,57.0,57.0,0.0
10032,classic_bike,2023-01-17 09:32:14.431,January,Tuesday,09:32:14.431000,2023-01-17 10:00:05.335,1670.904,E 39 St & 2 Ave,W 17 St & 7 Ave,40.748033,-73.973828,40.740564,-73.998526,member,0.00,3.13,47.0,35.0,0.0
18672,classic_bike,2023-01-27 18:35:40.665,January,Friday,18:35:40.665000,2023-01-27 18:43:48.903,488.238,6 Ave & W 33 St,6 Ave & W 45 St,40.749013,-73.988484,40.756951,-73.982631,casual,0.00,5.82,44.0,35.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12378,electric_bike,2023-12-20 08:48:35.967,December,Wednesday,08:48:35.967000,2023-12-20 08:53:48.454,312.487,Grand Army Plaza & Plaza St West,Washington Ave & Empire Blvd,40.672951,-73.970884,40.663140,-73.960570,member,0.00,,,,0.0
8296,electric_bike,2023-12-26 08:47:39.893,December,Tuesday,08:47:39.893000,2023-12-26 08:53:09.918,330.025,Allen St & Stanton St,Cleveland Pl & Spring St,40.721950,-73.989144,40.722104,-73.997249,member,0.00,,51.0,44.0,0.0
7419,classic_bike,2023-12-28 14:36:41.140,December,Thursday,14:36:41.140000,2023-12-28 15:15:12.901,2311.761,E 58 St & 1 Ave (NW Corner),W 20 St & 10 Ave,40.759125,-73.962658,40.745686,-74.005141,member,1.22,6.93,55.0,49.0,0.0
17344,classic_bike,2023-12-03 03:12:45.973,December,Sunday,03:12:45.973000,2023-12-03 03:26:49.153,843.180,Jefferson St & Cypress Ave,Stuyvesant Ave & Hart St,40.709070,-73.921570,40.694650,-73.934300,casual,0.23,,,,0.0


In [None]:
# saving the dataset
df_combined.to_csv('debug_dataset.csv')
df_combined.to_pickle('debug_dataset.pkl')

We have saved our debug dataset as csv format and as pkl format in this paths --> `debug_dataset.csv`,`debug_dataset.pkl`

## Preprocessing of training Data
**First: get the names of one hot encoding variables (eg. all the station names)** <br>
    For Training Data, we need to know all the station names. Because we are using the names for one hot encoding.
    We are taking almost all the values so that no station is missing <br>
    *Note: We don't need weather information for knowing the onehot encoding values ( eg. all the station names ). So we are intentionally ignoring it in this cell to make the code faster. Weather data will obviously be included in training dataset*

In [None]:
total_num = 35000
df_jan = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_January.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_feb = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_February.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_mar = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_March.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_apr = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_April.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_may = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_May.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_jun = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_June.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_jul = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_July.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_aug = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_August.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_sep = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_September.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_oct = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_October.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_nov = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_November.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)
df_dec = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_December.csv',
                                      weather_df,total_num=total_num,use_weather_data=False)

  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerc

**One Hot Encodings**

In [None]:
df_combined_large = pd.concat([df_jan,df_feb,df_mar,df_apr,df_may,df_jun,
          df_jul,df_aug,df_sep,df_oct,df_nov,df_dec],axis=0)
station_names = df_combined_large.start_station_name.unique()

## not necessary, these are very simple. only a few categories
# month_names = df_combined_large.Month.unique()
# days_names = df_combined_large.Day_of_Week.unique()
# rideable_type_names = df_combined_large.rideable_type.unique()
# member_type_names = df_combined_large.member_casual.unique()

len(station_names)
del df_combined_large

That means we will have one hot encoding of size `2190` for all the station names

## Creating Training Data
We are taking 10,000 rows from every months. So there will be `12*10000=120000` rows.<br>
We can easily adjust the value by changing `total_num = 10000`. Depending on how much time it takes to train, we will change it later. Right now we are keeping 10000 rows per month.

In [None]:
total_num = 10000
df_jan = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_January.csv',
                                      weather_df,total_num=total_num)
df_feb = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_February.csv',
                                      weather_df,total_num=total_num)
df_mar = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_March.csv',
                                      weather_df,total_num=total_num)
df_apr = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_April.csv',
                                      weather_df,total_num=total_num)
df_may = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_May.csv',
                                      weather_df,total_num=total_num)
df_jun = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_June.csv',
                                      weather_df,total_num=total_num)
df_jul = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_July.csv',
                                      weather_df,total_num=total_num)
df_aug = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_August.csv',
                                      weather_df,total_num=total_num)
df_sep = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_September.csv',
                                      weather_df,total_num=total_num)
df_oct = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_October.csv',
                                      weather_df,total_num=total_num)
df_nov = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_November.csv',
                                      weather_df,total_num=total_num)
df_dec = processing_before_train_test('../raw_data/Smaller_Dataset_of_2023/2023_December.csv',
                                      weather_df,total_num=total_num)

  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerce').dt.total_seconds()
  df.loc[:,'duration'] = pd.to_timedelta(df['duration'], errors='coerc

in training dataset our output label is `end_station_name`. We think we should not use `end_lat` and `end_lng` as features. Because, if we know the ending latitude and longitude, we actually know the ending station. Because for a specific `end_lat` and `end_lng` there is only one station. So obviously, we should remove these two features.

In [None]:
train_df = pd.concat([df_jan,df_feb,df_mar,df_apr,df_may,df_jun,
          df_jul,df_aug,df_sep,df_oct,df_nov,df_dec],axis=0)

train_df = train_df[['rideable_type', 'started_at', 'Month', 'Day_of_Week',
       'Exact_start_Time', 'ended_at', 'duration', 'start_station_name', 'start_lat', 'start_lng',
       'member_casual', 'Weather_PRCP', 'Weather_AVG_WIND', 'Weather_TMAX',
       'Weather_TMIN', 'Weather_SNOW','end_station_name','end_lat', 'end_lng']]

## saving it for future use
train_df.to_csv("training_data_without_onehot.csv")
train_df.to_pickle("training_data_without_onehot.pkl")

## implementing one hot encoding on training dataset ( categorical features )

In [None]:
train_df = pd.concat([df_jan,df_feb,df_mar,df_apr,df_may,df_jun,
          df_jul,df_aug,df_sep,df_oct,df_nov,df_dec],axis=0)

train_df = train_df[['rideable_type', 'started_at', 'Month', 'Day_of_Week',
       'Exact_start_Time', 'ended_at', 'duration', 'start_station_name', 'start_lat', 'start_lng',
       'member_casual', 'Weather_PRCP', 'Weather_AVG_WIND', 'Weather_TMAX',
       'Weather_TMIN', 'Weather_SNOW','end_station_name','end_lat', 'end_lng']]
#-------------------------------------------------------------------
one_hot_encoded_df1 = pd.get_dummies(train_df['Month'], prefix='Month')
one_hot_encoded_df2 = pd.get_dummies(train_df['Day_of_Week'], prefix='Day')
one_hot_encoded_df3 = pd.get_dummies(train_df['rideable_type'], prefix='ride')
one_hot_encoded_df4 = pd.get_dummies(train_df['member_casual'], prefix='user')

train_df['rideable_type'] = train_df['start_station_name'].astype(pd.CategoricalDtype(
    categories=station_names, ordered=False))

one_hot_encoded_df_0 = pd.get_dummies(train_df['start_station_name'], prefix='start_station')


train_df = pd.concat([train_df, one_hot_encoded_df1,one_hot_encoded_df2,
                     one_hot_encoded_df3,one_hot_encoded_df4,one_hot_encoded_df_0], axis=1)
train_df = train_df.drop(['Month','Day_of_Week','rideable_type','member_casual','start_station_name'],axis=1)

In [None]:
train_df

Unnamed: 0,started_at,Exact_start_Time,ended_at,duration,start_lat,start_lng,Weather_PRCP,Weather_AVG_WIND,Weather_TMAX,Weather_TMIN,...,start_station_Wyckoff Av & Jefferson St,start_station_Wyckoff Av & Stanhope St,start_station_Wyckoff Ave & Cooper Ave,start_station_Wyckoff Ave & Gates Ave,start_station_Wyckoff Ave & Jefferson St,start_station_Wyckoff Ave & Stanhope St,start_station_Wyckoff St & 3 Ave,start_station_Wythe Ave & Metropolitan Ave,start_station_Wythe Ave & N 13 St,start_station_Yankee Ferry Terminal
7517,2023-01-09 18:45:11.355,18:45:11.355000,2023-01-09 18:52:51.173,459.818,40.765265,-73.981923,0.01,4.47,44.0,37.0,...,False,False,False,False,False,False,False,False,False,False
13706,2023-01-24 21:59:02.412,21:59:02.412000,2023-01-24 22:01:46.344,163.932,40.678907,-73.941428,0.23,,,,...,False,False,False,False,False,False,False,False,False,False
28386,2023-01-06 05:36:46.608,05:36:46.608000,2023-01-06 05:53:27.805,1001.197,40.724310,-74.004730,0.34,0.89,57.0,57.0,...,False,False,False,False,False,False,False,False,False,False
10032,2023-01-17 09:32:14.431,09:32:14.431000,2023-01-17 10:00:05.335,1670.904,40.748033,-73.973828,0.00,3.13,47.0,35.0,...,False,False,False,False,False,False,False,False,False,False
18672,2023-01-27 18:35:40.665,18:35:40.665000,2023-01-27 18:43:48.903,488.238,40.749013,-73.988484,0.00,5.82,44.0,35.0,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29415,2023-12-24 11:16:29.629,11:16:29.629000,2023-12-24 11:19:29.024,179.395,40.711256,-73.986665,0.01,,46.0,35.0,...,False,False,False,False,False,False,False,False,False,False
11359,2023-12-16 16:55:20.319,16:55:20.319000,2023-12-16 16:57:41.232,140.913,40.875149,-73.901239,0.00,,,,...,False,False,False,False,False,False,False,False,False,False
575,2023-12-15 15:28:06.858,15:28:06.858000,2023-12-15 15:51:37.394,1410.536,40.727412,-73.979488,0.02,,44.0,35.0,...,False,False,False,False,False,False,False,False,False,False
17398,2023-12-28 12:49:46.191,12:49:46.191000,2023-12-28 12:56:22.384,396.193,40.756438,-73.929340,1.22,6.93,55.0,49.0,...,False,False,False,False,False,False,False,False,False,False


### Saving the Train Dataset

In [None]:
## saving it for future use
train_df.to_csv("training_data_with_onehot.csv")
train_df.to_pickle("training_data_with_onehot.pkl")