This notebook is used to process data.   
At the end of project it will be merged into the main notebook.


In [57]:
# import libraries to process parquet files and geospatial data
import pandas as pd
import geopandas as gpd
import os
import warnings
warnings.filterwarnings('ignore')

In [114]:
# create a function to read and process parquet files from folder and return a dataframe
def read_and_parquet(folder_path):
    '''
    INPUT: folder_path - path to folder containing parquet files
    OUTPUT: df - counted dataframe of parquet files
    '''
    
    for filename in os.listdir(folder_path):
        if filename.endswith(".parquet"):
            file_path = os.path.join(folder_path, filename)
            df = pd.read_parquet(file_path)

            # rename columns for yellow taxi data
            if filename.split('_')[0] == 'yellow':
                df = df.rename(columns={'tpep_pickup_datetime':'lpep_pickup_datetime',
                                                           'tpep_dropoff_datetime':'lpep_dropoff_datetime'})
            df = process_data(df)  # process the dataframe
            save_path = f'../data/processed_nyc_data/{filename.split(".")[0]}.csv'
            save_path = save_path.replace('trip', 'count')
            df.to_csv(save_path, index=False)  # save the dataframe to csv file
            print(f'{filename} is processed')
    return 'Done!'

# create a function to process the dataframe and return a counted dataframe
def process_data(df):
    '''
    INPUT: df - dataframe of parquet files
    OUTPUT: df - counted dataframe of parquet files
    '''
    # select columns that are needed
    df = df[['lpep_pickup_datetime', 'passenger_count', 
             'trip_distance', 'PULocationID', 'DOLocationID', 'tip_amount', 'total_amount']]


    # fill missing values in 'passenger_count' column with 1
    df['passenger_count'] = df['passenger_count'].fillna(1)
    # delete rows with missing values in 'trip_distance' column
    df = df.dropna(subset=['trip_distance'])
    # calculate the passenger_turnover = passenger_count * trip_distance
    df['passenger_turnover'] = df['passenger_count'] * df['trip_distance']
    # dropna in 'total_amount' and 'tip_amount' columns
    df = df.dropna(subset=['total_amount','tip_amount'])
    # calculate the trip_fee = Total_amount - Tip_amount
    df['trip_fee'] = df['total_amount'] - df['tip_amount']


    # delete columns that are not needed anymore
    df = df.drop(columns=['passenger_count','trip_distance','total_amount','tip_amount'])
    # delete rows with negative values and 0 in 'trip_fee' column
    df = df[df['trip_fee'] > 0]

    # count the number of trips in each day, hour, PULocationID and DOLocationID
    df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
    df['day'] = df['lpep_pickup_datetime'].dt.day
    df['hour'] = df['lpep_pickup_datetime'].dt.hour
    count_df = df.groupby(['day','hour','PULocationID','DOLocationID']).count().reset_index()
    
    # calculate the average trip_fee in each day, hour, PULocationID and DOLocationID
    count_df['trip_fee'] = df.groupby(['day','hour','PULocationID','DOLocationID']).mean().reset_index()['trip_fee']
    # calculate the total passenger_turnover in each day, hour, PULocationID and DOLocationID
    count_df['passenger_turnover'] = df.groupby(['day','hour','PULocationID','DOLocationID']).sum().reset_index()['passenger_turnover']

    # rename the column 'lpep_pickup_datetime' to 'trip_count'
    count_df = count_df.rename(columns={'lpep_pickup_datetime':'trip_count'})
    
    return count_df

In [115]:
# process parquet files from folder
folder_path = '../data/NYC_taxi_data/'
read_and_parquet(folder_path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


green_tripdata_2020-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


yellow_tripdata_2020-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


yellow_tripdata_2021-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


yellow_tripdata_2018-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


green_tripdata_2019-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


green_tripdata_2018-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


green_tripdata_2017-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


yellow_tripdata_2017-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


yellow_tripdata_2019-12.parquet is processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


green_tripdata_2021-12.parquet is processed


'Done!'

useful columns: 
* lpep _pickup_datetime
* passenger_count
* tripdistance
* PULocation
* DOLocation
* Tip_amount
* Total_amount

In which:
* passenger_turnover = passenger_count * tripdistance
* trip_fee = Total_amount - Tip_amount



In [96]:
# Read shapefile into geopandas dataframe
nyczones = gpd.read_file('../data/taxi_zones/taxi_zones.shp')

# Convert to WGS84
nyczones = nyczones.to_crs(epsg=4326)


In [97]:
# find the center of each zone
nyczones['center'] = nyczones['geometry'].centroid


  
