# Milestone 1 - EDA and Preprocessing data 

#  Table Of Contents
- [Extraction](#Extraction)
- [EDA](#EDA)
- [Cleaning](#Cleaning)
- [Outliers](#Outliers)
- [Feature-Engineering](#Feature)
- [GPS](#GPS)
- [Look Up Table](#LookUP)
- [Parquet File & Export](#Parquet)

<a class="anchor" id="Extraction"></a>
# 1 - Extraction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

In [2]:
def read_dataSet(dataSet_path):
    return pd.read_csv(dataSet_path + "green_tripdata_2016-01.csv")

In [3]:
dataSet_path = "./"
df = read_dataSet(dataSet_path)

FileNotFoundError: [Errno 2] No such file or directory: './green_tripdata_2016-01.csv'

In [None]:
df.head()

In [None]:
df.tail()

<a class="anchor" id="EDA"></a>
# 2- EDA


In [None]:
lookup_table = {}

def update_lookup(mappings):
    for feature, mapping in mappings.items():
        lookup_table[feature] = mapping

In [None]:
# Unique Values
unique_vals = pd.DataFrame(df.nunique(), columns=['Num of Unique Values'])
display(unique_vals)

In [None]:
# Shape
display("Shape of the Dataframe:")
display(df.shape)

In [None]:
# Info
display("Info of the Dataframe:")
display(df.info())

In [None]:
# Correlation
display("Correlation of Features:")
correlations = df.corr()
display(correlations)

In [None]:
# Description
display("Description of Dataframe:")
pd.set_option('display.float_format', lambda x: '%.2f' % x) # to avoid the scientific notation(to be more readable)
display(df.describe())

## Insights from that Data Description:

### 1. **Passenger Count**:
- Most trips have only 1 passenger (mean is close to 1, and median is 1).
- The max value of 666 passengers seems highly unrealistic for a taxi and might be an error that needs investigation or correction.

### 2. **Trip Distance**:
- Average trip distance is 2.76 miles, but 50% of the trips are 1.8 miles or shorter.
- The maximum trip distance of 360.5 miles is unusually long for a taxi ride within a city.

### 3. **Fare Amount**:
- The average fare is around $11.94.
- There's a negative minimum fare (-492.80), which seems to be a data error or might represent refunds/adjustments.

### 4. **Extra**:
- Half of the rides seem to have an extra charge of 0.50, but there's a max value of 83, which seems quite high.

### 5. **MTA Tax**:
- The vast majority of rides have an MTA tax of 0.50, as indicated by the 25th, 50th, and 75th percentiles.

### 6. **Tip Amount**:
- The median tip is 0, suggesting that a significant number of rides have no tip. However, the average tip is $1.25, indicating that while many rides don't have tips, some have significant tip amounts.
- A max tip of 400 dollars might be an outlier or a very generous tip.

### 7. **Tolls Amount**:
- The vast majority of rides don't have toll charges, but some do go up to 900 dollars, which is very high and might be an error.

### 8. **Ehail Fee**:
- All entries are missing for this column. It might be worth considering dropping it if it's not relevant for the analysis.

### 9. **Improvement Surcharge**:
- Most rides have an improvement surcharge of 0.30.

### 10. **Total Amount**:
- The average total amount is 14.42, but 50 percent of rides are 11.16 or cheaper.
- Negative values might represent refunds or adjustments but need to be checked for data quality.

### 11. **Congestion Surcharge**:
- There are only 2 non-missing values, both of which are 0. This column might not be very informative if almost all its values are missing, therefore it could be worth dropping.

In [None]:
def check_index_candidate(df):
    potential_indices = []
    for column in df.columns:
        if df[column].nunique() == df.shape[0] and df[column].isnull().sum() == 0:
            potential_indices.append(column)
    if not potential_indices:
        return "No potential index candidates found. "
    else:
        return potential_indices

In [None]:
# Index Candidate(s), if any
display(check_index_candidate(df))

#### No feature has unique values for each entry and does not contain any missing values, therefore there are no candidates

In [None]:
# Correlation heatmap
def draw_correlation_Heatmap(df):
    plt.figure(figsize=(15,10))
    sns.heatmap(correlations, cmap='coolwarm', annot=True, fmt='.2f')
    plt.title('Correlation Heatmap')
    plt.show()

In [None]:
draw_correlation_Heatmap(df)

## More Insights

- There is a strong correlation between the trip distance both the total amount and the fare amount
- There is a very strong correlation between the mta tax and the improvement surcharge
- There is an medium correlation between the tip amount and the total amount

## Visualization Functions

In [None]:
def plot_histogram(df, feature, bins='auto', title=None):
    """
    Plots a histogram for the given feature.
    """
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=feature, bins=bins)
    plt.title(title if title else f'Distribution of {feature}')
    plt.show()

def plot_density(df, feature, title=None):
    """
    Plots a density plot for the given feature.
    """
    plt.figure(figsize=(10, 6))
    sns.kdeplot(data=df, x=feature)
    plt.title(title if title else f'Density of {feature}')
    plt.show()

def plot_boxplot(df, feature, title=None):
    """
    Plots a boxplot for the given feature.
    """
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x=feature)
    plt.title(title if title else f'Boxplot of {feature}')
    plt.show()
    
def plot_countplot(df, feature, title=None):
    """
    Plots a countplot for a categorical or discrete feature.
    """
    plt.figure(figsize=(10, 6))
    
    sns.countplot(data=df, x=feature, order=df[feature].value_counts().index)
    
    plt.title(title if title else f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()
    
def plot_scatter(df, x_feature, y_feature, title=None):
    """
    Plots a scatter plot for the given features.
    """
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=df, x=x_feature, y=y_feature)
    plt.title(title if title else f'Relationship between {x_feature} and {y_feature}')
    plt.xlabel(x_feature)
    plt.ylabel(y_feature)
    plt.show()
    

def plot_line(data, title, xlabel, ylabel):
    """
    Create a line chart using matplotlib.

    Parameters:
    - data: Series or DataFrame with the data to be plotted.
    - title: Title of the line chart.
    - xlabel: Label for the x-axis.
    - ylabel: Label for the y-axis.

    Returns:
    - None (displays the chart).
    """
    plt.figure(figsize=(10, 6))
    if isinstance(data, pd.DataFrame):
        for column in data.columns:
            plt.plot(data.index, data[column], label=column)
    else:
        plt.plot(data.index, data)
    
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()


## Questions & Visualization

### 1) What's the distribution of trip distances?

In [None]:
max_distance = int(df['trip distance'].max())
plot_histogram(df, 'trip distance', bins=range(0, max_distance + 1, 1))

In [None]:
plot_density(df, 'trip distance', title= "Distribution of trip distance")

In [None]:
display((df["trip distance"]==0).sum())

In [None]:
df[df["trip distance"]==0]

### Conclusion

The 2 graphs indicate that the distance is right skewed indicating that most trips are very short. Moreover in 20437 cases 0 distance was recorded which means that either the taximeter wasn't functioning or there could be instances where rides are started and stopped without a passenger, either mistakenly or intentionally, to meet daily trip targets or other reasons, or depending on the granularity of the recording system, very short trips (like those within a large complex or between adjacent buildings) the driver didn't open the taximeter giving a distance of zero which could be seen since the duration of most of those trips is very short.

### 2) What's the Passenger Count Distrubtion?


In [None]:
plot_countplot(df, 'passenger count')

In [None]:
df["passenger count"].value_counts()

### Conclusion

The Graph indicates that the majority of the trips have a single passenger, however the count of 666 passengers for 5 rides is highly unrealistic and likely represents data entry errors or system glitches. Such values should be further investigated and possibly cleaned or corrected. Moreover, counts of 7, 8, and 9 are extremely rare, with only 43, 27, and 12 rides respectively. These might represent special vehicles or possible data entry errors also.


### 3) What's the most popular Payment Type?

In [None]:
plot_countplot(df, 'payment type')

In [None]:
df["payment type"].value_counts()

### Conclusion

Customers predominantly prefer cash and credit card. Uknown, indicating possible data entry errors or instances where the payment method was not captured accurately. Moreover, for Dispute it could be due to service dissatisfaction. Furthermore, for the No charge it could be due to promotions and discounts.

### 4) What's the relationship between the total amount and the tip amount?

In [None]:
plot_scatter(df, 'total amount', 'tip amount')

### Conclusion

Based on the correlation heatmap and the scatter graph, it could be concluded that the total amount and the tip amount are highly correlated, as the total amount increases it could be seen that the passengers were more generous and the tip amount increased. However for some high total amount values indicated in the scatter graphs it could be seen that the tip amount was so low (or zero) which could mean that there was a service dissatisfaction or dispute. Moreover, most trips had no tip.

### 5) What's the distribution of Trip Fare?

In [None]:
plot_density(df,"fare amount")

In [None]:
plot_boxplot(df,"fare amount")

In [None]:
display((df["fare amount"]==0).sum())
display((df["fare amount"]<0).sum())
display((df["fare amount"]>0).sum())
display((df["fare amount"]<200).sum())
df[df["fare amount"]<0]

In [None]:
df[df["fare amount"]==0]

### Conclusion
There are 2977 instances where the fare amount is negative. This could be due to data recording issues or actual refunds or cancelled trips since they have close to zero distance, but further investigation would be needed to pinpoint the exact cause. for the zero fare it could be due to promotions however the data is right skewed (positively skewed).

### 6) How does the trip distance relate to the fare amount?

In [None]:
plot_scatter(df, 'trip distance', 'fare amount')

In [None]:
df[(df["fare amount"] >200) & (df["trip distance"] == 0)]
df[(df["fare amount"] >0) & (df["trip distance"] == 0)]

### Conclusion
There is a linear relation between the trip distance and the fare amount ( as the distance increases the fare amount increases), However there are a lot of values with zero distance but high fares which could indicate Data entry errors where the taxi might have traveled a certain distance, but due to technical or human error, it got recorded as 0, howeover this doesn't look like the case since the trip durations are very short meaning it could be due to cancellations fees. As for the negative fares it could be due to Cancellation Charges since the time of the trips are very short as well.

### 7) How does the trip distance relate to the number of passengers?


In [None]:
plot_line(df.groupby('passenger count')['trip distance'].mean(),'distance to passenger count', 'passenger Count', 'Trip Distance' )

### Conclusion
Clearly we can see that the trip distance doesn't relate to the passenger count, moreover we can see that there is an outlier in the passenger count again and the average distance is small so it has nothing to do with the passenger count

<a class="anchor" id="Cleaning"></a>
# 3 - Cleaning Data

## Tidying up column names

In [None]:
df.columns

In [None]:
def clean_column_names(df, rename_dict=None):
    # Remove leading/trailing whitespaces, replace inner spaces with underscores, and lowercase
    df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()
        
    # Rename columns if a dictionary is provided
    if rename_dict:
        # Convert the keys of rename_dict to match our cleaned column names
        rename_dict = {key.lower().replace(' ', '_'): value for key, value in rename_dict.items()}
        df = df.rename(columns=rename_dict)
    
    return df

In [None]:
df = clean_column_names(df, rename_dict={"lpep pickup datetime": "pickup_datetime", "lpep dropoff datetime":"dropoff_datetime","store and fwd flag":"store_and_fwd","pu_location":"pickup_location","do_location":"dropoff_location"})

In [None]:
display(df.columns)

In [None]:
# Change the type of the date features from object to datetime..
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

## Observing & Handling irrelevant data

### Duplicates

In [None]:
display(df.duplicated().sum())

In [None]:
display(df.shape) 

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
display(df.shape) # Duplicates are removed, the 7 extra tuples were removed.

#### Duplicates were removed since they don't add any additional relevant information and they increase the computational costs.

## Handling Missing data

In [None]:
df_copy = df.copy()

def null_percentage(df):
    perc_null = df.isnull().sum() / len(df)*100
    return perc_null[perc_null>0]

display(null_percentage(df_copy))


In [None]:
display(df_copy.extra.value_counts())
nan_count = df_copy["extra"].isna().sum()
print(f"Number of NaN values in the 'extra' column: {nan_count}")

In [None]:
null_passenger_rows = df[df["passenger_count"].isnull()]
display(null_passenger_rows)

In [None]:
print(df_copy['passenger_count'].mode()[0])
print(df_copy['passenger_count'].median())

In [None]:
unique_values = df_copy["payment_type"].unique()
print(unique_values)
null_payment_rows = df[df["payment_type"].isnull()]
display(null_payment_rows)

### Reasoning behind each decision of handling missing data:
- For the `ehail_fee and congestion_surcharge` columns, every entry is missing. While one option is to retain these columns and fill them with zeros, this approach is problematic for a couple of reasons. First, it's implausible that all 1,445,294 entries genuinely have no associated `ehail fee or congestion surcharge`. Second, filling these columns with zeros might incorrectly suggest that these fees were specifically documented as zero, potentially leading to misunderstandings. Given these concerns, it's more prudent and straightforward to remove these two columns to prevent potential misinterpretations. Since every entry is missing, it could be a systematic issue, potentially suggesting MNAR.

&nbsp;

- For the `extra` feature, nearly half of its entries are absent. This absence likely indicates (MNAR) because it's plausible that many trips simply did not incur additional fees. Therefore, a logical step would be to fill these missing values with zeros, signaling that no extra fees were applied to those particular trips.

&nbsp;

- For the `passenger count` feature, it has only 411 null tuples, where this absence is likely (MAR), we can't simply remove those tuples since we will lose other information inside those rows about other features and to address the 411 missing values in the `passenger_count` column, we employed statistical imputation using the mode (which is 1). This approach prevents the loss of valuable data from other features in those rows. Using the mode or median for imputation assumes that the values are missing at random and helps maintain the overall statistical properties of the dataset. Moreover, this feature isn't correlated with other features, thus statistical imputation is the way to go.

&nbsp;

- For the `payment type` feature, it has 56,504 entries with missing values. Given the nature of the data, it's probable that these values are missing at random (MAR). Instead of removing these rows and losing valuable information from other columns, a more appropriate approach is to fill these missing values. Notably, the dataset already contains a category labeled as uknown (which should be corrected to unknown). Imputing the missing values with this unknown category makes sense, as it reflects the uncertainty regarding the payment method for these specific trips. But before filling with unknown we should first fill with credit category if there is a tip amount, since we know that tip amounts are only there for credit type trips.


### Missing Data Handling Functions 


In [None]:
def statistical_imputation(df, column, method='mean'):
    """
    Imputes missing values in a DataFrame column using a statistical method.

    Parameters:
    - df: DataFrame with the data.
    - column: The name of the column to impute.
    - method: The statistical method for imputation ('mean', 'median', or 'mode').

    Returns:
    - DataFrame with imputed values.
    """
    
    if method == 'mean':
        imputed_value = df[column].mean()
    elif method == 'median':
        imputed_value = df[column].median()
    elif method == 'mode':
        imputed_value = df[column].mode()[0]  # mode() returns a Series, so we get the first entry
    else:
        raise ValueError("Method should be one of 'mean', 'median', or 'mode'.")
    
    df[column].fillna(imputed_value, inplace=True)
    update_lookup({column: {"null/nan": imputed_value}})
    return df 

def fill_missing_with_zeros(df, feature):
    """
    Fills missing values in the specified feature with zeros.

    Parameters:
    - df: DataFrame
    - feature: column name as a string

    Returns:
    - DataFrame with missing values filled in the specified feature.
    """
    df[feature].fillna(0, inplace=True)
    update_lookup({feature: {"null/nan": 0}})    
    return df, {feature: {"null/nan": 0}}

def impute_missing_with_category(df, feature, category):
    """
    Fills missing values in the specified feature with the given category.

    Parameters:
    - df: DataFrame
    - feature: column name as a string
    - category: the category to fill the missing values with

    Returns:
    - DataFrame with missing values filled in the specified feature.
    """
    df[feature].fillna(category, inplace=True)
    update_lookup({feature: {pd.NA: category}})

    return df, {feature: {pd.NA: category}}

def impute_specific_value(df, column, target_value, method='mean'):
    """
    Imputes rows in a DataFrame column with a specific target value using a statistical method.

    Parameters:
    - df: DataFrame with the data.
    - column: The name of the column to impute.
    - target_value: Rows with this specific value in the column will be imputed.
    - method: The statistical method for imputation ('mean', 'median', or 'mode').

    Returns:
    - DataFrame with imputed values.
    """
    if method == 'mean':
        imputed_value = df.loc[df[column] != target_value, column].mean()
    elif method == 'median':
        imputed_value = df.loc[df[column] != target_value, column].median()
    elif method == 'mode':
        imputed_value = df.loc[df[column] != target_value, column].mode()[0]  # mode() returns a Series, so we get the first entry
    else:
        raise ValueError("Method should be one of 'mean', 'median', or 'mode'.")
    print(imputed_value)
    df.loc[df[column] == target_value, column] = imputed_value
    update_lookup({column: {target_value: imputed_value}})

    return df, {column: {target_value: imputed_value}}

In [None]:
#handle the ehail_fee and congestion_surcharge as stated above
df_copy.drop(columns=['ehail_fee', 'congestion_surcharge'], inplace=True)

In [None]:
#handle the passenger_count as stated above.
statistical_imputation(df_copy, "passenger_count", method="mode")

In [None]:
# handle extra as stated above.
fill_missing_with_zeros(df_copy, "extra")

In [None]:
# handle payment type as stated above.
df_copy.loc[df_copy['tip_amount'] != 0, 'payment_type'] = df_copy.loc[df_copy['tip_amount'] != 0, 'payment_type'].fillna('Credit card')
df_copy["payment_type"].replace("Uknown", "unknown", inplace=True) 
#handle the payment type missing values by placing unknown in null positions, as stated above.
impute_missing_with_category(df_copy, "payment_type", "unknown")

In [None]:
display(df_copy.shape)
unknown_payments = df_copy[df_copy["payment_type"] == "unknown"]
unique_values = df_copy["payment_type"].unique()
print(unique_values)
display(unknown_payments)
display(df_copy)
nan_count = df_copy["passenger_count"].isna().sum()
print(f"Number of NaN values in the 'passenger count' column: {nan_count}")

In [None]:
impute_specific_value(df_copy, "payment_type",'unknown', method="mode")
unique_values = df_copy["payment_type"].unique()
print(unique_values)

In [None]:
df_copy["payment_type"].value_counts()

#### This is done since Unknown is also considered missing since it could be due to the fact that these values were infact empty and somebody imputed them using unknown to indicate that they were missing. so we handle them by imputing using the most famous payment type (mode).

In [None]:
display(null_percentage(df_copy)) 

#### This  indicates all missing values were removed and handled


#### As shown from the shape of the columns, both `ehail_fee and congestion_surcharge` were dropped. Moreover, the `payment_type` was filled by unknown and replaced it from uknown to unknown. Furthermore, missing values in `extra` were filled with zeros as shown in the previous display. Finally, the `passenger_count` was statistically imputed.

### Incorrect & Inconsistent Data 


### Cleaning Functions

In [None]:
def adjust_column_for_multiple_values(df, col_to_adjust, adjustment_col, valid_values):
    """
    Adjusts the adjustment_col based on invalid values in col_to_adjust. Sets invalid values in col_to_adjust to 0.
    
    :param df: DataFrame
    :param col_to_adjust: Name of the column to check and adjust.
    :param adjustment_col: Name of the column to decrement by the value in col_to_adjust if invalid.
    :param valid_values: List of valid values for col_to_adjust.
    :return: Adjusted DataFrame.
    """

    # Convert the column to float for easy comparison
    df[col_to_adjust] = df[col_to_adjust].astype(float)

    # Create a mask for rows with invalid col_to_adjust values
    mask = ~df[col_to_adjust].isin(valid_values)

    # Adjust the adjustment_col based on the mask
    df.loc[mask, adjustment_col] -= df.loc[mask, col_to_adjust]

    # Set the invalid col_to_adjust values to 0
    df.loc[mask, col_to_adjust] = 0.0

    return df

def remove_rows_with_value(df, column_name, value_to_remove):
    """
    Removes rows from a DataFrame based on a specific value in a specified column.

    :param df: DataFrame from which rows will be removed.
    :param column_name: Name of the column to be checked.
    :param value_to_remove: Value to be removed from the DataFrame.
    :return: DataFrame with rows removed.
    """
    return df[df[column_name] != value_to_remove]

def remove_rows_with_multiple_conditions(df, conditions):
    """
    Removes rows from a DataFrame based on multiple conditions specified for different columns.
    Rows are removed if all specified conditions are met simultaneously.

    :param df: DataFrame from which rows will be removed.
    :param conditions: List of tuples, each containing a feature name and the value to be checked.
    :return: DataFrame with rows removed.
    """
    mask = pd.Series([True] * len(df))
    for feature, value in conditions:
        mask = mask & (df[feature] == value)
    return df[~mask]

def adjust_column_value_based_on_condition(df, condition_column, condition_value, target_column, desired_value):
    """
    Adjusts the value of the target column based on a condition in another column.

    Parameters:
    - df: The DataFrame.
    - condition_column: The column name where we check the condition.
    - condition_value: The value to check against in the condition column.
    - target_column: The column name where we want to adjust the value.
    - desired_value: The value to set in the target column if the condition is met.

    Returns:
    - DataFrame with adjusted values.
    """
    condition = (df[condition_column] == condition_value) & (df[target_column] != desired_value)
    df.loc[condition, target_column] = desired_value
    return df

def adjust_features_based_on_conditions(df, conditions, adjustments):
    """
    Adjusts values of specified features in a DataFrame based on given conditions for other features.

    :param df: Pandas DataFrame.
    :param conditions: List of tuples, each containing a column name, a comparison operator,
                      and a value for the condition (e.g., [('vendor', '==', 'VeriFone Inc.'), ('total_amount', '>', 50)]).
    :param adjustments: Dictionary where keys are feature names to be adjusted, and values are new values.
    :return: Pandas DataFrame with adjusted feature values.
    """
    mask = None
    for condition_column, operator, condition_value in conditions:
        condition_mask = df.eval(f"{condition_column} {operator} @condition_value", engine='python', local_dict={'condition_value': condition_value})
        mask = condition_mask if mask is None else mask & condition_mask

    if mask is not None:
        for feature, new_value in adjustments.items():
            df.loc[mask, feature] = new_value

    return df

def count_mismatch_rows(df, columns_to_sum, total_column, tolerance=1e-6):
    """
    Count the rows where the sum of specified columns is not equal to the total column.

    Parameters:
    - df: The DataFrame.
    - columns_to_sum: List of column names to be summed.
    - total_column: The column name of the total value to be compared against.

    Returns:
    - Count of rows where the sum of specified columns is not equal to the total column.
    """
    
    # Calculate the sum for each row
    row_sums = df[columns_to_sum].sum(axis=1)
    
    # Find rows where the sum isn't equal to the total
    mismatch_rows = df[abs(row_sums - df[total_column]) > tolerance]
    
    return len(mismatch_rows)

def remove_mismatched_rows(df, col_list, total_col, tolerance=1e-6):
    """ 
    Removes rows where the sum of values in col_list doesn't match the value in total_col within a given tolerance.
    
    Parameters:
    - df: DataFrame
    - col_list: List of column names to sum
    - total_col: Column name to compare the sum against
    - tolerance: Maximum difference allowed between the summed value and total_col value (default is 0.01)
    
    Returns:
    - DataFrame with mismatched rows removed
    """
    
    # Calculate the row-wise sum of columns in col_list
    df['calculated_sum'] = df[col_list].sum(axis=1)
    
    # Determine rows where the difference between calculated_sum and total_col is greater than the tolerance
    mask = (df['calculated_sum'] - df[total_col]).abs() > tolerance
    
    # Remove those rows
    filtered_df = df[~mask].drop(columns=['calculated_sum'])
    
    return filtered_df

def replace_values(df, feature, old_value, new_value):
    """
    Replaces occurrences of old_value with new_value in the specified feature of the DataFrame.

    :param df: Pandas DataFrame.
    :param feature: Name of the feature/column where values should be replaced.
    :param old_value: Value to be replaced.
    :param new_value: Value to replace old_value with.
    :return: Pandas DataFrame with replaced values.
    """
    df_copy = df.copy()
    df_copy[feature] = df_copy[feature].replace(old_value, new_value)
    
    # Prepare the update details for the lookup table
    update_details = {feature: {old_value: new_value}}
    update_lookup(update_details)
    return df_copy

def filter_records_by_date_range(df, column_name, lower_bound=None, upper_bound=None):
    """
    Filter DataFrame rows based on a datetime column's values being within a specified date range.
    
    Parameters:
    - df: The dataframe.
    - column_name: Name of the datetime column to be filtered.
    - lower_bound: The lower bound of the date range (inclusive). If None, no lower filtering is applied.
    - upper_bound: The upper bound of the date range (inclusive). If None, no upper filtering is applied.
    
    Returns:
    - DataFrame with rows filtered based on specified date range.
    """
    
    # Convert the bounds to datetime
    if lower_bound is not None:
        lower_bound = pd.to_datetime(lower_bound).normalize()
    
    if upper_bound is not None:
        upper_bound = pd.to_datetime(upper_bound).normalize() + pd.DateOffset(days=1) - pd.Timedelta(seconds=1)

    # Apply filtering
    if lower_bound:
        df = df[df[column_name] >= lower_bound]
    
    if upper_bound:
        df = df[df[column_name] <= upper_bound]
    
    return df


In [None]:
# Convert "passenger count" column to integer
df_copy["passenger_count"] = df_copy["passenger_count"].astype(int)
print(df_copy.dtypes)

#### Converting the `passenger_count` to an integer type aligns better with the practical reality that taxis can only carry whole numbers of passengers; having fractional passenger counts is not realistic. We also removed outliers like 666 from the dataset, given that it's implausible for a taxi to hold such a high number of passengers. 

In [None]:
unique_values = df_copy['extra'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)

In [None]:
# Checking whether the -1 is an incorrect value
filtered_count = df[(df['extra'] == -1)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == -1)& (df['total_amount'] < 0)].shape[0]
display(filtered_count)

In [None]:
# Checking whether the -0.5 is an incorrect value
filtered_count = df[(df['extra'] == -0.5)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == -0.5)& (df['total_amount'] < 0)].shape[0]
display(filtered_count)

In [None]:
filtered_count = df[(df['extra'] == 0.04)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == 0.02)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == 0.25)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == 2)].shape[0]
display(filtered_count)
filtered_count = df[(df['extra'] == 4.5)].shape[0]
display(filtered_count)

#### From our analysis, negative values in the `extra` column seem to suggest trip refunds, as corroborated by corresponding negative totals. While there is a significant number of such records, values like 0.04, 0.02, 0.25, 2, and 4.5 appear infrequently, implying they might be errors. The column's description only mentions 0.5 and 1 as valid values, further supporting this assumption. Additionally, the value 83 stands out as a clear outlier—it's unrealistic in this context. Consequently, to maintain consistency in our dataset, it's prudent to reset these irregular `extra` values to 0 and adjust the `total_amount` column by the respective amounts. For us to keep information from other features rather than just dropping those rows. Moreover the negative values will be analyzed later.

In [None]:
df_copy.shape

In [None]:
print(df[df['extra'] == 0.25].shape[0])

In [None]:
# clean the incorrect values in the extra column..
valid_values = [0.5, 0.0, -0.5, 1, -1, 83.00]
df_copy = adjust_column_for_multiple_values(df_copy, 'extra', 'total_amount', valid_values)
unique_values = df_copy['extra'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)


# The outlier (83) will be handled in the following section(s)

In [None]:
df_copy.shape

In [None]:
# Same Rational will apply for the mta tax and the surcharge features.(This is based on the column descriptions)
unique_values = df_copy['mta_tax'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)

In [None]:
# Checking whether the -0.5 is an incorrect value
filtered_count = df_copy[(df_copy['mta_tax'] == -0.5)].shape[0]
display(filtered_count)
filtered_count = df_copy[(df_copy['mta_tax'] == -0.5)& (df_copy['total_amount'] < 0)].shape[0]
display(filtered_count)

In [None]:
filtered_count = df_copy[(df_copy['mta_tax'] == 2.5)].shape[0]
display(filtered_count)
filtered_count = df_copy[(df_copy['mta_tax'] == 3)].shape[0]
display(filtered_count)

#### From our analysis, negative values in the `mta_tax` column seem to suggest trip refunds, as corroborated by corresponding negative totals. While there is a significant number of such records, values like 2.5 and 3 appear infrequently, implying they might be errors. The column's description only mentions 0.5  or possible 0 as valid values. Consequently, to maintain consistency in our dataset, it's prudent to reset these irregular `mta_tax` values to 0 and adjust the `total_amount` column by the respective amounts. For us to keep information from other features rather than just dropping those rows. Moreover the negative values will be analyzed later.

In [None]:
# Cleaning the Incorrect Values
valid_values = [0.5, 0.0, -0.5]
df_copy = adjust_column_for_multiple_values(df_copy, 'mta_tax', 'total_amount', valid_values)

In [None]:
unique_values = df_copy['mta_tax'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)

In [None]:
df_copy.shape

In [None]:
unique_values = df_copy['improvement_surcharge'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)

In [None]:
# Checking whether the -0.3 is an incorrect value
filtered_count = df_copy[(df_copy['improvement_surcharge'] == -0.3)].shape[0]
display(filtered_count)
filtered_count = df_copy[(df_copy['improvement_surcharge'] == -0.3)& (df_copy['total_amount'] < 0)].shape[0]
display(filtered_count)

In [None]:
filtered_count = df_copy[(df_copy['improvement_surcharge'] == 0.97)].shape[0]
display(filtered_count)

#### From our analysis, negative values in the `improvement_surcharge` column seem to suggest trip refunds, as corroborated by corresponding negative totals. While there is a significant number of such records, other values rather than 0.3 , 0, -0.3 appear infrequently, implying they might be errors. The column's description only mentions 0.3 or possible 0 as valid values. Consequently, to maintain consistency in our dataset, it's prudent to reset these irregular `improvement_surcharge` values to 0 and adjust the `total_amount` column by the respective amounts. For us to keep information from other features rather than just dropping those rows. Moreover the negative values will be analyzed later.

In [None]:
# Cleaning the Incorrect Values
valid_values = [0.3, 0.0, -0.3]
df_copy = adjust_column_for_multiple_values(df_copy, 'improvement_surcharge', 'total_amount', valid_values)

In [None]:
unique_values = df_copy['improvement_surcharge'].unique()
formatted_values = [f"{value:.2f}" for value in unique_values]
print(formatted_values)

In [None]:
df_copy.shape

In [None]:
unique_values = df_copy['trip_type'].unique()
display(unique_values)
filtered_count = df_copy[(df_copy['trip_type'] == "Unknown")].shape[0]
display(filtered_count)
filtered_count = df_copy[(df_copy['improvement_surcharge'] == 0.3) & (df_copy['trip_type'] == "Dispatch")].shape[0]
filtered_count1 = df_copy[(df_copy['improvement_surcharge'] == -0.3) & (df_copy['trip_type'] == "Dispatch")].shape[0]
display(filtered_count)
display(filtered_count1)

#### Only two rows in our dataset have an 'Unknown' trip type. Given the minimal presence of this category and the fact that our dataset primarily consists of two distinct trip types for street-hail dispatch, it's prudent to remove these 'Unknown' entries. This way, we prevent introducing an unnecessary category. Moreover since we have 157 rows with dispatch type and surcharge is only assessed in street hail trips, furthemore we know that the driver could alter the type. Then this must be an error and we should handle it by changing those trips to street hail since there is a surcharge.

In [None]:
# Clean the irrelevant data 
df_copy = remove_rows_with_value(df_copy, 'trip_type', 'Unknown')
unique_values = df_copy['trip_type'].unique()
display(unique_values)
df_copy.shape

In [None]:
df_copy = adjust_column_value_based_on_condition(df_copy, 'improvement_surcharge', 0.3, 'trip_type', 'Street-hail')
df_copy = adjust_column_value_based_on_condition(df_copy, 'improvement_surcharge', -0.3, 'trip_type', 'Street-hail')
unique_values = df_copy['trip_type'].unique()
display(unique_values)
filtered_count = df_copy[(df_copy['improvement_surcharge'] == 0.3) & (df_copy['trip_type'] == "Dispatch")].shape[0]
display(filtered_count)
df_copy.shape

In [None]:
filtered_count = df_copy[(df_copy['tip_amount'] > 0) & (df_copy['payment_type'] != "Credit card")].shape[0]
display(filtered_count)

#### Since we know that the tip amount is only there if the payment type is Credit Card we should clean the data to meet that criteria by transforming any row that has tip amount but not credit card payment to have a credit card payment.

In [None]:
# Condition to check if tip amount is greater than 0 and payment type isn't 'credit card'
condition = (df_copy['tip_amount'] > 0) & (df_copy['payment_type'] != "Credit card")
    
# Adjust the payment type to 'credit card' where the condition is met
df_copy.loc[condition, 'payment_type'] = 'Credit card'
    
filtered_count = df_copy[(df_copy['tip_amount'] > 0) & (df_copy['payment_type'] != "Credit card")].shape[0]
display(filtered_count)

In [None]:
count_mismatch = count_mismatch_rows(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(count_mismatch)
display(df_copy.shape)
df_copy = remove_mismatched_rows(df_copy,['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(df_copy.shape)
count_mismatch = count_mismatch_rows(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(count_mismatch)


#### We've eliminated rows where the total cost doesn't align with the sum of its components. This discrepancy is likely due to data entry errors. Given that the number of affected records is minimal and we can't pinpoint the exact source of the inconsistency, we chose to remove these rows. This approach ensures we avoid making assumptions and prevents any potential inaccuracies, especially when dealing with financial data.

In [None]:
# Filter rows with negative total amount
negative_total_amount_rows = df[df['total_amount'] < 0]

# Display the 'vendor' column for these rows
print(negative_total_amount_rows['vendor'])
plot_histogram(negative_total_amount_rows, 'vendor', title='Distribution of Vendors for Negative Total Amounts')


In [None]:
negative_total_amount_rows = df_copy[df_copy['total_amount'] < 0]

# Display the 'vendor' column for these rows
display(negative_total_amount_rows.describe())

In [None]:
plot_boxplot(negative_total_amount_rows, 'total_amount')


In [None]:
plot_boxplot(df_copy,'total_amount')

In [None]:
def convert_to_positive(df, features):
    """
    Convert specified features' values from negative to positive in the given DataFrame.
    
    Parameters:
    - df: DataFrame containing the data.
    - features: List of feature column names to be converted.
    
    Returns:
    - DataFrame with specified features' negative values turned to positive.
    """
    for feature in features:
        df[feature] = df[feature].apply(lambda x: abs(x) if x < 0 else x)
    return df


def find_matching_trips(df):
    # Filter rows based on given conditions
    filtered_df = df[(df["total_amount"] < 0) & 
                     (df["fare_amount"] < 0) & 
                     (df["extra"] <= 0) & 
                     (df['tolls_amount'] <= 0) & 
                     (df["mta_tax"] <= 0) & 
                     (df["improvement_surcharge"] <= 0) & 
                     (df['tip_amount'] <= 0)]
    
    # Extract values from 'filtered_df' to be matched against the original df
    pickup_times = filtered_df['pickup_datetime']
    dropoff_times = filtered_df['dropoff_datetime']
    pu_locations = filtered_df['pickup_location']
    do_locations = filtered_df['dropoff_location']

    # Find rows in original df with matching 'pickup_datetime', 'dropoff_datetime', 
    # 'pu_location', and 'do_location' against 'filtered_df'
    matching_trips = df[df['pickup_datetime'].isin(pickup_times) & 
                        df['dropoff_datetime'].isin(dropoff_times) & 
                        df['pickup_location'].isin(pu_locations) & 
                        df['dropoff_location'].isin(do_locations)]
    return matching_trips


In [None]:
result_df = find_matching_trips(df_copy)
display(result_df)

In [None]:
#df_copy = convert_to_positive(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount', 'total_amount'])
count_mismatch = count_mismatch_rows(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(count_mismatch)      
display(df_copy.shape)

In [None]:
negative_total_amount_rows = df_copy[df_copy['total_amount'] < 0]
display(negative_total_amount_rows.shape)
df_copy.describe()

In [None]:
plot_histogram(negative_total_amount_rows, 'payment_type')

#### Since we can see that the vendor Verifone was the only vendor with negative values indicating that they might give refunds and not a system error with multiple vendors offering refunds, moreover every negative row had an equivelent positive row. Futhermore most of those trips had no charge or dispute indicating that those weren't a mistake. Thus we leave the negatives as is since they indicate a refund in this case.

In [None]:
zero_distance_rows = df_copy[(df_copy['trip_distance'] == 0) & (df_copy['total_amount'] ==0)]
display(zero_distance_rows)

In [None]:
remove_pairs = [('total_amount', 0), ('trip_distance', 0)]
df_copy = remove_rows_with_multiple_conditions(df_copy, remove_pairs)
display(df_copy.shape)

#### Since These rows have total amount of zero and trip distance of 0 and clearly we can see the duration is very low we simply remove those rows since they are irrelevant to keep. (There is no trip) just empty data, why keep it.

In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0) ]
display(zero_total_rows)

In [None]:
# Display the 'vendor' column for these rows
zero_total_rows = df_copy[(df_copy['total_amount'] == 0) ]
plot_histogram(zero_total_rows, 'vendor', title='Distribution of Vendors for 0 Total Amounts')

In [None]:
plot_histogram(zero_total_rows, 'rate_type', title='Distribution of Vendors for 0 Total Amounts')

In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0)  & (df_copy['vendor']== "Creative Mobile Technologies, LLC")  ]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['vendor'] == "Creative Mobile Technologies, LLC") & (df_copy['rate_type'] == "Negotiated fare") ]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['payment_type'] == "Dispute") & (df_copy['vendor'] == "Creative Mobile Technologies, LLC")]
display(zero_total_rows.shape)


In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0)  & (df_copy['vendor']== "Creative Mobile Technologies, LLC")  ]

plot_histogram(zero_total_rows , 'payment_type' )

In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0)  & (df_copy['vendor'] != "Creative Mobile Technologies, LLC")  ]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['vendor'] != "Creative Mobile Technologies, LLC") & (df_copy['rate_type'] == "Standard rate") ]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['payment_type'] == "Cash") & (df_copy['vendor'] != "Creative Mobile Technologies, LLC")]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['payment_type'] == "Credit card") & (df_copy['vendor'] != "Creative Mobile Technologies, LLC")]
display(zero_total_rows.shape)

In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0)  & (df_copy['trip_type'] == "Dispatch") & (df_copy['rate_type'] == "Negotiated fare") & (df_copy['vendor'] == "Creative Mobile Technologies, LLC") ]
display(zero_total_rows.shape)

zero_total_rows = df_copy[(df_copy['total_amount'] == 0)  & (df_copy['trip_type'] == "Dispatch") &  (df_copy['vendor'] != "Creative Mobile Technologies, LLC") ]
display(zero_total_rows.shape)

#### For Verifone it's clear that all trips were either cash or credit, moreover one can see that only 1 row was of negotiated rate. Furtheremore, only 2 were marked as dispatch(ordered throught the phone, only way to give a refund/discount).Therefore, these rows (52) could be a mistake and should be removed since we know that verifone earlier gives refunds as negative values not 0s.

#### For The other vendor (Creative LLC ) it has 2813 Dispatch rides with 0 total, and 2807 Negotiated. Thus, those could be refunds and should be marked as no charge in the payment_type. However 6 of them are standard fare which also should be negotiated fare indicating that this ride was a refund or a discount. For more details, we can think of it as if the driver had a bad experience reported that ride to the vendor they offered him a discount/refund for that ride or the next ride as some sort of compensation.

In [None]:
conditions = [('vendor', '==', 'Creative Mobile Technologies, LLC'), ('total_amount', '==', 0)]
adjustments = {'rate_type': 'Negotiated fare', 'payment_type': 'No charge','trip_type':'Dispatch'}
df_copy = adjust_features_based_on_conditions(df_copy, conditions, adjustments)
zero_total_rows = df_copy[(df_copy['total_amount'] == 0) ]
display(zero_total_rows)

In [None]:
zero_total_rows = df_copy[(df_copy['total_amount'] == 0) & (df_copy['vendor'] == "Creative Mobile Technologies, LLC") & (df_copy['rate_type'] == "Negotiated fare") & (df_copy['trip_type'] == "Dispatch") &(df_copy['payment_type'] == "No charge")]
display(zero_total_rows.shape)

In [None]:
display(df_copy.shape)

In [None]:
# remove those of Venifone
remove_pairs = [('total_amount', 0), ('vendor',"VeriFone Inc." )]
df_copy = remove_rows_with_multiple_conditions(df_copy, remove_pairs)
display(df_copy.shape)

In [None]:
unique_values = df_copy['rate_type'].unique()
display(unique_values)

#### Since we know there is a rate type for group ride we need to check if all group rides have more than 1 passenger 

In [None]:
incorrect_rows = df_copy[(df_copy['rate_type'] == "Group ride") & (df_copy['passenger_count'] == 1) ]
display(incorrect_rows)

#### since the passenger count is a driver entered value, human errors could occur which could be the case here, thus we impute those values by 2 since it's the second most common count and it falls under the category group ride.

In [None]:
conditions = [('rate_type', '==', 'Group ride'), ('passenger_count', '==', 1)]
adjustments = {'passenger_count': 2}
df_copy = adjust_features_based_on_conditions(df_copy, conditions, adjustments)

In [None]:
incorrect_rows = df_copy[(df_copy['rate_type'] == "Group ride") & (df_copy['passenger_count'] == 1) ]
display(incorrect_rows)

In [None]:
display(df_copy.shape)

In [None]:
incorrect_rows = df_copy[(df_copy['trip_distance'] == 0)  ]
display(incorrect_rows)

#### For the trips with distance 0 we can't simply get rid of them since they have a total amount, moreover it could be due to a misfunctioning taximeter, thus we indicate this by adding -1 , since we can't impute using the median since there is a strong correlation between the distance and the total amount and this will give false information, thus we indicate that the taximeter wasn't working by adding -1.


In [None]:
df_copy = replace_values(df_copy, 'trip_distance', 0, -1)
incorrect_rows = df_copy[(df_copy['trip_distance'] == 0)  ]
display(incorrect_rows)
new_rows = df_copy[(df_copy['trip_distance'] == -1)]
display(new_rows.shape) # same number as the 0 before imputing.

#### Remove rows outside the dataset range since they are irrelevant, and the dataset is only concerned with January 2016.

In [None]:
display(df_copy.shape)

In [None]:
start_date = "2016-01-01"
end_date = "2016-01-31"
df_copy = filter_records_by_date_range(df_copy, "pickup_datetime", start_date, end_date)

In [None]:
display(df_copy.shape)

<a class="anchor" id="Outliers"></a>
## Observing & Handling outliers

### Cleaning Functions

In [None]:
def remove_unwanted_values(df, feature, unwanted_value):
    """
    Removes rows with unwanted values from the given feature column.
    """
    return df[df[feature] != unwanted_value]

def fix_unwanted_values(df, feature, unwanted_value, replacement_value):
    """
    Fixes unwanted values in the given feature column.

    Args:
        df: The DataFrame to fix the values in.
        feature: The name of the feature column to fix the values in.
        unwanted_value: The value to fix.
        replacement_value: The value to replace the unwanted value with.

    Returns:
        The DataFrame with the unwanted values fixed.
    """

    # Create a copy of the DataFrame to avoid modifying the original DataFrame.
    fixed_df = df.copy()

    # Replace the unwanted value with the replacement value in the specified feature column.
    fixed_df.loc[fixed_df[feature] == unwanted_value, feature] = replacement_value

    # Return the DataFrame with the unwanted values fixed.
    return fixed_df

def impute_outliers_with_mean(df, feature, total_column=None, threshold=10):
    """
    Impute outliers in a feature using the mean of positive values within a given threshold range.

    Parameters:
    - df (pd.DataFrame): The data frame.
    - feature (str): The column with outliers.
    - total_column (str, optional): The total amount column to adjust based on imputed values.
    - threshold (float, optional): The threshold value to determine outliers.

    Returns:
    - DataFrame: The modified data frame.
    """

    # Calculate the mean of positive values between 0 and threshold
    mean_val = df[(df[feature] > 0) & (df[feature] < threshold)][feature].mean()
    
    # Create a mask for rows that are outliers (values below -threshold or above threshold)
    outliers_mask = (df[feature] < -threshold) | (df[feature] > threshold)
    
    if total_column:
        # Adjust the total column based on the difference between the outlier and mean value
        df.loc[outliers_mask, total_column] = df.loc[outliers_mask, total_column] - df.loc[outliers_mask, feature] + mean_val
    
    # Impute the outliers with the mean value
    df.loc[outliers_mask, feature] = mean_val
    
    return df

def impute_outliers_with_mean_iqr(df, feature, total_column=None):
    """
    Impute outliers in a feature using the mean of values within the IQR range.

    Parameters:
    - df (pd.DataFrame): The data frame.
    - feature (str): The column with outliers.
    - total_column (str): The total amount column to adjust based on imputed values (optional).

    Returns:
    - DataFrame: The modified data frame.
    """
    
    # Calculate the IQR and boundaries
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Create a mask for rows that are outliers
    outliers_mask = (df[feature] < lower_bound) | (df[feature] > upper_bound)
    
    # Calculate the mean of values within the IQR range
    mean_val = df[~outliers_mask][feature].mean()
    
    if total_column:
        # Adjust the total column based on the difference between the outlier and mean value
        df.loc[outliers_mask, total_column] = df.loc[outliers_mask, total_column] - df.loc[outliers_mask, feature] + mean_val
    
    # Impute the outliers with the mean value
    df.loc[outliers_mask, feature] = mean_val
    
    return df



In [None]:
unique_values = df_copy["passenger_count"].unique()
print(unique_values)
df_copy.shape

In [None]:
# fix out rows where passenger count is 666
df_copy = fix_unwanted_values(df_copy, "passenger_count", 666, 6)
df_copy.shape

In [None]:
unique_values = df_copy["passenger_count"].unique()
print(unique_values)

#### We simply fixed the rows with passenger count(s) with value 666 since this is clearly an outlier since it's not feasible to have a ride with 666 passengers, which indicates that this is a data entry error since the user might have pressed 6 2 additional times and he meant to have 6 only once.


In [None]:
unique_values = df_copy["extra"].unique()
print(unique_values)
display (df_copy[df_copy['extra'] == 83])
df_copy.shape

In [None]:
# Filter out rows where passenger count is 83
#df_copy = remove_unwanted_values(df_copy, "extra", 83)

In [None]:
valid_values = [0.5, 0.0, 1 , -0.5 , -1]
df_copy = adjust_column_for_multiple_values(df_copy, 'extra', 'total_amount', valid_values)

In [None]:
unique_values = df_copy["extra"].unique()
print(unique_values)
df_copy.shape

#### We simply fixed the row with `Extra` value of 83 from the dataset since this is clearly an outlier since we know from the column description that Extra could only be 0.5 or 1 (possibly 0), however this value is way too far off which indicates data entry error. we zerod that value and fixed the total amount accordingly.


In [None]:
df_copy.describe()

In [None]:
df_copy['trip_distance'].skew()

In [None]:
plot_density(df_copy, 'trip_distance')

In [None]:
Q1_distance = df_copy.trip_distance.quantile(0.25)
Q3_distance = df_copy.trip_distance.quantile(0.75)
IQR_distance = Q3_distance - Q1_distance
print(IQR_distance)

In [None]:
cut_off_distance = IQR_distance * 1.5
lower_distance = Q1_distance - cut_off_distance
upper_distance =  Q3_distance + cut_off_distance
print(lower_distance,upper_distance)

In [None]:
df1 = df_copy[df_copy['trip_distance'] > upper_distance]
df2 = df_copy[df_copy['trip_distance'] < lower_distance]
print('Total number of outliers are', df1.shape[0]+ df2.shape[0])

In [None]:
df_copy.trip_distance.quantile(0.93)

#### Due to The fact that the data is very skewed thus it's not normally distrubuted we can't use Zscore, thus to detect outliers I used IQR, and mean imputed the outliers using the mean of the data that falls in IQR range, since the outliers will affect the mean if we calculate the whole data mean.


In [None]:
df_copy = impute_outliers_with_mean_iqr(df_copy, 'trip_distance')

In [None]:
df_copy['trip_distance'].skew()
df_copy.describe()

In [None]:
df_copy['trip_distance'].skew()

#### The Max in the trip distance is way lower now and reasonable , moreover the skewness is better now.

In [None]:
display(df_copy['fare_amount'].skew())
display(df_copy['tip_amount'].skew())

In [None]:
plot_density(df_copy, 'fare_amount')
plot_density(df_copy, 'tip_amount')

#### The same concept applies for the `fare_amount` and `tip_amount` , we can't apply zscore since the data is very skewed and not normally distrubted, thus we apply IQR and mean impute the outliers using the mean of the data that falls in IQR range, since the outliers will affect the mean if we calculate the whole data mean. Moreover, we need to fix the `total_ amount` column for the values we imputed for it to keep the data consisteny ( all components aligning with the `total_amount` ).


In [None]:
df_copy = impute_outliers_with_mean_iqr(df_copy, 'fare_amount', 'total_amount')
df_copy = impute_outliers_with_mean_iqr(df_copy, 'tip_amount', 'total_amount')

In [None]:
df_copy.describe()

In [None]:
display(df_copy['fare_amount'].skew())
display(df_copy['tip_amount'].skew())

In [None]:
plot_density(df_copy, 'tolls_amount')

In [None]:
display(df_copy['tolls_amount'].skew())

In [None]:
df_copy['tolls_amount'].quantile(0.90)

In [None]:
(df_copy['tolls_amount']<0).sum()

In [None]:
(df_copy['tolls_amount']>0).sum()

In [None]:
(df_copy['tolls_amount']>8).sum()

In [None]:
(df_copy['tolls_amount']>15).sum()

In [None]:
(df_copy['tolls_amount']>30).sum()

In [None]:
df_copy = impute_outliers_with_mean(df_copy, 'tolls_amount','total_amount', 10)

#### For the `tolls_amount` we can clearly see that the data is very skewed, thus we can't apply z-score as well, however here we can't apply IQR as well since the first Quartile and the third Quartile are both zeros, which will give an IQR a value of zero, thus we need to impute the values above a threshold.  However, for this dataset, it's crucial to compute the mean only from rides that incurred tolls, so as not to skew the data. Furthermore, to prevent extreme outliers from heavily influencing the mean (like a toll of $900), we'll set a reasonable threshold. We'll then compute the mean from non-zero, positive toll amounts below this threshold. Values exceeding this threshold will be imputed using the calculated mean. We should also account for the change in the value that should be reflected in the `total_amount` feature. The Threshold used here is reasonable for a ride with toll_amount based on the data we saw.

In [None]:
# clear after we handled the tolls outliers we still kept the data consistent.
count_mismatch = count_mismatch_rows(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(count_mismatch)

In [None]:
display(df_copy['tolls_amount'].skew())
display(df_copy['total_amount'].skew())

#### For the `tolls_amount` the data is still skewed due to the fact that 90 percent of rides didnt incurr tolls, but the only way to handle the outliers and make the toll amount was through the method discussed above, moreover the data is now reasonable.

In [None]:
df_copy.describe()

#### For the rest of the columns they have discrete values and inorder to apply statistical analysis, the data needs to be continous. Moreover we know that they are fine since we already handled the incorrect 'discrete' values earlier so the data is already correct and we are sure about that.

<a class="anchor" id="Feature"></a>
# 4 - Data transformation and feature eng.

## 4.1 - Discretization

#### In this section we will discretize all columns that are continous using equal width discretization and added the 2 new columns `week_number` , and `date_range`. Equal width discretization was used for its Consistency, since this method doesn't depend on the underlying data distribution. This means that if you have a predetermined set of bins, new data can be consistently categorized into these bins even if the distribution of the new data is different from the old. Moreover Uniformity since every bin has the same width. This can be useful when you need to categorize data into fixed ranges regardless of the distribution of data within those ranges.

#### Helper Functions

In [None]:
import calendar

def add_weekly_columns(df, datetime_column):
    """
    Add 'week_number' and 'data_range' column to the dataframe based on a datetime column.
    
    Parameters:
    - df: The dataframe
    - datetime_column: The column in df containing datetime values
    
    Returns:
    - DataFrame with two new columns 'week_number' and 'date_range'
    """
    
    # Determine the start of the week for the earliest date in the dataframe
    first_date = df[datetime_column].min()
    start_of_first_week = first_date - pd.to_timedelta(first_date.dayofweek, unit='D')

    # Compute the day difference to the start of the first week
    day_difference = (df[datetime_column] - start_of_first_week).dt.days

    # Use this difference to determine the week number
    df['week_number'] = (day_difference // 7) + 1

    week_start = (df[datetime_column] - pd.to_timedelta(df[datetime_column].dt.dayofweek, unit='D')).dt.date
    week_end = (week_start + pd.DateOffset(days=6)).dt.date
    df['week_range'] = week_start.astype(str) + ' / ' + week_end.astype(str)
    
    return df

def equal_width_discretization(df, column_name, bins=10, new_column_name=None, labels=None):
    """
    Perform Equal Width Discretization on a specified column of a DataFrame.
    
    Parameters:
    - df: The dataframe.
    - column_name: Name of the column to be discretized.
    - bins: Number of equal width bins to create (default is 10).
    - new_column_name: Name of the new column to store the discretized data (default is None, which overwrites the existing column).
    - labels: List of labels for the bins. If not provided, defaults to numerical labels.
    
    Returns:
    - DataFrame with the discretized column.
    """
    
    # If no labels provided, use numerical bin labels
    if labels is None:
        labels = [i for i in range(1, bins + 1)]
    elif len(labels) != bins:
        raise ValueError(f"Number of labels ({len(labels)}) should match the number of bins ({bins}).")

    # If no new column name provided, overwrite the original column
    if new_column_name is None:
        new_column_name = column_name
    
    # Check if the column is of datetime type
    if df[column_name].dtype == 'datetime64[ns]':
        min_date = df[column_name].min()
        
        # Creating bins with 7 day width using pd.to_timedelta
        bin_edges = [min_date + pd.to_timedelta(7 * i, unit='D') for i in range(bins + 1)]
        series_to_cut = df[column_name]
    else:
        series_to_cut = df[column_name]
        bin_edges = bins
    
    # Use pandas cut function to create the discretized column
    df[new_column_name] = pd.cut(series_to_cut, bins=bin_edges, labels=labels, include_lowest=True)
    
    return df

def equal_width_discretization_range(df, column_name, bins=10, new_column_name=None):
    """
    Perform Equal Width Discretization on a specified column of a DataFrame and return the date ranges for each bin.
    
    Parameters:
    - df: The dataframe.
    - column_name: Name of the column to be discretized.
    - bins: Number of equal width bins to create (default is 10).
    - new_column_name: Name of the new column to store the date range (default is None, which overwrites the existing column).
    
    Returns:
    - DataFrame with the date range column.
    """
    
    # If no new column name provided, overwrite the original column
    if new_column_name is None:
        new_column_name = column_name
    
    # Check if the column is of datetime type
    if df[column_name].dtype == 'datetime64[ns]':
        min_date = df[column_name].min()
        
        # Creating bins with 7 day width using pd.to_timedelta
        bin_edges = [min_date + pd.to_timedelta(7 * i, unit='D') for i in range(bins + 1)]
        series_to_cut = df[column_name]
    else:
        series_to_cut = df[column_name]
        bin_edges = bins
    
    # Generate labels for the date ranges
    labels = [f"{bin_edges[i].strftime('%Y-%m-%d')} / {(bin_edges[i+1] - pd.to_timedelta(1, unit='D')).strftime('%Y-%m-%d')}" for i in range(len(bin_edges)-1)]
    
    # Use pandas cut function to create the discretized column
    df[new_column_name] = pd.cut(series_to_cut, bins=bin_edges, labels=labels, include_lowest=True)
    
    return df

In [None]:
display(df_copy.head(5))

In [None]:
df_copy = equal_width_discretization(df_copy, "trip_distance", bins=5, new_column_name="trip_distance_discretized", labels=["low", "medium-low", "medium", "medium-high", "high"])

In [None]:
display(df_copy.tail(25))

In [None]:
df_copy = equal_width_discretization(df_copy, "fare_amount", bins=5, new_column_name="fare_amount_discretized", labels =["low", "medium-low", "medium", "medium-high", "high"])

In [None]:
display(df_copy.tail(5))

In [None]:
df_copy = equal_width_discretization(df_copy, "tip_amount", bins=3, new_column_name="tip_amount_discretized", labels =["low", "medium","high"])

In [None]:
df_copy = equal_width_discretization(df_copy, "total_amount", bins=5, new_column_name="total_amount_discretized", labels =["low", "medium-low", "medium", "medium-high", "high"])

In [None]:
display(df_copy.head(25))

In [None]:
df_copy.dtypes

In [None]:
df_copy = equal_width_discretization(df_copy, "pickup_datetime", bins=5, new_column_name="week_number")

In [None]:
display(df_copy)

In [None]:
unique_values = df_copy["week_number"].value_counts()
print(unique_values)
df_copy.dtypes

In [None]:
filtered_df = df_copy[df_copy['pickup_datetime'].dt.day == 7]

display(filtered_df)

In [None]:
unique_values = df_copy["week_number"].unique()
print(unique_values)
df_copy.dtypes

In [None]:
df_copy = equal_width_discretization_range(df_copy, "pickup_datetime", bins=5, new_column_name="date_range")

In [None]:
filtered_df = df_copy[df_copy['pickup_datetime'].dt.day == 7]

display(filtered_df)

In [None]:
display(df_copy.shape)

#### As shown this makes sense, since we have 5 weeks in January ( week 5 accounts for the last couple of days in January ) and the date range within every week. 


<a class="anchor" id="GPS"></a>
## 4.2 - Additional data extraction (GPS coordinates)

In [None]:
import os
from geopy.geocoders import GoogleV3
import time

### Helper Functions

In [None]:
def get_coordinates(city_name, geolocator):
    location = geolocator.geocode(city_name)
    if location:
        return location.latitude, location.longitude
    else:
        return None, None
    
def get_coordinates_google(location, api_key):
    geolocator = GoogleV3(api_key=api_key)

    # Split location into borough and zone
    parts = location.split(', ')
    full_location = parts[0] + ", " + parts[1]
    
    try:
        # Try to geocode using full location (borough, zone)
        location_obj = geolocator.geocode(full_location, timeout=10)
        if location_obj:
            return location_obj.latitude, location_obj.longitude
        
        # If that fails, try just the borough
        location_obj = geolocator.geocode(parts[0], timeout=10)
        if location_obj:
            return location_obj.latitude, location_obj.longitude
    except Exception as e:
        print(f"Error fetching coordinates for {location}: {str(e)}")
        return None, None

    time.sleep(1)  # To prevent hitting request limits
    return None, None
    
def gather_and_save_unique_coordinates(df, api_key, pu_column='pickup_location', do_column='dropoff_location', filename="all_location_coordinates.csv"):
    """
    Gathers unique GPS coordinates for the given pickup and drop-off columns and saves them to a CSV.
    """
    
    # Extract unique values from both columns
    unique_pu_locations = df[pu_column].unique()
    unique_do_locations = df[do_column].unique()

    # Combine and deduplicate
    all_unique_locations = set(unique_pu_locations) | set(unique_do_locations)

    # If CSV doesn't exist, fetch coordinates and save to CSV
    if not os.path.exists(filename):
        coordinates = {}
        for location in all_unique_locations:
            coords = get_coordinates_google(location, api_key)
            if coords:
                coordinates[location] = coords
        # Save to CSV
        coordinates_df = pd.DataFrame.from_dict(coordinates, orient='index', columns=['Latitude', 'Longitude'])
        coordinates_df.to_csv(filename)
    else:
        coordinates_df = pd.read_csv(filename, index_col=0)
    
    return coordinates_df

def populate_lat_long(df, prefix, coordinates_df):
    df[prefix + '_lat'] = df[prefix + '_location'].map(coordinates_df['Latitude'])
    df[prefix + '_long'] = df[prefix + '_location'].map(coordinates_df['Longitude'])
    return df



In [None]:
api_key = "AIzaSyBXV_Q4_CWvV7btH9drTwc3BYRoj2GwozQ"
coordinates_df = gather_and_save_unique_coordinates(df_copy, api_key)

In [None]:
# Populating the columns
df_copy = populate_lat_long(df_copy, 'pickup', coordinates_df)
df_copy = populate_lat_long(df_copy, 'dropoff', coordinates_df)

In [None]:
coordinates_df = pd.read_csv("all_location_coordinates.csv")
display(coordinates_df)

In [None]:
display(df_copy)

In [None]:
display(df_copy.tail())

In [None]:
df_copy[["pickup_lat","dropoff_lat"]].isnull().sum()

#### Those are the cells that the api didnt find coordinates for, Moreover this is due to the fact that the borough, zone are both unknown

## 4.3 - Encoding

In [None]:
df_copy.nunique()

#### those are the categorical features,  `vendor`, `store_and_fwd` , `trip_type`, `rate_type`,`pickup_location`,`dropoff_location`,  and `payment_type` that could be encoded:
- We can encode `vendor`, `store_and_fwd`,`payment_type`, `rate_type` and `trip_type` using One_Hot encoding.
- we can encode ,`pickup_location`and `dropoff_location` using label encoding.



#### Reasoning, for the `vendor`, `store_and_fwd` and `trip_type`, One-hot encoding is appropriate. Even though they are binary, if there's no clear ordinal relationship between the two values, using label encoding might impose an unintended ordinality. For example, assigning 0 and 1 might make certain algorithms treat them as having some kind of rank. Since they don't increase dimensionality by much (just an extra column), the trade-off is minimal, Moreover we can remove first thus we don't increase the columns (which in the end gives the same scenario as label encoding), so the decision here doesn't really matter. However for the `rate_type` and `payment_type` they don't have ordinal relation between them, thus if we use one-hot this will increase the feature space by 10 (they have 10 categories combined) which will increase the size of the dataset, and if we use label encoding we might inforce some kind of rank between them. Therefore, it's a tradeoff here. I prefer using one hot encoding for it since it will be more readable and it will avoid model confusion by not inducing ranking among the categories, howerver it will increase the feature space. I also didn't use one-hot encoding of the most frequent categories since I'll lose information about the other less frequent categories. Futhermore for the pu_location and do_location since they have a lot of unique values , doing them using one hot will cause a huge increase in the dimensions of the data will add so many columns this way, thus it's more appropriote to use label encoding for the data preprocessing.


In [None]:
df_copy.shape

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import pandas as pd

def encode_features(df, features, method):
    """
    Encode given columns of a dataframe.
    
    Parameters:
    - df: DataFrame to be encoded.
    - features: List of column names to be encoded.
    - method: Encoding method - either 'onehot' or 'label'.
    
    Returns:
    - DataFrame with encoded features.
    - Dictionary of mappings from original to encoded values for each feature.
    """
    df_copy = df.copy()
    mappings = {}

    # One-Hot Encoding
    if method == 'onehot':
        one_hot_encoder = OneHotEncoder(drop='first', sparse_output=False)
        
        df_encoded = pd.DataFrame(one_hot_encoder.fit_transform(df_copy[features]), 
                                  columns=one_hot_encoder.get_feature_names_out(features),
                                  index=df_copy.index)  # Important: Maintain the original index
        
        # Mapping for one-hot encoding - for each feature, a dict of category to encoded column
        for feature, cats in zip(features, one_hot_encoder.categories_):
            mappings[feature] = {cat: f"{feature}_{cat}" for cat in cats[1:]}  # Exclude first category
        
        df_copy = pd.concat([df_copy, df_encoded], axis=1)
        df_copy.drop(features, axis=1, inplace=True)  # Remove original columns after encoding
    
    # Label Encoding
    elif method == 'label':
        for feature in features:
            label_encoder = LabelEncoder()
            df_copy[feature] = label_encoder.fit_transform(df_copy[feature])  # In-place modification
            mappings[feature] = dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))
    else:
        raise ValueError("Method must be either 'onehot' or 'label'")
    
    return df_copy, mappings



In [None]:
onehot_features = ['vendor', 'store_and_fwd', 'payment_type', 'rate_type', 'trip_type']
label_features = ['pickup_location','dropoff_location']

df_copy, onehot_mappings = encode_features(df_copy, onehot_features, 'onehot')

label_features = ['pickup_location','dropoff_location']
df_copy, label_mappings = encode_features(df_copy, label_features, 'label')
update_lookup(label_mappings)

display(lookup_table)

display(df_copy.columns)


In [None]:
df_copy.shape

In [None]:
df_copy.head()

In [None]:
df_copy[label_features].tail(5)

#### As shown, the data has been encoded for the onehot columns only a single column for the binary features was added since we used drop first, we can induce which type based on the value of that feature using 1 column and for the `rate_type` and `payment_type` a new column was added for each category except the first one since we use drop first as well. Moreover for the `pickup_location` and `dropoff_location` they were encoded using label-encoding and it could be seen in the previous cell.

## 4.4 - Normalisation 

In [None]:
df_copy.describe()

#### Since we don't know what is the machine learning model that will be used, thus we shouldn't normalise the data and change its distribution, moreover we already know that the columns aren't skewed after we handled the outliers earlier, however the `fare_amount` has a high STD consequently the `total_amount` too, which could benefit from normalisation,but it will change the distribution and we are tampering with financial information which is wrong in that context and again we don't know what machine learning model that will be used, thus normalising isn't beneficial in that case.

## 4.5 - Adding more features(feature eng.)

 #### Add 3 new features: 
 - Duration: Difference between dropoff_datetime and pickup_datetime.
 - Weekend: Whether the trip started on a weekend.
 - Average Speed: trip_distance divided by Duration.

In [None]:
def add_features(df):
    
    # 1. Calculate Duration in hours
    df['duration_hours'] = (df['dropoff_datetime'] - df['pickup_datetime']).dt.total_seconds() / 3600
    
    # 2. Identify if the trip was on a weekend
    df['is_weekend'] = df['pickup_datetime'].dt.weekday >= 5  # 5 for Saturday, 6 for Sunday
    
    # 3. Calculate Average Speed in miles per hour
    df['avg_speed_mph'] = df['trip_distance'] / df['duration_hours']
    # Set avg_speed_mph to -1 where trip_distance is -1
    df['avg_speed_mph'] = np.where(df['trip_distance'] == -1, -1, df['avg_speed_mph'])
    
    return df

In [None]:
df_copy = add_features(df_copy)

In [None]:
display(df_copy.head(5))

#### As shown the 3 new features were added with their respective values. 

In [None]:
count_mismatch = count_mismatch_rows(df_copy, ['extra', 'mta_tax', 'tolls_amount','improvement_surcharge','fare_amount','tip_amount'], 'total_amount')
display(count_mismatch)

In [None]:
print(df_copy.columns)
display(null_percentage(df_copy))

<a class="anchor" id="LookUP"></a>
## 4.6 - Csv file for lookup

In [None]:
flattened_data = []

for feature, mapping in lookup_table.items():
    for original_value, encoded_value in mapping.items():
        flattened_data.append({
            'Column Name': feature,
            'Original Value': original_value,
            'Imputed/Encoded Value': encoded_value
        })

lookup_df = pd.DataFrame(flattened_data)

lookup_df.to_csv('lookup_table.csv', index=False)

<a class="anchor" id="Parquet"></a>
## 5- Exporting the dataframe to a csv file or parquet

In [None]:
# replace df to df final
df_copy.to_csv('green_tripdata_16-01_clean.csv', index=False, header=True)

In [None]:
df_copy.to_parquet('green_trip_data_16-1_clean.parquet', engine='pyarrow', compression='snappy')