# **Taxi 🚕 Data** 📊
## Loading CSV file. 



In [40]:
import zipfile
import pandas as pd

# Define the ZIP file name
zip_file_name = 'Yellow_Taxi_Assignment.csv.zip'

# Extract the ZIP file
with zipfile.ZipFile(zip_file_name, 'r') as zip_file:
    # Assuming there is only one CSV file in the ZIP archive
    csv_file_name = zip_file.namelist()[0]
    zip_file.extract(csv_file_name)

# Define the date columns that you want to parse as datetime objects
date_columns = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']

# Read the extracted CSV file using pandas with date parsing
df_ny = pd.read_csv(csv_file_name, parse_dates=date_columns)

# Now we can work with the 'df' DataFrame containing the CSV data
# Checking the first few rows:
# Assuming you have a DataFrame named df_ny
variable_types = df_ny.dtypes

# Now, variable_types is a Series that contains variable names as the index and their data types as values.
# You can print it or iterate over it to see the variable names and their types.
for variable, data_type in variable_types.items():
    print(f"Variable: {variable}, Data Type: {data_type}")
df_ny.head()


Variable: VendorID, Data Type: int64
Variable: tpep_pickup_datetime, Data Type: datetime64[ns]
Variable: tpep_dropoff_datetime, Data Type: datetime64[ns]
Variable: passenger_count, Data Type: float64
Variable: trip_distance, Data Type: float64
Variable: RatecodeID, Data Type: float64
Variable: store_and_fwd_flag, Data Type: object
Variable: PULocationID, Data Type: int64
Variable: DOLocationID, Data Type: int64
Variable: payment_type, Data Type: int64
Variable: fare_amount, Data Type: float64
Variable: extra, Data Type: float64
Variable: mta_tax, Data Type: float64
Variable: tip_amount, Data Type: float64
Variable: tolls_amount, Data Type: float64
Variable: improvement_surcharge, Data Type: float64
Variable: total_amount, Data Type: float64
Variable: congestion_surcharge, Data Type: float64
Variable: airport_fee, Data Type: float64


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2018-01-01 12:02:01,2018-01-01 12:04:05,1.0,0.53,1.0,N,142,163,1,3.5,0.0,0.5,1.29,0.0,0.3,5.59,,
1,2,2018-01-01 12:26:48,2018-01-01 12:31:29,1.0,1.05,1.0,N,140,236,1,6.0,0.0,0.5,1.02,0.0,0.3,7.82,,
2,2,2018-01-01 01:28:34,2018-01-01 01:39:38,4.0,1.83,1.0,N,211,158,1,9.5,0.5,0.5,1.62,0.0,0.3,12.42,,
3,1,2018-01-01 08:51:59,2018-01-01 09:01:45,1.0,2.3,1.0,N,249,4,2,10.0,0.0,0.5,0.0,0.0,0.3,10.8,,
4,2,2018-01-01 01:00:19,2018-01-01 01:14:16,1.0,3.06,1.0,N,186,142,1,12.5,0.5,0.5,1.0,0.0,0.3,14.8,,


## Data Cleaning & Imputation
1.- Im going to check for missing values.

In [41]:
print(f"The number of rows is {df_ny.shape[0]}")
# The following have null rows 
df_ny.isnull().sum()

The number of rows is 304978


VendorID                      0
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count            9513
trip_distance                 0
RatecodeID                 9513
store_and_fwd_flag         9513
PULocationID                  0
DOLocationID                  0
payment_type                  0
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge      72632
airport_fee              198761
dtype: int64

2.- **To remove the nan values from the following fields:**
| Field                | Count |
|----------------------|-------|
| passenger_count      | 9513  |
| RatecodeID           | 9513  |
| store_and_fwd_flag   | 9513  |

**Nulls in columns `airport_fee` and `congestion_surcharge` will be developed differently.** 

In [42]:
# Remove rows with missing values in specific columns
columns_to_check = ['passenger_count', 'RatecodeID', 'store_and_fwd_flag']
# Dropping the specific values
df_ny.dropna(subset=columns_to_check, inplace=True)
# The new number of rows
print(f"The number of rows is {df_ny.shape[0]}")


The number of rows is 295465


3.- Now that the rides with no passangers have been remove we will look into columns `airport_fee` and `congestion_surcharge` for cleaning or imputation. 

In [43]:
unique_airport_fees = df_ny['airport_fee'].unique()
unique_congestion_surcharges = df_ny['congestion_surcharge'].unique()

print("Unique values in 'airport_fee':")
print(unique_airport_fees)

print("\nUnique values in 'congestion_surcharge':")
print(unique_congestion_surcharges)

Unique values in 'airport_fee':
[  nan  0.    1.25 -1.25]

Unique values in 'congestion_surcharge':
[  nan  0.    2.5  -2.5   2.75  0.5 ]


-----
As the deninition for `airport_fee` is $1.25 for pick up only at LaGuardia and John F. Kennedy Airports. 
And the definition for `congestion_surcharge` is Congestion_Surcharge Total amount collected in trip for NYS congestion surcharge. 
I will change the negative values to positive. (???????????????????) and replace the nan for 0. values.
when `airport_fee` is negative then multiply for -1. if `airport_fee` is nan remplace by 0. 
-----

In [47]:
# First, fill NaN values in 'airport_fee' with 0
df_ny['airport_fee'].fillna(0, inplace=True)
df_ny['congestion_surcharge'].fillna(0, inplace=True)
# Then, replace negative values in 'airport_fee' with their absolute values
df_ny['airport_fee'] = df_ny['airport_fee'].apply(lambda x: abs(x) if x < 0 else x)
df_ny['congestion_surcharge'] = df_ny['congestion_surcharge'].apply(lambda x: abs(x) if x < 0 else x)

unique_airport_fees = df_ny['airport_fee'].unique()
unique_congestion_surcharges = df_ny['congestion_surcharge'].unique()

print("Unique values in 'airport_fee':")
print(unique_airport_fees)

print("\nUnique values in 'congestion_surcharge':")
print(unique_congestion_surcharges)

print(f"The number of rows is {df_ny.shape[0]}")

Unique values in 'airport_fee':
[0.   1.25]

Unique values in 'congestion_surcharge':
[0.   2.5  2.75 0.5 ]
The number of rows is 295465


## Looking for outliers. 