# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

Before you start, install the some of the required packages. 

In [None]:
!python -m pip install kaggle==1.6.12
!pip install --target=/workspace ucimlrepo

In [1]:
!pip install openmeteo-requests

Defaulting to user installation because normal site-packages is not writeable


In [2]:
!pip install requests-cache retry-requests numpy pandas

Defaulting to user installation because normal site-packages is not writeable


In [3]:
# Install compatible versions using %pip magic command
%pip install numpy==1.21.4 matplotlib


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


**Note:** Restart the kernel to use updated package(s).

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
The primary goal of this project is to analyze the impact of weather conditions on taxi trip patterns in New York City. By examining both the NYC Taxi Trip dataset and weather data from the Open-Meteo API, we aim to identify how variables such as temperature, humidity, precipitation, and wind speed influence taxi demand, trip duration, and fare amounts. This analysis will provide insights into how weather conditions affect urban transportation and can help improve the efficiency and reliability of taxi services in the city.

Resources : 
* kaggle https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc/data?select=taxi_tripdata.csv
* open-meteo API https://open-meteo.com/en/docs/historical-weather-api#latitude=40.7128&longitude=74.006&start_date=2020-01-01&end_date=2021-12-31&hourly=temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m&daily=&temperature_unit=fahrenheit&timezone=America%2FNew_York


### **1.2.** I Gathered two datasets using two different data gathering methods

List of data gathering methods I used:

- Download data manually
- Gather data by accessing APIs


#### **Dataset 1: New York City Taxi Trip Data**
**Why this dataset was picked:**
The NYC Taxi Trip dataset provides a comprehensive record of taxi trips in New York City, including details such as pickup and dropoff locations, trip distance, fare amount, and various surcharges. This dataset is crucial for understanding urban mobility patterns and assessing how different factors, such as weather conditions, affect taxi usage.

Type: *CSV file*

Method: *Download data manually from : Dataset Source: https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc/data Taxi trip data NYC* 

Dataset variables:
- **VendorID**: A code indicating the provider associated with the trip record.
- **lpep_pickup_datetime**: The date and time when the trip started. This variable helps to analyze taxi demand patterns over different times of the day and seasons.
- **lpep_dropoff_datetime**: The date and time when the trip ended. This variable, along with the pickup datetime, can be used to calculate trip duration.
- **trip_distance**: The distance covered during the trip. This variable is essential for understanding the relationship between trip distance and fare amounts.
- **fare_amount**: The fare charged for the trip. This variable is crucial for analyzing the economic aspects of taxi operations.
- **total_amount**: The total amount charged for the trip, including all surcharges and taxes. This variable provides a complete picture of the cost of a taxi ride.
- **passenger_count**: The number of passengers in the taxi. This variable can be used to analyze trends in group travel and taxi sharing.
- **PULocationID** and **DOLocationID**: The pickup and dropoff location IDs. These variables help in mapping and spatial analysis of taxi trips across different areas of the city.


In [4]:
import requests
import pandas as pd
import os

taxi_df = pd.read_csv('taxi_tripdata.csv')


  taxi_df = pd.read_csv('taxi_tripdata.csv')


#### Dataset 2 : Hourly weather forecast in NYC
**Why this dataset was picked:**
The weather data from the Open-Meteo API was chosen to complement the NYC Taxi Trip dataset. Weather conditions significantly affect transportation patterns, and analyzing this data can help in understanding how factors like temperature, precipitation, and wind speed influence taxi demand and travel times.



Type: *API Data.*

Method: *The data was gathered using the "weather_api" method from the Open-Meteo API.*

Dataset variables:

*   *Variable 1: temperature_2m* (e.g., 2-meter temperature in Degree Celsius)
*   *Variable 2: relative_humidity_2m* (e.g., Relative humidity at 2 meters)
*   *Variable 3: rain* (e.g., Rainfall)
*   *Variable 4: snowfall* (e.g., Snowfall)
*   *Variable 5: snow_depth* (e.g., Snow depth)
*   *Variable 6: pressure_msl* (e.g., Mean sea level pressure)
*   *Variable 7: wind_speed_100m* (e.g., Wind speed at 100 meters above ground level)
*   *Variable 8: wind_direction_100m* (e.g., Wind direction at 100 meters above ground level)


In [5]:
import openmeteo_requests
import requests_cache
import pandas as pd
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
    "latitude": 40.7128,
    "longitude": 74.006,
    "start_date": "2020-01-01",
    "end_date": "2021-12-31",
    "hourly": ["temperature_2m", "relative_humidity_2m", "rain", "snowfall", "snow_depth", "pressure_msl", "wind_speed_100m", "wind_direction_100m"],
    "temperature_unit": "fahrenheit",
    "timezone": "America/New_York"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()} {response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
hourly_rain = hourly.Variables(2).ValuesAsNumpy()
hourly_snowfall = hourly.Variables(3).ValuesAsNumpy()
hourly_snow_depth = hourly.Variables(4).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(5).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(6).ValuesAsNumpy()
hourly_wind_direction_100m = hourly.Variables(7).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
    start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
    end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
    freq = pd.Timedelta(seconds = hourly.Interval()),
    inclusive = "left"
)}
hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["rain"] = hourly_rain
hourly_data["snowfall"] = hourly_snowfall
hourly_data["snow_depth"] = hourly_snow_depth
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["wind_direction_100m"] = hourly_wind_direction_100m

hourly_dataframe = pd.DataFrame(data = hourly_data)




Coordinates 40.738136291503906°N 74.17021179199219°E
Elevation 3825.0 m asl
Timezone b'America/New_York' b'EDT'
Timezone difference to GMT+0 -14400 s


Optional data storing step: You may save your raw dataset files to the local data store before moving to the next step.

In [6]:
#Optional: store the raw data in your local data store

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:  invalid entities for fare_count and total_count columns 

Dataset 1:

Visually assess

In [7]:
#FILL IN - Inspecting the dataframe visually
print("Shape of NYC Taxi Trip data:", taxi_df.shape)
taxi_df.head()

Shape of NYC Taxi Trip data: (83691, 20)


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,1.0,2021-07-01 00:30:52,2021-07-01 00:35:36,N,1.0,74,168,1.0,1.2,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2.0,1.0,0.0
1,2.0,2021-07-01 00:25:36,2021-07-01 01:01:31,N,1.0,116,265,2.0,13.69,42.0,0.5,0.5,0.0,0.0,,0.3,43.3,2.0,1.0,0.0
2,2.0,2021-07-01 00:05:58,2021-07-01 00:12:00,N,1.0,97,33,1.0,0.95,6.5,0.5,0.5,2.34,0.0,,0.3,10.14,1.0,1.0,0.0
3,2.0,2021-07-01 00:41:40,2021-07-01 00:47:23,N,1.0,74,42,1.0,1.24,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,2.0,1.0,0.0
4,2.0,2021-07-01 00:51:32,2021-07-01 00:58:46,N,1.0,42,244,1.0,1.1,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2.0,1.0,0.0


In [8]:
taxi_df.sample(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
15943,2.0,2021-07-10 16:27:35,2021-07-10 16:43:02,N,1.0,82,7,1.0,3.78,14.0,0.0,0.5,0.0,0.0,,0.3,14.8,2.0,1.0,0.0
14375,2.0,2021-07-09 18:32:20,2021-07-09 18:47:23,N,1.0,129,226,1.0,2.26,11.0,1.0,0.5,0.0,0.0,,0.3,12.8,2.0,1.0,0.0
72929,,2021-07-10 11:13:00,2021-07-10 12:07:00,,,210,230,,15.12,40.33,2.75,0.0,0.0,6.55,,0.3,49.93,,,
79585,,2021-07-27 13:26:00,2021-07-27 13:36:00,,,112,232,,2.78,17.25,0.0,0.0,0.0,0.0,,0.3,17.55,,,
70347,,2021-07-30 15:07:00,2021-07-30 15:48:00,,,76,188,,6.52,20.56,5.5,0.0,0.0,0.0,,0.3,26.36,,,
18493,2.0,2021-07-12 12:20:23,2021-07-12 12:27:46,N,1.0,116,244,1.0,1.36,7.5,0.0,0.5,0.0,0.0,,0.3,8.3,2.0,1.0,0.0
44940,2.0,2021-07-28 14:45:11,2021-07-28 14:58:03,N,1.0,116,41,1.0,1.96,10.5,0.0,0.5,0.0,0.0,,0.3,11.3,2.0,1.0,0.0
20422,2.0,2021-07-13 14:33:15,2021-07-13 15:05:43,N,1.0,247,50,1.0,7.38,28.0,0.0,0.5,0.0,0.0,,0.3,31.55,2.0,1.0,2.75
75889,,2021-07-06 14:51:00,2021-07-06 15:19:00,,,17,228,,5.09,22.9,2.75,0.0,0.0,6.55,,0.3,32.5,,,
63828,,2021-07-09 16:06:00,2021-07-09 17:22:00,,,248,186,,12.56,42.12,1.35,0.0,0.0,0.0,,0.3,43.77,,,


Programmatically assess:

In [9]:
taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83691 entries, 0 to 83690
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               51173 non-null  float64
 1   lpep_pickup_datetime   83691 non-null  object 
 2   lpep_dropoff_datetime  83691 non-null  object 
 3   store_and_fwd_flag     51173 non-null  object 
 4   RatecodeID             51173 non-null  float64
 5   PULocationID           83691 non-null  int64  
 6   DOLocationID           83691 non-null  int64  
 7   passenger_count        51173 non-null  float64
 8   trip_distance          83691 non-null  float64
 9   fare_amount            83691 non-null  float64
 10  extra                  83691 non-null  float64
 11  mta_tax                83691 non-null  float64
 12  tip_amount             83691 non-null  float64
 13  tolls_amount           83691 non-null  float64
 14  ehail_fee              0 non-null      float64
 15  im

In [10]:
taxi_df.describe()

Unnamed: 0,VendorID,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
count,51173.0,51173.0,83691.0,83691.0,51173.0,83691.0,83691.0,83691.0,83691.0,83691.0,83691.0,0.0,83691.0,83691.0,51173.0,51173.0,51173.0
mean,1.851113,1.159244,108.362572,133.270005,1.307858,194.354699,20.388305,1.156707,0.293562,1.058618,0.624529,,0.297745,24.204836,1.421726,1.034393,0.642815
std,0.355981,0.77326,70.37017,77.216791,0.984362,4405.549221,15.583552,1.367897,0.247773,2.368771,1.990481,,0.031219,17.262183,0.511146,0.182239,1.164219
min,1.0,1.0,3.0,1.0,0.0,0.0,-150.0,-4.5,-0.5,-1.14,0.0,,-0.3,-150.3,1.0,1.0,-2.75
25%,2.0,1.0,56.0,69.0,1.0,1.35,9.0,0.0,0.0,0.0,0.0,,0.3,11.76,1.0,1.0,0.0
50%,2.0,1.0,75.0,132.0,1.0,2.76,16.0,0.5,0.5,0.0,0.0,,0.3,19.8,1.0,1.0,0.0
75%,2.0,1.0,166.0,205.0,1.0,6.2,26.83,2.75,0.5,1.66,0.0,,0.3,31.3,2.0,1.0,0.0
max,2.0,5.0,265.0,265.0,32.0,260517.93,480.0,8.25,0.5,87.71,30.05,,0.3,480.31,5.0,2.0,2.75


In [11]:
taxi_df[taxi_df['fare_amount'] <= 0].count()

VendorID                 440
lpep_pickup_datetime     444
lpep_dropoff_datetime    444
store_and_fwd_flag       440
RatecodeID               440
PULocationID             444
DOLocationID             444
passenger_count          440
trip_distance            444
fare_amount              444
extra                    444
mta_tax                  444
tip_amount               444
tolls_amount             444
ehail_fee                  0
improvement_surcharge    444
total_amount             444
payment_type             440
trip_type                440
congestion_surcharge     440
dtype: int64

In [12]:
taxi_df[taxi_df['total_amount'] <= 0].count()

VendorID                 390
lpep_pickup_datetime     394
lpep_dropoff_datetime    394
store_and_fwd_flag       390
RatecodeID               390
PULocationID             394
DOLocationID             394
passenger_count          390
trip_distance            394
fare_amount              394
extra                    394
mta_tax                  394
tip_amount               394
tolls_amount             394
ehail_fee                  0
improvement_surcharge    394
total_amount             394
payment_type             390
trip_type                390
congestion_surcharge     390
dtype: int64




Issue: The fare_amount and total_amount columns might have negative or zero values which are not valid for fare amounts.


Visually assess using df.head() to spot-check for negative or zero values.                                            
Programmatically assess using df.describe() to find the minimum values and df[df['fare_amount'] <= 0] to count such instances.      
Justification: These methods ensure we identify and quantify the inaccuracies in fare-related columns.             

### Quality Issue 2:  Extreme Values in Weather Data


Dataset 2:

In [13]:
hourly_dataframe.head()

Unnamed: 0,date,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
0,2020-01-01 04:00:00+00:00,-11.819202,72.383469,0.0,0.0,0.78,1024.0,12.313894,127.875046
1,2020-01-01 05:00:00+00:00,-7.229198,60.484665,0.0,0.0,0.78,1023.200012,11.457958,133.727051
2,2020-01-01 06:00:00+00:00,-2.909203,52.399223,0.0,0.0,0.78,1022.799988,11.200571,135.000107
3,2020-01-01 07:00:00+00:00,0.060801,46.170277,0.0,0.0,0.78,1021.5,9.0,143.13002
4,2020-01-01 08:00:00+00:00,2.130802,40.043922,0.0,0.0,0.78,1019.900024,8.557102,157.750931


In [14]:
hourly_dataframe.sample(10)

Unnamed: 0,date,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
8585,2020-12-23 21:00:00+00:00,-9.839199,58.965752,0.0,0.0,0.73,1020.900024,7.895416,155.772263
8730,2020-12-29 22:00:00+00:00,-18.389198,56.732548,0.0,0.0,0.72,1032.099976,6.725354,164.47583
6406,2020-09-24 02:00:00+00:00,22.3808,68.70359,0.0,0.0,0.01,1022.400024,6.12,180.0
1428,2020-02-29 16:00:00+00:00,4.560801,73.309776,0.0,0.0,0.96,1020.099976,14.904173,142.853226
11791,2021-05-06 11:00:00+00:00,30.3008,74.005379,0.1,0.0,0.91,1018.0,4.829907,63.435013
11978,2021-05-14 06:00:00+00:00,28.500799,64.037827,0.0,0.56,0.67,1018.099976,14.154915,277.305664
4665,2020-07-13 13:00:00+00:00,39.1208,87.420372,0.3,0.0,0.0,1012.299988,10.464797,116.564987
9639,2021-02-05 19:00:00+00:00,-4.529198,32.341316,0.0,0.0,0.88,1021.5,7.289444,159.775055
12212,2021-05-24 00:00:00+00:00,22.560801,77.005379,0.0,0.0,0.53,1019.0,13.556282,100.713081
15393,2021-10-03 13:00:00+00:00,21.3008,82.808388,0.0,0.35,0.1,1023.0,4.693825,274.398621


In [15]:
hourly_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17544 entries, 0 to 17543
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   date                  17544 non-null  datetime64[ns, UTC]
 1   temperature_2m        17544 non-null  float32            
 2   relative_humidity_2m  17544 non-null  float32            
 3   rain                  17544 non-null  float32            
 4   snowfall              17544 non-null  float32            
 5   snow_depth            17544 non-null  float32            
 6   pressure_msl          17544 non-null  float32            
 7   wind_speed_100m       17544 non-null  float32            
 8   wind_direction_100m   17544 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(8)
memory usage: 685.4 KB


In [16]:
hourly_dataframe.describe()

Unnamed: 0,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
count,17544.0,17544.0,17544.0,17544.0,17544.0,17544.0,17544.0,17544.0
mean,17.612923,65.784538,0.032866,0.04541,0.506306,1020.901062,7.905742,186.48912
std,17.604399,21.286137,0.181937,0.174239,0.464847,5.977007,3.836345,64.107635
min,-28.6492,8.0536,0.0,0.0,0.0,1003.700012,0.0,3.576264
25%,3.930801,49.682577,0.0,0.0,0.0,1016.700012,5.154416,142.124954
50%,19.0508,67.517971,0.0,0.0,0.48,1020.0,7.23591,175.601379
75%,31.1108,83.753237,0.0,0.0,0.83,1024.5,10.086427,237.744358
max,61.980797,100.0,7.0,2.8,1.69,1045.199951,23.565567,360.0





Issue: The weather data might contain invalid or extreme values that are not realistic for the location and period (e.g., extremely high temperatures or negative humidity values).                                              
Assessment:                                      
Visually assess using df_weather.head() to spot-check for unrealistic values.                             
Programmatically assess using df_weather.describe() to find the range and identify outliers.                             
Justification: These methods ensure we identify and quantify the invalid or extreme values in the dataset.                            
______________________________________________________________________________________________________________________________

### Tidiness Issue 1: Unnecessary Columns in Taxi Data (nulls , duplicated , dataTypes)


In [17]:
taxi_df.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,1.0,2021-07-01 00:30:52,2021-07-01 00:35:36,N,1.0,74,168,1.0,1.2,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2.0,1.0,0.0
1,2.0,2021-07-01 00:25:36,2021-07-01 01:01:31,N,1.0,116,265,2.0,13.69,42.0,0.5,0.5,0.0,0.0,,0.3,43.3,2.0,1.0,0.0
2,2.0,2021-07-01 00:05:58,2021-07-01 00:12:00,N,1.0,97,33,1.0,0.95,6.5,0.5,0.5,2.34,0.0,,0.3,10.14,1.0,1.0,0.0
3,2.0,2021-07-01 00:41:40,2021-07-01 00:47:23,N,1.0,74,42,1.0,1.24,6.5,0.5,0.5,0.0,0.0,,0.3,7.8,2.0,1.0,0.0
4,2.0,2021-07-01 00:51:32,2021-07-01 00:58:46,N,1.0,42,244,1.0,1.1,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2.0,1.0,0.0


In [18]:
taxi_df.sample(5)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
13609,2.0,2021-07-09 13:49:08,2021-07-09 15:00:51,N,1.0,152,116,1.0,0.83,-8.5,0.0,-0.5,0.0,0.0,,-0.3,-9.3,3.0,1.0,0.0
61215,,2021-07-07 14:33:00,2021-07-07 14:50:00,,,242,213,,3.43,22.79,2.75,0.0,0.0,0.0,,0.3,25.84,,,
32571,1.0,2021-07-20 19:55:24,2021-07-20 20:16:08,N,1.0,45,62,1.0,0.0,14.2,0.0,0.5,0.0,0.0,,0.3,15.0,1.0,1.0,0.0
51768,,2021-07-01 00:24:00,2021-07-01 00:46:00,,,153,185,,4.6,32.95,2.75,0.0,0.0,0.0,,0.3,36.0,,,
18471,2.0,2021-07-12 12:48:09,2021-07-12 12:56:51,N,1.0,179,223,1.0,1.67,8.0,0.0,0.5,0.0,0.0,,0.3,8.8,2.0,1.0,0.0


In [19]:
taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83691 entries, 0 to 83690
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               51173 non-null  float64
 1   lpep_pickup_datetime   83691 non-null  object 
 2   lpep_dropoff_datetime  83691 non-null  object 
 3   store_and_fwd_flag     51173 non-null  object 
 4   RatecodeID             51173 non-null  float64
 5   PULocationID           83691 non-null  int64  
 6   DOLocationID           83691 non-null  int64  
 7   passenger_count        51173 non-null  float64
 8   trip_distance          83691 non-null  float64
 9   fare_amount            83691 non-null  float64
 10  extra                  83691 non-null  float64
 11  mta_tax                83691 non-null  float64
 12  tip_amount             83691 non-null  float64
 13  tolls_amount           83691 non-null  float64
 14  ehail_fee              0 non-null      float64
 15  im

In [20]:
taxi_df.dtypes

VendorID                 float64
lpep_pickup_datetime      object
lpep_dropoff_datetime     object
store_and_fwd_flag        object
RatecodeID               float64
PULocationID               int64
DOLocationID               int64
passenger_count          float64
trip_distance            float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
ehail_fee                float64
improvement_surcharge    float64
total_amount             float64
payment_type             float64
trip_type                float64
congestion_surcharge     float64
dtype: object

In [21]:
taxi_df.isnull().sum()

VendorID                 32518
lpep_pickup_datetime         0
lpep_dropoff_datetime        0
store_and_fwd_flag       32518
RatecodeID               32518
PULocationID                 0
DOLocationID                 0
passenger_count          32518
trip_distance                0
fare_amount                  0
extra                        0
mta_tax                      0
tip_amount                   0
tolls_amount                 0
ehail_fee                83691
improvement_surcharge        0
total_amount                 0
payment_type             32518
trip_type                32518
congestion_surcharge     32518
dtype: int64

In [22]:
taxi_df.duplicated().sum()

np.int64(0)


There is no duplicated values 
___________________________________________________________________________________________________________________________


Issue 1: There are several columns with missing values and we don't want them to answer our question, such as VendorID, store_and_fwd_flag, payment_type, and trip_type.                                     
Assessment:                              
Visually assess using df.head() to see a snapshot of the data.                              
Programmatically assess using df.info() and df.isnull().sum() to quantify the missing values.                              
Justification: These methods provide a clear understanding of the extent of missing data and help identify which columns are most affected.                              



Issue 2: The date and time columns (lpep_pickup_datetime and lpep_dropoff_datetime) might have inconsistent formats.                              
Assessment:                              
Visually assess using taxi_df.sample() to see a snapshot of the datetime columns.                              
Programmatically assess using taxi_df.dtypes to check the data types of these columns.                              
Justification: These methods help verify that the datetime columns are in the correct format for further analysis.                              

Issue 3: null entities for passenger_count and congestion_surcharge           



Issue drop the unnecessery datetime columns

### Tidiness Issue 2:  Dataset2

In [23]:
hourly_dataframe.head()

Unnamed: 0,date,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
0,2020-01-01 04:00:00+00:00,-11.819202,72.383469,0.0,0.0,0.78,1024.0,12.313894,127.875046
1,2020-01-01 05:00:00+00:00,-7.229198,60.484665,0.0,0.0,0.78,1023.200012,11.457958,133.727051
2,2020-01-01 06:00:00+00:00,-2.909203,52.399223,0.0,0.0,0.78,1022.799988,11.200571,135.000107
3,2020-01-01 07:00:00+00:00,0.060801,46.170277,0.0,0.0,0.78,1021.5,9.0,143.13002
4,2020-01-01 08:00:00+00:00,2.130802,40.043922,0.0,0.0,0.78,1019.900024,8.557102,157.750931


In [24]:
hourly_dataframe.sample(5)

Unnamed: 0,date,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
16828,2021-12-02 08:00:00+00:00,23.3708,15.054276,0.0,0.0,0.9,1020.900024,7.172949,162.474335
6020,2020-09-08 00:00:00+00:00,22.1108,61.184513,0.0,0.0,0.0,1019.5,6.28713,156.370605
9929,2021-02-17 21:00:00+00:00,-4.709202,43.637104,0.0,0.0,1.02,1020.5,8.714677,128.290207
15446,2021-10-05 18:00:00+00:00,16.8008,88.215408,0.0,0.63,0.13,1022.299988,9.36,180.0
10023,2021-02-21 19:00:00+00:00,9.330799,87.454453,0.0,0.21,1.11,1024.400024,0.72,180.0


Programmatically assess:

In [25]:
hourly_dataframe.columns

Index(['date', 'temperature_2m', 'relative_humidity_2m', 'rain', 'snowfall',
       'snow_depth', 'pressure_msl', 'wind_speed_100m', 'wind_direction_100m'],
      dtype='object')

In [26]:
hourly_dataframe.shape

(17544, 9)

In [27]:
hourly_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17544 entries, 0 to 17543
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   date                  17544 non-null  datetime64[ns, UTC]
 1   temperature_2m        17544 non-null  float32            
 2   relative_humidity_2m  17544 non-null  float32            
 3   rain                  17544 non-null  float32            
 4   snowfall              17544 non-null  float32            
 5   snow_depth            17544 non-null  float32            
 6   pressure_msl          17544 non-null  float32            
 7   wind_speed_100m       17544 non-null  float32            
 8   wind_direction_100m   17544 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(8)
memory usage: 685.4 KB


In [28]:
hourly_dataframe.isnull().sum()

date                    0
temperature_2m          0
relative_humidity_2m    0
rain                    0
snowfall                0
snow_depth              0
pressure_msl            0
wind_speed_100m         0
wind_direction_100m     0
dtype: int64

In [29]:
hourly_dataframe.duplicated().sum()

np.int64(0)

There is no null nor duplicated values in this dataset
_____________________________________________________________________________________________________

Thid dataset doesn't have clear tidiness issues

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [30]:
taxi = taxi_df.copy()
weather = hourly_dataframe.copy()

### Quality Issue 1: invalid entities for fare_count and total_count columns ### 

In [31]:
invalid_fare_count = taxi[taxi['fare_amount'] <= 0].shape[0]
invalid_total_count = taxi[taxi['total_amount'] <= 0].shape[0]

print(f"Number of trips with invalid fare_amount: {invalid_fare_count}")
print(f"Number of trips with invalid total_amount: {invalid_total_count}")

Number of trips with invalid fare_amount: 444
Number of trips with invalid total_amount: 394


In [32]:
taxi = taxi[(taxi['fare_amount'] > 0) & (taxi['total_amount'] > 0)]

In [33]:
taxi[['fare_amount', 'total_amount']].describe()

Unnamed: 0,fare_amount,total_amount
count,83247.0,83247.0
mean,20.50944,24.343475
std,15.5223,17.187644
min,0.01,0.31
25%,9.0,11.8
50%,16.0,19.86
75%,27.0,31.31
max,480.0,480.31


Justification:                  
- The `fare_amount` and `total_amount` columns should logically contain positive values only, as they represent monetary charges for taxi services.
- Negative or zero values in these columns indicate data inaccuracies or errors and could lead to incorrect analysis results.
- By removing rows with invalid values, we ensure that our dataset accurately reflects the financial aspects of taxi trips, which is crucial for reliable analysis of fare patterns and economic impact.

### Quality Issue 2:  Extreme Values in Weather Data


In [34]:
### Quality Issue 2:  Extreme Values in Weather Data
# Clean temperature_2m (Celsius)
weather = weather[(weather['temperature_2m'] >= -40) & (weather['temperature_2m'] <= 55)]

# Clean relative_humidity_2m
weather = weather[(weather['relative_humidity_2m'] >= 0) & (weather['relative_humidity_2m'] <= 100)]

# Clean rain, snowfall, and snow_depth
weather = weather[(weather['rain'] >= 0)]
weather = weather[(weather['snowfall'] >= 0)]
weather = weather[(weather['snow_depth'] >= 0)]

# Clean pressure_msl
weather = weather[(weather['pressure_msl'] >= 870) & (weather['pressure_msl'] <= 1080)]

# Clean wind_speed_100m
weather = weather[(weather['wind_speed_100m'] >= 0)]

# Clean wind_direction_100m
weather = weather[(weather['wind_direction_100m'] >= 0) & (weather['wind_direction_100m'] <= 360)]


In [35]:
# Validate the cleaning was successful
(weather.describe())

Unnamed: 0,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
count,17489.0,17489.0,17489.0,17489.0,17489.0,17489.0,17489.0,17489.0
mean,17.487959,65.88166,0.032918,0.045553,0.507899,1020.929626,7.89936,186.170898
std,17.489885,21.243895,0.182202,0.174494,0.464708,5.963303,3.837052,63.931034
min,-28.6492,8.0536,0.0,0.0,0.0,1003.700012,0.0,3.576264
25%,3.840799,49.841476,0.0,0.0,0.0,1016.799988,5.154416,142.124954
50%,18.9608,67.641739,0.0,0.0,0.49,1020.0,7.208994,175.23645
75%,31.0208,83.798477,0.0,0.0,0.83,1024.5,10.041354,237.264771
max,54.8708,100.0,7.0,2.8,1.69,1045.199951,23.565567,360.0


In [36]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17489 entries, 0 to 17543
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   date                  17489 non-null  datetime64[ns, UTC]
 1   temperature_2m        17489 non-null  float32            
 2   relative_humidity_2m  17489 non-null  float32            
 3   rain                  17489 non-null  float32            
 4   snowfall              17489 non-null  float32            
 5   snow_depth            17489 non-null  float32            
 6   pressure_msl          17489 non-null  float32            
 7   wind_speed_100m       17489 non-null  float32            
 8   wind_direction_100m   17489 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(8)
memory usage: 819.8 KB


Description: The weather data might contain invalid or extreme values that are not realistic for the location and period.

Justification:                               

Clean temperature_2m (Celsius)   :                                                           
Temperature values in Celsius are expected to be within a reasonable range based on historical weather data for most locations on Earth. Extremely low values (below -40°C) and extremely high values (above 55°C) are rare and often indicate data errors. By filtering out these outliers, we ensure the dataset contains realistic temperature values, improving the accuracy of any temperature-related analysis.                              


Clean relative_humidity_2m     :                         

Relative humidity is measured as a percentage and logically ranges from 0% (completely dry air) to 100% (air fully saturated with moisture). Values outside this range are physically impossible and indicate data errors. Cleaning this data ensures that all relative humidity values are valid, which is crucial for accurate climate and comfort-level analyses.                              


                                                            
Clean rain, snowfall, and snow_depth

Rain, snowfall, and snow depth should not have negative values, as these measurements represent quantities of precipitation and accumulated snow, which cannot be less than zero. By removing negative values, we ensure the integrity of the dataset, allowing for accurate precipitation analysis and forecasting.                              

Clean pressure_msl    :                                                                                      
Atmospheric pressure at mean sea level (msl) typically ranges between 870 hPa and 1080 hPa. Values outside this range are uncommon and usually indicate sensor errors or data entry mistakes. By filtering these outliers, we improve the reliability of pressure data, which is important for weather modeling and understanding atmospheric conditions.                                                            

Clean wind_speed_100m         :                                                   
Wind speed should not be negative, as it represents the magnitude of wind movement. Negative values are indicative of errors in measurement or data entry. Ensuring all wind speed values are non-negative maintains the dataset’s accuracy and is essential for analyses involving wind energy, weather prediction, and safety.

Clean wind_direction_100m                                                            
Wind direction is measured in degrees, ranging from 0 to 360, representing the direction from which the wind is blowing. Values outside this range are invalid and indicate data entry errors. Cleaning these values ensures the dataset's correctness, which is vital for accurately determining wind patterns and directions

### Tidiness Issue 1: Unnecessary Columns in Taxi Data (nulls , duplicated , dataTypes)


In [37]:
# FILL IN - Apply the cleaning strategy
columns_to_drop = ['VendorID', 'store_and_fwd_flag', 'payment_type', 'trip_type','ehail_fee']
taxi = taxi.drop(columns=columns_to_drop)

In [38]:
taxi.columns

Index(['lpep_pickup_datetime', 'lpep_dropoff_datetime', 'RatecodeID',
       'PULocationID', 'DOLocationID', 'passenger_count', 'trip_distance',
       'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'congestion_surcharge'],
      dtype='object')

Justification: 

1. `VendorID`: This column is not necessary for the analysis as it only identifies the taxi vendor and does not provide useful information for the analysis of trip patterns or fare structures.
2. `store_and_fwd_flag`: This column indicates whether the trip record was held in vehicle memory before sending to the vendor. It is not useful for the analysis of trip data and does not impact trip patterns or fare amounts.
3. `payment_type`: This column describes the payment method, which is not relevant to the analysis of trip durations, distances, or fare amounts.
4. `trip_type`: This column differentiates between dispatched and non-dispatched trips, which is not relevant for the analysis of trip data.
5. `ehail_fee`: This column contains no data (all values are NaN) and thus is not useful for any analysis.




### Tidiness Issue 2 : convert dataType to datetime and drop the unnecessery

In [39]:
taxi['date'] = pd.to_datetime(taxi['lpep_pickup_datetime'])

columns_to_drop = ['lpep_dropoff_datetime','lpep_pickup_datetime']
taxi = taxi.drop(columns=columns_to_drop)

In [40]:
taxi.head()

Unnamed: 0,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,date
0,1.0,74,168,1.0,1.2,6.0,0.5,0.5,0.0,0.0,0.3,7.3,0.0,2021-07-01 00:30:52
1,1.0,116,265,2.0,13.69,42.0,0.5,0.5,0.0,0.0,0.3,43.3,0.0,2021-07-01 00:25:36
2,1.0,97,33,1.0,0.95,6.5,0.5,0.5,2.34,0.0,0.3,10.14,0.0,2021-07-01 00:05:58
3,1.0,74,42,1.0,1.24,6.5,0.5,0.5,0.0,0.0,0.3,7.8,0.0,2021-07-01 00:41:40
4,1.0,42,244,1.0,1.1,7.0,0.5,0.5,0.0,0.0,0.3,8.3,0.0,2021-07-01 00:51:32


In [41]:
taxi['date'] = pd.to_datetime(taxi['date']).dt.tz_localize('UTC')

In [42]:
taxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 83247 entries, 0 to 83690
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   RatecodeID             50733 non-null  float64            
 1   PULocationID           83247 non-null  int64              
 2   DOLocationID           83247 non-null  int64              
 3   passenger_count        50733 non-null  float64            
 4   trip_distance          83247 non-null  float64            
 5   fare_amount            83247 non-null  float64            
 6   extra                  83247 non-null  float64            
 7   mta_tax                83247 non-null  float64            
 8   tip_amount             83247 non-null  float64            
 9   tolls_amount           83247 non-null  float64            
 10  improvement_surcharge  83247 non-null  float64            
 11  total_amount           83247 non-null  float64            


Justification:                      
The pd.to_datetime() function is used to convert the 'lpep_pickup_datetime' and 'lpep_dropoff_datetime' columns from string format to datetime format. This conversion is necessary to facilitate date-based operations and analyses, such as calculating trip durations, grouping data by time periods, and extracting specific components like hour, day, or month. Storing these datetime objects in new columns (pickup_date and dropoff_date) preserves the original data while allowing for efficient date manipulations and queries.

### Tidiness Issue 3: null entities for passenger_count and congestion_surcharge           


In [43]:
taxi = taxi.dropna(subset=['passenger_count', 'congestion_surcharge'])

In [44]:
taxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50733 entries, 0 to 51172
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   RatecodeID             50733 non-null  float64            
 1   PULocationID           50733 non-null  int64              
 2   DOLocationID           50733 non-null  int64              
 3   passenger_count        50733 non-null  float64            
 4   trip_distance          50733 non-null  float64            
 5   fare_amount            50733 non-null  float64            
 6   extra                  50733 non-null  float64            
 7   mta_tax                50733 non-null  float64            
 8   tip_amount             50733 non-null  float64            
 9   tolls_amount           50733 non-null  float64            
 10  improvement_surcharge  50733 non-null  float64            
 11  total_amount           50733 non-null  float64            


Justification:                  
used drpna to remove null entities from the dataframe

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [45]:
#FILL IN - Remove unnecessary variables and combine datasets
combined_data = pd.merge(taxi, weather, on='date', how='inner')

In [46]:
combined_data.head(5)

Unnamed: 0,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,...,congestion_surcharge,date,temperature_2m,relative_humidity_2m,rain,snowfall,snow_depth,pressure_msl,wind_speed_100m,wind_direction_100m
0,1.0,75,236,1.0,1.2,7.0,2.75,0.5,2.1,0.0,...,2.75,2021-07-02 15:00:00+00:00,34.8908,93.065773,0.2,0.0,0.0,1015.700012,6.830519,71.564964
1,1.0,74,247,1.0,2.09,8.5,0.0,0.5,2.79,0.0,...,0.0,2021-07-03 12:00:00+00:00,45.8708,57.156956,0.1,0.0,0.0,1012.299988,7.24486,296.564972
2,1.0,95,121,2.0,2.53,11.0,0.0,0.5,1.0,0.0,...,0.0,2021-07-06 08:00:00+00:00,48.750801,55.337654,0.0,0.0,0.0,1013.700012,11.525623,268.210114
3,1.0,74,237,1.0,3.31,16.0,0.0,0.5,0.0,0.0,...,2.75,2021-07-08 13:00:00+00:00,51.270802,50.208988,0.0,0.0,0.0,1017.400024,8.217153,298.810699
4,1.0,213,197,1.0,15.92,52.0,0.5,0.5,0.0,6.55,...,0.0,2021-07-13 06:00:00+00:00,41.2808,59.273697,0.5,0.0,0.0,1013.099976,10.495713,239.036301


## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [47]:
#FILL IN - saving data
# Save raw data to files
taxi_df.to_csv('raw_taxi_data.csv', index=False)
hourly_dataframe.to_csv('raw_weather_data.csv', index=False)
# Save cleaned data to files
taxi.to_csv('cleaned_taxi_data.csv', index=False)
weather.to_csv('cleaned_weather_data.csv', index=False)

combined_data.to_csv('combined_data.csv', index=False)


## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

Research question: How do weather conditions influence taxi trip patterns in New York City?

###  `Visual 1 - Visualization 1: Daily Average Temperature vs. Daily Taxi Demand`


Visualization: Plot showing the relationship between average daily taxi demand and temperature.



In [50]:
daily_taxi_demand = taxi.groupby('date').size().reset_index(name='taxi_demand')
# Merge with weather data on date
merged_data = pd.merge(daily_taxi_demand, weather, on='date')


In [51]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(merged_data['temperature_2m'], merged_data['taxi_demand'], alpha=0.5, color='b')
plt.title('Taxi Demand vs. Temperature')
plt.xlabel('Temperature (°C)')
plt.ylabel('Daily Taxi Demand')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/opt/conda/lib/python3.10/site-packages/ipykernel/

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

*Answer to research question:*                      
In conclusion, while the scatter plot analysis suggests no direct correlation between temperature and taxi demand in NYC, further investigation into various factors influencing transportation preferences could provide a more comprehensive understanding of urban mobility patterns.

This structured approach acknowledges the findings from the scatter plot while framing your observations and insights effectively.

###  `Visualization 2: Daily Average Temperature vs. Daily Total Fare Amount`

Visualization: Scatter plot showing the relationship between trip duration and precipitation.


In [52]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming 'taxi' and 'weather' are your DataFrames
# Convert dropoff_date and date columns to datetime with timezone aware in 'taxi'
taxi['dropoff_date'] = pd.to_datetime(taxi['dropoff_date'], utc=True)
taxi['date'] = pd.to_datetime(taxi['date'], utc=True)

# Convert 'date' column in 'weather' to datetime with timezone aware
weather['date'] = pd.to_datetime(weather['date'], utc=True)

# Merge 'taxi' and 'weather' on 'date'
merged_data = pd.merge(taxi, weather, on='date', how='inner')

# Calculate trip duration in minutes
merged_data['trip_duration'] = (merged_data['dropoff_date'] - merged_data['date']).dt.total_seconds() / 60.0

plt.figure(figsize=(10, 6))
plt.scatter(merged_data['rain'], merged_data['trip_duration'], alpha=0.5, color='g')
plt.title('Trip Duration vs. Precipitation')
plt.xlabel('Precipitation (mm)')
plt.ylabel('Trip Duration (minutes)')
plt.grid(True)
plt.tight_layout()

plt.show()



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/opt/conda/lib/python3.10/site-packages/ipykernel/

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

*Answer to research question:* Visualization 2:                       
Trip Duration vs. Precipitation                                              
Insight: The scatter plot illustrates that taxi trip durations in NYC tend to be longer during lighter precipitation or no rain, indicating a potential impact of weather on travel times.                       

Explanation: Lighter precipitation levels likely result in smoother traffic conditions or increased taxi demand, leading to longer trip durations compared to heavy rain periods.                       

Implications: Understanding this relationship helps optimize taxi service operations during varying weather conditions, enhancing service reliability and customer satisfaction.                       



### **5.2:** Reflection

If I had more time to complete the project, I would focus on addressing several aspects:

1. **Data Quality and Cleaning**: I would delve deeper into ensuring data completeness and accuracy, particularly in handling missing values and outliers across both taxi and weather datasets.

2. **Further Analysis**: I would explore additional research questions such as the impact of wind speed, humidity, and seasonal variations on taxi trip patterns. This could provide a more comprehensive understanding of how diverse weather factors influence urban transportation dynamics.

3. **Enhanced Visualizations**: I would refine visualizations to better illustrate trends and correlations, possibly using interactive plots to explore temporal and spatial variations in taxi demand and trip characteristics.

4. **Modeling and Predictive Analytics**: Incorporating predictive modeling to forecast taxi demand under different weather scenarios could enhance operational planning and resource allocation for taxi services in NYC.

These actions would not only deepen the analysis but also provide actionable insights for improving taxi service efficiency and resilience to weather-related challenges.
