## 1

### Why This Dataset?
The New York taxi cab dataset from November 2015 was particularly engaging due to its rich time-series and geospatial features. The dataset provided pickup locations, timestamps, and trip details, which made it ideal for analyzing trends in taxi demand, identifying traffic bottlenecks, and visualizing movement patterns across the city. 

This dataset was especially valuable for geospatial mapping, as it allowed for an insightful visualization of taxi activity during the Thanksgiving Day parade. By focusing on specific timestamps and locations, I was able to examine how road closures impacted pickup availability along the parade route.

### Example Analysis
One particularly interesting analysis was verifying the paradeâ€™s impact on taxi pickups. The question examined whether taxis were unable to operate along the parade route due to street closures.

#### **Question:**  
Did the Thanksgiving Day parade affect taxi pickups along its route in NYC on November 26, 2015?
### Solution Below


In [1]:
import pandas as pd

In [2]:
#loding the New York taxi data with required columns
myDF = pd.read_csv("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv", usecols=['tpep_pickup_datetime','pickup_longitude','pickup_latitude'])

In [3]:
#imporitng datetime & defining the start and end time
from datetime import datetime 

paradestart = datetime.strptime('2015-11-26 09:00:00', "%Y-%m-%d %H:%M:%S")

paradeend = datetime.strptime('2015-11-26 12:00:00', "%Y-%m-%d %H:%M:%S")

In [4]:
#converting it to  datetime format
mytimes = pd.to_datetime(myDF['tpep_pickup_datetime'])

In [5]:
#filtering data to include rides during parade time
finalDF = myDF[ (mytimes >= paradestart) & (mytimes <= paradeend)]

In [6]:
finalDF.shape #display the dhape of the sorted dataframe

(28710, 3)

## 2

## Approach to Validating Results in Data Analytics
Solved problems using two different approaches or any other way to verify consistency.

Used visual tools like histograms,plots, and maps to detect anomalies or strange values.

Comparing manual counts of records within the time range against the final DataFrame shape.

Re-running the filter with slightly adjusted time ranges to check sensitivity

In [13]:
#checked basic stats of coordinates to ensure values based on question 3
print(finalDF[['pickup_longitude','pickup_latitude']].describe())  

       pickup_longitude  pickup_latitude
count      27654.000000     27654.000000
mean         -73.970550        40.754908
std            0.037252         0.026623
min          -74.028313        40.604679
25%          -73.989708        40.740991
50%          -73.978619        40.759052
75%          -73.963097        40.773907
max          -73.755829        40.799988


## 3

In the NYC taxi file, XY-Xm-Xd   $M:%X%:%S" in datetime.strptime(), which waas unclear for me in the begining.

So I used standard date time format to make it more clear and understandable.

## Below is the working example

In [14]:
#loding the New York taxi data with required columns
myDF = pd.read_csv("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv", usecols=['tpep_pickup_datetime','pickup_longitude','pickup_latitude'])

In [15]:
#making it simple to read. 
start = datetime.strptime('2015-11-26 09:00:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2015-11-26 12:00:00', '%Y-%m-%d %H:%M:%S')

In [18]:
# Filter to get clean dataset in the end. 
filtered = myDF[
    (pd.to_datetime(myDF['tpep_pickup_datetime']).between(start, end)) &
    (myDF['pickup_longitude'].between(-74.03, -73.75)) & 
    (myDF['pickup_latitude'].between(40.6, 40.8))]    

In [19]:
print(f"Found {len(filtered)} valid parade pickups")

Found 27654 valid parade pickups
