# Understanding Hired Rides in NYC

_[Project prompt](https://docs.google.com/document/d/1VERPjEZcC1XSs4-02aM-DbkNr_yaJVbFjLJxaYQswqA/edit#)_

_This scaffolding notebook may be used to help setup your final project. It's **totally optional** whether you make use of this or not._

_If you do use this notebook, everything provided is optional as well - you may remove or add prose and code as you wish._

_Anything in italics (prose) or comments (in code) is meant to provide you with guidance. **Remove the italic lines and provided comments** before submitting the project, if you choose to use this scaffolding. We don't need the guidance when grading._

_**All code below should be consider "pseudo-code" - not functional by itself, and only a suggestion at the approach.**_

## Requirements

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project._

* Code clarity: make sure the code conforms to:
    * [ ] [PEP 8](https://peps.python.org/pep-0008/) - You might find [this resource](https://realpython.com/python-pep8/) helpful as well as [this](https://github.com/dnanhkhoa/nb_black) or [this](https://jupyterlab-code-formatter.readthedocs.io/en/latest/) tool
    * [ ] [PEP 257](https://peps.python.org/pep-0257/)
    * [ ] Break each task down into logical functions
* The following files are submitted for the project (see the project's GDoc for more details):
    * [ ] `README.md`
    * [ ] `requirements.txt`
    * [ ] `.gitignore`
    * [ ] `schema.sql`
    * [ ] 6 query files (using the `.sql` extension), appropriately named for the purpose of the query
    * [x] Jupyter Notebook containing the project (this file!)
* [x] You can edit this cell and add a `x` inside the `[ ]` like this task to denote a completed task

## Project Setup

In [1]:
# all import statements needed for the project, for example:

import math
import bs4
import matplotlib.pyplot as plt
import pandas as pd
import requests
import re
import pyarrow.parquet as pq
import geopandas as gpd
import os
import datetime
from math import sin, cos, sqrt, atan2, radians

In [2]:
# any general notebook setup, like log formatting
import warnings
warnings.filterwarnings("ignore")

In [3]:
# any constants you might need, for example:

TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
# add other constants to refer to any local data, e.g. uber & weather
UBER_CSV = "uberdata/uber_rides_sample.csv"

NEW_YORK_BOX_COORDS = ((40.560445, -74.242330), (40.908524, -73.717047))

DATABASE_URL = "sqlite:///project.db"
DATABASE_SCHEMA_FILE = "schema.sql"
QUERY_DIRECTORY = "queries"

In [4]:
# Create folder 'taxidata' because of .gitignore
try:
    os.mkdir("taxidata")
    os.mkdir("sql_files")

except FileExistsError:
            pass

## Part 1: Data Preprocessing

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Define a function that calculates the distance between two coordinates in kilometers that **only uses the `math` module** from the standard library.
* [ ] Taxi data:
    * [ ] Use the `re` module, and the packages `requests`, BeautifulSoup (`bs4`), and (optionally) `pandas` to programmatically download the required CSV files & load into memory.
    * You may need to do this one file at a time - download, clean, sample. You can cache the sampling by saving it as a CSV file (and thereby freeing up memory on your computer) before moving onto the next file. 
* [ ] Weather & Uber data:
    * [ ] Download the data manually in the link provided in the project doc.
* [ ] All data:
    * [ ] Load the data using `pandas`
    * [ ] Clean the data, including:
        * Remove unnecessary columns
        * Remove invalid data points (take a moment to consider what's invalid)
        * Normalize column names
        * (Taxi & Uber data) Remove trips that start and/or end outside the designated [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
    * [ ] (Taxi data) Sample the data so that you have roughly the same amount of data points over the given date range for both Taxi data and Uber data.
* [ ] Weather data:
    * [ ] Split into two `pandas` DataFrames: one for required hourly data, and one for the required daily daya.
    * [ ] You may find that the weather data you need later on does not exist at the frequency needed (daily vs hourly). You may calculate/generate samples from one to populate the other. Just document what you’re doing so we can follow along. 

### Calculating distance
_**TODO:** Write some prose that tells the reader what you're about to do here._

In [5]:
def calculate_distance(start_lat, start_lon, end_lat, end_lon):
    R = 6373.0

    lat1 = radians(start_lat)
    lon1 = radians(start_lon)
    lat2 = radians(end_lat)
    lon2 = radians(end_lon)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    
    return round(distance, 2)

In [6]:
def add_distance_column(df):
    distance = []
    
    for i in df.index:
        estimated_distance = calculate_distance(df["pickup_latitude"][i], df["pickup_longitude"][i], df["dropoff_latitude"][i], df["dropoff_longitude"][i])
        distance.append(estimated_distance)
        
    df["calculated_distance"] = distance
       
    return df

### Converting datetime

TODO: transform date columns from strings to datetime Python objects.

In [7]:
def datetime_str_to_obj(date_time_str):
    date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S')

    return date_time_obj    

### Processing Taxi Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [8]:
def find_taxi_csv_urls():
    TAXI_URL = "https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
    
    response = requests.get(TAXI_URL)
    html = response.content
    data = []
    soup = bs4.BeautifulSoup(html, "html.parser")
    for link in soup("a"):
        name = link.get("title")
        date = link.get("href")
        date_pattern = r'201[012345]|2009'
        if name == "Yellow Taxi Trip Records" and re.search(date_pattern, date):
            if not re.search(r'2015\-0[789]|2015\-1[012]', date):
                data.append(link.get("href"))
    return data

In [9]:
def get_lat_lon_from_loc():
    find_lat_lon = gpd.read_file("taxi_zones/taxi_zones.shp")
  
    find_lat_lon = find_lat_lon.to_crs(4326)
    lon = find_lat_lon.centroid.x 
    lat = find_lat_lon.centroid.y
    find_lat_lon["lon"] = lon
    find_lat_lon["lat"] = lat

    return find_lat_lon

In [10]:
def filter_lat_lon(df):
    lon_border = [NEW_YORK_BOX_COORDS[0][1], NEW_YORK_BOX_COORDS[1][1]]
    lat_border = [NEW_YORK_BOX_COORDS[0][0], NEW_YORK_BOX_COORDS[1][0]]
    
    deleted_row = []
    
    for i in df.index:
        if df["pickup_longitude"][i] < lon_border[0] or df["pickup_longitude"][i] > lon_border[1] or df["dropoff_longitude"][i] < lon_border[0] or df["dropoff_longitude"][i] > lon_border[1]:
            deleted_row.append(i)
            
        elif df["pickup_latitude"][i] < lat_border[0] or df["pickup_latitude"][i] > lat_border[1] or df["dropoff_latitude"][i] < lat_border[0] or df["dropoff_latitude"][i] > lat_border[1]:
            deleted_row.append(i)
            
            
    df = df.drop(labels = deleted_row, axis=0)
    
    return df.reset_index(drop=True)

In [11]:
def clean_data(df):
    # Normalize column for 2009 data
    if "Passenger_Count" in df.columns :
        df.rename(columns={"Passenger_Count": 'passenger_count', "Fare_Amt": 'fare_amount', "Trip_Distance": 'trip_distance'}, inplace=True)
    
    deleted_row = []
    
    for i in df.index:        
        # Trips with zero passenger count
        if df["passenger_count"][i] < 1 or df["passenger_count"][i] == False:
            deleted_row.append(i)
            
        # Trips with no fare
        elif df["fare_amount"][i] <= 0 or df["fare_amount"][i] == False:
            deleted_row.append(i)
            
        # Trips with no distance between dropoff and pickup
        elif df["trip_distance"][i] <= 0 or df["trip_distance"][i] == False:
            deleted_row.append(i)
            
    df = df.drop(labels = deleted_row, axis=0)
    
    return df.reset_index(drop=True)

In [12]:
def get_and_clean_month_taxi_data(url):
    
    # Check if we already have the data in local, otherwise download it
    file_name = url[-31:]
    
    for i in os.listdir('./taxidata'):
        if i == file_name:
            df = pd.read_parquet(f"taxidata/{file_name}", engine='pyarrow')
            return df
    
    
    # Download and preprocessing data
    find_lat_lon = get_lat_lon_from_loc() #get the lat-lon form .shp file
    
    # Read the data from url
    response = requests.get(url, stream=True)
    with open(f"taxidata/{url[-31:]}", "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    df = pd.read_parquet(f"taxidata/{url[-31:]}", engine='pyarrow') #Reading the data
    df = df.sample(n=2564).reset_index(drop=True) #Take 2564 random sample

    try:
        #Clean the data
        df = clean_data(df)
    
        # Rename the columns
        if 'Start_Lon' in df.columns : #2009
            df = df[["Trip_Pickup_DateTime", "Start_Lon", "Start_Lat", "End_Lon", "End_Lat"]]
            df.rename(columns={"Trip_Pickup_DateTime": 'date_time', "Start_Lon": 'pickup_longitude', "Start_Lat": 'pickup_latitude', "End_Lon": 'dropoff_longitude', "End_Lat": 'dropoff_latitude'}, inplace=True)
        
        elif 'pickup_longitude' in df.columns : #2010
            df = df[["pickup_datetime", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"]]
            df.rename(columns={"pickup_datetime": 'date_time'}, inplace=True)

        else:     
            if "tpep_pickup_datetime" in df.columns : #2011-2014
                df = df[["tpep_pickup_datetime", "PULocationID", "DOLocationID"]]
                df.rename(columns={"tpep_pickup_datetime": 'date_time'}, inplace=True)
        
            else: df = df[["date_time", "trip_distance", "PULocationID", "DOLocationID"]] #2015
            
            
            # Finding the lat-lon using shp file
            start_lon = []
            start_lat = []
            end_lon = []
            end_lat = []
        
        
            for i in range(len(df["PULocationID"])):
                start_point = df["PULocationID"][i]
                end_point = df["DOLocationID"][i]
            
                if df["PULocationID"][i] < 264 and df["DOLocationID"][i] < 264: #Filter for NYC Area only
                    index_location = find_lat_lon[find_lat_lon["LocationID"] == start_point].index.values[0] 
                    start_lon.append(float(find_lat_lon["lon"][index_location]))
                    start_lat.append(float(find_lat_lon["lat"][index_location]))
                
                    index_location = find_lat_lon[find_lat_lon["LocationID"] == end_point].index.values[0] 
                    end_lon.append(float(find_lat_lon["lon"][index_location]))
                    end_lat.append(float(find_lat_lon["lat"][index_location]))
                
                else: # Area outside NYC, to be deleted later
                    start_lon.append(0)
                    start_lat.append(0)
                    end_lon.append(0)
                    end_lat.append(0)
                    
        
                     
            df["pickup_longitude"] = start_lon
            df["pickup_latitude"] = start_lat
            df["dropoff_longitude"] = end_lon
            df["dropoff_latitude"] = end_lat
        
            df = df.drop(["PULocationID", "DOLocationID"], axis=1)
        
       
        
        # Filter the lat-lon between (40.560445, -74.242330) and (40.908524, -73.717047)
        df = filter_lat_lon(df)
    
        #Convert datetime str to python object
        if isinstance(df["date_time"][0], str):
            print("str nee")
            for i in df.index:
                df["date_time"][i] = datetime_str_to_obj(df["date_time"][i])
    
        # Calculate distance and add calculated_distance column
        df = add_distance_column(df)
        
    except IndexError:
        os.remove(f"taxidata/{url[-31:]}")
        get_and_clean_month_taxi_data(url)
    
    #df = df[["date_time", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "calculated_distance"]]
    
    # Re-save the file
    df.to_parquet(f"taxidata/{url[-31:]}")
    
    return df            
    

In [13]:
#TBD
#url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2015-01.parquet'
#x = get_and_clean_month_taxi_data(url)
#x

In [14]:
def get_and_clean_taxi_data():
    all_taxi_dataframes = []
    
    all_csv_urls = find_taxi_csv_urls()
    for csv_url in all_csv_urls:
        # Get and clean the data from local or url
        dataframe = get_and_clean_month_taxi_data(csv_url)
        
        # Put all the dataframe into a list
        all_taxi_dataframes.append(dataframe)
        
    # Create one gigantic dataframe with data from every month needed
    taxi_data = pd.concat(all_taxi_dataframes, ignore_index=True)
    
    return taxi_data

### Processing Uber Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [15]:
def clean_uber_data(df):
    
    deleted_row = []
    
    for i in df.index:
        # Convert datetime str to python object
        df['pickup_datetime'][i] = df['pickup_datetime'][i].replace(' UTC', '')
        df['pickup_datetime'][i] = df['pickup_datetime'][i].replace('T', ' ')
        df['pickup_datetime'][i] = datetime_str_to_obj(df['pickup_datetime'][i])

        # Trips with zero passenger count
        if df["passenger_count"][i] < 1 or df["passenger_count"][i] == False:
            deleted_row.append(i)
            
        # Trips with no fare
        elif df["fare_amount"][i] <= 0 or df["fare_amount"][i] == False:
            deleted_row.append(i)
            
    df = df.drop(labels = deleted_row, axis=0)
    df.rename(columns={"pickup_datetime": 'date_time'}, inplace=True)

    return df.reset_index(drop=True)

In [16]:
def load_and_clean_uber_data(csv_file):
    df = pd.read_csv(csv_file)
    df = clean_uber_data(df)
    df = filter_lat_lon(df)
    # Drop unecesarry column
    df.drop(df.columns[[0, 1, 2, 8]], axis=1,inplace=True)
    return df

In [17]:
def get_uber_data():
    uber_dataframe = load_and_clean_uber_data(UBER_CSV)
    add_distance_column(uber_dataframe)
    return uber_dataframe

### Processing Weather Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [18]:
def clean_month_weather_data_hourly(csv_file):
    df2 = pd.read_csv('weatherdata/' + csv_file)
    df2 = df2[['DATE', 'LATITUDE', 'LONGITUDE', 'HourlyPrecipitation', 'HourlyWindSpeed']]
    df2 = df2.dropna()    
    
    deleted_row = []
    
    for i in df2.index:
        # Convert datetime str to python object
        df2['DATE'][i] = df2['DATE'][i].replace('T', ' ')
        df2['DATE'][i] = datetime_str_to_obj(df2['DATE'][i])
    
        try:
            df2["HourlyPrecipitation"][i] = float(df2["HourlyPrecipitation"][i])
            df2["HourlyWindSpeed"][i] = int(df2["HourlyWindSpeed"][i])
            
        
            if(df2["HourlyPrecipitation"][i] <= 0 or df2["HourlyPrecipitation"][i] == False):
                deleted_row.append(i)
            elif (df2["HourlyWindSpeed"][i] <= 0 or df2["HourlyWindSpeed"][i] == False):
                deleted_row.append(i)
            
        except ValueError:
            deleted_row.append(i)
       
            
    df2 = df2.drop(labels = deleted_row, axis=0)
    
    return df2.reset_index(drop=True)

In [19]:
def clean_month_weather_data_daily(csv_file):
    df3 = pd.read_csv('weatherdata/' + csv_file)
    df3 = df3[['DATE', 'LATITUDE', 'LONGITUDE', 'DailyAverageWindSpeed', 'DailyPrecipitation', 'Sunrise', 'Sunset']]
    df3.dropna()
    
    deleted_row = []
    
    for i in df3.index:
        # Convert datetime str to python object
        df3['DATE'][i] = df3['DATE'][i].replace('T', ' ')
        df3['DATE'][i] = datetime_str_to_obj(df3['DATE'][i])
    
        try:
            df3["DailyPrecipitation"][i] = float(df3["DailyPrecipitation"][i])
            df3["DailyAverageWindSpeed"][i] = int(df3["DailyAverageWindSpeed"][i])
            df3["Sunrise"][i] = int(df3["Sunrise"][i])
            df3["Sunset"][i] = int(df3["Sunset"][i])
        
            if(df3["DailyPrecipitation"][i] <= 0 or df3["DailyPrecipitation"][i] == False):
                deleted_row.append(i)
            elif (df3["DailyAverageWindSpeed"][i] <= 0 or df3["DailyAverageWindSpeed"][i] == False):
                deleted_row.append(i)
            
        except ValueError:
            deleted_row.append(i)
       
            
    df3 = df3.drop(labels = deleted_row, axis=0)
    
    return df3.reset_index(drop=True)

In [20]:
def load_and_clean_weather_data():
    hourly_dataframes = []
    daily_dataframes = []
    
    # add some way to find all weather CSV files or just add the name/paths manually
    weather_csv_files = ["2009_weather.csv", "2010_weather.csv", "2011_weather.csv", "2012_weather.csv", "2013_weather.csv", "2014_weather.csv", "2015_weather.csv"]
    
    for csv_file in weather_csv_files:
        hourly_dataframe = clean_month_weather_data_hourly(csv_file)
        daily_dataframe = clean_month_weather_data_daily(csv_file)
        hourly_dataframes.append(hourly_dataframe)
        daily_dataframes.append(daily_dataframe)
        
    # create two dataframes with hourly & daily data from every month
    hourly_data = pd.concat(hourly_dataframes, ignore_index=True)
    daily_data = pd.concat(daily_dataframes, ignore_index=True)
    daily_sunrisesunset_data = daily_data[["Sunrise", "Sunset"]]
    
    daily_data = daily_data.drop(["Sunrise", "Sunset"], axis = 1)
    
    
    return hourly_data, daily_data, daily_sunrisesunset_data

### Process All Data

_This is where you can actually execute all the required functions._

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
taxi_data = get_and_clean_taxi_data()
uber_data = get_uber_data()
hourly_weather_data, daily_weather_data, daily_sunrisesunset_data = load_and_clean_weather_data()

str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee
str nee


In [None]:
# Filter taxi data
taxi_data = taxi_data[["date_time", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "calculated_distance"]]
taxi_data

In [None]:
#Update the timesetting of sunrise-sunset

for i in daily_sunrisesunset_data.index:
    x = str(daily_sunrisesunset_data["Sunrise"][i])
    daily_sunrisesunset_data["Sunrise"][i] = f"{x[:-4]}:{x[-4:-2]}"
    daily_sunrisesunset_data["Sunrise"][i] = datetime.datetime.strptime(daily_sunrisesunset_data["Sunrise"][i], '%H:%M').time()
    y = str(daily_sunrisesunset_data["Sunset"][i])
    daily_sunrisesunset_data["Sunset"][i] = f"{y[:-4]}:{y[-4:-2]}"
    daily_sunrisesunset_data["Sunset"][i] = datetime.datetime.strptime(daily_sunrisesunset_data["Sunset"][i], '%H:%M').time()

daily_sunrisesunset_data

## Part 2: Storing Cleaned Data

_Write some prose that tells the reader what you're about to do here._

In [None]:
import sqlite3
connection = sqlite3.connect("final_project.db")
connection

In [None]:
# if using SQL (as opposed to SQLAlchemy), define the commands 
# to create your 4 tables/dataframes
HOURLY_WEATHER_SCHEMA = """
CREATE TABLE IF NOT EXISTS hourly_weather
(
    id INTEGER PRIMARY KEY,
    DATE DATETIME,
    LATITUDE FLOAT,
    LONGITUDE FLOAT,
    HourlyPrecipitation FLOAT,
    HourlyWindSpeed FLOAT
);
"""

DAILY_WEATHER_SCHEMA = """
CREATE TABLE IF NOT EXISTS daily_weather
(
    id INTEGER PRIMARY KEY,
    DATE DATETIME,
    LATITUDE FLOAT,
    LONGITUDE FLOAT,
    DailyPrecipitation FLOAT,
    DailyAverageWindSpeed FLOAT,
    Sunrise FLOAT,
    Sunset FLOAT
);
"""

TAXI_TRIPS_SCHEMA = """
CREATE TABLE IF NOT EXISTS taxi_trips
(
    id INTEGER PRIMARY KEY,
    date_time DATETIME,
    pickup_longitude FLOAT,
    pickup_latitude FLOAT,
    dropoff_longitude FLOAT,
    dropoff_latitude FLOAT,
    calculated_distance FLOAT
);
"""

UBER_TRIPS_SCHEMA = """
CREATE TABLE IF NOT EXISTS uber_trips
(
    id INTEGER PRIMARY KEY,
    date_time DATETIME,
    pickup_longitude FLOAT,
    pickup_latitude FLOAT,
    dropoff_longitude FLOAT,
    dropoff_latitude FLOAT,
    calculated_distance FLOAT
);
"""

DAILY_SUNRISE_SUNSET_SCHEMA = """
CREATE TABLE IF NOT EXISTS daily_sunrise_sunset
(
    id INTEGER PRIMARY KEY,
    Sunrise TIMESTAMP,
    Sunset TIMESTAMP,
    FOREIGN KEY(id) REFERENCES daily_weather(id)
);
"""

In [None]:
# create that required schema.sql file
with open('sql_files/schema.sql', "w") as f:
    f.write(HOURLY_WEATHER_SCHEMA)
    f.write(DAILY_WEATHER_SCHEMA)
    f.write(TAXI_TRIPS_SCHEMA)
    f.write(UBER_TRIPS_SCHEMA)
    f.write(DAILY_SUNRISE_SUNSET_SCHEMA)

In [None]:
# create the tables with the schema files
with connection:
    connection.execute(HOURLY_WEATHER_SCHEMA)
with connection:
    connection.execute(DAILY_WEATHER_SCHEMA)
with connection:
    connection.execute(TAXI_TRIPS_SCHEMA)
with connection:
    connection.execute(UBER_TRIPS_SCHEMA)
with connection:
    connection.execute(DAILY_SUNRISE_SUNSET_SCHEMA)

### Add Data to Database

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def write_dataframes_to_table(table_to_df_dict):
    for key, value in table_to_df_dict.items():
        print(key)
        value.to_sql(name=key, con=connection, if_exists='append', index=False)
    
    return 'Success adding data'

In [None]:
map_table_name_to_dataframe = {
    "taxi_trips": taxi_data,
    "uber_trips": uber_data,
    "hourly_weather": hourly_weather_data,
    "daily_weather": daily_weather_data,
    "daily_sunrise_sunset": daily_sunrisesunset_data
}

In [None]:
write_dataframes_to_table(map_table_name_to_dataframe)

## Part 3: Understanding the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] For 01-2009 through 06-2015, what hour of the day was the most popular to take a yellow taxi? The result should have 24 bins.
* [ ] For the same time frame, what day of the week was the most popular to take an uber? The result should have 7 bins.
* [ ] What is the 95% percentile of distance traveled for all hired trips during July 2013?
* [ ] What were the top 10 days with the highest number of hired rides for 2009, and what was the average distance for each day?
* [ ] Which 10 days in 2014 were the windiest, and how many hired trips were made on those days?
* [ ] During Hurricane Sandy in NYC (Oct 29-30, 2012) and the week leading up to it, how many trips were taken each hour, and for each hour, how much precipitation did NYC receive and what was the sustained wind speed?

In [None]:
def write_query_to_file(query, outfile):
    with open(outfile, "w") as f:
        f.write(query)
    
    return f'Succes generate {outfile}'

### Query N

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each query_

In [None]:
#Q1) For 01-2009 through 06-2015, what hour of the day was the most popular to take a yellow taxi? 
#The result should have 24 bins.

QUERY_1 = """
SELECT 
    DISTINCT strftime('%H', date_time) AS time,
    COUNT (*) as trip
FROM taxi_trips
WHERE date_time between '2009-01-01' AND '2015-06-30'
GROUP BY time
ORDER BY trip DESC
"""

In [None]:
# TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_1)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_1, "sql_files/question_1.sql")

In [None]:
#Q2) For the same time frame, what day of the week was the most popular to take an uber? 
#The result should have 7 bins.

QUERY_2 = """
SELECT  case cast (strftime('%w', date_time) as integer)
  when 0 then 'Sunday'
  when 1 then 'Monday'
  when 2 then 'Tuesday'
  when 3 then 'Wednesday'
  when 4 then 'Thursday'
  when 5 then 'Friday'
  else 'Saturday' end as day,
  COUNT(*) as no_of_trip
FROM uber_trips
WHERE date_time between '2009-01-01' AND '2015-06-30'
GROUP BY day
ORDER BY no_of_trip DESC
"""

# WHERE tpep_pickup_datetime between '2009-01-01' AND '2015-06-30'

In [None]:
# TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_2)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_2, "sql_files/question_2.sql")

In [None]:
#Q3) What is the 95% percentile of distance traveled for all hired trips during July 2013?
# TBD! FILTER DATE BLM, MASIH BINGUNG MAU YG DI OERCENTILE

QUERY_3 = """
WITH 
base 
AS (
    SELECT
        date,
        calculated_distance,
        ROW_NUMBER() OVER(ORDER BY calculated_distance ASC) AS row_num
    FROM (
        SELECT date(date_time) as date, calculated_distance
        FROM taxi_trips
        WHERE date between '2013-07-01' AND '2013-07-31'
        UNION ALL
        SELECT date(date_time) as date, calculated_distance
        FROM uber_trips
        WHERE date between '2013-07-01' AND '2013-07-31'
    )
    WHERE date between '2013-07-01' AND '2013-07-31'
    ),
    
quantile
AS (
    SELECT
        round(0.95 * COUNT(calculated_distance)) AS n_quantile
    FROM
        base
    )
    
select 
base.calculated_distance 
from base
join quantile
on base.row_num = quantile.n_quantile
"""

In [None]:
# TOBEDELETED Read Data # WHERE percent_rank >= 0.95
with connection:
    result = connection.execute(QUERY_3)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_3, "sql_files/question_3.sql")

In [None]:
#Q4) What were the top 10 days with the highest number of hired rides for 2009, 
# and what was the average distance for each day?

QUERY_4 = """
SELECT
    date,
    COUNT (*) as no_of_trip,
    AVG (calculated_distance)
FROM (
    SELECT date(date_time) AS date, calculated_distance
    FROM taxi_trips
    UNION ALL
    SELECT date(date_time) AS date, calculated_distance
    FROM uber_trips
)
WHERE date between '2009-01-01' AND '2009-12-31'
GROUP BY date
ORDER BY no_of_trip DESC
LIMIT 10
"""

In [None]:
# TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_4)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_4, "sql_files/question_4.sql")

In [None]:
#Q5) Which 10 days in 2014 were the windiest, and how many hired trips were made on those days?

QUERY_5 = """
WITH
weather AS (
SELECT
    date(DATE) as date,
    DailyAverageWindSpeed
FROM
    daily_weather
WHERE date between '2014-01-01' AND '2014-12-31'
ORDER BY DailyAverageWindSpeed DESC
LIMIT 10),

trip AS (
SELECT
    date,
    COUNT (*) as no_of_trip
FROM (
    SELECT date(date_time) AS date
    FROM taxi_trips
    WHERE date between '2014-01-01' AND '2014-12-31'
    UNION ALL
    SELECT date(date_time) AS date
    FROM uber_trips
    WHERE date between '2014-01-01' AND '2014-12-31'
)
GROUP BY date
)


SELECT
    weather.*,
    trip.no_of_trip
FROM
    weather
LEFT JOIN trip
ON weather.date = trip.date
"""

In [None]:
# TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_5)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_5, "sql_files/question_5.sql")

In [None]:
#Q6) During Hurricane Sandy in NYC (Oct 29-30, 2012) and the week leading up to it, 
# how many trips were taken each hour, and for each hour, 
# how much precipitation did NYC receive and what was the sustained wind speed?
#starting 22nd Oct

QUERY_6 = """
WITH
weather AS (
SELECT
    strftime('%Y-%m-%d', DATE) AS date,
    strftime('%H', DATE) AS time,
    sum(HourlyPrecipitation) as Precipitation,
    avg(HourlyWindSpeed) as Wind_Speed
FROM
    hourly_weather
WHERE date between '2012-10-22' AND '2012-10-31'
GROUP BY time
),

trip AS (
SELECT
    date,
    time,
    COUNT (*) as no_of_trip
FROM(
    SELECT 
        date(date_time) AS date, 
        strftime('%H', date_time) AS time
    FROM taxi_trips
    WHERE date between '2012-10-22' AND '2012-10-31'
    UNION ALL
    SELECT 
        date(date_time) AS date, 
        strftime('%H', date_time) AS time
    FROM uber_trips
    WHERE date between '2012-10-22' AND '2012-10-31'
)
GROUP BY date, time
)


SELECT
    weather.date,
    weather.time,
    trip.no_of_trip,
    weather.Precipitation,
    weather.Wind_Speed
FROM
    weather
LEFT JOIN trip
ON weather.time = trip.time AND weather.date = trip.date
ORDER BY weather.date
"""

In [None]:
#TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_6)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_6, "sql_files/question_6.sql")

In [None]:
#Extra question regarding the sunrise-sunset 
# What are the top 10 sunrise times with highest average wind speed?

QUERY_EXTRA = """
SELECT
    strftime('%H:%M', x.Sunrise) AS time,
    avg(y.DailyAverageWindSpeed) AS average_windspeed
FROM
    daily_sunrise_sunset AS x
JOIN daily_weather AS Y
ON x.id = Y.id
GROUP BY time
ORDER BY average_windspeed DESC
LIMIT 10
"""

In [None]:
#TOBEDELETED Read Data
with connection:
    result = connection.execute(QUERY_EXTRA)

for row in result:
    print(row)

In [None]:
write_query_to_file(QUERY_EXTRA, "sql_files/question_extra.sql")

## Part 4: Visualizing the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Create an appropriate visualization for the first query/question in part 3
* [ ] Create a visualization that shows the average distance traveled per month (regardless of year - so group by each month). Include the 90% confidence interval around the mean in the visualization
* [ ] Define three lat/long coordinate boxes around the three major New York airports: LGA, JFK, and EWR (you can use bboxfinder to help). Create a visualization that compares what day of the week was most popular for drop offs for each airport.
* [ ] Create a heatmap of all hired trips over a map of the area. Consider using KeplerGL or another library that helps generate geospatial visualizations.
* [ ] Create a scatter plot that compares tip amount versus distance.
* [ ] Create another scatter plot that compares tip amount versus precipitation amount.

_Be sure these cells are executed so that the visualizations are rendered when the notebook is submitted._

### Visualization 1

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each visualization._

_The example below makes use of the `matplotlib` library. There are other libraries, including `pandas` built-in plotting library, kepler for geospatial data representation, `seaborn`, and others._

In [None]:
def addlabels(x,y):
        for i in range(len(x)):
            plt.text(i,y[i],y[i])

In [None]:
def plot_visual_1(data):
    # preparing the dataset
    hours = list(data.keys())
    trips = list(data.values())

    fig = plt.figure(figsize = (10, 5))
 
    # creating the bar plot
    bars = plt.bar(hours, trips, color ='lightblue', width = 0.4)
 
    plt.xlabel("Hour", fontsize = 13)
    plt.ylabel("No. of trip", fontsize = 13)
    plt.title("No of yellow taxi trip in Hour basis", fontsize = 13)
    
    plt.yticks(fontsize=10)
    plt.xticks(fontsize=10)

 
    
            
    addlabels(hours, trips)

    
    plt.show()
    
    highest = max(trips)
    index = trips.index(highest)
    
    print(f"the hour that was the most popular to take yellow taxi is {hours[index]}")

In [None]:
def get_data_for_visual_1():
    # Query SQL database for the data needed.
    QUERY = """
    SELECT 
        DISTINCT strftime('%H', date_time) AS time,
        COUNT (*) as trip
    FROM taxi_trips
    WHERE date_time between '2009-01-01' AND '2015-06-30'
    GROUP BY time
    ORDER BY time ASC
    """
    
    with connection:
        output = connection.execute(QUERY)
    
    data = {}
    
    for row in output:
        data[str(row[0])] = row[1]
        
     
    return data

In [None]:
qurated_taxidata = get_data_for_visual_1()
plot_visual_1(qurated_taxidata)

## Visualization 2

In [None]:
import statistics
from math import sqrt

def plot_visual_2(data):
    
    month = list(data.keys())
    avg_distance = list(data.values())
    z = 1.645
    
    stdev = statistics.stdev(avg_distance)
    confidence_interval = z * stdev / sqrt(len(avg_distance))
    
    figure = plt.figure(figsize = (10, 5))
    
    top = [i + confidence_interval for i in avg_distance]
    bottom = [i - confidence_interval for i in avg_distance]

    
 
    plt.plot(month, avg_distance,'o', color='red')
    plt.fill_between(month, bottom, top, color = 'lightblue', alpha = 0.5)
    
    
    addlabels(month, avg_distance)
    
    plt.xlabel("Month")
    plt.ylabel("Average Distance")
    plt.title("")
    
    plt.show()

In [None]:
def get_data_for_visual_2():
    
    QUERY = """   
    SELECT
        month,
        round(AVG (calculated_distance), 2)
    FROM (
        SELECT strftime('%m', date_time) AS month, calculated_distance
        FROM taxi_trips
        UNION ALL
        SELECT strftime('%m', date_time) AS month, calculated_distance
        FROM uber_trips
    )   
    GROUP BY month
    """
    
    with connection:
        output = connection.execute(QUERY)
    
    data = {} 
    
    for row in output:
        data[str(row[0])] = row[1]
        
    return data

In [None]:
visualization2_dataframe = get_data_for_visual_2()
plot_visual_2(visualization2_dataframe)

## Visualization 3

## Visualization 4

In [None]:
import pandas as pd
from keplergl import KeplerGl
import geopandas as gpd



In [None]:
map = KeplerGl(height=600, width=800)
#show the map
map.add_data(data=df4,name='New York City Taxi Trips')
map.add_data(data=df5,name='New York City Uber Trips')
map

## Visualization 5

In [None]:
import numpy as np

def plot_visual_5(data):
    # preparing the dataset
    tips = list(data.keys())
    
    distance = list(data.values())

    fig = plt.figure(figsize = (10, 5))
 
    #Creating the bar plot
    plt.plot(tips, distance,'o', color='#f44336')
 
    plt.xlabel("Tip Amount", fontsize = 13)
    plt.ylabel("Distance", fontsize = 13)
    plt.title("Graph of Tip Amount versus Distance for Yellow Taxi rides", fontsize = 13)
    
    plt.yticks(fontsize=10)
    plt.xticks(fontsize=10)
    
    x = np.array(tips)
    y = np.array(distance)
    a, b = np.polyfit(x, y, 1)

    #Creating the line of best fit 
    plt.plot(x, a*x+b)

    plt.show()

In [None]:
def get_data_for_visual_5():
    # Query SQL database for the data needed.
    QUERY_V5 = """
    SELECT 
        tip_amount,
        calculated_distance
    FROM (taxi_trips
    )
    WHERE tip_amount IS NOT NULL AND calculated_distance IS NOT NULL AND tip_amount BETWEEN 0 and 40
    """
    
    with connection:
        output = connection.execute(QUERY_V5)
    
    data = {}
    
    for row in output:
        data[row[0]] = row[1]
 
    return data

In [None]:
visualization5_dataframe = get_data_for_visual_5()
plot_visual_5(visualization5_dataframe)