### 1. Read Data from CSV files to dataframes


Caution: It is prerequsite to have local copy of data in local machine.

Please run following commands if you dont have the data in your machine:

```bash 
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv
```
```bash 
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
```

In [1]:
import pandas as pd

In [2]:
df_tripdata = pd.read_csv('2_docker_sql/yellow_tripdata_2021-01.csv')
df_zones = pd.read_csv('2_docker_sql/taxi+_zone_lookup.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
df_tripdata.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2.0,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5
1,1.0,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2.0,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0
2,1.0,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1.0,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0
3,1.0,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1.0,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0
4,2.0,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1.0,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5


In [4]:
df_zones.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


### 2. Connect to the Postgres and Run Queries

Caution: It is prerequsite to have Up and Running Postgres Database Instance (and optional PgAdmin) with the expected tables

- Check 1: If you dont see postgres container running after `docker ps`. Please run the one of following commands to make Postgres running on your environment.

```bash
docker-compose up
```
    or 
```bash
docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13
 ```

- Check 2: If you have connection ready but the database doesnt have the expected tables. Please inject the data into database by using the `load_csv_to_db` function given below.

In [5]:
# Global variables for accessing the database
user = "root"
password = "root"
host = "localhost"
port = 5432
db = "ny_taxi"

In [58]:
import sys
from time import time
from sqlalchemy import create_engine
import pandas as pd

def load_csv_to_db(csv_name, table_name , user=user, password=password, host=host, port=port, db=db):
    '''
        csv_name: Eg. yellow_tripdata_2021-01.csv, taxi+_zone_lookup.csv
        table_name: Eg. yellow_taxi_data, zones
        host: Please set host as "localhost" if you have used docker run command in step 2.
         Otherwise please use the service name defined within the docker-compose yaml "pgdatabase"
    '''
    
    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
            
    df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000)
            
    while True: 
        t_start = time()

        try:
            df = next(df_iter)
        except StopIteration as e:
            print(f'Exception: {str(e)}')
            print(f'Exception: {str(e.value)}')
            # sys.exit(1)
            break

        if 'tpep_pickup_datetime' in df:
            df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
        if 'tpep_dropoff_datetime' in df:
            df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

        df.to_sql(name=table_name, con=engine, if_exists='append')

        t_end = time()

        print('inserted another chunk, took %.3f second' % (t_end - t_start))

``` python
load_csv_to_db('taxi+_zone_lookup.csv','zones')
```

``` python
load_csv_to_db('yellow_tripdata_2021-01.csv','yellow_taxi_data')
```

In [6]:
# Different between sqlalchemy ad psycopg2: https://pplonski.github.io/sqlalchemy-vs-psycopg2/

alchemy_engine = "postgresql://{}:{}@{}:{}/{}".format(user, password, host, port, db)
pg_engine = "user='{}' password='{}' host='{}' port='{}' dbname='{}'".format(user, password, host, port,db)

In [22]:
# 2.1 Create Connection and run queries to Postgres with psycopg2
import psycopg2

pg_connection = psycopg2.connect(pg_engine)
def query_in_db_psycopg2(connection, sql_query):
    cur = connection.cursor()
    cur.execute(sql_query)
    column_names = [desc[0] for desc in cur.description]
    result = cur.fetchall()
    df = pd.DataFrame(result,columns=column_names)
    return df
    

In [9]:
# 2.2 Create Connection to Postgres with sqlalchemy
from sqlalchemy import create_engine

alchemy_connection = create_engine(alchemy_engine)

def query_in_db_alchemy(connection, sql_query):
    df = pd.read_sql_query(sql_query, connection)
    return df

#### Question 3: 

How many taxi trips were there on January 15? Consider only trips that started on January 15.

In [10]:
question3 = '''SELECT COUNT(*) AS counter 
FROM yellow_taxi_data t WHERE 
	t.tpep_pickup_datetime >= '2021-01-15'::date
    AND
	t.tpep_pickup_datetime < '2021-01-16'::date; '''

result_df = query_in_db_psycopg2(pg_connection, question3)
result_df.head()

Unnamed: 0,counter
0,53024


In [11]:
## Side Note: Eventhough the defined query is also correct. Alchemy fails to evaluate it correctly..!!!
question3_missleading = '''SELECT COUNT(*) AS counter 
FROM yellow_taxi_data t WHERE 
	t.tpep_pickup_datetime = '2021-01-15'::date;'''

query_in_db_alchemy(alchemy_connection, question3_missleading )

Unnamed: 0,counter
0,1


#### Question 4: 

Find the largest tip for each day. On which day it was the largest tip in January?

Use the pick up time for your calculations.

(note: it's not a typo, it's "tip", not "trip")



In [12]:
question4 = '''SELECT tpep_pickup_datetime::date, MAX(tip_amount) AS max_tip
FROM yellow_taxi_data t 
	WHERE t.tpep_pickup_datetime >= '2021-01-01'::date 
		AND t.tpep_pickup_datetime < '2021-01-31'::date
	GROUP BY tpep_pickup_datetime::date
	ORDER BY max_tip DESC;
'''

result_df = query_in_db_psycopg2(pg_connection, question4) # the first item
result_df.head()

Unnamed: 0,tpep_pickup_datetime,max_tip
0,2021-01-20,1140.44
1,2021-01-04,696.48
2,2021-01-03,369.4
3,2021-01-26,250.0
4,2021-01-09,230.0


#### Question 5: 

What was the most popular destination for passengers picked up in central park on January 14?

Use the pick up time for your calculations.

Enter the zone name (not id). If the zone name is unknown (missing), write "Unknown"


In [13]:
question5 = '''SELECT  "DOLocationID", "Zone", COUNT(*) AS counter
	FROM yellow_taxi_data INNER JOIN zones
		ON "DOLocationID" = "LocationID"
        WHERE tpep_pickup_datetime::date='2021-01-14' 
            AND "PULocationID" = (SELECT "LocationID" FROM zones WHERE "Zone" = 'Central Park')
	GROUP BY "DOLocationID", "Zone"
	ORDER BY "counter" DESC;
'''

question5_v2 = '''SELECT  tpep_pickup_datetime::date, "DOLocationID", "Zone", COUNT(*) AS counter
	FROM yellow_taxi_data INNER JOIN zones
		ON "DOLocationID" = "LocationID"
	WHERE "PULocationID" = (SELECT "LocationID" FROM zones WHERE "Zone" = 'Central Park')
	GROUP BY "DOLocationID", "Zone", tpep_pickup_datetime::date
	ORDER BY tpep_pickup_datetime::date, counter;

'''

result_df = query_in_db_psycopg2(pg_connection, question5) # the first item
result_df.head()

Unnamed: 0,DOLocationID,Zone,counter
0,237,Upper East Side South,97
1,236,Upper East Side North,94
2,142,Lincoln Square East,83
3,238,Upper West Side North,68
4,239,Upper West Side South,60


#### Question 6: 

What's the pickup-dropoff pair with the largest average price for a ride (calculated based on total_amount)?

Enter two zone names separated by a slash

For example:

"Jamaica Bay / Clinton East"

If any of the zone names are unknown (missing), write "Unknown". For example, "Unknown / Clinton East".

In [23]:
question6_v1 = '''SELECT
	"PULocationID", "DOLocationID", AVG(total_amount),
	CONCAT(zpu."Zone" , ' / ' , zdo."Zone") AS "route"
FROM 
	yellow_taxi_data t, 
	zones zdo,
	zones zpu
WHERE
	t."PULocationID" = zpu."LocationID" AND
	t."DOLocationID" = zdo."LocationID"
GROUP BY "PULocationID", "DOLocationID", "route"
ORDER BY avg DESC;
'''

question6_v2 = '''SELECT 
	"PULocationID", "DOLocationID", AVG(total_amount),
	CONCAT(zpu."Zone" , ' / ' , zdo."Zone") AS "route"
FROM yellow_taxi_data t  
	JOIN zones zpu
		ON t."PULocationID" = zpu."LocationID" 
	JOIN zones zdo
		ON t."DOLocationID" = zdo."LocationID"
	GROUP BY "PULocationID", "DOLocationID", "route"
	ORDER BY avg DESC;
'''

result_df = query_in_db_psycopg2(pg_connection, question6_v2) # check the first
result_df.head()

Unnamed: 0,PULocationID,DOLocationID,avg,route
0,4,265,2292.4,Alphabet City /
1,234,39,262.852,Union Sq / Canarsie
2,177,265,234.51,Ocean Hill /
3,145,48,207.61,Long Island City/Hunters Point / Clinton East
4,25,260,200.3,Boerum Hill / Woodside
