### Ingesting NY Taxi Data using Postgres
[Lesson Video](https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

### Running Postgres using Docker
1. Pull Postgres image with the below configuration and create a Postgres container:
```
Windows WSL (volume path is different)

    docker run -it \
        -e POSTGRES_USER='root' \ # -e describes Environment variables 
        -e POSTGRES_PASSWORD='root' \
        -e POSTGRES_DB="ny_taxi" \ # dataset name is a good name for the DB name
        -v c/home/vlad/dev/Data Engineering ZoomCamp/Module 1 - Containerization Infrastructure as Code/Data Ingesting with Postgres/pg_db_volume:/var/lib/postgresql/data \ # mountaing the volume to a container
        --name postgres_db \ # container name
        -p 5432:5432 \  # port mapping 
        postgres:13

Linux

    docker run -it \
        -e POSTGRES_USER='root' \
        -e POSTGRES_PASSWORD='root' \
        -e POSTGRES_DB="ny_taxi" \ 
        -v $(pwd)/pg_db_volume:/var/lib/postgresql/data \ # mountaing the volume to a container
        -- name postgres_db
        -p 5432:5432 \  # port mapping
        postgres:13
```

Sometimes the formatted code from above doesn't work -> use one line command:
    - `docker run -it -e POSTGRES_USER='root' -e POSTGRES_PASSWORD='root' -e POSTGRES_DB='ny_taxi' -v $(pwd)/pg_db_volume:/var/lib/postgresql/data --name postgres_db -p 5432:5432 postgres:13`

If the container already exists, run:
    - `docker start postgres_db`
    - Even if we drop the container, db remains saved because of Docker Volume

### Using `pgcli` for Postgres Connection
1. Install `pgcli` in current Python environment and connect to the Postgres using another terminal
pgcli installation:
    - `pip install --upgrade pip`
    - `pip install "psycopg[binary,pool]"`
    - `pip install pgcli`

2. Connect to postgres container using `pgcli` in a new terminal:
    - `pgcli -h localhost -p 5432 -u root -d ny_taxi` -> test connection to Postgres NY Taxi DB (docker)
    - `\dt` -> test the connection

### Download NY Taxi Data
1. Download the dataset from [here](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and save in dir `data/`
    - mkdir `data`, `cd data`, `wget <dataset_url>`

Currently all dataset in NY Taxi data is in parquet format. We need to transform it into `.csv`


### NY Taxi Data Ingesting
1. Prepare the data for ingesting to Postgres DB:
    - Open Jupyter Lab in a new terminal: `jupyter lab`
    - Install dependencies: `pip install pandas sqlalchemy pyarrow`

2. Create script to transform data from parquet into CSV and make the transformation

3. Create Schema for Postgres 
    - read csv data (only first 100 rows for now)
    - check that all columns in a DataFrame have valid data types, transform if needed
    - use pandas io module to generate the schema from table:
         - `print(pd.io.sql.get_schema(data, name='yellow_taxi_data'))`

4. Create Connection object to Postgres using Pandas
    - `engine = create_engine(f'postgresql://{user_name}:{pwd}@localhost:5432/ny_taxi')`
    - generate Postgres related schema: `print(pd.io.sql.get_schema(data, name='yellow_taxi_data', con=engine))`


### Dataset Batching
Since the dataset is big, we will injest the data in batches to prevent issues with data base.

- Create iterator object to batch the DataFrame:
    - `df_iterator = pd.read_csv('yellow_tripdata_2023-01.csv', iterator=True, chunksize=100000)`

- Get the first batch and generate schema for it:
    - `df_curent_batch = next(df_iterator)`

- Create table in Postgres using pd.to_sql():
    - `df_curent_batch.head(n=0).to_sql(name=pg_table_name, con=engine, if_exists='replace')` -> replace True to create a table

- Append the first chunk of data to Postgres
    - `df_curent_batch.to_sql(name=pg_table_name, con=engine, if_exists='append')`

- Validate that the first batch has been moved to Postges using pgcli
    - `SELECT COUNT(*) FROM yellow_taxi_data`
    
- Create script to automate the process completely

### Managing Postgres Data using pgAdmin 
As an alternative to `pgcli` pgAdmin can be used. Pull the docker image of pgAdmin using the following script:
```
docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \ 
    dpage/pgadmin4
```

However, running the command above will not allow accessing Postgres container. This is because pgAdmin and Postgres are **running in different containers and cannot access each other.** **Solution: put those containers in the same network**

1. Create a docker network
    - `docker network create <name>`

2. Remove previous docker container and redefine it. Run docker command from volume directory, redirect there
```
docker run -it \
    -e POSTGRES_USER='root' \
    -e POSTGRES_PASSWORD='root' \
    -e POSTGRES_DB="ny_taxi" \ 
    -v $(pwd)/pg_db_volume:/var/lib/postgresql/data \ 
    -p 5432:5432 \ 
    --network pg_network \
    --name postgres_db \
    postgres:13
```
- `docker run -it -e POSTGRES_USER='root' -e POSTGRES_PASSWORD='root' -e POSTGRES_DB="ny_taxi" -v $(pwd)/pg_db_volume:/var/lib/postgresql/data -p 5432:5432 --network pg_network --name postgres_db postgres:13`

3. Run pgAdmin in the same network
```
docker run -d \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg_network \
  --name pgadmin \
  dpage/pgadmin4
```
- `docker run -d -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" -e PGADMIN_DEFAULT_PASSWORD="root" -p 8080:80 --network=pg_network --name pgadmin dpage/pgadmin4`

4. Once pgAdmin is open, create a new server and check that the ingested data exists
    - host name/address: <container_name> / (e.g. postgres_db)
    - port 5432
    - ...

### Putting Everything Into a Script/Pipeline
To process the data that we downloaded in a more robust way, we have to migrate all the scripts from notebook into a Python file. Jupyter notebook can be easily converted using the following command:
- `jupyter nbconvert --to FORMAT notebook.ipynb`

1. Transform notebook into Python script using `jupyter nbconvert`, clean it.

2. Use `argparse` library to configure the script
    2.1 Define what parameters must be configured (e.g. pg_user_name, pg_password, etc)
    2.2 Create any functions needed for the pipeline and put them all into `if __name__ == '__main__':`
    2.3 Drop the table in Postgres: `DROP TABLE <table_name>`
    2.4 Run the following script to run your pipeline:
    ```
    URL="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"

        python run_injesting_pipeline.py \
            --user=root \
            --password=root \
            --host=localhost \
            --port=5432 \
            --db_name=ny_taxi \
            --pg_table_name=yellow_taxi_trips \
            --url=${URL} \
            --n_rows_read=1500000 \
            --chunksize=100000 \
    ```

### Pipeline/Script Dockerization
Now let's dockerise our pipeline using Docker
1. Create Dockerfile for your script
2. Create a container
    - `docker build -t pg_taxi_injesting:0.1 .`

We mount volume to a container to preserve the downloaded data. No need to download each time after restarting.

3. Check that the container has been built correctly, run the container:
```
URL="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"

docker run -it \
    -v $(pwd)/data:/app/data \
    --name pg_taxi_injesting \
    --network=pg_network \
    pg_taxi_injesting:0.1 \
        --user=root \
        --password=root \
        --host=postgres_db \
        --port=5432 \
        --db_name=ny_taxi \
        --pg_table_name=yellow_taxi_trips \
        --url=${URL} \
        --n_rows_read=1500000 \
        --chunksize=100000 
```

### Notes
- passing passwords is unsecure in bash commands -> history is saved
- the safest way -> pass through environment variables
- to fix several lines of code -> `Shift + -> or <-` and then `Ctr + Alt + down`
- check datatypes in argparse, especially int values!
- volumes better to mount instead of defining in Dockerfile