## Ingest data from circuits.csv
##### TODO:
1. Read the data from `circuits.csv` file into a dataframe using DataframeReader
2. Select only the columns except `url` column
3. Rename the columns as required 
- circuitId to circuit_id
- circuitRef to circuit_ref
- lat to latitude
- lng to longitude
- alt to altitude
4. add a new column to our existing dataframe (ingestion_date) 
5. Write data into a file using - DataframeWriter

#### Step 1 - Read the data from `circuits.csv` file into a dataframe using DataframeReader

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

In [0]:
circuits_schema = StructType(fields=[StructField("circuitId", IntegerType(), False),
                                     StructField("circuitRef", StringType(), True),
                                     StructField("name", StringType(), True),
                                     StructField("location", StringType(), True),
                                     StructField("country", StringType(), True),
                                     StructField("lat", DoubleType(), True),
                                     StructField("lng", DoubleType(), True),
                                     StructField("alt", IntegerType(), True),
                                     StructField("url", StringType(), True),
])

In [0]:
circuits_df = spark.read \
    .option("header", True) \
    .schema(circuits_schema) \
    .csv('/mnt/formula19533dl/raw/circuits.csv')

In [0]:
# display(circuits_df)

#### Step 2 - Select only the columns except `url` column

In [0]:
from pyspark.sql.functions import col

In [0]:
# df.select(col("col_1"), col("col_2"), col("col_3"))

circuits_selected_df = circuits_df.select(col("circuitId"), col("circuitRef"), col("name"), col("location"), col("country"), col("lat"), col("lng"), col("alt"))

In [0]:
# display(circuits_selected_df)

#### Step 3 - Rename the columns as required
##### using `.withColumnRenamed(existing, new)`

In [0]:
circuits_renamed_df = circuits_selected_df.withColumnRenamed("circuitId", "circuit_id") \
    .withColumnRenamed("circuitRef", "circuit_ref") \
    .withColumnRenamed("lat", "latitude") \
    .withColumnRenamed("lng", "longitude") \
    .withColumnRenamed("alt", "altitude")

In [0]:
# display(circuits_renamed_df)

#### Step 4 - add a new column to our existing dataframe (ingestion_date)
##### using `.withColumn(<colname>, <col>)`
- this will create a new column or replace an existing column

In [0]:
from pyspark.sql.functions import current_timestamp

In [0]:
circuits_final_df = circuits_renamed_df.withColumn("ingestion_date", current_timestamp())

In [0]:
display(circuits_final_df)

if you want to add a string directly in the new column instead of timestamp(), we need to wrap our strin inside a function `lit()`

``` python
from pyspark.sql.functions import withColumn, lit

new_df = df.withColumn("env", lit("production"))
```

### Step 5 - Write data into a file using - DataframeWriter

In [0]:
circuits_final_df.write.mode('overwrite').parquet("/mnt/formula19533dl/processed/circuits")
# .mode('overwrite') - will overwrite everytime this is run, if not, we get an error that file already exists

In [0]:
# %fs
# ls /mnt/formula19533dl/processed/circuits

In [0]:
# df = spark.read.parquet("/mnt/formula19533dl/processed/circuits")
# display(df)