# Cleaning and Transformation Steps

## Actor Table
- Import nested column actor.id renamed as actor_id
- Drop duplicates on import based on values in 'actor_id' column
- Drop Null values based on values in 'actor_id' column
- Add column 'is_bot' that determines whether actor is a bot by checking if the login contains the string '[bot]'
- Enforce schema to correct data types and nullability

## Org Table
- Drop duplicates on import based on 'org.id' column.
- Select nested fields from all non-null orgs 
- Drop null orgs based on 'org_id' column
- Enforce schema to correct data types and nullability

## Repo Table
- Drop duplicates on import based on 'repo.id' column.
- Enforce schema to correct data types and nullability

## Event Table
- Enforce schema to correct data types and nullability.

## PushEvent Table
- On import:
  - Drop any records where the commit_id is null
  - Add column 'is_main' that determines whether the event affects the main branch using the branch_ref column.

## Commit Table
- On import, drop any records where the 'commit_id' is null.
- Use raw text from commit_message column to determine the language of the comment using the sparknlp library then store the language in a new column named language
- Enforce schema to correct data types and nullability.

# Partitioning Strategy

In order to determine the appropriate number of partitions, we followed the following steps:
- Import a sample size of two days of data.
- Transform these two days into the appropriate tables and export those tables as parquet files.
- Use the size of these files to estimate the size of the table once it contains the full month of data
  - Divide size by two to get average size of daily data.
  - Multiply by 31 to estimate the size of a month a data
  - Divide that size estimate for a month of date by 128mb to determine the approximate number of partitions needed
  - This last step was repeated for each individual table in our erd.