# Final Project - Part 1: Data Cleaning with R

## Overview
This project involves cleaning messy car dealership data using R, then loading it into a PostgreSQL database for analysis.

### Part 1 Goals (This Notebook):
- Set working directory to the data folder
- Clean 7 messy CSV files using R tidyverse tools
- Handle missing values, inconsistent formatting, and data type issues
- Export clean data for database import

### Files to Clean:
1. `messy_dealership_sales.csv` → `clean_dealership_sales.csv`
2. `messy_customer_data.csv` → `clean_customer_data.csv`
3. `messy_vehicle_inventory.csv` → `clean_vehicle_inventory.csv`
4. `messy_salesperson_info.csv` → `clean_salesperson_info.csv`
5. `messy_service_records.csv` → `clean_service_records.csv`
6. `messy_financing_details.csv` → `clean_financing_details.csv`
7. `messy_warranty_info.csv` → `clean_warranty_info.csv`

### Part 2 (Next Notebook):
After completing this cleaning process, you will use the cleaned CSV files in the PostgreSQL notebook to:
- Create a normalized database schema with 10 tables
- Import data using PostgreSQL COPY and INSERT commands
- Perform complex SQL queries and analysis

In [None]:
# Load required libraries (suppress startup messages)
suppressPackageStartupMessages({
  # YOUR CODE HERE
})

# Suppress warnings from mutate operations
options(warn = -1)

data_path <- "/workspaces/Fall2025-MS3083-Base_Template/data/"
cat("✓ Libraries loaded successfully!\n")

## Part 2: Load Core Business Data

**Instructions:** Load the four main CSV files into R dataframes:
- Load `messy_dealership_sales.csv` into `sales_raw`
- Load `messy_customer_data.csv` into `customers_raw`
- Load `messy_vehicle_inventory.csv` into `vehicles_raw`
- Load `messy_salesperson_info.csv` into `salespeople_raw`
- Use `read_csv()` and wrap each call in `suppressMessages()` to avoid column specification output
- Print a confirmation message when complete

In [None]:
sales_raw <- # YOUR CODE HERE
customers_raw <- # YOUR CODE HERE
vehicles_raw <- # YOUR CODE HERE
salespeople_raw <- # YOUR CODE HERE

cat("✓ Core data loaded (sales, customers, vehicles, salespeople)\n")

## Part 3: Load Supporting Data

**Instructions:** Load the three supporting CSV files:
- Load `messy_service_records.csv` into `service_raw`
- Load `messy_financing_details.csv` into `financing_raw`
- Load `messy_warranty_info.csv` into `warranty_raw`
- Use `suppressMessages()` with `read_csv()` for clean output
- Print a confirmation message when complete

In [None]:
service_raw <- # YOUR CODE HERE
financing_raw <- # YOUR CODE HERE
warranty_raw <- # YOUR CODE HERE

cat("✓ Supporting data loaded (service, financing, warranty)\n")

## Part 4: Inspect Data Quality Issues

**Instructions:** Examine the `sales_raw` dataset for data quality problems:
- Use `colSums(is.na())` to count missing values in each column
- Display unique values in the `vehicle_make` column to see inconsistent capitalization
- Use `head()` to display the first 10 rows of the raw data
- Add appropriate labels to show what you're displaying

In [None]:
cat("Sales - Missing values:\n")
# YOUR CODE HERE

cat("\nUnique makes (mixed case):\n")
# YOUR CODE HERE

cat("\n✓ Data quality issues identified\n")
# YOUR CODE HERE

## Part 5: Clean Sales Data

**Instructions:** Clean the sales dataset by creating `sales_clean` from `sales_raw`:
- Convert `vehicle_make` and `vehicle_model` to lowercase and trim whitespace
- Parse `sale_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `case_when()`, `str_detect()`, `mdy()`, and `ymd()`
- Standardize `payment_method` to title case
- Trim whitespace from `salesperson` names
- Convert `trade_in_value` to numeric, replacing "NULL", empty strings, and NA with 0
- Clean `customer_name` by trimming/squishing whitespace and replacing missing values with "Unknown Customer"
- Display the first 10 rows of cleaned data using `head()`

In [None]:
sales_clean <- sales_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Sales data cleaned\n\n")
head(sales_clean, 10)

## Part 6: Clean Customer Data

**Instructions:** Clean the customer dataset by creating `customers_clean` from `customers_raw`:
- Replace missing `first_name` values with "Unknown" and trim whitespace
- Replace missing `last_name` values with "Customer" and trim whitespace
- Create a new `full_name` column by concatenating first and last names
- Generate placeholder emails for missing values using pattern: `customer[ID]@placeholder.com`
- Convert existing emails to lowercase
- Clean `phone` numbers by removing all non-numeric characters, then reformatting as "(XXX) XXX-XXXX"
- Convert `state` to uppercase
- Convert `zip_code` to character type
- Parse `registration_date` as dates using `ymd()`
- Display the first 10 rows of cleaned data

In [None]:
customers_clean <- customers_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Customer data cleaned\n\n")
head(customers_clean, 10)

## Part 7: Clean Vehicle Inventory

**Instructions:** Clean the vehicle inventory by creating `vehicles_clean` from `vehicles_raw`:
- Convert `vin` to uppercase and trim whitespace
- Convert `make` and `model` to lowercase and trim whitespace
- Standardize `condition` and `color` to title case
- Convert `sold` to boolean (TRUE/FALSE) by checking if uppercase trimmed values are in: "YES", "Y", "TRUE", "1"
- Convert `year` and `mileage` to integers
- Convert `purchase_price` to numeric
- Parse `lot_date` as dates using `ymd()`
- Display the first 10 rows of cleaned data

In [None]:
vehicles_clean <- vehicles_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Vehicle inventory cleaned\n\n")
head(vehicles_clean, 10)

## Part 8: Clean Salesperson Data

**Instructions:** Clean the salesperson dataset by creating `salespeople_clean` from `salespeople_raw`:
- Standardize `salesperson_name` to title case and trim whitespace
- Parse `hire_date` as dates using `ymd()`
- Convert `email` to lowercase and generate placeholder emails for missing values: `employee[ID]@dealership.com`
- Clean `phone` numbers by removing non-numeric characters and reformatting as "(XXX) XXX-XXXX"
- Set missing `commission_rate` values to 0.03 (3%)
- Standardize `department` to title case
- Convert `status` to lowercase
- Display the first 10 rows of cleaned data

In [None]:
salespeople_clean <- salespeople_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Salesperson data cleaned\n\n")
head(salespeople_clean, 10)

## Part 9: Clean Service Records

**Instructions:** Clean the service records by creating `service_clean` from `service_raw`:
- Convert `vin` to uppercase and trim whitespace
- Parse `service_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `if_else()`, `str_detect()`, `mdy()`, and `ymd()`
- Standardize `service_type` to title case and trim whitespace
- Standardize `mechanic_name` to title case and trim whitespace
- Convert `labor_cost` to numeric, replacing "NULL", empty strings, and NA with 0
- Convert `parts_cost` to numeric, replacing "NULL", empty strings, and NA with 0
- Clean `notes` by trimming and squishing whitespace, setting empty values to NA
- Display the first 10 rows of cleaned data

In [None]:
service_clean <- service_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Service records cleaned\n\n")
head(service_clean, 10)

## Part 10: Clean Financing Details

**Instructions:** Clean the financing data by creating `financing_clean` from `financing_raw`:
- Standardize `lender_name` to title case and trim whitespace
- Convert `loan_amount`, `interest_rate` to numeric and `term_months` to integer
- Calculate missing `monthly_payment` values using the loan payment formula:
  - Formula: `loan_amount * (rate/12) * (1 + rate/12)^months / ((1 + rate/12)^months - 1)`
  - Where rate = `interest_rate / 100`
  - Only calculate when loan_amount, interest_rate, and term_months are not missing
- Parse `approval_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats
- Set missing `down_payment` values to 0
- Display the first 10 rows of cleaned data

In [None]:
financing_clean <- financing_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Financing data cleaned\n\n")
head(financing_clean, 10)

## Part 11: Clean Warranty Information

**Instructions:** Clean the warranty data by creating `warranty_clean` from `warranty_raw`:
- Standardize `warranty_type` and `provider` to title case and trim whitespace
- Parse `start_date` and `end_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `if_else()`, `str_detect()`, `mdy()`, and `ymd()`
- Convert `coverage_amount` to numeric
- Set missing `deductible` values to 0
- Convert `status` to lowercase and trim whitespace
- Display the first 10 rows of cleaned data

In [None]:
warranty_clean <- warranty_raw %>%
  mutate(
    # YOUR CODE HERE
  )

cat("✓ Warranty data cleaned\n\n")
head(warranty_clean, 10)

## Part 12: Data Quality Summary

**Instructions:** Create a summary report of all cleaned datasets:
- Calculate the total number of records across all 7 cleaned datasets
- Display the record count for each individual dataset:
  - sales_clean, customers_clean, vehicles_clean, salespeople_clean
  - service_clean, financing_clean, warranty_clean
- Format the output with separators (equal signs) for readability
- Use `nrow()` to count records in each dataframe

In [None]:
cat("=" , rep("=", 78), "\n")
cat("DATA CLEANING SUMMARY\n")
cat("=" , rep("=", 78), "\n\n")

total_records <- # YOUR CODE HERE

cat("Total Records Cleaned:", total_records, "\n")
cat("  - Sales:", nrow(sales_clean), "\n")
cat("  - Customers:", nrow(customers_clean), "\n")
cat("  - Vehicles:", nrow(vehicles_clean), "\n")
cat("  - Salespeople:", nrow(salespeople_clean), "\n")
cat("  - Service Records:", nrow(service_clean), "\n")
cat("  - Financing:", nrow(financing_clean), "\n")
cat("  - Warranties:", nrow(warranty_clean), "\n\n")

cat("=" , rep("=", 78), "\n")

## Part 13: Export Cleaned Data

**Instructions:** Export all cleaned datasets to CSV files:
- Export each of the 7 cleaned dataframes to the data directory with "clean_" prefix:
  - `sales_clean` → `clean_dealership_sales.csv`
  - `customers_clean` → `clean_customer_data.csv`
  - `vehicles_clean` → `clean_vehicle_inventory.csv`
  - `salespeople_clean` → `clean_salesperson_info.csv`
  - `service_clean` → `clean_service_records.csv`
  - `financing_clean` → `clean_financing_details.csv`
  - `warranty_clean` → `clean_warranty_info.csv`
- Use `write_csv()` wrapped in `suppressMessages()` for clean output
- Print a confirmation message when all files are exported
- These cleaned files will be used in Part 2: PostgreSQL Database Operations

In [None]:
suppressMessages({
  # YOUR CODE HERE
})

cat("✓ Exported 7 cleaned CSV files\n")
cat("\nProceed to Part 2: PostgreSQL Database Operations\n")