# Final Project - Part 1: Data Cleaning with R

## Overview
This project involves cleaning messy car dealership data using R, then loading it into a PostgreSQL database for analysis.

### Part 1 Goals (This Notebook):
- Set working directory to the data folder
- Clean 7 messy CSV files using R tidyverse tools
- Handle missing values, inconsistent formatting, and data type issues
- Export clean data for database import

### Files to Clean:
1. `messy_dealership_sales.csv` → `clean_dealership_sales.csv`
2. `messy_customer_data.csv` → `clean_customer_data.csv`
3. `messy_vehicle_inventory.csv` → `clean_vehicle_inventory.csv`
4. `messy_salesperson_info.csv` → `clean_salesperson_info.csv`
5. `messy_service_records.csv` → `clean_service_records.csv`
6. `messy_financing_details.csv` → `clean_financing_details.csv`
7. `messy_warranty_info.csv` → `clean_warranty_info.csv`

### Part 2 (Next Notebook):
After completing this cleaning process, you will use the cleaned CSV files in the PostgreSQL notebook to:
- Create a normalized database schema with 10 tables
- Import data using PostgreSQL COPY and INSERT commands
- Perform complex SQL queries and analysis

In [11]:
# Load required libraries (suppress startup messages)
suppressPackageStartupMessages({
  library(tidyverse)
  library(lubridate)
  library(stringr)
})

# Suppress warnings from mutate operations
options(warn = -1)

data_path <- "/workspaces/Data-Management-2025/data/"
cat("✓ Libraries loaded successfully!\n")

✓ Libraries loaded successfully!


## Part 2: Load Core Business Data

**Instructions:** Load the four main CSV files into R dataframes:
- Load `messy_dealership_sales.csv` into `sales_raw`
- Load `messy_customer_data.csv` into `customers_raw`
- Load `messy_vehicle_inventory.csv` into `vehicles_raw`
- Load `messy_salesperson_info.csv` into `salespeople_raw`
- Use `read_csv()` and wrap each call in `suppressMessages()` to avoid column specification output
- Print a confirmation message when complete

In [12]:
sales_raw <- suppressMessages(read_csv(paste0(data_path, "messy_dealership_sales.csv")))
customers_raw <- suppressMessages(read_csv(paste0(data_path, "messy_customer_data.csv")))
vehicles_raw <- suppressMessages(read_csv(paste0(data_path, "messy_vehicle_inventory.csv")))
salespeople_raw <- suppressMessages(read_csv(paste0(data_path, "messy_salesperson_info.csv")))

cat("✓ Core data loaded (sales, customers, vehicles, salespeople)\n")

✓ Core data loaded (sales, customers, vehicles, salespeople)


## Part 3: Load Supporting Data

**Instructions:** Load the three supporting CSV files:
- Load `messy_service_records.csv` into `service_raw`
- Load `messy_financing_details.csv` into `financing_raw`
- Load `messy_warranty_info.csv` into `warranty_raw`
- Use `suppressMessages()` with `read_csv()` for clean output
- Print a confirmation message when complete

In [13]:
service_raw <- suppressMessages(read_csv(paste0(data_path, "messy_service_records.csv")))
financing_raw <- suppressMessages(read_csv(paste0(data_path, "messy_financing_details.csv")))
warranty_raw <- suppressMessages(read_csv(paste0(data_path, "messy_warranty_info.csv")))

cat("✓ Supporting data loaded (service, financing, warranty)\n")

✓ Supporting data loaded (service, financing, warranty)


## Part 4: Inspect Data Quality Issues

**Instructions:** Examine the `sales_raw` dataset for data quality problems:
- Use `colSums(is.na())` to count missing values in each column
- Display unique values in the `vehicle_make` column to see inconsistent capitalization
- Use `head()` to display the first 10 rows of the raw data
- Add appropriate labels to show what you're displaying

In [14]:
cat("Sales - Missing values:\n")
print(colSums(is.na(sales_raw)))

cat("\nUnique makes (mixed case):\n")
print(unique(sales_raw$vehicle_make))

cat("\n✓ Data quality issues identified\n")
head(sales_raw, 10)

Sales - Missing values:
       sale_id  customer_name   vehicle_make  vehicle_model      sale_date 
             0              1              0              0              0 
    sale_price    salesperson payment_method trade_in_value 
             0              0              0              3 

Unique makes (mixed case):
 [1] "TOYOTA" "honda"  "Ford"   "toyota" "HONDA"  "ford"   "Tesla"  "FORD"  
 [9] "Honda"  "tesla" 

✓ Data quality issues identified


sale_id,customer_name,vehicle_make,vehicle_model,sale_date,sale_price,salesperson,payment_method,trade_in_value
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>
1,John Smith,TOYOTA,camry,2024-01-15,28500,Bob Johnson,Cash,5000.0
2,Mary Jones,honda,Accord,01/22/2024,31200,Sarah Williams,financing,
3,Robert Brown,Ford,mustang,2024-02-05,45000,Bob Johnson,Cash,8500.5
4,,toyota,RAV4,2024-02-10,32000,Mike Davis,Lease,
5,Lisa Martinez,HONDA,civic,02/28/2024,24500,Sarah Williams,Cash,3000.0
6,James Wilson,ford,F-150,2024-03-12,52000,Bob Johnson,Financing,12000.0
7,Patricia Garcia,Tesla,model 3,03/20/2024,48000,Mike Davis,cash,
8,Michael Rodriguez,TOYOTA,Highlander,2024-04-08,42000,Sarah Williams,financing,9500.0
9,Jennifer Lee,honda,CR-V,04/15/2024,29500,Bob Johnson,Cash,4500.0
10,David Kim,FORD,Explorer,2024-05-02,38000,Mike Davis,Lease,7000.0


## Part 5: Clean Sales Data

**Instructions:** Clean the sales dataset by creating `sales_clean` from `sales_raw`:
- Convert `vehicle_make` and `vehicle_model` to lowercase and trim whitespace
- Parse `sale_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `case_when()`, `str_detect()`, `mdy()`, and `ymd()`
- Standardize `payment_method` to title case
- Trim whitespace from `salesperson` names
- Convert `trade_in_value` to numeric, replacing "NULL", empty strings, and NA with 0
- Clean `customer_name` by trimming/squishing whitespace and replacing missing values with "Unknown Customer"
- Display the first 10 rows of cleaned data using `head()`

In [15]:
sales_clean <- sales_raw %>%
  mutate(
    vehicle_make = str_to_lower(str_trim(vehicle_make)),
    vehicle_model = str_to_lower(str_trim(vehicle_model)),
    sale_date = case_when(
      str_detect(sale_date, "/") ~ mdy(sale_date),
      TRUE ~ ymd(sale_date)
    ),
    payment_method = str_to_title(payment_method),
    salesperson = str_trim(salesperson),
    trade_in_value = case_when(
      is.na(trade_in_value) | trade_in_value == "NULL" | trade_in_value == "" ~ 0,
      TRUE ~ as.numeric(trade_in_value)
    ),
    customer_name = if_else(
      is.na(customer_name) | str_trim(customer_name) == "",
      "Unknown Customer",
      str_squish(str_trim(customer_name))
    )
  )

cat("✓ Sales data cleaned\n\n")
head(sales_clean, 10)

✓ Sales data cleaned



sale_id,customer_name,vehicle_make,vehicle_model,sale_date,sale_price,salesperson,payment_method,trade_in_value
<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<chr>,<chr>,<dbl>
1,John Smith,toyota,camry,2024-01-15,28500,Bob Johnson,Cash,5000.0
2,Mary Jones,honda,accord,2024-01-22,31200,Sarah Williams,Financing,0.0
3,Robert Brown,ford,mustang,2024-02-05,45000,Bob Johnson,Cash,8500.5
4,Unknown Customer,toyota,rav4,2024-02-10,32000,Mike Davis,Lease,0.0
5,Lisa Martinez,honda,civic,2024-02-28,24500,Sarah Williams,Cash,3000.0
6,James Wilson,ford,f-150,2024-03-12,52000,Bob Johnson,Financing,12000.0
7,Patricia Garcia,tesla,model 3,2024-03-20,48000,Mike Davis,Cash,0.0
8,Michael Rodriguez,toyota,highlander,2024-04-08,42000,Sarah Williams,Financing,9500.0
9,Jennifer Lee,honda,cr-v,2024-04-15,29500,Bob Johnson,Cash,4500.0
10,David Kim,ford,explorer,2024-05-02,38000,Mike Davis,Lease,7000.0


## Part 6: Clean Customer Data

**Instructions:** Clean the customer dataset by creating `customers_clean` from `customers_raw`:
- Replace missing `first_name` values with "Unknown" and trim whitespace
- Replace missing `last_name` values with "Customer" and trim whitespace
- Create a new `full_name` column by concatenating first and last names
- Generate placeholder emails for missing values using pattern: `customer[ID]@placeholder.com`
- Convert existing emails to lowercase
- Clean `phone` numbers by removing all non-numeric characters, then reformatting as "(XXX) XXX-XXXX"
- Convert `state` to uppercase
- Convert `zip_code` to character type
- Parse `registration_date` as dates using `ymd()`
- Display the first 10 rows of cleaned data

In [16]:
customers_clean <- customers_raw %>%
  mutate(
    first_name = if_else(is.na(first_name), "Unknown", str_trim(first_name)),
    last_name = if_else(is.na(last_name), "Customer", str_trim(last_name)),
    full_name = paste(first_name, last_name),
    email = if_else(
      is.na(email) | str_trim(email) == "",
      paste0("customer", customer_id, "@placeholder.com"),
      str_to_lower(email)
    ),
    phone = str_replace_all(phone, "[^0-9]", "") %>%
      str_replace("(\\d{3})(\\d{3})(\\d{4})", "(\\1) \\2-\\3"),
    state = str_to_upper(state),
    zip_code = as.character(zip_code),
    registration_date = ymd(registration_date)
  )

cat("✓ Customer data cleaned\n\n")
head(customers_clean, 10)

✓ Customer data cleaned



customer_id,first_name,last_name,email,phone,address,city,state,zip_code,registration_date,full_name
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<chr>
1,John,Smith,john.smith@email.com,(555) 123-4567,123 Main St,Springfield,IL,62701,2023-12-01,John Smith
2,Mary,Jones,mary.jones@email.com,(555) 234-5678,456 Oak Ave,Chicago,IL,60601,2024-01-05,Mary Jones
3,Robert,Brown,rbrown@email.com,(555) 345-6789,789 Pine Rd,Peoria,IL,61602,2024-01-20,Robert Brown
4,Unknown,Customer,customer4@placeholder.com,(555) 456-7890,321 Elm St,Rockford,IL,61101,2024-02-01,Unknown Customer
5,Lisa,Martinez,lisa.m@email.com,(555) 567-8901,654 Maple Dr,Naperville,IL,60540,2024-02-15,Lisa Martinez
6,James,Wilson,j.wilson@email.com,(555) 678-9012,987 Cedar Ln,Aurora,IL,60502,2024-02-28,James Wilson
7,Patricia,Garcia,pgarcia@email.com,(555) 789-0123,147 Birch Ave,Joliet,IL,60431,2024-03-10,Patricia Garcia
8,Michael,Rodriguez,m.rodriguez@email.com,(555) 890-1234,258 Walnut St,Elgin,IL,60120,2024-03-25,Michael Rodriguez
9,Jennifer,Lee,jlee@email.com,(555) 901-2345,369 Spruce Rd,Waukegan,IL,60085,2024-04-08,Jennifer Lee
10,David,Kim,david.kim@email.com,(555) 012-3456,741 Ash Dr,Champaign,IL,61820,2024-04-22,David Kim


## Part 7: Clean Vehicle Inventory

**Instructions:** Clean the vehicle inventory by creating `vehicles_clean` from `vehicles_raw`:
- Convert `vin` to uppercase and trim whitespace
- Convert `make` and `model` to lowercase and trim whitespace
- Standardize `condition` and `color` to title case
- Convert `sold` to boolean (TRUE/FALSE) by checking if uppercase trimmed values are in: "YES", "Y", "TRUE", "1"
- Convert `year` and `mileage` to integers
- Convert `purchase_price` to numeric
- Parse `lot_date` as dates using `ymd()`
- Display the first 10 rows of cleaned data

In [17]:
vehicles_clean <- vehicles_raw %>%
  mutate(
    vin = str_to_upper(str_trim(vin)),
    make = str_to_lower(str_trim(make)),
    model = str_to_lower(str_trim(model)),
    condition = str_to_title(condition),
    color = str_to_title(color),
    sold = str_to_upper(str_trim(as.character(sold))) %in% c("YES", "Y", "TRUE", "1"),
    year = as.integer(year),
    mileage = as.integer(mileage),
    purchase_price = as.numeric(purchase_price),
    lot_date = ymd(lot_date)
  )

cat("✓ Vehicle inventory cleaned\n\n")
head(vehicles_clean, 10)

✓ Vehicle inventory cleaned



vehicle_id,vin,make,model,year,color,mileage,condition,purchase_price,lot_date,sold
<dbl>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<dbl>,<date>,<lgl>
1,1HGBH41JXMN109186,toyota,camry,2023,Silver,15000,Excellent,22000,2023-11-15,True
2,2HGFA16528H123456,honda,accord,2023,Blue,12000,Good,25000,2023-12-01,True
3,1FA6P8CF3L5123456,ford,mustang,2022,Red,18000,Excellent,38000,2024-01-10,True
4,4T1BF1FK5CU123456,toyota,rav4,2024,White,5000,Excellent,26000,2024-01-20,True
5,19XFC2F59LE123456,honda,civic,2023,Black,20000,Good,19500,2024-02-05,True
6,1FTFW1E89MFA12345,ford,f-150,2023,Gray,8000,Excellent,45000,2024-02-15,True
7,5YJ3E1EB8MF123456,tesla,model 3,2024,White,2000,Excellent,42000,2024-02-28,True
8,5TDBZRFH8JS123456,toyota,highlander,2023,Black,14000,Good,36000,2024-03-10,True
9,2HKRM4H75MH123456,honda,cr-v,2023,Blue,16000,Excellent,24500,2024-03-25,True
10,1FM5K8D84LGB12345,ford,explorer,2024,Gray,6000,Good,32000,2024-04-05,True


## Part 8: Clean Salesperson Data

**Instructions:** Clean the salesperson dataset by creating `salespeople_clean` from `salespeople_raw`:
- Standardize `salesperson_name` to title case and trim whitespace
- Parse `hire_date` as dates using `ymd()`
- Convert `email` to lowercase and generate placeholder emails for missing values: `employee[ID]@dealership.com`
- Clean `phone` numbers by removing non-numeric characters and reformatting as "(XXX) XXX-XXXX"
- Set missing `commission_rate` values to 0.03 (3%)
- Standardize `department` to title case
- Convert `status` to lowercase
- Display the first 10 rows of cleaned data

In [18]:
salespeople_clean <- salespeople_raw %>%
  mutate(
    salesperson_name = str_to_title(str_trim(salesperson_name)),
    hire_date = ymd(hire_date),
    email = if_else(
      is.na(email) | str_trim(email) == "",
      paste0("employee", salesperson_id, "@dealership.com"),
      str_to_lower(email)
    ),
    phone = str_replace_all(phone, "[^0-9]", "") %>%
      str_replace("(\\d{3})(\\d{3})(\\d{4})", "(\\1) \\2-\\3"),
    commission_rate = if_else(is.na(commission_rate), 0.03, commission_rate),
    department = str_to_title(department),
    status = str_to_lower(status)
  )

cat("✓ Salesperson data cleaned\n\n")
head(salespeople_clean, 10)

✓ Salesperson data cleaned



salesperson_id,salesperson_name,hire_date,email,phone,commission_rate,department,status
<dbl>,<chr>,<date>,<chr>,<chr>,<dbl>,<chr>,<chr>
1,Mike Johnson,2020-05-15,mjohnson@dealership.com,(555) 111-2222,0.03,Sales,active
2,Sarah Williams,2019-03-20,swilliams@dealership.com,(555) 333-4444,0.035,Sales,active
3,Robert Brown,2021-08-10,rbrown@dealership.com,(555) 555-6666,0.03,Sales,active
4,Emily Davis,2022-01-15,edavis@dealership.com,(555) 777-8888,0.025,Sales,active
5,Michael Chen,2020-11-01,mchen@dealership.com,(555) 999-0000,0.03,Sales,active
6,Jessica Martinez,2021-06-12,jmartinez@dealership.com,(555) 123-4567,0.03,Sales,active
7,David Kim,2023-02-28,dkim@dealership.com,(555) 987-6543,0.03,Sales,inactive
8,Lisa Anderson,2019-09-05,landerson@dealership.com,(555) 246-8135,0.035,Sales,active
9,James Wilson,2022-07-20,employee9@dealership.com,(555) 159-7532,0.03,Sales,active
10,Maria Garcia,2021-04-18,mgarcia@dealership.com,(555) 864-2097,0.03,Sales,active


## Part 9: Clean Service Records

**Instructions:** Clean the service records by creating `service_clean` from `service_raw`:
- Convert `vin` to uppercase and trim whitespace
- Parse `service_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `if_else()`, `str_detect()`, `mdy()`, and `ymd()`
- Standardize `service_type` to title case and trim whitespace
- Standardize `mechanic_name` to title case and trim whitespace
- Convert `labor_cost` to numeric, replacing "NULL", empty strings, and NA with 0
- Convert `parts_cost` to numeric, replacing "NULL", empty strings, and NA with 0
- Clean `notes` by trimming and squishing whitespace, setting empty values to NA
- Display the first 10 rows of cleaned data

In [19]:
service_clean <- service_raw %>%
  mutate(
    vin = str_to_upper(str_trim(vin)),
    service_date = if_else(
      str_detect(service_date, "/"),
      mdy(service_date),
      ymd(service_date)
    ),
    service_type = str_to_title(str_trim(service_type)),
    mechanic_name = str_to_title(str_trim(mechanic_name)),
    labor_cost = case_when(
      is.na(labor_cost) | labor_cost == "NULL" | labor_cost == "" ~ 0,
      TRUE ~ as.numeric(labor_cost)
    ),
    parts_cost = case_when(
      is.na(parts_cost) | parts_cost == "NULL" | parts_cost == "" ~ 0,
      TRUE ~ as.numeric(parts_cost)
    ),
    notes = if_else(
      str_trim(str_squish(notes)) == "",
      NA_character_,
      str_squish(str_trim(notes))
    )
  )

cat("✓ Service records cleaned\n\n")
head(service_clean, 10)

✓ Service records cleaned



service_id,vin,service_date,service_type,mechanic_name,labor_cost,parts_cost,notes
<dbl>,<chr>,<date>,<chr>,<chr>,<dbl>,<dbl>,<chr>
1,1HGBH41JXMN109186,2024-03-15,Oil Change,Mike Johnson,45,25.5,Regular maintenance
2,1HGCM82633A123456,2024-03-22,Brake Repair,Mike Johnson,120,85.0,Replaced front brake pads
3,5YJSA1E14HF123456,2024-04-01,Tire Rotation,Sarah Williams,35,0.0,Rotated all four tires
4,1HGBH41JXMN109186,2024-04-15,Transmission Service,Mike Johnson,250,150.0,Fluid change and filter
5,1G1ZD5ST8HF109186,2024-05-10,Oil Change,Sarah Williams,45,0.0,Standard oil change
6,1HGCM82633A123456,2024-05-20,Engine Diagnostic,Sarah Williams,95,0.0,Check engine light diagnosis
7,5YJSA1E14HF123456,2024-06-01,Battery Replacement,Mike Johnson,50,120.0,New battery installed
8,1FAHP3K29CL123456,2024-06-10,Oil Change,Sarah Williams,45,28.0,Synthetic oil
9,1HGBH41JXMN109186,2024-07-15,Brake Repair,Mike Johnson,0,95.0,Rear brake pads
10,1G1ZD5ST8HF109186,2024-07-22,Tire Rotation,Mike Johnson,35,0.0,


## Part 10: Clean Financing Details

**Instructions:** Clean the financing data by creating `financing_clean` from `financing_raw`:
- Standardize `lender_name` to title case and trim whitespace
- Convert `loan_amount`, `interest_rate` to numeric and `term_months` to integer
- Calculate missing `monthly_payment` values using the loan payment formula:
  - Formula: `loan_amount * (rate/12) * (1 + rate/12)^months / ((1 + rate/12)^months - 1)`
  - Where rate = `interest_rate / 100`
  - Only calculate when loan_amount, interest_rate, and term_months are not missing
- Parse `approval_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats
- Set missing `down_payment` values to 0
- Display the first 10 rows of cleaned data

In [20]:
financing_clean <- financing_raw %>%
  mutate(
    lender_name = str_to_title(str_trim(lender_name)),
    loan_amount = as.numeric(loan_amount),
    interest_rate = as.numeric(interest_rate),
    term_months = as.integer(term_months),
    monthly_payment = if_else(
      is.na(monthly_payment) & !is.na(loan_amount) & !is.na(interest_rate) & !is.na(term_months),
      {
        rate <- interest_rate / 100
        loan_amount * (rate/12) * (1 + rate/12)^term_months / ((1 + rate/12)^term_months - 1)
      },
      as.numeric(monthly_payment)
    ),
    approval_date = if_else(
      str_detect(approval_date, "/"),
      mdy(approval_date),
      ymd(approval_date)
    ),
    down_payment = if_else(is.na(down_payment), 0, down_payment)
  )

cat("✓ Financing data cleaned\n\n")
head(financing_clean, 10)

✓ Financing data cleaned



financing_id,sale_id,lender_name,loan_amount,interest_rate,term_months,monthly_payment,approval_date,down_payment
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<date>,<dbl>
1,3,First National Bank,25000,4.5,60,466.08,2024-01-20,5000
2,5,Credit Union Auto,18000,3.9,48,407.89,2024-02-01,3000
3,7,First National Bank,22000,4.75,60,412.64,2024-02-15,4000
4,10,Auto Finance Corp,28000,5.2,72,,2024-03-05,2000
5,11,Credit Union Auto,20000,3.8,60,367.71,2024-03-10,0
6,13,First National Bank,26500,4.6,60,,2024-04-01,3500
7,14,Auto Finance Corp,19500,5.0,48,449.44,2024-04-12,2500
8,16,Credit Union Auto,23000,4.0,60,423.41,2024-05-01,4500
9,18,First National Bank,21000,4.8,60,394.8,2024-05-15,3000
10,20,Auto Finance Corp,24000,,72,,2024-06-01,2000


## Part 11: Clean Warranty Information

**Instructions:** Clean the warranty data by creating `warranty_clean` from `warranty_raw`:
- Standardize `warranty_type` and `provider` to title case and trim whitespace
- Parse `start_date` and `end_date` handling both "MM/DD/YYYY" and "YYYY-MM-DD" formats using `if_else()`, `str_detect()`, `mdy()`, and `ymd()`
- Convert `coverage_amount` to numeric
- Set missing `deductible` values to 0
- Convert `status` to lowercase and trim whitespace
- Display the first 10 rows of cleaned data

In [21]:
warranty_clean <- warranty_raw %>%
  mutate(
    warranty_type = str_to_title(str_trim(warranty_type)),
    provider = str_to_title(str_trim(provider)),
    start_date = if_else(
      str_detect(start_date, "/"),
      mdy(start_date),
      ymd(start_date)
    ),
    end_date = if_else(
      str_detect(end_date, "/"),
      mdy(end_date),
      ymd(end_date)
    ),
    coverage_amount = as.numeric(coverage_amount),
    deductible = if_else(is.na(deductible), 0, deductible),
    status = str_to_lower(str_trim(status))
  )

cat("✓ Warranty data cleaned\n\n")
head(warranty_clean, 10)

✓ Warranty data cleaned



warranty_id,vehicle_id,warranty_type,provider,start_date,end_date,coverage_amount,deductible,status
<dbl>,<dbl>,<chr>,<chr>,<date>,<date>,<dbl>,<dbl>,<chr>
1,1,Extended Warranty,Premium Auto Protection,2024-01-15,2027-01-15,5000.0,100,active
2,2,Powertrain,Premium Auto Protection,2024-01-22,2029-01-22,7500.0,150,active
3,3,Extended Warranty,Complete Coverage Inc,2024-01-28,2027-01-28,5000.0,100,active
4,4,Bumper To Bumper,Complete Coverage Inc,2024-02-05,2027-02-05,10000.0,0,active
5,5,Extended Warranty,Premium Auto Protection,2024-02-12,2027-02-12,5000.0,0,active
6,6,Powertrain,Complete Coverage Inc,2024-02-20,2029-02-20,7500.0,150,active
7,7,Extended Warranty,Premium Auto Protection,2024-03-01,2027-03-01,5000.0,100,active
8,8,Bumper To Bumper,Complete Coverage Inc,2024-03-10,2027-03-10,10000.0,0,active
9,9,Extended Warranty,Premium Auto Protection,2024-03-22,2027-03-22,,100,active
10,10,Powertrain,Complete Coverage Inc,2024-04-01,2029-04-01,7500.0,150,active


## Part 12: Data Quality Summary

**Instructions:** Create a summary report of all cleaned datasets:
- Calculate the total number of records across all 7 cleaned datasets
- Display the record count for each individual dataset:
  - sales_clean, customers_clean, vehicles_clean, salespeople_clean
  - service_clean, financing_clean, warranty_clean
- Format the output with separators (equal signs) for readability
- Use `nrow()` to count records in each dataframe

In [22]:
cat("=" , rep("=", 78), "\n")
cat("DATA CLEANING SUMMARY\n")
cat("=" , rep("=", 78), "\n\n")

total_records <- nrow(sales_clean) + nrow(customers_clean) + 
                 nrow(vehicles_clean) + nrow(salespeople_clean) +
                 nrow(service_clean) + nrow(financing_clean) + 
                 nrow(warranty_clean)

cat("Total Records Cleaned:", total_records, "\n")
cat("  - Sales:", nrow(sales_clean), "\n")
cat("  - Customers:", nrow(customers_clean), "\n")
cat("  - Vehicles:", nrow(vehicles_clean), "\n")
cat("  - Salespeople:", nrow(salespeople_clean), "\n")
cat("  - Service Records:", nrow(service_clean), "\n")
cat("  - Financing:", nrow(financing_clean), "\n")
cat("  - Warranties:", nrow(warranty_clean), "\n\n")

cat("=" , rep("=", 78), "\n")

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
DATA CLEANING SUMMARY
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

Total Records Cleaned: 125 
  - Sales: 20 
  - Customers: 20 
  - Vehicles: 20 
  - Salespeople: 10 
  - Service Records: 20 
  - Financing: 15 
  - Warranties: 20 

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 


## Part 13: Export Cleaned Data

**Instructions:** Export all cleaned datasets to CSV files:
- Export each of the 7 cleaned dataframes to the data directory with "clean_" prefix:
  - `sales_clean` → `clean_dealership_sales.csv`
  - `customers_clean` → `clean_customer_data.csv`
  - `vehicles_clean` → `clean_vehicle_inventory.csv`
  - `salespeople_clean` → `clean_salesperson_info.csv`
  - `service_clean` → `clean_service_records.csv`
  - `financing_clean` → `clean_financing_details.csv`
  - `warranty_clean` → `clean_warranty_info.csv`
- Use `write_csv()` wrapped in `suppressMessages()` for clean output
- Print a confirmation message when all files are exported
- These cleaned files will be used in Part 2: PostgreSQL Database Operations

In [23]:
suppressMessages({
  write_csv(sales_clean, paste0(data_path, "clean_dealership_sales.csv"))
  write_csv(customers_clean, paste0(data_path, "clean_customer_data.csv"))
  write_csv(vehicles_clean, paste0(data_path, "clean_vehicle_inventory.csv"))
  write_csv(salespeople_clean, paste0(data_path, "clean_salesperson_info.csv"))
  write_csv(service_clean, paste0(data_path, "clean_service_records.csv"))
  write_csv(financing_clean, paste0(data_path, "clean_financing_details.csv"))
  write_csv(warranty_clean, paste0(data_path, "clean_warranty_info.csv"))
})

cat("✓ Exported 7 cleaned CSV files\n")
cat("\nProceed to Part 2: PostgreSQL Database Operations\n")

✓ Exported 7 cleaned CSV files

Proceed to Part 2: PostgreSQL Database Operations
