# **02.1 Data Cleaning - Vehicle Dimension Data**

In [1]:
!pip install simpledbf

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 3, Finished, Available)

Collecting simpledbf
  Downloading simpledbf-0.2.6.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hBuilding wheels for collected packages: simpledbf
  Building wheel for simpledbf (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for simpledbf: filename=simpledbf-0.2.6-py3-none-any.whl size=13784 sha256=322a1a12bf478d8b85e951fdc8f17aa3c5b9369595065ecf9cb73ef85297d4a9
  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/e5/41/13/ebdef29165b9309ec4e235dbff19eca8b6759125b0924ad430
Successfully built simpledbf
Installing collected packages: simpledbf
Successfully installed simpledbf-0.2.6


In [2]:
import os
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

from simpledbf import Dbf5 #to convert .dbf files to .csv

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 4, Finished, Available)

## Overview
The following cell automates the preparation of raw data files for intermediate processing. It involves organizing and converting files from a source directory to a designated target directory, focusing on `.dbf` and `.csv` file formats.

## Steps
1. **Directories Setup**: Identifies source (`/lakehouse/default/Files/data/raw`) and target (`/lakehouse/default/Files/data/intermediate`) directories for file processing.
2. **Target Directory Creation**: Checks for the target directory's existence; if absent, it's created.
3. **File Filtering**: Lists all `.dbf` and `.csv` files in the source directory for processing.
4. **File Processing**:
   - For `.dbf` files, converts them to `.csv` format using `Dbf5` and saves them in the target directory with standardized lowercase names.
   - Directly copies `.csv` files to the target directory after standardizing the file names to lowercase.

This script ensures all relevant data files are standardized and located in a single directory, ready for further data analysis or processing tasks.


In [3]:
# Define source and target directories
source_dir = '/lakehouse/default/Files/data/raw'
target_dir = '/lakehouse/default/Files/data/intermediate'

# Create the target directory if it doesn't exist
if not os.path.exists(target_dir):
    os.makedirs(target_dir)

# List all files in the source directory
files = [file for file in os.listdir(source_dir) if file.lower().endswith('.dbf') or file.lower().endswith('.csv')]

for file in files:
    # Set the full path for source and target files
    source_file_path = os.path.join(source_dir, file)
    
    # Standardize the file name to lowercase "spe" format
    standardized_file_name = file.lower().replace('SPE', 'spe')
    target_file_path = os.path.join(target_dir, os.path.splitext(standardized_file_name)[0] + '.csv')
    
    # Check if the file is a DBF file
    if file.lower().endswith('.dbf'):
        # Convert DBF to CSV
        dbf = Dbf5(source_file_path)
        df = dbf.to_dataframe()
        
        # Save to CSV in the target directory
        df.to_csv(target_file_path, index=False)
        print(f"Converted {file} to CSV and saved to {target_file_path}")
    elif file.lower().endswith('.csv'):
        # If the file is already a CSV, simply copy it to the target directory
        # This assumes you want to unify the location/format and not necessarily only convert formats
        import shutil
        shutil.copy(source_file_path, target_file_path)
        print(f"Copied {file} to {target_file_path}")


StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 5, Finished, Available)

PyTables is not installed. No support for HDF output.
Converted SPE2010.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe2010.csv
Converted SPE2011.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe2011.csv
Converted spe1971.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1971.csv
Converted spe1972.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1972.csv
Converted spe1973.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1973.csv
Converted spe1974.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1974.csv
Converted spe1975.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1975.csv
Converted spe1976.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1976.csv
Converted spe1977.DBF to CSV and saved to /lakehouse/default/Files/data/intermediate/spe1977.csv
Converted spe1978.DBF to CSV and saved to /lakehouse/default/Files/data/i

## Overview
The cell below, identifies and processes vehicle specification files from a specific directory, updating counts for each column present across files.

## Key Components
- **Target Directory**: Sets the directory where the files are located.
- **Column Counts Initialization**: Initializes a dictionary to track the occurrence of each column.
- **File Pattern Matching**: Uses a regular expression to identify files following a specific naming convention.
- **Column Descriptions Mapping**: Provides human-readable descriptions for each column present in the files.
- **File Processing**: Lists all files in the target directory matching the pattern, reads them, and updates the column counts.
- **Total Files Calculation**: Computes the total number of files processed.
- **Information Display**: Outputs the count and description of each column found across all files.

## Data Dictionary

- `MAKE`: Vehicle Make
- `MODEL`: Vehicle Model
- `MYR`: Last two digits of the year data was compiled
- `A`: Longitudinal distance front bumper to windshield base
- `B`: Distance rear bumper to backlight base (varies by vehicle type)
- `C`: Maximum vertical height of the side glass
- `D`: Vertical distance base of side glass to lower edge of rocker panel
- `E`: Distance between side rails or maximum width of top
- `F`: Front overhang
- `G`: Rear overhang
- `OL`: Overall length
- `OW`: Overall width
- `OH`: Overall height
- `WB`: Wheelbase
- `TWF`: Front track width
- `TWR`: Rear track width
- `CW`: Curb weight
- `WDIST`: Weight distribution (Front/Rear)

Source: [Open Data Canada - Vehicle Specifications Dataset](https://open.canada.ca/data/dataset/913f8940-036a-45f2-a5f2-19bde76c1252)

In [4]:
# Define the target directory
target_dir = '/lakehouse/default/Files/data/intermediate'

# Initialize a dictionary to hold column names and their counts
column_counts = {}

# Regular expression to match files named "speYEAR.csv"
file_pattern = re.compile(r'spe\d{4}\.csv$', re.IGNORECASE)

# Mapping of column names to descriptions
column_descriptions = {
    'MAKE': 'Vehicle Make',
    'MODEL': 'Vehicle Model',
    'MYR': 'Last two digits of the year data was compiled',
    'A1': 'Longitudinal distance front bumper to windshield base',
    'B1': 'Distance rear bumper to backlight base (varies by vehicle type)',
    'C1': 'Maximum vertical height of the side glass',
    'D1': 'Vertical distance base of side glass to lower edge of rocker panel',
    'E1': 'Distance between side rails or maximum width of top',
    'F1': 'Front overhang',
    'G1': 'Rear overhang',
    'OL': 'Overall length',
    'OW': 'Overall width',
    'OH': 'Overall height',
    'WB': 'Wheelbase',
    'TWF': 'Front track width',
    'TWR': 'Rear track width',
    'CW': 'Curb weight',
    'WDIST': 'Weight distribution (Front/Rear)',
}

# List and process files
for file_ in os.listdir(target_dir):
    if file_pattern.match(file_):
        file_path = os.path.join(target_dir, file_)
        try:
            # Read the file_ into a DataFrame
            df = pd.read_csv(file_path)
            
            # Update counts for each column in this file_
            for column in df.columns:
                column_counts[column] = column_counts.get(column, 0) + 1
        except Exception as e:
            print(f"Error processing file {file_}: {e}")

# Calculate total number of files processed
total_files = sum(1 for file_ in os.listdir(target_dir) if file_pattern.match(file_))

# Display the information
for column, count in column_counts.items():
    description = column_descriptions.get(column, "Unknown description")
    print(f"{column}: {count}/{total_files} - {description}")


StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 6, Finished, Available)

MAKE: 56/56 - Vehicle Make
MODEL: 56/56 - Vehicle Model
MYR: 36/56 - Last two digits of the year data was compiled
OL: 56/56 - Overall length
OW: 56/56 - Overall width
OH: 56/56 - Overall height
WB: 56/56 - Wheelbase
CW: 56/56 - Curb weight
A1: 56/56 - Longitudinal distance front bumper to windshield base
B1: 56/56 - Distance rear bumper to backlight base (varies by vehicle type)
C1: 56/56 - Maximum vertical height of the side glass
D1: 56/56 - Vertical distance base of side glass to lower edge of rocker panel
E1: 56/56 - Distance between side rails or maximum width of top
F1: 56/56 - Front overhang
G1: 56/56 - Rear overhang
TWF: 36/56 - Front track width
TWR: 36/56 - Rear track width
WDIST: 43/56 - Weight distribution (Front/Rear)
H1: 16/56 - Unknown description
I1: 17/56 - Unknown description
J1: 18/56 - Unknown description
K1: 18/56 - Unknown description
CTC: 2/56 - Unknown description



## Overview
The cell below designed to combine data from multiple CSV files into a single DataFrame, ensuring that only files containing specific required columns are included. It supports both Pandas and Dask DataFrames for scalability.

- Defines a set of columns that must be present in each CSV file for it to be included in the final combined DataFrame.
- Initializes an empty DataFrame. The example shows how to initialize both a regular Pandas DataFrame and a Dask DataFrame for handling larger datasets.
- Iterates over files in a specified directory, checking each file for the required columns. If a file meets the criteria, it extracts the year from the file name, adds this as a new column, and appends the data to the combined DataFrame. Errors during file processing are caught and reported.

In [5]:
# Define the required columns 
required_columns = set([
    'MAKE', 'MODEL', 'MYR', 'A1', 'B1', 'C1', 'D1', 'E1', 'F1', 'G1', 
    'OL', 'OW', 'OH', 'WB', 'TWF', 'TWR', 'CW', 'WDIST'
])

# Initialize an empty DataFrame or Dask DataFrame for large data
# For a regular pandas DataFrame:
combined_df = pd.DataFrame()
# For a Dask DataFrame:
# combined_df = dd.from_pandas(pd.DataFrame(), npartitions=1)

# Iterate over files in the directory again to filter and combine them
for file_ in os.listdir(target_dir):
    if file_pattern.match(file_):
        file_path = os.path.join(target_dir, file_)
        try:
            # Temporarily read the file to check if it contains the required columns
            temp_df = pd.read_csv(file_path)
            if required_columns.issubset(temp_df.columns):
                # Extract year from file name
                file_year = re.search(r'spe(\d{4})\.csv', file_).group(1)
                # Add FILE_YEAR column
                temp_df['FILE_YEAR'] = file_year
                # Append to the combined DataFrame
                if isinstance(combined_df, pd.DataFrame):
                    combined_df = pd.concat([combined_df, temp_df], ignore_index=True)
                else:
                    combined_df = dd.concat([combined_df, dd.from_pandas(temp_df, npartitions=1)], ignore_index=True)
        except Exception as e:
            print(f"Error processing file {file_}: {e}")

# If using Dask, you can compute the final result or save it directly without computing
# For example, to save to a Parquet file (good for large datasets)
# combined_df.to_parquet('/path/to/save/your_large_dataset.parquet')

# If the data is manageable in memory and you're using pandas
combined_df.to_csv('/lakehouse/default/Files/data/intermediate/yearly_combined_dimension_data.csv', index=False)

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 7, Finished, Available)

Error processing file SPE2010.csv: 'NoneType' object has no attribute 'group'
Error processing file SPE2011.csv: 'NoneType' object has no attribute 'group'


In [6]:
combined_df.tail()

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 8, Finished, Available)

Unnamed: 0,MAKE,MODEL,MYR,OL,OW,OH,WB,CW,A1,B1,C1,D1,E1,F1,G1,TWF,TWR,WDIST,FILE_YEAR
24940,VOLVO,850 4DR WAGON TURBO,94.0,471.0,176.0,142.0,267,1542.0,130.0,214.0,41.0,69.0,130.0,98.0,108.0,152.0,147.0,,1994
24941,VOLVO,940 4DR SEDAN BASE/TURBO,91.0,487.0,176.0,141.0,277,1455.0,149.0,69.0,41.0,67.0,129.0,99.0,110.0,147.0,146.0,53/47,1994
24942,VOLVO,940 4DR WAGON BASE/TURBO,91.0,481.0,176.0,143.0,277,1489.0,149.0,228.0,41.0,67.0,129.0,99.0,107.0,147.0,146.0,,1994
24943,VOLVO,960 4DR SEDAN,92.0,487.0,175.0,141.0,277,1591.0,149.0,69.0,41.0,67.0,129.0,99.0,110.0,147.0,152.0,53/47,1994
24944,VOLVO,960 4DR WAGON,92.0,481.0,176.0,143.0,277,1565.0,149.0,228.0,41.0,67.0,129.0,99.0,107.0,147.0,146.0,,1994


In [7]:
# Calculate the number of null values per column
null_counts = combined_df.isnull().sum()

# Calculate the total number of entries in the DataFrame
total_entries = len(combined_df)

# Calculate the percentage of null values per column
null_percentage = (null_counts / total_entries) * 100

# Create a report as a DataFrame for better readability
null_report = pd.DataFrame({
    'Column': null_counts.index,
    'Null Values': null_counts.values,
    'Percentage (%)': null_percentage.values
}).sort_values(by='Percentage (%)', ascending=False)

# Display the report
print(null_report)

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 9, Finished, Available)

       Column  Null Values  Percentage (%)
17      WDIST         7078       28.374424
12         E1         1448        5.804770
10         C1         1324        5.307677
9          B1         1287        5.159351
16        TWR         1089        4.365604
11         D1         1017        4.076969
8          A1         1015        4.068952
14         G1          976        3.912608
13         F1          969        3.884546
15        TWF          884        3.543796
7          CW          503        2.016436
5          OH          484        1.940269
2         MYR          106        0.424935
4          OW           83        0.332732
6          WB           13        0.052115
3          OL            9        0.036079
0        MAKE            0        0.000000
1       MODEL            0        0.000000
18  FILE_YEAR            0        0.000000


In [8]:
# Remove all rows that contain at least one null value
cleaned_df = combined_df.dropna()

# If you want to ensure changes are made to the original DataFrame, you can also use inplace=True
# combined_df.dropna(inplace=True)

# Optional: If you're curious about how many rows were dropped
rows_before = len(combined_df)
rows_after = len(cleaned_df)
rows_dropped = rows_before - rows_after

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning: {rows_after}")
print(f"Total rows dropped: {rows_dropped}")


StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 10, Finished, Available)

Rows before cleaning: 24945
Rows after cleaning: 16661
Total rows dropped: 8284


## Overview
The code cell below is used to convert a Model Year Representation (MYR) to a full year format in a DataFrame, utilizing the file year as a proxy for the actual vehicle year.

## Strategy
- **Extract MYR:** Split the MYR at the decimal point, use the first part.
- **Year Prefix:** Prefix the MYR with the first two digits of `FILE_YEAR` to form a full year.
- **Adjustment Logic:** If the resultant year is off by more than 10 years from `FILE_YEAR`, adjust by adding or subtracting 100.

## Implementation Notes
- Ensures MYR prefix has two digits, adding a '0' if necessary.
- The strategy relies on `FILE_YEAR` to approximate the century, adjusting for anomalies.
- Applied to a DataFrame, updating the MYR to a full year and ensuring it is in integer format.

This method cleverly uses the file year to deduce the correct century for the vehicle model year, adjusting for potential errors in MYR.


In [9]:
# Function to convert MYR to a full year format
def convert_myr_to_full_year(row):
    # Attempt to extract the year component from MYR correctly, considering it's in floating point format
    myr_str = str(row['MYR'])
    myr_prefix = myr_str.split('.')[0]  # Extract the digits before any decimal point

    # Ensure the extracted MYR prefix has at least two digits
    if len(myr_prefix) == 1:
        myr_prefix = '0' + myr_prefix  # Prefix with 0 if only one digit is present

    file_year_prefix = str(row['FILE_YEAR'])[:2]  # Extract the century
    potential_year = int(file_year_prefix + myr_prefix)  # Combine to form the full year

    # Adjust for cases where the difference between FILE_YEAR and potential_year is significant
    file_year = int(row['FILE_YEAR'])
    if abs(potential_year - file_year) > 10:
        if potential_year > file_year:
            potential_year -= 100  # Adjust down if MYR indicates a year past FILE_YEAR
        else:
            potential_year += 100  # Adjust up if MYR indicates a year before FILE_YEAR

    return potential_year

# Apply the correction and update the DataFrame
cleaned_df['Adjusted_MYR'] = cleaned_df.apply(convert_myr_to_full_year, axis=1)

# Convert Adjusted_MYR to integer
cleaned_df['Adjusted_MYR'] = cleaned_df['Adjusted_MYR'].astype(int)

# Verify the changes
cleaned_df[['MYR', 'FILE_YEAR', 'Adjusted_MYR']].tail()


StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 11, Finished, Available)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['Adjusted_MYR'] = cleaned_df.apply(convert_myr_to_full_year, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['Adjusted_MYR'] = cleaned_df['Adjusted_MYR'].astype(int)


Unnamed: 0,MYR,FILE_YEAR,Adjusted_MYR
24934,92.0,1994,1992
24937,93.0,1994,1993
24938,93.0,1994,1993
24941,91.0,1994,1991
24943,92.0,1994,1992


In [10]:
cleaned_df.head()

StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 12, Finished, Available)

Unnamed: 0,MAKE,MODEL,MYR,OL,OW,OH,WB,CW,A1,B1,C1,D1,E1,F1,G1,TWF,TWR,WDIST,FILE_YEAR,Adjusted_MYR
0,ACURA,ILX 4DR SEDAN/HYBRID,12.0,454.0,180.0,141.0,267,1356.0,121.0,51.0,30.0,79.0,114.0,96.0,91.0,150.0,152.0,60/40,2013,2012
1,ACURA,MDX 4DR SUV AWD/TECH/ELITE,10.0,485.0,199.0,173.0,275,2076.0,117.0,183.0,42.0,85.0,126.0,100.0,110.0,172.0,172.0,56/44,2013,2010
3,ACURA,TL 4 DR SEDAN FWD/TECH,12.0,493.0,188.0,145.0,278,1695.0,140.0,65.0,32.0,81.0,115.0,102.0,113.0,161.0,162.0,61/39,2013,2012
4,ACURA,TL 4 DR SEDAN SH-AWD/SH-AWD TECH/SH-AWD ELITE,12.0,493.0,188.0,145.0,278,1807.0,140.0,65.0,32.0,81.0,115.0,102.0,113.0,161.0,162.0,59/41,2013,2012
5,ACURA,TSX 4DR SEDAN FWD TECH PACKAGE/PREMIUM,9.0,473.0,184.0,144.0,271,1545.0,121.0,44.0,35.0,79.0,115.0,97.0,104.0,158.0,158.0,60/40,2013,2009


In [12]:
# Define the file path
file_path = '/lakehouse/default/Files/data/processed/processed_vehicle_dimension_data.csv'

# Check if the directory exists, if not, create it
directory = os.path.dirname(file_path)
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the final DataFrame to a new file
cleaned_df.to_csv(file_path, index=False)

print("DataFrame saved to:", file_path)


StatementMeta(, ae8e51e6-b347-46c3-82ee-f269c3f7afe2, 14, Finished, Available)

DataFrame saved to: /lakehouse/default/Files/data/processed/processed_vehicle_dimension_data.csv
