### Data Engineering (Data Preparation)

The second phase of the CRISP-ML(Q) process model aims to prepare data for the following modeling phase. Data selection, data cleaning, feature engineering, and data standardization tasks are performed during this phase.
We identify valuable and necessary features for future model training by using either filter methods, wrapper methods, or embedded methods for data selection. Furthermore, we select data by discarding samples that do not satisfy data quality requirements. At this point, we also might tackle the problem of unbalanced classes by applying over-sampling or under-sampling strategies.
The data cleaning task implies that we perform error detection and error correction steps for the available data. Adding unit testing for data will mitigate the risk of error propagation to the next phase. Depending on the machine learning task, we might need to perform feature engineering and data augmentation activities. For example, such methods include one-hot encoding, clustering, or discretization of continuous attributes.
The data standardization task denotes the process of unifying the ML tools’ input data to avoid the risk of erroneous data. Finally, the normalization task will mitigate the risk of bias to features on larger scales. We build data and input data transformation pipelines for data pre-processing and feature creation to ensure the ML application’s reproducibility during this phase.

In [187]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [188]:
def readFile(name):
    df = pd.read_json(name, encoding = 'ISO-8859-1')
    print(df.head())
    return df

In [189]:
vehicles_sales = readFile('../Data/Maestria_sls_Dummy_S.json')

# Count the number of records (rows)
num_records = vehicles_sales.shape[0]
print("Number of records:", num_records)

                    DL_NM    BR      PR   TS              LINE    TYPE  TOT
0  Dimotors Coatzacoalcos  FORD  202401  MTD  RANGER SILVERTON  TRUCKS    1
1  Dimotors Coatzacoalcos  FORD  202401  MTD  RANGER SILVERTON  TRUCKS    3
2  Dimotors Coatzacoalcos  FORD  202401  MTD         LOBO CREW  TRUCKS    2
3  Dimotors Coatzacoalcos  FORD  202401  MTD          MAVERICK  TRUCKS    4
4  Dimotors Coatzacoalcos  FORD  202401  MTD  RANGER SILVERTON  TRUCKS    7
Number of records: 42043


In [190]:
# Rename columns
vehicles_sales.rename(columns={'DL_NM': 'Dealer_Name', 'BR': 'Brand', 'PR': 'Period', 'TS': 'Time', 
                               'LINE': 'Vehicle_Line', 'TYPE': 'Vehicle_Type', 'TOT': 'Total_Sales'}, inplace=True)

vehicles_sales.head()

Unnamed: 0,Dealer_Name,Brand,Period,Time,Vehicle_Line,Vehicle_Type,Total_Sales
0,Dimotors Coatzacoalcos,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,1
1,Dimotors Coatzacoalcos,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,3
2,Dimotors Coatzacoalcos,FORD,202401,MTD,LOBO CREW,TRUCKS,2
3,Dimotors Coatzacoalcos,FORD,202401,MTD,MAVERICK,TRUCKS,4
4,Dimotors Coatzacoalcos,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,7


In [191]:
# Convert "Dealer_Name" column to uppercase
vehicles_sales['Dealer_Name'] = vehicles_sales['Dealer_Name'].str.upper()
vehicles_sales['Brand'] = vehicles_sales['Brand'].str.upper()

# Remove duplicate items
vehicles_sales.drop_duplicates(inplace=True)

# Count the number of records (rows)
num_records = vehicles_sales.shape[0]
print("Number of records:", num_records)


Number of records: 29244


In [192]:
# Remove rows with blank or null values in "Total_Sales" column
vehicles_sales = vehicles_sales.dropna(subset=['Total_Sales'])

# Check for consistency in data values (Example: Check if 'Total_Sales' is non-negative)
vehicles_sales_less = vehicles_sales[vehicles_sales['Total_Sales'] >= 0]

# Count the number of records (rows)
num_records = vehicles_sales.shape[0]
print("Number of records:", num_records)


Number of records: 29244


In [193]:
vehicles_sales.head()

Unnamed: 0,Dealer_Name,Brand,Period,Time,Vehicle_Line,Vehicle_Type,Total_Sales
0,DIMOTORS COATZACOALCOS,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,1
1,DIMOTORS COATZACOALCOS,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,3
2,DIMOTORS COATZACOALCOS,FORD,202401,MTD,LOBO CREW,TRUCKS,2
3,DIMOTORS COATZACOALCOS,FORD,202401,MTD,MAVERICK,TRUCKS,4
4,DIMOTORS COATZACOALCOS,FORD,202401,MTD,RANGER SILVERTON,TRUCKS,7


In [194]:
# Split the "Period" column into "Year" and "Month"
vehicles_sales['Year'] = vehicles_sales['Period'] // 100  # Extract the year
vehicles_sales['Month'] = vehicles_sales['Period'] % 100  # Extract the month

# Drop the "Period" column
vehicles_sales.drop(columns=['Period'], inplace=True)

# Map numerical month values to month names
month_map = {
    1: 'January',
    2: 'February',
    3: 'March',
    4: 'April',
    5: 'May',
    6: 'June',
    7: 'July',
    8: 'August',
    9: 'September',
    10: 'October',
    11: 'November',
    12: 'December'
}
vehicles_sales['Month'] = vehicles_sales['Month'].map(month_map)

# Move the "Year" and "Month" columns to the 3rd position
vehicles_sales.insert(2, 'Year', vehicles_sales.pop('Year'))
vehicles_sales.insert(3, 'Month', vehicles_sales.pop('Month'))

# Display the DataFrame with the new "Year" and "Month" columns
vehicles_sales.head()

Unnamed: 0,Dealer_Name,Brand,Year,Month,Time,Vehicle_Line,Vehicle_Type,Total_Sales
0,DIMOTORS COATZACOALCOS,FORD,2024,January,MTD,RANGER SILVERTON,TRUCKS,1
1,DIMOTORS COATZACOALCOS,FORD,2024,January,MTD,RANGER SILVERTON,TRUCKS,3
2,DIMOTORS COATZACOALCOS,FORD,2024,January,MTD,LOBO CREW,TRUCKS,2
3,DIMOTORS COATZACOALCOS,FORD,2024,January,MTD,MAVERICK,TRUCKS,4
4,DIMOTORS COATZACOALCOS,FORD,2024,January,MTD,RANGER SILVERTON,TRUCKS,7


In [195]:
# Group by specified columns and sum "Total_Sales"
grouped_sales = vehicles_sales.groupby(['Dealer_Name', 'Brand', 'Year', 'Month', 'Time', 'Vehicle_Line', 'Vehicle_Type'])['Total_Sales'].sum().reset_index()
grouped_sales = grouped_sales.sort_values(by=['Month', 'Dealer_Name', 'Total_Sales'])

# Sort the DataFrame by month in the correct order
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
grouped_sales['Month'] = pd.Categorical(grouped_sales['Month'], categories=month_order, ordered=True)
grouped_sales = grouped_sales.sort_values(by=['Dealer_Name', 'Brand', 'Year', 'Month', 'Time'])

grouped_sales.head()

Unnamed: 0,Dealer_Name,Brand,Year,Month,Time,Vehicle_Line,Vehicle_Type,Total_Sales
21,ACASA PERINORTE,FORD,2024,January,MTD,F-450,TRUCKS,1
22,ACASA PERINORTE,FORD,2024,January,MTD,FORD BRONCO,OUTFITTERS,2
23,ACASA PERINORTE,FORD,2024,January,MTD,LOBO CREW,TRUCKS,6
19,ACASA PERINORTE,FORD,2024,January,MTD,E-TRANSIT,TRUCKS,25
26,ACASA PERINORTE,FORD,2024,January,MTD,TRANSIT COURIER,TRUCKS,28


In [196]:
# Count the number of records (rows)
num_records = grouped_sales.shape[0]
print("Number of records:", num_records)



Number of records: 4481


In [197]:
grouped_sales_quarterly = grouped_sales.copy()
grouped_sales_quarterly['Time'] = 'QTR'

# Map 'Month' column to represent quarters
quarter_map = {
    'January': 'Q1',
    'February': 'Q1',
    'March': 'Q1',
    'April': 'Q2',
    'May': 'Q2',
    'June': 'Q2',
    'July': 'Q3',
    'August': 'Q3',
    'September': 'Q3',
    'October': 'Q4',
    'November': 'Q4',
    'December': 'Q4'
}
grouped_sales_quarterly['Month'] = grouped_sales_quarterly['Month'].map(quarter_map)

# Group by specified columns and sum 'Total_Sales'
grouped_sales_quarterly = grouped_sales_quarterly.groupby(['Dealer_Name', 'Brand', 'Year', 'Month', 'Time', 'Vehicle_Line', 'Vehicle_Type'])['Total_Sales'].sum().reset_index()

# Sort the DataFrame if needed
grouped_sales_quarterly = grouped_sales_quarterly.sort_values(by=['Dealer_Name', 'Brand', 'Year', 'Month', 'Time'])



# Display the resulting DataFrame
print("Number of records:", grouped_sales_quarterly.shape[0])
print(grouped_sales_quarterly.head())


Number of records: 2790
       Dealer_Name Brand  Year Month Time    Vehicle_Line Vehicle_Type  \
0  ACASA PERINORTE  FORD  2024    Q1  QTR    BRONCO SPORT   OUTFITTERS   
1  ACASA PERINORTE  FORD  2024    Q1  QTR       E-TRANSIT       TRUCKS   
2  ACASA PERINORTE  FORD  2024    Q1  QTR  ESCAPE NA FHEV   OUTFITTERS   
3  ACASA PERINORTE  FORD  2024    Q1  QTR       EXPED MAX   OUTFITTERS   
4  ACASA PERINORTE  FORD  2024    Q1  QTR           F-150       TRUCKS   

   Total_Sales  
0           77  
1           25  
2          192  
3           57  
4           34  


In [198]:
# Custom search criteria
dealer_name_criteria = 'ZAPATA'
months_criteria = ['Q1']
vehicle_line_criteria = 'BRONCO SPORT'

# Perform the search
custom_search_result = grouped_sales_quarterly[
    (grouped_sales_quarterly['Dealer_Name'] == dealer_name_criteria) &
    (grouped_sales_quarterly['Month'].isin(months_criteria)) &
    (grouped_sales_quarterly['Vehicle_Line'] == vehicle_line_criteria)
]

# Display the search result
print(custom_search_result)


     Dealer_Name Brand  Year Month Time  Vehicle_Line Vehicle_Type  \
2691      ZAPATA  FORD  2024    Q1  QTR  BRONCO SPORT   OUTFITTERS   

      Total_Sales  
2691          227  


In [199]:
# Custom search criteria to validate Quarter Calculation
months_criteria = ['January', 'February','March']

# Perform the search
custom_search_result = grouped_sales[
    (grouped_sales['Dealer_Name'] == dealer_name_criteria) &
    (grouped_sales['Month'].isin(months_criteria)) &
    (grouped_sales['Vehicle_Line'] == vehicle_line_criteria)
]

# Display the search result
print(custom_search_result)

     Dealer_Name Brand  Year     Month Time  Vehicle_Line Vehicle_Type  \
4350      ZAPATA  FORD  2024   January  MTD  BRONCO SPORT   OUTFITTERS   
4333      ZAPATA  FORD  2024  February  MTD  BRONCO SPORT   OUTFITTERS   
4373      ZAPATA  FORD  2024     March  MTD  BRONCO SPORT   OUTFITTERS   

      Total_Sales  
4350           84  
4333           54  
4373           89  


In [200]:
# Concatenate the two DataFrames
pd_total_sales_q_m = pd.concat([grouped_sales, grouped_sales_quarterly], ignore_index=True)

# Display the resulting DataFrame
print("Number of records:", pd_total_sales_q_m.shape[0])


Number of records: 7271


In [201]:
import json

# Assuming 'grouped_sales_quarterly' contains the DataFrame with the sales data

# Define the documentation template
documentation_template = {
    "documentation": {
        "context": "This file contains the sales information of Ford of Mexico vehicle sales for 2024 at vehicle line level.",
        "terms": [
            {"name": "Dealer_Name", "definition": "Name of the car dealership"},
            {"name": "Brand", "definition": "Brand of the vehicle"},
            {"name": "Year", "definition": "Year of the sales data"},
            {"name": "Month", "definition": "Month of the sales data"},
            {"name": "Time", "definition": "Time period of the sales data, MTD is Month to date and QTY means Quarter"},
            {"name": "Vehicle_Line", "definition": "Type of vehicle line"},
            {"name": "Vehicle_Type", "definition": "Type of vehicle"}
        ]
    }
}

# Convert DataFrame to list of dictionaries
vehicle_sales_data = grouped_sales_quarterly.to_dict(orient='records')

# Combine documentation and vehicle_sales data
json_data = {
    **documentation_template,
    "vehicle_sales": vehicle_sales_data
}

# Write JSON data to file
file_location = '../json/llm_train_sales_data.json'
with open(file_location, 'w') as json_file:
    json.dump(json_data, json_file, indent=4)


#### Industry Sales

In [202]:
industry = readFile('../Data/Maestria_Indsty_Dummy_S.json')

# Count the number of records (rows)
num_records = industry.shape[0]
print("Number of records:", num_records)

       PR        BRND   INDST    PMA    PMA_R    MS
0  202401      NISSAN  108293  19970  1 DE 44  0.18
1  202401   CHEVROLET  108293  14426  2 DE 44  0.13
2  202401  VOLKSWAGEN  108293  10264  3 DE 44  0.09
3  202401      TOYOTA  108293   9264  4 DE 44  0.09
4  202401         KIA  108293   8204  5 DE 44  0.08
Number of records: 126


In [203]:
industry.rename(columns={'PR': 'Period', 'BRND': 'Brand', 'INDST': 'Industry', 'PMA': 'PMA', 
                               'PMA_R': 'PMA_R', 'MS': 'Market_Shared'}, inplace=True)
industry.head()

Unnamed: 0,Period,Brand,Industry,PMA,PMA_R,Market_Shared
0,202401,NISSAN,108293,19970,1 DE 44,0.18
1,202401,CHEVROLET,108293,14426,2 DE 44,0.13
2,202401,VOLKSWAGEN,108293,10264,3 DE 44,0.09
3,202401,TOYOTA,108293,9264,4 DE 44,0.09
4,202401,KIA,108293,8204,5 DE 44,0.08


In [204]:
# Split the "Period" column into "Year" and "Month"
industry['Year'] = industry['Period'] // 100  # Extract the year
industry['Month'] = industry['Period'] % 100  # Extract the month

# Drop the "Period" column
industry.drop(columns=['Period'], inplace=True)

industry['Month'] = industry['Month'].map(month_map)

# Move the "Year" and "Month" columns to the 3rd position
industry.insert(2, 'Year', industry.pop('Year'))
industry.insert(3, 'Month', industry.pop('Month'))

# Display the DataFrame with the new "Year" and "Month" columns
industry.head()

Unnamed: 0,Brand,Industry,Year,Month,PMA,PMA_R,Market_Shared
0,NISSAN,108293,2024,January,19970,1 DE 44,0.18
1,CHEVROLET,108293,2024,January,14426,2 DE 44,0.13
2,VOLKSWAGEN,108293,2024,January,10264,3 DE 44,0.09
3,TOYOTA,108293,2024,January,9264,4 DE 44,0.09
4,KIA,108293,2024,January,8204,5 DE 44,0.08


In [205]:
grouped_indsutry_quarterly = industry.copy()
grouped_indsutry_quarterly['Time'] = 'QTR'

grouped_indsutry_quarterly['Month'] = grouped_indsutry_quarterly['Month'].map(quarter_map)

# Get total of vehicles available and market shared per brand
grouped_indsutry_quarterly = grouped_indsutry_quarterly.groupby(['Brand', 'Year', 'Month', 'Time'])[['PMA', 'Market_Shared']].sum().reset_index()

#grouped_indsutry_quarterly['MS'] = (grouped_indsutry_quarterly['MS'] / grouped_indsutry_quarterly['MS'].sum()) * 100

grouped_indsutry_quarterly

Unnamed: 0,Brand,Year,Month,Time,PMA,Market_Shared
0,ACURA,2024,Q1,QTR,0,0.0
1,ALFA ROMEO,2024,Q1,QTR,0,0.0
2,AUDI,2024,Q1,QTR,0,0.0
3,BAIC,2024,Q1,QTR,2712,0.03
4,BMW,2024,Q1,QTR,0,0.0
5,BUICK,2024,Q1,QTR,769,0.0
6,CADILLAC,2024,Q1,QTR,0,0.0
7,CHEVROLET,2024,Q1,QTR,46032,0.41
8,CHIREY,2024,Q1,QTR,5457,0.05
9,CUPRA,2024,Q1,QTR,1616,0.01


In [206]:
# Calculate total sales per quarter
total_sales_by_quarter = grouped_indsutry_quarterly.groupby('Month')['Market_Shared'].sum().rename('Total Sales')

In [207]:
# Estimate the % of sales for each brand per quarter
df = grouped_indsutry_quarterly.join(total_sales_by_quarter, on='Month')
df['%_SALES_PER_Q'] = (df['Market_Shared'] / df['Total Sales']) * 100
#df['PMA_Normalized_MinMax'] = (df['PMA'] - df['PMA'].min()) / (df['PMA'].max() - df['PMA'].min())

In [208]:
not_in_final = ['Total Sales', 'Market_Shared']
final_df = df[[v for v in df.columns if v not in not_in_final]]

In [209]:
documentation_template = {
    "documentation": {
        "context": "This file contains the sales information of Ford of Mexico vehicle sales for 2024 at vehicle line level.",
        "terms": [
            {"name": "Brand", "definition": "Brand of the vehicle"},
            {"name": "Year", "definition": "Year of the sales data"},
            {"name": "Month", "definition": "Month of the sales data"},
            {"name": "Time", "definition": "Time period of the sales data, MTD is Month to date and QTY means Quarter"},
            {"name": "PMA", "definition": "Total of vehicles available for sale"},
            {"name": "%_SALES_PER_Q", "definition": "Percentage of sales for each quarter"}
        ]
    }
}

# Convert DataFrame to list of dictionaries
industry_sales_data = final_df.to_dict(orient='records')

# Combine documentation and vehicle_sales data
json_data = {
    **documentation_template,
    "industry": industry_sales_data
}

# Write JSON data to file
file_location = '../json/llm_train_industry_data.json'
with open(file_location, 'w') as json_file:
    json.dump(json_data, json_file, indent=4)

#### Sales Stock

In [210]:
stock = readFile('../Data/Maestria_stck_Dummy_S.json')

# Count the number of records (rows)
num_records = stock.shape[0]
print("Number of records:", num_records)

                    DL_NM      VIIN      PR        TYP MSS006_LINE_VEH_X  \
0  Dimotors Coatzacoalcos  PBA82406  202401  OUTFITTER              EDGE   
1  Dimotors Coatzacoalcos  PED61524  202401      TRUCK             F-250   
2  Dimotors Coatzacoalcos  PED70372  202401      TRUCK             F-250   
3  Dimotors Coatzacoalcos  PFA55596  202401      TRUCK         LOBO CREW   
4  Dimotors Coatzacoalcos  PFA84357  202401      TRUCK         LOBO CREW   

   MSS006_UNITS_Q MSS006_INVOICE_DATE_Y  MSS006_RANGE_DAYS_Q  \
0               1            2023-07-24                    5   
1               1            2023-11-11                    3   
2               1            2023-09-27                    4   
3               1            2023-02-28                    5   
4               1            2023-03-30                    5   

  MSS006_STATUS_STOCK_C  
0                  STCK  
1                  STCK  
2                  STCK  
3                  INTR  
4                  INTR  
Nu

In [211]:
stock.rename(columns={'DL_NM': 'Dealer_Name', 'VIIN': 'Vin', 'PR': 'Period', 'TYP':'Type', 'MSS006_LINE_VEH_X': 'Vehicle_Line', 
                               'MSS006_UNITS_Q': 'Units', 'MSS006_INVOICE_DATE_Y': 'Invoice_Date',
                               'MSS006_RANGE_DAYS_Q': 'Range_Days', 'MSS006_STATUS_STOCK_C': 'Status_Stock'}, inplace=True)
stock.head()

Unnamed: 0,Dealer_Name,Vin,Period,Type,Vehicle_Line,Units,Invoice_Date,Range_Days,Status_Stock
0,Dimotors Coatzacoalcos,PBA82406,202401,OUTFITTER,EDGE,1,2023-07-24,5,STCK
1,Dimotors Coatzacoalcos,PED61524,202401,TRUCK,F-250,1,2023-11-11,3,STCK
2,Dimotors Coatzacoalcos,PED70372,202401,TRUCK,F-250,1,2023-09-27,4,STCK
3,Dimotors Coatzacoalcos,PFA55596,202401,TRUCK,LOBO CREW,1,2023-02-28,5,INTR
4,Dimotors Coatzacoalcos,PFA84357,202401,TRUCK,LOBO CREW,1,2023-03-30,5,INTR


In [212]:
# Split the "Period" column into "Year" and "Month"
stock['Year'] = stock['Period'] // 100  # Extract the year
stock['Month'] = stock['Period'] % 100  # Extract the month

# Drop the "Period" column
stock.drop(columns=['Period'], inplace=True)

stock['Month'] = stock['Month'].map(month_map)

# Move the "Year" and "Month" columns to the 3rd position
stock.insert(2, 'Year', stock.pop('Year'))
stock.insert(3, 'Month', stock.pop('Month'))

# Display the DataFrame with the new "Year" and "Month" columns
stock.head()

Unnamed: 0,Dealer_Name,Vin,Year,Month,Type,Vehicle_Line,Units,Invoice_Date,Range_Days,Status_Stock
0,Dimotors Coatzacoalcos,PBA82406,2024,January,OUTFITTER,EDGE,1,2023-07-24,5,STCK
1,Dimotors Coatzacoalcos,PED61524,2024,January,TRUCK,F-250,1,2023-11-11,3,STCK
2,Dimotors Coatzacoalcos,PED70372,2024,January,TRUCK,F-250,1,2023-09-27,4,STCK
3,Dimotors Coatzacoalcos,PFA55596,2024,January,TRUCK,LOBO CREW,1,2023-02-28,5,INTR
4,Dimotors Coatzacoalcos,PFA84357,2024,January,TRUCK,LOBO CREW,1,2023-03-30,5,INTR


In [213]:
# Convert "Dealer_Name" column to uppercase
stock['Dealer_Name'] = stock['Dealer_Name'].str.upper()

# Remove duplicate items
stock.drop_duplicates(inplace=True)

# Count the number of records (rows)
num_records = stock.shape[0]
print("Number of records:", num_records)

Number of records: 62042


In [214]:
grouped_stock_quarterly = stock.copy()
grouped_stock_quarterly['Time'] = 'QTR'

grouped_stock_quarterly['Month'] = grouped_stock_quarterly['Month'].map(quarter_map)

# Get total of vehicles available and market shared per brand
grouped_stock_quarterly = grouped_stock_quarterly.groupby(['Dealer_Name', 'Vin', 'Year', 'Month', 'Type','Vehicle_Line', 'Invoice_Date', 'Range_Days', 'Status_Stock', 'Time'])[['Units']].sum().reset_index()

grouped_stock_quarterly

Unnamed: 0,Dealer_Name,Vin,Year,Month,Type,Vehicle_Line,Invoice_Date,Range_Days,Status_Stock,Time,Units
0,ACASA PERINORTE,PFA55448,2024,Q1,TRUCK,LOBO CREW,2023-01-13,5,INTR,QTR,1
1,ACASA PERINORTE,PFA68245,2024,Q1,TRUCK,LOBO CREW,2023-03-30,5,STCK,QTR,1
2,ACASA PERINORTE,PFB81693,2024,Q1,TRUCK,LOBO CREW,2023-05-30,5,STCK,QTR,1
3,ACASA PERINORTE,PGB12934,2024,Q1,OUTFITTER,EXPLORER,2023-12-22,3,INTR,QTR,1
4,ACASA PERINORTE,PGB15226,2024,Q2,OUTFITTER,EXPLORER,2023-12-28,4,INTR,QTR,1
...,...,...,...,...,...,...,...,...,...,...,...
62036,ZAPATA (SUC. AEROPUERTO),RX574511,2024,Q2,TRUCK,RANGER TAILANDIA,2024-04-10,1,INTR,QTR,1
62037,ZAPATA (SUC. AEROPUERTO),RYF19436,2024,Q1,OUTFITTER,TERRITORY,2024-02-21,1,INTR,QTR,1
62038,ZAPATA (SUC. PACHUCA),PED10179,2024,Q1,TRUCK,F-250,2023-09-06,4,STCK,QTR,1
62039,ZAPATA (SUC. PACHUCA),PED35863,2024,Q1,TRUCK,F-250,2023-09-06,4,STCK,QTR,1


In [215]:
# Calculate total sales per quarter
total_sales_by_quarter = grouped_stock_quarterly.groupby('Month')['Units'].sum().rename('Total Sales')

In [216]:
# Estimate the % of sales for each brand per quarter
df = grouped_stock_quarterly.join(total_sales_by_quarter, on='Month')
df['%_SALES_PER_Q'] = (df['Range_Days'] / df['Total Sales']) * 100

In [217]:
documentation_template = {
    "documentation": {
        "context": "This file contains the sales information of Ford of Mexico vehicle sales for 2024 at vehicle line level.",
        "terms": [
            {"name": "Dealer_Name", "definition": "Brand of the vehicle"},
            {"name": "Vin", "definition": "Vehicle Identification number"},
            {"name": "Year", "definition": "Year of the sales data"},
            {"name": "Month", "definition": "Month of the sales data"},
            {"name": "Type", "definition": "Type of vehicle"},
            {"name": "Vehicle_Line", "definition": "Type of vehicle line"},
            {"name": "Invoice_Date", "definition": "Date of Invoicing"},
            {"name": "Range_Days", "definition": "Days until Invoicing"},
            {"name": "Status_Stock", "definition": "Status of Vehicle in Stock"},
            {"name": "Time", "definition": "Time period of the sales data, QTR means Quarter"},
            {"name": "Units", "definition": "Total Units sold per vehicle type"},
            {"name": "%_SALES_PER_Q", "definition": "Percentage of sales for each quarter"}
        ]
    }
}

# Convert DataFrame to list of dictionaries
stock_sales_data = final_df.to_dict(orient='records')

# Combine documentation and vehicle_sales data
json_data = {
    **documentation_template,
    "stock": stock_sales_data
}

# Write JSON data to file
file_location = '../json/llm_train_stock_data.json'
with open(file_location, 'w') as json_file:
    json.dump(json_data, json_file, indent=4)

### Conclusions

This project successfully transformed and cleaned vehicle sales data into a format specifically designed for training large language models (LLMs) for AI chatbots. 

We used the principles of the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, specifically focusing on the Data Preparation phase. This phase is crucial for ensuring the quality and effectiveness of data used to train large language models (LLMs) for AI chatbots.

The chosen format is a JSON file with a structured layout, providing both the data itself and clear explanations for each data point.

This approach leverages the power of fine-tuning for LLM training:

    Fine-Tuning for Chatbot Development: Fine-tuning involves taking a pre-trained LLM, with a vast understanding of general language, and specializing it for a specific task like chatbot development. By exposing the LLM to our structured JSON data – containing sales data relevant to the chatbot domain – we can refine its ability to understand and respond to user queries related to car sales.

The chosen JSON format offers several advantages in the context of LLM fine-tuning for AI chatbots:

    Improved Training Efficiency: The structured nature of JSON data allows the LLM to focus on the relevant information during fine-tuning. This targeted learning process is more efficient compared to training on unstructured text formats, leading to faster convergence and better model performance.
    
    Reduced Risk of Hallucination: The JSON format's inherent structure and the inclusion of a "documentation" section with clear tag definitions help prevent the LLM from hallucinating during chatbot interactions. By providing context and meaning to the data, the LLM is less likely to generate responses that deviate from the actual sales information or introduce irrelevant details.
    
    Enhanced Chatbot Performance: The combination of fine-tuning with a well-structured JSON format leads to AI chatbots that can understand user queries related to car sales more accurately. The chatbot can access and process relevant data points (e.g., brand, model, sales figures) efficiently, enabling it to provide informative and coherent responses to user inquiries.

Furthermore, we opted against normalization techniques commonly used in traditional machine learning for a specific reason:

    Normalization techniques like min-max scaling or z-scaling often aim to transform data points to a specific range (e.g., 0-1 or -1 to 1). While beneficial for tasks involving distance or magnitude calculations, this approach can be detrimental for LLM training. LLMs often rely on the inherent distribution and relationships within the data to learn language patterns. Normalization can disrupt these relationships, potentially hindering the LLM's ability to learn effectively.