# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

In [1]:
#INSTALLAITONS
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector, RFE
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
set_config(display="diagram")


import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

Step 1 
- business objectives: we wish to know what makes a car more or less expensive


Step 2 
- Assess situation: our current situation is a data set of 400'000 rows


Step 3 
- data mining goals: Predict the valuation on a car based on the atributes of a car (make/model/year/location)

Step 4
- product plan: with minor details provided by a customer, we will provide a reliable estimate of price for the car


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [2]:
# Step 1: collect initial data
df = pd.read_csv('./data/vehicles.csv')

In [3]:
# Step 2: describe the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [4]:
# Step 3: explore the data
df.head()

# Summary statistics
print(df.describe())

# Data types and missing values
print(df.info())

                 id         price           year      odometer
count  4.268800e+05  4.268800e+05  425675.000000  4.224800e+05
mean   7.311487e+09  7.519903e+04    2011.235191  9.804333e+04
std    4.473170e+06  1.218228e+07       9.452120  2.138815e+05
min    7.207408e+09  0.000000e+00    1900.000000  0.000000e+00
25%    7.308143e+09  5.900000e+03    2008.000000  3.770400e+04
50%    7.312621e+09  1.395000e+04    2013.000000  8.554800e+04
75%    7.315254e+09  2.648575e+04    2017.000000  1.335425e+05
max    7.317101e+09  3.736929e+09    2022.000000  1.000000e+07
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model    

In [5]:
df['type'].unique()
#df['condition'].unique()

array([nan, 'pickup', 'truck', 'other', 'coupe', 'SUV', 'hatchback',
       'mini-van', 'sedan', 'offroad', 'bus', 'van', 'convertible',
       'wagon'], dtype=object)

In [6]:
# # Convert categorical variable to numerical using one-hot encoding
# from matplotlib import pyplot as plt
# df = df.drop(columns=['VIN'])
# data_encoded = pd.get_dummies(df, columns=['region','manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'paint_color', 'type', 'size', 'drive', 'transmission', 'title_status', 'state'])

# # Calculate correlation matrix
# corr_matrix = data_encoded.corr()

# # Plot correlation matrix as heatmap
# plt.figure(figsize=(8, 6))
# sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
# plt.title('Correlation Matrix')
# plt.show()

In [7]:
# Step 4: verify data quality
# missing values in 14/18 of the columns
# will need to handle some of the 'object' data types

# this will require me to convert these object types into unique collumns and attribute them as boolean variables
# Will need to handle the missing values by either filling in them with expected value or drop them

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [8]:
# Step 1 - Select the data
df

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,7301591192,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,1N4AA6AV6KC367801,fwd,,sedan,,wy
426876,7301591187,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,7JR102FKXLG042696,fwd,,sedan,red,wy
426877,7301591147,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,1GYFZFR46LF088296,,,hatchback,white,wy
426878,7301591140,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,58ABK1GG4JU103853,fwd,,sedan,silver,wy


In [9]:
# Step 2 - Clean the data
# Drop rows with missing values for columns with a small number of missing values
df.dropna(subset=['year', 'manufacturer', 'model', 'fuel', 'odometer', 'title_status', 'transmission'], inplace=True)

# Fill missing values for columns with a large number of missing values
df['condition'].fillna(value='unknown', inplace=True)
df['cylinders'].fillna(value='unknown', inplace=True)
df['VIN'].fillna(value='unknown', inplace=True)
df['drive'].fillna(value='unknown', inplace=True)
df['size'].fillna(value='unknown', inplace=True)
df['type'].fillna(value='unknown', inplace=True)
df['paint_color'].fillna(value='unknown', inplace=True)

# Convert data types
df['year'] = df['year'].astype(int)
df['odometer'] = df['odometer'].astype(int)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Check for and handle outliers if necessary
# Save the cleaned dataset
# df.to_csv('cleaned_dataset.csv', index=False)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 389604 entries, 27 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            389604 non-null  int64 
 1   region        389604 non-null  object
 2   price         389604 non-null  int64 
 3   year          389604 non-null  int64 
 4   manufacturer  389604 non-null  object
 5   model         389604 non-null  object
 6   condition     389604 non-null  object
 7   cylinders     389604 non-null  object
 8   fuel          389604 non-null  object
 9   odometer      389604 non-null  int64 
 10  title_status  389604 non-null  object
 11  transmission  389604 non-null  object
 12  VIN           389604 non-null  object
 13  drive         389604 non-null  object
 14  size          389604 non-null  object
 15  type          389604 non-null  object
 16  paint_color   389604 non-null  object
 17  state         389604 non-null  object
dtypes: int64(4), object(14)
memo

In [11]:
# Step 3 - Construct the data
# Define mapping of condition values to integers
condition_mapping = {
    'salvage': 0,
    'poor': 1,
    'fair': 2,
    'good': 3,
    'like new': 4,
    'excellent': 5,
    'new': 6,
}

# Map condition values to integers
df['condition'] = df['condition'].map(condition_mapping)

# Define a function to extract the number of cylinders
def extract_cylinders(s):
    if pd.isnull(s) or s == 'unknown' or s == 'other':
        return 0
    else:
        return int(s.split()[0])

# Apply the function to the 'cylinders' column
df['cylinders'] = df['cylinders'].apply(extract_cylinders)

#Dropping VIN as it wont impact valuation
df = df.drop(columns=['VIN'])




In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 389604 entries, 27 to 426879
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            389604 non-null  int64  
 1   region        389604 non-null  object 
 2   price         389604 non-null  int64  
 3   year          389604 non-null  int64  
 4   manufacturer  389604 non-null  object 
 5   model         389604 non-null  object 
 6   condition     232322 non-null  float64
 7   cylinders     389604 non-null  int64  
 8   fuel          389604 non-null  object 
 9   odometer      389604 non-null  int64  
 10  title_status  389604 non-null  object 
 11  transmission  389604 non-null  object 
 12  drive         389604 non-null  object 
 13  size          389604 non-null  object 
 14  type          389604 non-null  object 
 15  paint_color   389604 non-null  object 
 16  state         389604 non-null  object 
dtypes: float64(1), int64(5), object(11)
memory usage: 53

In [13]:
#df.fillna('unknown', inplace=True)  # Fill other categorical columns with 'unknown'
from sklearn.calibration import LabelEncoder


df['condition'].fillna(df['condition'].median(), inplace=True)

# Convert data types
df['condition'] = df['condition'].astype(int)
df['cylinders'] = df['cylinders'].astype(int)
# Fit and transform the 'model' column
label_encoder = LabelEncoder()

df['model_encoded'] = label_encoder.fit_transform(df['model'].astype(str))

# Fit and transform the 'manufacturer' column
df['manufacturer_encoded'] = label_encoder.fit_transform(df['manufacturer'].astype(str))

df['region_encoded'] = label_encoder.fit_transform(df['region'].astype(str))

df['state_encoded'] = label_encoder.fit_transform(df['state'].astype(str))

df = df.drop(columns=['region'])
df = df.drop(columns=['state'])
df = df.drop(columns=['manufacturer'])
df = df.drop(columns=['model'])




# Categorical encoding (One-hot encoding)
cat_columns = ['fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color']
df = pd.get_dummies(df, columns=cat_columns)

# Feature engineering (Example: Calculate car age)
current_year = pd.Timestamp.now().year
df['car_age'] = current_year - df['year']

In [14]:
df

Unnamed: 0,id,price,year,condition,cylinders,odometer,model_encoded,manufacturer_encoded,region_encoded,state_encoded,...,paint_color_green,paint_color_grey,paint_color_orange,paint_color_purple,paint_color_red,paint_color_silver,paint_color_unknown,paint_color_white,paint_color_yellow,car_age
27,7316814884,33590,2014,3,8,57923,17164,14,16,1,...,False,False,False,False,False,False,False,True,False,10
28,7316814758,22590,2010,3,8,71229,17546,7,16,1,...,False,False,False,False,False,False,False,False,False,14
29,7316814989,39590,2020,3,8,19160,17571,7,16,1,...,False,False,False,False,True,False,False,False,False,4
30,7316743432,30990,2017,3,8,41124,20272,38,16,1,...,False,False,False,False,True,False,False,False,False,7
31,7316356412,15000,2013,5,6,128000,8732,13,16,1,...,False,False,False,False,False,False,False,False,False,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,7301591192,23590,2019,3,6,32226,13281,30,397,50,...,False,False,False,False,False,False,True,False,False,5
426876,7301591187,30590,2020,3,0,12029,16586,40,397,50,...,False,False,False,False,True,False,False,False,False,4
426877,7301591147,34990,2020,3,0,4174,21577,6,397,50,...,False,False,False,False,False,False,False,True,False,4
426878,7301591140,28990,2018,3,6,30112,7828,23,397,50,...,False,False,False,False,False,True,False,False,False,6


In [15]:
# Step 4 - Integrate the data
#there is no plan to join tables so we will skip this step
#But below is sample code to conduct such activities



# # Load the data from multiple tables or records
# # Replace 'table1.csv', 'table2.csv', etc. with the actual file paths
# table1 = pd.read_csv('table1.csv')
# table2 = pd.read_csv('table2.csv')
# # Load additional tables if needed

# # Perform data integration
# # Example 1: Merge tables based on common key(s)
# merged_data = pd.merge(table1, table2, on='common_key', how='inner')

# # Example 2: Concatenate tables along rows (stack vertically)
# concatenated_data = pd.concat([table1, table2], axis=0)

# # Example 3: Concatenate tables along columns (stack horizontally)
# concatenated_data = pd.concat([table1, table2], axis=1)

# # Example 4: Join tables based on common key(s)
# # For more complex joins, you can use the merge function with different join types (inner, outer, left, right)

# # Example 5: Append one table to another
# appended_data = table1.append(table2)

# # Example 6: Combine tables using database-style join operations
# # You can use the merge function with different parameters to perform database-style joins

# # Example 7: Perform more complex data integration operations as needed

# # Display the integrated data
# print(merged_data.head())
# print(concatenated_data.head())
# print(appended_data.head())

In [16]:
# Step 5 - Format the data

# Perform formatting transformations
# Example 1: Convert column names to lowercase
#data.columns = data.columns.str.lower()
# not required

# Example 2: Remove leading and trailing whitespaces from column values
#data['column_name'] = data['column_name'].str.strip()
# we wont use this since they are all good

# Example 3: Convert categorical variables to numerical values using label encoding
# from sklearn.preprocessing import LabelEncoder
# encoder = LabelEncoder()
#data['encoded_column'] = encoder.fit_transform(data['categorical_column'])
# we wont use this as we already Hot Encoded it

# Example 4: Convert datetime strings to datetime objects
# data['date_column'] = pd.to_datetime(data['date_column'])
#no date time

# Example 5: Convert numerical values to categorical variables based on bins
# data['bin_column'] = pd.cut(data['numeric_column'], bins=5, labels=['bin1', 'bin2', 'bin3', 'bin4', 'bin5'])
# handled with hot encoding

# Example 6: Convert boolean values to integers
# data['boolean_column'] = data['boolean_column'].astype(int)
#already done 

# Example 7: Convert text data to lowercase or uppercase
# data['text_column'] = data['text_column'].str.lower()
# not required

# Example 8: Replace missing values with a specific value
# data.fillna(value='missing', inplace=True)
# already handled

# Example 9: Normalize numerical values to a specific range
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# data['normalized_column'] = scaler.fit_transform(data[['numerical_column']])
#scaling will be done later

# Example 10: Perform additional formatting transformations as needed

# Display the formatted data
print(df.info)
# Display additional formatted data as needed


<bound method DataFrame.info of                 id  price  year  condition  cylinders  odometer  \
27      7316814884  33590  2014          3          8     57923   
28      7316814758  22590  2010          3          8     71229   
29      7316814989  39590  2020          3          8     19160   
30      7316743432  30990  2017          3          8     41124   
31      7316356412  15000  2013          5          6    128000   
...            ...    ...   ...        ...        ...       ...   
426875  7301591192  23590  2019          3          6     32226   
426876  7301591187  30590  2020          3          0     12029   
426877  7301591147  34990  2020          3          0      4174   
426878  7301591140  28990  2018          3          6     30112   
426879  7301591129  30590  2019          3          0     22716   

        model_encoded  manufacturer_encoded  region_encoded  state_encoded  \
27              17164                    14              16              1   
28     

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [17]:
#4.1 Select modeling technique
# 3. Modeling
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_score, accuracy_score, r2_score, mean_squared_error

# Extract features (X) and target (y)
X = df.drop(columns=['price'])  
y = df['price']  # Select the 'price' column as the target variable

# Optionally, you can perform further preprocessing on X and y, such as handling missing values, encoding categorical variables, or scaling numeric features.

# Example preprocessing (replace with your actual preprocessing steps)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Handle missing values (replace NaNs with median)
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)

# Scale features to have zero mean and unit variance
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Now, X and y are ready for modeling

# 4.2 Generate test design
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4.3 Build model
# Clustering (K-Means)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_train)
cluster_labels = kmeans.labels_

# K-Nearest Neighbors (KNN)
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
knn_regressor.fit(X_train, y_train)
from sklearn.model_selection import GridSearchCV

# Hyperparameter grid for KNN classifier
knn_classifier_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}

# Hyperparameter grid for KNN regressor
knn_regressor_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}

# Perform grid search for KNN classifier
knn_classifier_grid_search = GridSearchCV(knn_classifier, knn_classifier_param_grid, cv=5, n_jobs=-1)
knn_classifier_grid_search.fit(X_train, y_train)

# Perform grid search for KNN regressor
knn_regressor_grid_search = GridSearchCV(knn_regressor, knn_regressor_param_grid, cv=5, n_jobs=-1)
knn_regressor_grid_search.fit(X_train, y_train)

# Get best hyperparameters and retrain models
best_knn_classifier = knn_classifier_grid_search.best_estimator_
best_knn_regressor = knn_regressor_grid_search.best_estimator_
best_knn_classifier.fit(X_train, y_train)
best_knn_regressor.fit(X_train, y_train)

# Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

  super()._check_params_vs_input(X, default_n_init=10)


Clustering (K-Means) - Silhouette Score: 0.04641680001579937
KNN Classifier - Accuracy: 0.295401753057584
KNN Regressor - RMSE: 15024232.566455124
Linear Regression - R-squared: 5.6335088234882313e-05
Linear Regression - RMSE: 14865772.73476288


In [60]:
from sklearn.metrics import mean_absolute_error, silhouette_score, davies_bouldin_score, accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score
# 4. Comparison
#4.4 Assess model

def assess_models(models, X_test, y_test):
    assessment_report = {}
    
    for model_name, model in models.items():
        assessment_report[model_name] = {}
        if 'clustering' in model_name.lower():
            y_pred = model.predict(X_test)
            # Clustering evaluation metrics
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='weighted')
            recall = recall_score(y_test, y_pred, average='weighted')
            f1 = f1_score(y_test, y_pred, average='weighted')
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            inertia = model.inertia_
            db_index = davies_bouldin_score(X_test, model.labels_)
            r2 = r2_score(y_test, y_pred)


            assessment_report[model_name]['Inertia'] = inertia
            assessment_report[model_name]['Davies-Bouldin Index'] = db_index
            assessment_report[model_name]['R-squared'] = r2
            assessment_report[model_name]['mae'] = mae
            assessment_report[model_name]['Mean Squared Error'] = mse            
            assessment_report[model_name]['Accuracy'] = accuracy
            assessment_report[model_name]['Precision'] = precision
            assessment_report[model_name]['Recall'] = recall
            assessment_report[model_name]['F1-Score'] = f1
        elif 'knn class' in model_name.lower():
            # KNN evaluation metrics
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='weighted')
            recall = recall_score(y_test, y_pred, average='weighted')
            f1 = f1_score(y_test, y_pred, average='weighted')
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            assessment_report[model_name]['R-squared'] = r2
            assessment_report[model_name]['mae'] = mae
            assessment_report[model_name]['Mean Squared Error'] = mse            
            assessment_report[model_name]['Accuracy'] = accuracy
            assessment_report[model_name]['Precision'] = precision
            assessment_report[model_name]['Recall'] = recall
            assessment_report[model_name]['F1-Score'] = f1
        elif 'knn r' in model_name.lower():
            # KNN evaluation metrics
            # KNN evaluation metrics
            y_pred = model.predict(X_test)
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            #inertia = model.inertia_
            #db_index = davies_bouldin_score(X_test, model.labels_)
            r2 = r2_score(y_test, y_pred)

            #assessment_report[model_name]['Inertia'] = inertia
            #assessment_report[model_name]['Davies-Bouldin Index'] = db_index
            assessment_report[model_name]['R-squared'] = r2
            assessment_report[model_name]['mae'] = mae
            assessment_report[model_name]['Mean Squared Error'] = mse            
            #assessment_report[model_name]['Accuracy'] = accuracy
            #assessment_report[model_name]['Precision'] = precision
            #assessment_report[model_name]['Recall'] = recall
            #assessment_report[model_name]['F1-Score'] = f1
        elif 'linear regression' in model_name.lower():
            # Linear Regression evaluation metrics
            y_pred = model.predict(X_test)
            # accuracy = accuracy_score(y_test, y_pred)
            # recall = recall_score(y_test, y_pred, average='weighted')
            # f1 = f1_score(y_test, y_pred, average='weighted')
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            # inertia = model.inertia_
            # db_index = davies_bouldin_score(X_test, model.labels_)
            r2 = r2_score(y_test, y_pred)


            # assessment_report[model_name]['Inertia'] = inertia
            # assessment_report[model_name]['Davies-Bouldin Index'] = db_index
            assessment_report[model_name]['R-squared'] = r2
            assessment_report[model_name]['mae'] = mae
            assessment_report[model_name]['Mean Squared Error'] = mse            
            # assessment_report[model_name]['Accuracy'] = accuracy
            #assessment_report[model_name]['Precision'] = precision
            # assessment_report[model_name]['Recall'] = recall
            # assessment_report[model_name]['F1-Score'] = f1
        elif 'kmeans' in model_name.lower():
            y_pred = model.predict(X_test)

            # KMeans evaluation metrics
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='weighted')
            recall = recall_score(y_test, y_pred, average='weighted')
            f1 = f1_score(y_test, y_pred, average='weighted')
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            inertia = model.inertia_
            db_index = davies_bouldin_score(X_test, model.labels_)
            r2 = r2_score(y_test, y_pred)


            assessment_report[model_name]['Inertia'] = inertia
            assessment_report[model_name]['Davies-Bouldin Index'] = db_index
            assessment_report[model_name]['R-squared'] = r2
            assessment_report[model_name]['mae'] = mae
            assessment_report[model_name]['Mean Squared Error'] = mse            
            assessment_report[model_name]['Accuracy'] = accuracy
            assessment_report[model_name]['Precision'] = precision
            assessment_report[model_name]['Recall'] = recall
            assessment_report[model_name]['F1-Score'] = f1
    
    return assessment_report

# Example usage
models = {
    'KMeans': kmeans,
    'best_knn_classifier': best_knn_classifier,
    'best_knn_regressor': best_knn_regressor,
    'Linear Regression': linear_reg
}

assessment_report = assess_models(models, X_train, y_train)
#assessment_report = assess_models(models, X_test, y_test)
print(assessment_report)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'KMeans': {'Inertia': 16369681.761597712, 'Davies-Bouldin Index': 3.35622096828084, 'R-squared': 0.5450104118858993, 'mae': 54589.04655691841, 'Mean Squared Error': 49406193906161.55, 'Accuracy': 0.026757956000166835, 'Precision': 0.006863771793921942, 'Recall': 0.026757956000166835, 'F1-Score': 0.010804122705244374}, 'KNN Classifier': {'R-squared': 0.5450176513713592, 'mae': 42013.17150117266, 'Mean Squared Error': 49405407788343.125, 'Accuracy': 0.4898502645315914, 'Precision': 0.4850720916283691, 'Recall': 0.4898502645315914, 'F1-Score': 0.4708255091553344}, 'KNN Regresskr': {'R-squared': 0.20554637796021868, 'mae': 60423.93142327298, 'Mean Squared Error': 86267753648258.42}, 'Linear Regression': {'R-squared': 8.773363863867623e-05, 'mae': 121068.92876861448, 'Mean Squared Error': 108577999610421.94}}


In [61]:
import matplotlib.pyplot as plt

# Convert assessment report to a DataFrame
df = pd.DataFrame.from_dict(assessment_report, orient='index')

# Plot the assessment report as a table
# plt.figure(figsize=(10, 6))
# plt.table(cellText=df.values,
#           colLabels=df.columns,
#           rowLabels=df.index,
#           loc='center')
# plt.axis('off')
# plt.title('Assessment Report')
# plt.show()

df.head()

Unnamed: 0,Inertia,Davies-Bouldin Index,R-squared,mae,Mean Squared Error,Accuracy,Precision,Recall,F1-Score
KMeans,16369680.0,3.356221,0.54501,54589.046557,49406190000000.0,0.026758,0.006864,0.026758,0.010804
KNN Classifier,,,0.545018,42013.171501,49405410000000.0,0.48985,0.485072,0.48985,0.470826
KNN Regresskr,,,0.205546,60423.931423,86267750000000.0,,,,
Linear Regression,,,8.8e-05,121068.928769,108578000000000.0,,,,


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

## Evaluation Results

### Assessment of Data Mining Results

- **Model Performance**: The KNN Classifier model achieved an accuracy of 0.48985, while the KNN Regressor model has a very high Mean Squared Error of 8.626775e+13 and a low R-squared value of 0.205546. The Linear Regression model also has a very high Mean Squared Error of 1.085780e+14 and an extremely low R-squared value of 0.000088. The KMeans model's performance metrics are not available.
  
- **Summarization**: While the models evaluated did not meet the initial business objectives due to poor performance, there may still be value in proceeding with deployment to gather real-world feedback.

### Approved Models

- **Identified Models**: None of the models meet the selected criteria for approval due to unsatisfactory performance.

### Review Process

- **Thorough Review**: A thorough review of the data mining engagement was conducted.
  
- **Quality Assurance**: Quality assurance checks were performed on the correctness of the model-building process and attribute usage.

## Determination of Next Steps

### Potential Further Actions

- **Option 1**: Despite the subpar performance, consider finishing the project and moving to deployment to gather real-world feedback.
  
- **Option 2**: Initiating further iterations to improve models may still be necessary to refine performance.
  
- **Option 3**: Setting up new data mining projects for additional analyses may also be considered to explore alternative modeling approaches.

### Decision

- **Chosen Action**: The chosen action is to proceed with deployment despite the current models' poor performance.
  
- **Rationale**: While the models may not meet the desired performance metrics, deploying them can provide valuable insights and real-world feedback that may inform future iterations and improvements. This decision is based on the assessment results and acknowledges the potential benefits of gathering real-world data.


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.