<a href="https://colab.research.google.com/github/MarcusLongton/Used_Cars_Analysis/blob/main/Car_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

###Our tast is to determine, what do customers value in a used car? Our answer will come in the form of the names of several features of our dataset that we conclude to hold the most significance or in other words, have the most correlation, with our target variable. We will go through several steps to solve this problem. The first is making sure we have a very solid understanding of the real question we are being asked.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

Here will be our first look at the dataset itself. There will be several things that we will want to explore including, the structure of the dataset, what different datatypes it has, and missing values or poorly recored data it might contain. In addition, we will need to begin constructing some visualizations to aid us in our understanding of our dataset.

In [None]:
# Begin by importing the libraries that we will be using
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.feature_selection import SequentialFeatureSelector

In [None]:
# Reading our csv file into a pandas DataFrame
filename = '/content/vehicles.csv'
df = pd.read_csv(filename)
df

In [None]:
# Listing our data types
df.dtypes

The above cell tells us that our data is mostly of a categorical type.

In [None]:
# Describe our data
df.describe(include='all')

In [None]:
df.columns # Listing all features

In [None]:
df['price'].isna().value_counts() # Price will be the variable that wer will try and predict.

In [None]:
# Looking at how many missing values we have per feature
missing_vals = df.isnull().sum().sort_values()
print(missing_vals)

In [None]:
# Count the number of null values in each row
null_count_per_row = df.isnull().sum(axis=1)

# Find rows with more than 5 null values
rows_with_more_than_5_nulls = null_count_per_row[null_count_per_row > 5]

# Get the number of such rows
num_rows_with_more_than_5_nulls = rows_with_more_than_5_nulls.count()

print(f'Number of rows with more than 5 null values: {num_rows_with_more_than_5_nulls}')

In [None]:
# Looking at the distribution of prices
plt.figure(figsize=(8,5))
sns.histplot(df['price'], bins=30, kde=True)
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Histogram of Price')
plt.show()


This initial Histogram provides more questions than answers about our dependent variable. I am hypothesizing that this feature might have some values of 0 as well as some high outliers that are skewing our histogram.

In [None]:
df['price'].describe()

In [None]:
(df['price'] == 0).sum() # Number of rows with price equal to zero

In [None]:
df['price'].max()

In [None]:
# Step 1: Calculate the IQR (Interquartile Range)
Q1 = df['price'].quantile(0.01)
Q3 = df['price'].quantile(0.99)
IQR = Q3 - Q1

# Step 2: Filter out the outliers (values outside the range of Q1 - 1.5*IQR to Q3 + 1.5*IQR)
df_no_outliers = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]
df_no_outliers_no_zeros = df_no_outliers[df_no_outliers['price'] > 1000] # Set this to be a small positive value.

# Step 3: Plot the histogram without outliers
plt.figure(figsize=(10, 6))
sns.histplot(df_no_outliers_no_zeros['price'], kde=True, bins=30)
plt.title('Histogram of Price (Without Outliers)')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

In our Data Preparation phase we will certainly have to take out all of the zero values from 'price'. One of the biggest qestions will be how to deal with the shear quantity of null values we have.

In [None]:
df_numeric = df.select_dtypes(include=['number'])
df_numeric.corr() # Looking at correlation of our numeric features

In [None]:
# Step 1: Calculate the IQR (Interquartile Range)
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

# Step 2: Filter out the outliers (values outside the range of Q1 - 1.5*IQR to Q3 + 1.5*IQR)
df_no_outliers = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]

In [None]:
df['odometer'].describe()

In [None]:
fig = px.scatter(df_no_outliers, x='year', y='price', title='Price vs Year', color='odometer', trendline='ols')
fig.show()

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,6)) # Figure with 1 row and 2 columns

sns.regplot(x='year', y='price', data=df_no_outliers, ax=ax[0], scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})
ax[0].set_title('Price vs Year with Trendline')


sns.histplot(df_no_outliers['price'], kde=True, ax=ax[1])
ax[1].set_title('Distribution of Price')

plt.tight_layout()

plt.show()

In [None]:
# Sorting by price
df.sort_values(by='price', ascending=False).head(10) # Notice lots of values where the price is extremely high.

In [None]:
df.sort_values(by='price', ascending=True) # Notice lots of Values where price is equal to zero

In [None]:
fig = plt.figure(figsize=(15, 6))
sns.regplot(x='year', y='price', data=df_no_outliers, scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})


In [None]:
# Making a distribution of the numbers of each car condition.

In [None]:
man_order = df['manufacturer'].value_counts().index # Sort to show manufacturer from most to least common
plt.figure(figsize=(10, 6))


sns.countplot(x='manufacturer', data=df, order=man_order)
plt.xticks(rotation=90)
plt.title('Distribution of Car Manufacturers')
plt.xlabel('Manufacturer')
plt.ylabel('Number of Cars')
plt.show()

In [None]:
condition_order = df['condition'].value_counts().index # Sort to show condition from most to least common


plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.countplot(x='condition', data=df, order=condition_order)
plt.title('Distribution of Car Conditions')
plt.xlabel('Condition')
plt.ylabel('Number of Cars')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels if they are long
plt.show()


### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

In [None]:
# show me the number of cars whose price is less than 1000
df[df['price'] < 500].value_counts().sum() # Here

In [None]:
df.query('price < 500 & price > 50').groupby('condition').size() # Shows us the number of cars between $50 and $500 based on their condition.

Referring to the above cell. I do not think it makes sense that the vast majority of cars between the price of #500 and $50 fall into the Excellent, good, or like new categories. When I filter my dataframe. I will set 500 to my lower bound.

In [None]:
print(df.select_dtypes(include=['number'])) # Display numeric features

In [None]:
# Here I am going to create a dataframe in which we exclude the upper and lower quantiles from our numeric variable

# Define filtering conditions
lower_bound = 500 # I will assign this to be 500 based on the cell above
upper_bound = df['price'].quantile(0.99) # 99th percentile

df_filtered = df[
    (df['price'] > lower_bound) & (df['price'] <= upper_bound) &  # Filter price
    (df['year'] < df['year'].quantile(0.99)) & (df['year'] > df['year'].quantile(0.01)) &  # Filter year
    (df['odometer'] < df['odometer'].quantile(0.99)) & (df['odometer'] > df['odometer'].quantile(0.01))  # Filter odometer
]

print(f'Price min: {df_filtered["price"].min()}')
print(f'Price max: {df_filtered["price"].max()}')
print(f'Year min: {df_filtered["year"].min()}')
print(f'Year max: {df_filtered["year"].max()}')
print(f'Odometer max: {df_filtered["odometer"].max()}')
print(f'Odometer min: {df_filtered["odometer"].min()}')

In [None]:
df_filtered.shape

In [None]:
# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df_filtered.isnull(), cmap='viridis', cbar=False, yticklabels=False)
plt.title('Missing Data Heatmap')
plt.show()

In [None]:
# Calculate the percentage of null values for each feature
null_percentage = df_filtered.isnull().sum() * 100 / len(df_filtered)

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=null_percentage.index, y=null_percentage.values)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Percentage of Null Values')
plt.title('Percentage of Null Values per Feature')
plt.show()


In [None]:
# I am going to drop some columns that either be will be erroneous to predicting price or have too high a percentage of null values.
df_filtered = df_filtered.drop(columns=['size', 'VIN', 'id'])

In [None]:
df_filtered.dropna(subset=['year', 'price', 'odometer'], inplace=True) # Dropping nulls from our numeric features

In [None]:
# Calculate the percentage of null values for each feature
null_percentage = df_filtered.isnull().sum() * 100 / len(df_filtered)

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=null_percentage.index, y=null_percentage.values)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Percentage of Null Values')
plt.title('Percentage of Null Values per Feature')
plt.show()

In [None]:
# Removing rows with missing values in specific columns
columns_to_remove_na = ['transmission', 'manufacturer', 'fuel', 'title_status', 'model']
initial_rows = len(df_filtered)

for col in columns_to_remove_na:
    df_filtered = df_filtered[df_filtered[col].notna()]  # Remove rows where the column is NaN

# Replacing NaN values with 'unknown' for categorical features
columns_to_fill_unknown = ['condition', 'fuel','drive', 'type', 'paint_color', 'state']

for col in columns_to_fill_unknown:
    df_filtered[col] = df_filtered[col].fillna('unknown')
    print(f"{col} distinct values:", df_filtered[col].unique(), "\n")

# Handling 'cylinders' column separately
df_filtered['cylinders'] = df_filtered['cylinders'].fillna('unknown').replace('other', 'unknown')  # Replace NaN and "other"
df_filtered = df_filtered[df_filtered['cylinders'] != 'unknown'].copy()  # Drop rows where cylinders = 'unknown'
df_filtered['cylinders'] = df_filtered['cylinders'].str.replace('cylinders', '', regex=True).str.strip()  # Remove text
df_filtered['cylinders'] = pd.to_numeric(df_filtered['cylinders'], errors='coerce')  # Convert to numeric (only care about number of cylinders)
print("cylinders distinct values:", df_filtered['cylinders'].unique(), "\n")

In [None]:
# Make sure we have addressed all of the missing values.
df_filtered.isna().sum()

The biggest challenge for me now is keeping my dimensionality in check. I need to one-hot-encode or oridnally-encode lots of the categorical features in my dataset. To simplify this, I am going to look at the frequencies of each of the categorical features and limit the unqiue values in each column to only those with the highest levels of frequency.

In [None]:
# Calculate average price per car color
avg_price_by_color = df_filtered.groupby('paint_color')['price'].mean()

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_by_color.index, y=avg_price_by_color.values)
plt.xlabel("Car Color")
plt.ylabel("Average Price")
plt.title("Average Price of Cars by Color")
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


In [None]:
# Calculate average price per transmission type
avg_price_by_transmission = df_filtered.groupby('transmission')['price'].mean()

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_by_transmission.index, y=avg_price_by_transmission.values)
plt.xlabel("Transmission Type")
plt.ylabel("Average Price")
plt.title("Average Price of Cars by Transmission")
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


In [None]:
# Calculate average price per title_status
avg_price_by_title = df_filtered.groupby('title_status')['price'].mean()

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_by_title.index, y=avg_price_by_title.values)
plt.xlabel("Title Status")
plt.ylabel("Average Price")
plt.title("Average Price of Cars by Title Status")
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


In [None]:
# Calculate average price per car type
avg_price_by_type = df_filtered.groupby('type')['price'].mean()

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_by_type.index, y=avg_price_by_type.values)
plt.xlabel("Type")
plt.ylabel("Average Price")
plt.title("Average Price of Cars by Type")
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


In [None]:
df_filtered

In [None]:
clean_df = df_filtered.copy()
# Delete df_filtered2 to save ram
del df_filtered

In [None]:
# Apply the same filter to x and color
odometer_filtered = clean_df[clean_df['odometer'] < 400000]

# Now use the filtered DataFrame for all arguments
fig = px.scatter(
    x=clean_df['price'],
    y=clean_df['odometer'],  # Also update y to use the filtered price
    color=clean_df['price'],
    title='Price vs. Odometer with Density',
    labels = {'x':'Price', 'y':'Odometer'},
    trendline='ols',
    trendline_color_override='red'
)
fig.show()

In [None]:
# Now use the filtered DataFrame for all arguments
fig = px.scatter(
    x=clean_df['year'],
    y=clean_df['odometer'],  # Also update y to use the filtered price
    color=clean_df['price'],
    title='Odometer vs. Year with Density',
    labels = {'x':'Price', 'y':'Odometer'},
    trendline='ols',
    trendline_color_override='red'
)
fig.show()

In [None]:
# Now use the filtered DataFrame for all arguments
fig = px.scatter(
    x=clean_df['year'],
    y=clean_df['price'],  # Also update y to use the filtered price
    color=clean_df['price'],
    title='Year vs. Price with Density',
    labels = {'x':'Price', 'y':'Odometer'},
    trendline='ols',
    trendline_color_override='red'
)
fig.show()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
# Begin by separating our numerical features.
numerical_features = ['year','odometer','cylinders', 'price']
numerical_features

In [None]:
# Firstly we will look at some simple linear regressions with our numerical features.
numeric_df = clean_df[numerical_features]
numeric_df

In [None]:
X_numeric = numeric_df.drop(columns=['price'])
y_numeric = numeric_df['price']

In [None]:
# Simple linear regression on year
X_train, X_test, y_train, y_test = train_test_split(X_numeric[['year']], y_numeric, test_size=0.2, random_state=42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
mse_year_linreg = mean_squared_error(y_test, y_pred)

sorted_idx = X_test['year'].argsort()
X_test_sorted = X_test.iloc[sorted_idx]
y_pred_sorted = y_pred[sorted_idx]

plt.figure(figsize=(15,6))
plt.scatter(X_test['year'], y_test, color='blue', label='Actual')
plt.plot(X_test_sorted['year'], y_pred_sorted, color='red', linewidth=2, label='Prediction')
plt.xlabel('Year')
plt.ylabel('Price')
plt.title('Linear Regression')
plt.legend()
plt.show()

print(f"Root Mean Squared Error (RMSE) for Linear Regression with 'Year' Feature: {np.sqrt(mse_year_linreg)}")

In [None]:
# Simple linear regression on odometer
X_train, X_test, y_train, y_test = train_test_split(X_numeric[['odometer']], y_numeric, test_size=0.2, random_state=42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
mse_odometer_linreg = mean_squared_error(y_test, y_pred)

plt.figure(figsize=(15,6))
plt.scatter(X_test['odometer'], y_test, color='blue', label='Actual')
plt.plot(X_test['odometer'], y_pred, color='red', linewidth=2, label='Prediction')
plt.xlabel('Odometer')
plt.ylabel('Price')
plt.title('Linear Regression of Odometer and Price')
plt.legend()
plt.show()

print(f"Root Mean Squared Error (RMSE) for Linear Regression with 'Odometer' Feature: {np.sqrt(mse_odometer_linreg)}")

In [None]:
# Lets see if a multiple linear regression could improve our RMSE.
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y_numeric, test_size=0.2, random_state=42)
multiple_linreg = LinearRegression()
multiple_linreg.fit(X_train, y_train)
y_pred = multiple_linreg.predict(X_test)

rmse_multiple_linreg = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE) for Multiple Linear Regression with numeric features: {rmse_multiple_linreg}")

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
# Comparing our RMSE scored for our different models.
print(f"RMSE for Linear Regression with 'Year' Feature: {np.sqrt(mse_year_linreg)}")
print(f"RMSE for Linear Regression with 'Odometer' Feature: {np.sqrt(mse_odometer_linreg)}")
print(f"RMSE for Multiple Linear Regression with numeric features: {rmse_multiple_linreg}")

The above cells shows us that we acheived the lowest RMSE score while using multiple linear regression.

In [None]:
# list of all the columns with a dtype == 'object' (isolates our categorical variables)
object_columns = clean_df.select_dtypes(include=['object']).columns.tolist()
object_columns

In [None]:
clean_df[object_columns].nunique().sort_values(ascending=False)

In [None]:
# We are going to drop the model, region, and state features due to their high cardinality.
clean_df = clean_df.drop(columns=['model', 'region', 'state'])

In [None]:
clean_df[clean_df.select_dtypes(include=['object']).columns.tolist()].nunique().sort_values(ascending=False)

In [None]:
# Making our dummy variables
X_categorical = pd.get_dummies(clean_df, drop_first=True)
y_categorical = clean_df['price']

In [None]:
# Standardizing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_categorical)

In [None]:
# Separate Datasets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_categorical, test_size=0.2, random_state=42)

In [None]:
# We are going to use PCA to help with dimensionality reduction
pca = PCA(n_components=0.95) # Varience at 95% to balance between overfitting and over-simplicity
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
# Using Sequential Feature selection to limit our number of features.
linreg_sfs = LinearRegression()
selector = SequentialFeatureSelector(linreg_sfs, n_features_to_select=5, direction='forward')
selector.fit(X_train_pca, y_train)
X_train_sfs = selector.transform(X_train_pca)
X_test_sfs = selector.transform(X_test_pca)

In [None]:
# Using GridSearchCV for optimizing Ridge and Lasso Regressions.

# Building parameter grid
parameter_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Creating GridSearchCV instances
ridge = GridSearchCV(Ridge(), parameter_grid, cv=5)
lasso = GridSearchCV(Lasso(), parameter_grid, cv=5)

# Training models
ridge.fit(X_train_sfs, y_train)
lasso.fit(X_train_sfs, y_train)

# Making predictions
lasso_preds = lasso.predict(X_test_sfs)
ridge_preds = ridge.predict(X_test_sfs)

In [None]:
# Evaluate Performance of Ridge vs Lasso Models


ridge_mse = mean_squared_error(y_test, ridge_preds) # Calculating MSE and RSME
ridge_rsme = np.sqrt(ridge_mse)
lasso_mse = mean_squared_error(y_test, lasso_preds)
lasso_rsme = np.sqrt(lasso_mse)
r2_lasso = r2_score(y_test, lasso_preds)
r2_ridge = r2_score(y_test, ridge_preds)
# Printing results
print(f"MSE for Ridge Regression: {ridge_mse}")
print(f"MSE for Lasso Regression: {lasso_mse}")
print(f"RMSE for Ridge Regression: {ridge_rsme}")
print(f"RMSE for Lasso Regression: {lasso_rsme}")
print(f"R2 for Ridge Regression: {r2_ridge}")
print(f"R2 for Lasso Regression: {r2_lasso}")

In [None]:
# Side by side subplots comparing the Actual vs predicted for Lasso and Ridge

fig, ax = plt.subplots(1,2, figsize=(15,6)) # Figure with 1 row and 2 columns

sns.regplot(x=y_test, y=lasso_preds, ax=ax[0], scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})
ax[0].set_title('Lasso Regression')
ax[0].set_xlabel('Actual Price')
ax[0].set_ylabel('Predicted Price')

sns.regplot(x=y_test, y=ridge_preds, ax=ax[1], scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})
ax[1].set_title('Ridge Regression')
ax[1].set_xlabel('Actual Price')
ax[1].set_ylabel('Predicted Price')

plt.tight_layout()

plt.show()

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

In [None]:
# Create a DataFrame for the visualization
results_df = pd.DataFrame({'Actual Price': y_test, 'Predicted Price (Lasso)': lasso_preds})

# Group data by year for mean values
results_by_year = results_df.groupby(clean_df['year']).mean()

# Create the plot
plt.figure(figsize=(15, 6))
plt.plot(results_by_year.index, results_by_year['Actual Price'], label='Actual Price', marker='o')
plt.plot(results_by_year.index, results_by_year['Predicted Price (Lasso)'], label='Predicted Price (Lasso)', marker='x')

plt.xlabel('Year')
plt.ylabel('Mean Price')
plt.title('Actual vs. Predicted Mean Price by Year')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Create the subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Subplot 1: Odometer vs. Price
sns.regplot(x='odometer', y='price', data=clean_df, ax=axes[0], scatter_kws={"color": "blue"}, line_kws={"color": "red"})
axes[0].set_title('Odometer vs. Price')

# Subplot 2: Year vs. Price
sns.regplot(x='year', y='price', data=clean_df, ax=axes[1], scatter_kws={"color": "blue"}, line_kws={"color": "red"})
axes[1].set_title('Year vs. Price')


plt.tight_layout()
plt.show()


In [None]:
# Delete dataframes to reduce file size

# del clean_df
# del results_df
# del df_numeric
# del df_no_outliers
# del results_by_year
# del odometer_filtered
# del X_numeric
# del y_numeric
# del X_train
# del X_test
# del y_train
# del y_test
# del X_scaled
# del X_categorical
# del y_categorical
# del X_train_pca
# del X_test_pca
# del X_train_sfs
# del X_test_sfs
# del linreg_sfs
# del selector
# del lasso
# del ridge
# del parameter_grid
# del lasso_preds
# del ridge_preds
# del ridge_mse
# del ridge_rsme
# del lasso_mse
# del lasso_rsme
# del df

