# Used Car Price Estimator

### Summary
[Auto.dev](https://www.auto.dev/) is an API with recent car listing data. Data on Toyota Camrys (n~1000) was gathered and a model was developed to predict car price. Multiple linear regression was used with year, mileage, and trim as independent predictors. The model's fitted values were used to identify cars with the largest discounts (residuals).

### Next Steps
* A daily check-in that can generate discount prices on new cars
* A tracker of new listing data for seasonal analysis

### Code:
#### Part 1: Setup

In [500]:
# Jupyter occasionally has errors with retrieving packages installed in other places/environments
!{sys.executable} -m pip install --upgrade seaborn

Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: seaborn
Successfully installed seaborn-0.13.2


In [2]:
# Import relevant libraries
import requests
import json
import pandas as pd
import numpy as np
from pandas import json_normalize
import statsmodels.api as sm
import matplotlib.pyplot as plt
import sys
import math
from scipy.stats import chi2
import seaborn as sns

#### Part 2: Call API for Data
Data on ~1000 Toyota Camry listings was gathered. Cars before 2016 were not considered. All mileages were considered for the model building. There were four trims used for model building (SE, LE, XSE, XLE). All other trims had low and unreliable counts and were thus excluded.

In [6]:
# Base URL for API requests
url = "https://auto.dev/api/listings"
# General parameters for model trainings
params = {
    'apikey': 'ZrQEPSkKbWF4bWVsbmlrYXNAZ21haWwuY29t',
    'sort_filter': 'created_at:desc',
    'year_min': 2016,
    'make': 'Toyota',
    'model': 'Camry',
    #'city': 'Boston',
    #'state': 'MA',
    #'location': 'Boston, MA',
    'longitude': -71.058884,
    'latitude': 42.360081,
    'radius': 200,
    'transmission[]': 'automatic',
    'exclude_no_price': 'true'
    # Remove or properly set empty parameters
}

# Each page has 20 entries therefore 50 pages need to be collected
pages = list(range(1, 51))

# Initialize an empty list to store all records
all_data = []

# Loop over each page, fetch data, and append to the list
for page in pages:
    params['page'] = page
    response = requests.get(url, params=params)
    result = response.json()
    
    if 'records' in result:
        all_data.extend(result['records'])
    else:
        print(f"Warning: 'records' key not found on page {page}")

# Normalize the JSON data into a DataFrame
df = json_normalize(all_data)

# Now `df` contains all the records from the fetched pages
df  # To display the first few rows of the DataFrame

Unnamed: 0,id,vin,displayColor,year,make,model,price,mileage,city,lat,...,trackingParams.remoteDealerId,trackingParams.dealerName,trackingParams.remoteSku,trackingParams.experience,trackingParams.rooftopUniqueName,trackingParams.rooftopUuid,trackingParams.dealerUniqueName,trackingParams.dealerUuid,trackingParams.dealerGroupUniqueName,trackingParams.dealerGroupUuid
0,295410867,4T1KZ1AK4PU075125,Midnight Black Metallic,2023,Toyota,Camry,"$39,900","6,071 Miles",Nashua,42.7286,...,327343,,N19155TC,local,,,,,,
1,295410866,4T1C11AK7MU423619,Predawn Gray Mica,2021,Toyota,Camry,"$23,600","44,746 Miles",Nashua,42.7286,...,327343,,N21449C,local,,,,,,
2,295410758,4T1DBADK2SU006855,Ice Cap,2025,Toyota,Camry,"$33,669",New,Bronx,40.8700,...,65202,,241012,local,,,,,,
3,295405971,4T1T11BKXNU048163,WHITE,2022,Toyota,Camry,"$23,499","57,170 Miles",Brooklyn,40.6064,...,453444,,53710,local,,,,,,
4,295405421,4T1G11BK3RU118223,Midnight Black Metallic,2024,Toyota,Camry,"$29,990","8,213 Miles",Schenectady,42.7557,...,190707,,RU118223R,local,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,291480983,4T1B11HK7JU645039,Midnight Black Metallic,2018,Toyota,Camry,"$17,995","21,203 Miles",Inwood,40.6230,...,342141,,645039,local,,,,,,
996,291457309,4T1G11BK2RU108671,,2024,Toyota,Camry,"$29,350","17,025 Miles",Little Ferry,40.8539,...,444472,,108671,local,,,,,,
997,291449119,4T1DBADK3SU31A457,White,2025,Toyota,Camry,"$34,973",New,West Springfield,42.1420,...,305851,,SU31A457,local,,,,,,
998,291436725,4T1C11AK6PU760769,White,2023,Toyota,Camry,"$20,514","30,701 Miles",Irvington,40.7247,...,405793,,760769,local,,,,,,


#### Part 3: Clean data
Three car features are included in this model:
- Model Year
- Mileage
- Body Trim

In [22]:
def convert_mileage(value):
    value = value.strip()
    if value.lower() == 'new':
        return 0
    try:
        # Remove non-numeric characters and convert to int
        return int(value.replace(',', '').replace(' Miles', ''))
    except ValueError:
        # Handle unexpected cases
        return None

def convert_year(value):
    value = 2025 - value + 1
    return math.log(value)

def convert_price(value):
    if isinstance(value, str):
        value = value.replace('$', '').replace(',', '')
        try:
            return float(value)
        except ValueError:
            return None
    return None

In [150]:
X = df[['year', 'mileageUnformatted', 'priceUnformatted', 'trim']]

X = X.dropna()

In [152]:
# Define the trims you want to include
trims_to_filter = ['SE', 'LE', 'XSE', 'XLE']

# Filter out any unwanted trims
X_mod = X[X['trim'].isin(trims_to_filter)]
y = X_mod.pop('priceUnformatted')

# Convert remaining trims into categorical data types
X_mod['trim'] = X_mod['trim'].astype('category')
X_mod = pd.get_dummies(X_mod, columns=['trim'], drop_first=True)

# Convert boolean columns to 1/0
X_mod['trim_SE'] = X_mod['trim_SE'].astype(int)
X_mod['trim_XLE'] = X_mod['trim_XLE'].astype(int)
X_mod['trim_XSE'] = X_mod['trim_XSE'].astype(int)


X_mod = sm.add_constant(X_mod)

# Allow each model year to have an additional affect beyond the existing logarithmic effect
temp = X_mod['year']
X_mod = pd.get_dummies(X_mod, columns=['year'], drop_first=False)
X_mod.pop('year_2025')

X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]] = X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]].astype(int)
X_mod['year'] = temp.apply(convert_year)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_mod['trim'] = X_mod['trim'].astype('category')
3      0
4      0
5      0
8      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: year_2016, Length: 742, dtype: int64' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]] = X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]].astype(int)
3      0
4      0
5      0
8      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: year_2017, Length: 742, dtype: int64' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]] = X_mod.iloc[:, [5, 6, 7, 8, 9, 10, 11, 12, 13]].astype(int)
3      0
4      0
5      0

#### Part 4: Build model


In [154]:
regression_model = sm.GLM(y, X_mod, family=sm.families.Gaussian()).fit()

# Print the summary of the model
print(regression_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:       priceUnformatted   No. Observations:                  742
Model:                            GLM   Df Residuals:                      728
Model Family:                Gaussian   Df Model:                           13
Link Function:               Identity   Scale:                      4.7603e+06
Method:                          IRLS   Log-Likelihood:                -6750.2
Date:                Sun, 25 Aug 2024   Deviance:                   3.4655e+09
Time:                        12:17:34   Pearson chi2:                 3.47e+09
No. Iterations:                     3   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               3.493e+04    186

#### Part 5: Evaluate model

This section calculates the estimated price of a car based on input features. Cars with large discounts (difference between predicted price and actual price) are identified.

In [205]:
new_car = pd.DataFrame({'const': [1], 
                        'mileageUnformatted': [39000],
                        'trim_SE': [0],
                        'trim_XLE': [1],
                        'trim_XSE': [0],
                        'year_2016': [0],
                        'year_2017': [0],
                        'year_2018': [1],
                        'year_2019': [0],
                        'year_2020': [0],
                        'year_2021': [0],
                        'year_2022': [0],
                        'year_2023': [0],
                        'year_2024': [0],
                        'year': math.log(8)
                       })
# Predict the price
regression_model.predict(new_car)

0    23781.970535
dtype: float64

In [158]:
regression_model.resid_pearson[regression_model.resid_pearson < -5000]

17    -5058.260503
207   -5092.175621
218   -5886.915986
252   -5123.398309
694   -6174.300329
779   -5870.016413
797   -5554.333122
929   -5328.128593
998   -5528.780504
dtype: float64

In [148]:
pd.set_option('display.max_rows', 100)
df.iloc[207]

id                                                                              295050155
vin                                                                     4T1G11AKXPU184377
displayColor                                                                      Ice Cap
year                                                                                 2023
make                                                                               Toyota
model                                                                               Camry
price                                                                             $19,995
mileage                                                                      53,103 Miles
city                                                                               Newark
lat                                                                               40.7697
lon                                                                               -74.163
primaryPho

#### Part 6: Automate a daily newsletter

In [1]:
%run daily_newsletter.py

Found an entry older than 24 hours, stopping the loop.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_mod['estimate_price'] = X_mod.apply(calculate_price, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_mod['discount'] = X_mod['estimate_price'] - X_mod['priceUnformatted']


## Appendix