The summary of the entire process of part 3:

1. **Data Loading and Preprocessing**:
   - The data is loaded from a JSON file.
   - Relevant columns are selected, including time of publish, year, distance, description, condition, and price.
   - The 'After_Description' column is transformed into a numerical feature using categorical encoding.
   - Data is split into training and testing sets with an 80/20 ratio.

2. **Model Training and Evaluation**:
   - Multiple regression models and classifiers are trained and evaluated:
     - **Linear Regression**: A basic model for continuous output prediction. Evaluated with MSE, RMSE, MAE, and R².
     - **Random Forest Regression**: An ensemble model that uses multiple decision trees to improve prediction accuracy. Evaluated similarly to Linear Regression.
     - **Gradient Boosting Regression**: Another ensemble model that builds trees sequentially to minimize errors from previous trees. Evaluated using the same metrics.
     - **Naïve Bayes Classifier**: A probabilistic classifier, adapted here for categorizing prices into bins. Evaluated with accuracy and confusion matrix.
     - **Linear Discriminant Analysis (LDA)**: A classifier that also reduces dimensionality, used here for categorical price prediction. Evaluated with accuracy and confusion matrix.
     - **k-Nearest Neighbors (kNN)**: A model that predicts the output based on the 'k' closest training examples. Optimized using grid search to find the best 'k' and evaluated with regression metrics.
     - **Support Vector Machine (SVM)**: Applied for regression (SVR), with hyperparameters optimized using grid search. Evaluated using MSE, RMSE, MAE, and R².
     - **Decision Tree**: A model that predicts the value of a target variable by learning simple decision rules from data features. Optimized for parameters like max depth and evaluated with regression metrics.

3. **Advanced Neural Network Model - LSTM**:
   - An LSTM (Long Short-Term Memory) model is trained for time series prediction of prices, which involves:
     - Feature scaling using MinMaxScaler.
     - Reshaping data to fit the LSTM input requirements.
     - Training with early stopping to prevent overfitting.
     - Evaluation using MSE, RMSE, MAE, and R², with results transformed back to the original scale for meaningful interpretation.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Read data
file_path = '/Users/a1234/Desktop/BU/677 PYTHON/project/combined_data/processed_toyota_data.json'
data = pd.read_json(file_path, lines=True)

# Select required columns
selected_columns = ['Publish Time', 'After_Year', 'Distance', 'After_Description', 'Condition Numeric', 'Price']
data = data[selected_columns]

# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [2]:
'''Linear Regression Model'''
# Build linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred)
rmse_lr = np.sqrt(mse_lr)
mae_lr = mean_absolute_error(y_test, y_pred)
r2_lr = r2_score(y_test, y_pred)

print("Linear Regression MSE:", mse_lr)
print("Linear Regression RMSE:", rmse_lr)
print("Linear Regression MAE:", mae_lr)
print("Linear Regression R^2:", r2_lr)


Linear Regression MSE: 109287552.87232703
Linear Regression RMSE: 10454.068723340546
Linear Regression MAE: 7638.766117197178
Linear Regression R^2: 0.2317195259729372


In [3]:
'''Random Forest Regression Model'''
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Build Random Forest regression model
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train, y_train)

# Evaluate the model
y_pred_rf = random_forest_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest MSE:", mse_rf)
print("Random Forest RMSE:", rmse_rf)
print("Random Forest MAE:", mae_rf)
print("Random Forest R^2:", r2_rf)


Random Forest MSE: 24095678.44002073
Random Forest RMSE: 4908.734912380249
Random Forest MAE: 2654.6383732752356
Random Forest R^2: 0.8306098108397637


In [4]:
'''Gradient Boosting Regression Model'''

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# 建立Gradient Boosting回归模型
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)

# 评估模型
y_pred_gb = gb_model.predict(X_test)
mse_gb = mean_squared_error(y_test, y_pred_gb)
rmse_gb = np.sqrt(mse_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("Gradient Boosting MSE:", mse_gb)
print("Gradient Boosting RMSE:", rmse_gb)
print("Gradient Boosting MAE:", mae_gb)
print("Gradient Boosting R^2:", r2_gb)


Gradient Boosting MSE: 43410152.972740695
Gradient Boosting RMSE: 6588.638172850342
Gradient Boosting MAE: 4281.901194164018
Gradient Boosting R^2: 0.6948310029190025


In [5]:
'''Naïve Bayesian'''
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']

# Convert price to categorical labels
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
y_binned = est.fit_transform(y.values.reshape(-1, 1)).ravel()

X_train, X_test, y_train_binned, y_test_binned = train_test_split(X, y_binned, test_size=0.2, random_state=42)

# Build Naïve Bayes classifier
nb_model = GaussianNB()
nb_model.fit(X_train, y_train_binned)

# Evaluate the model
y_pred_nb = nb_model.predict(X_test)
accuracy = accuracy_score(y_test_binned, y_pred_nb)
conf_matrix = confusion_matrix(y_test_binned, y_pred_nb)

print("Naïve Bayesian Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)


Naïve Bayesian Accuracy: 0.514161220043573
Confusion Matrix:
 [[105  47   8]
 [ 36 106   7]
 [  5 120  25]]


In [6]:
'''Linear Discriminant Analysis, LDA'''
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix

# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']

# Convert price to categorical labels
est = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
y_binned = est.fit_transform(y.values.reshape(-1, 1)).ravel()

X_train, X_test, y_train_binned, y_test_binned = train_test_split(X, y_binned, test_size=0.2, random_state=42)

# Build Linear Discriminant Analysis model
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train_binned)

# Evaluate the model
y_pred_lda = lda_model.predict(X_test)
accuracy_lda = accuracy_score(y_test_binned, y_pred_lda)
conf_matrix_lda = confusion_matrix(y_test_binned, y_pred_lda)

print("LDA Accuracy:", accuracy_lda)
print("Confusion Matrix:\n", conf_matrix_lda)


LDA Accuracy: 0.45751633986928103
Confusion Matrix:
 [[72  6  9 27]
 [41 34 30 20]
 [16 18 29 38]
 [ 5  3 36 75]]


In [7]:
'''k-Nearest Neighbors, kNN'''
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up parameter grid for kNN model
param_grid = {'n_neighbors': range(1, 31)}

# Create and run grid search
grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Find and display the best k value
best_k = grid_search.best_params_['n_neighbors']
print("Best k:", best_k)

# Build kNN model with the best k value
knn_model = KNeighborsRegressor(n_neighbors=best_k)
knn_model.fit(X_train, y_train)

# Evaluate the model
y_pred_knn = knn_model.predict(X_test)
mse_knn = mean_squared_error(y_test, y_pred_knn)
rmse_knn = np.sqrt(mse_knn)
mae_knn = mean_absolute_error(y_test, y_pred_knn)
r2_knn = r2_score(y_test, y_pred_knn)

print("kNN MSE:", mse_knn)
print("kNN RMSE:", rmse_knn)
print("kNN MAE:", mae_knn)
print("kNN R^2:", r2_knn)


Best k: 1
kNN MSE: 39898807.98474946
kNN RMSE: 6316.550323139162
kNN MAE: 3070.912854030501
kNN R^2: 0.7195154040328053


In [8]:
'''Support Vector Machine (SVM)'''
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Set up parameter grid for SVM model
param_grid = {
    'C': [10000, 20000, 23000, 25000, 28000, 30000],  # Regularization parameter
    'gamma': ['scale', 'auto'],  # Kernel coefficient
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']  # Kernel function type
}

# Create and run grid search
grid_search = GridSearchCV(SVR(), param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train_scaled, y_train)

# Find and display the best parameters
best_params = grid_search.best_params_
print("Best parameters:", best_params)

# Build SVM model with the best parameters
svm_model = SVR(**best_params)
svm_model.fit(X_train_scaled, y_train)

# Evaluate the model
y_pred_svm = svm_model.predict(X_test_scaled)
mse_svm = mean_squared_error(y_test, y_pred_svm)
rmse_svm = mean_squared_error(y_test, y_pred_svm, squared=False)
mae_svm = mean_absolute_error(y_test, y_pred_svm)
r2_svm = r2_score(y_test, y_pred_svm)

print("SVM MSE:", mse_svm)
print("SVM RMSE:", rmse_svm)
print("SVM MAE:", mae_svm)
print("SVM R^2:", r2_svm)


Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END ................C=10000, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ................C=10000, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ................C=10000, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ................C=10000, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ................C=10000, gamma=scale, kernel=linear; total time=   0.0s
[CV] END ..................C=10000, gamma=scale, kernel=poly; total time=   0.1s
[CV] END ..................C=10000, gamma=scale, kernel=poly; total time=   0.1s
[CV] END ..................C=10000, gamma=scale, kernel=poly; total time=   0.1s
[CV] END ..................C=10000, gamma=scale, kernel=poly; total time=   0.1s
[CV] END ..................C=10000, gamma=scale, kernel=poly; total time=   0.1s
[CV] END ...................C=10000, gamma=scale, kernel=rbf; total time=   0.1s
[CV] END ...................C=10000, gamma=scal

In [9]:
'''Decision Tree'''
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# Split into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up parameter grid for grid search
param_grid = {
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 10, 20, 40],
    'min_samples_leaf': [1, 2, 5, 10],
    'max_features': ['auto', 'sqrt', 'log2', None]
}

# Create and run grid search for decision tree regression model
grid_search = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)

# Find and display the best parameters
best_params = grid_search.best_params_
print("Best parameters:", best_params)

# Build decision tree model with the best parameters
dt_model_optimized = DecisionTreeRegressor(**best_params, random_state=42)
dt_model_optimized.fit(X_train, y_train)

# Evaluate the optimized model
y_pred_dt_opt = dt_model_optimized.predict(X_test)
mse_dt_opt = mean_squared_error(y_test, y_pred_dt_opt)
rmse_dt_opt = mean_squared_error(y_test, y_pred_dt_opt, squared=False)
mae_dt_opt = mean_absolute_error(y_test, y_pred_dt_opt)
r2_dt_opt = r2_score(y_test, y_pred_dt_opt)

print("Optimized Decision Tree MSE:", mse_dt_opt)
print("Optimized Decision Tree RMSE:", rmse_dt_opt)
print("Optimized Decision Tree MAE:", mae_dt_opt)
print("Optimized Decision Tree R^2:", r2_dt_opt)

Fitting 5 folds for each of 384 candidates, totalling 1920 fits
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10; total time=   0

480 fits failed out of a total of 1920.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
480 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/Library/Frameworks/Python.framewo

In [None]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Input
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
import numpy as np

# Define features and target
features = ['Publish Time', 'After_Year', 'Distance', 'Condition Numeric']
X = data[features]
y = data['Price'].values

# Feature scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = y.reshape(-1, 1)
y_scaled = scaler.fit_transform(y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=42)

# Reshape input data to fit LSTM model
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

# Define the model with an Input layer
model = Sequential()
model.add(Input(shape=(1, X_train.shape[2])))  # Correctly add the Input layer
model.add(LSTM(20, activation='relu'))
model.add(Dense(1))
model.compile(optimizer=Adam(learning_rate=0.01), loss='mean_squared_error')

# Adjust batch size and epochs
batch_size = 64
epochs = 15

# Implement early stopping with reduced patience
early_stopping = EarlyStopping(monitor='val_loss', patience=2)

# Train the model with early stopping
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1, 
                    validation_data=(X_test, y_test), callbacks=[early_stopping])

# Evaluate the model
y_pred = model.predict(X_test)
# Transform back to original scale
y_test_inv = scaler.inverse_transform(y_test)
y_pred_inv = scaler.inverse_transform(y_pred)

In [17]:
# Calculate MSE
mse = mean_squared_error(y_test_inv, y_pred_inv)
print('LSTM MSE:', mse)

# Calculate RMSE
rmse = np.sqrt(mse)
print('LSTM RMSE:', rmse)

# Calculate MAE
mae = mean_absolute_error(y_test_inv, y_pred_inv)
print('LSTM MAE:', mae)

# Calculate R^2
r2 = r2_score(y_test_inv, y_pred_inv)
print('LSTM R^2:', r2)

LSTM MSE: 32469541.33252813
LSTM RMSE: 5423.03523379162
LSTM MAE: 2857.853403304501
LSTM R^2: 0.821340460328053
