**This is our notebook where we will explain our process of everything**

In [259]:
#cell for importing all libraries
import numpy as np
import pandas as pd
import scipy
import tensorflow
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

In [235]:
#cell for importing dataset
X_train = pd.read_csv("./data/X_train.csv")
y_train = pd.read_csv("./data/y_train.csv")
X_test = pd.read_csv("./data/X_test.csv")
y_test = pd.read_csv("./data/y_test.csv")

**We should probably drop the make data as well since the make is already included in the model that the car is**

In [236]:
#one hot encoding variables
#X_train = pd.get_dummies(X_train)
#make sure to one hot encode test dataset as well before predicting
#X_test = pd.get_dummies(X_test)

#Makes sure that the test and training data always have all one-hot categories by combining before encoding then splitting again
X_combined = pd.concat([X_train,X_test],axis=0)
X_combined = pd.get_dummies(X_combined, )

X_train = X_combined[:len(X_train)]
X_test = X_combined[len(X_train):]

#dropping data that isn't useful
X_train = X_train.drop('carID', axis=1)  # drop the CarID value since it's just an identifier
X_test = X_test.drop('carID', axis=1)  # drop the CarID value since it's just an identifier
y_train = y_train.drop('carID', axis=1)  # drop the CarID value since it's just an identifier
y_test = y_test.drop('carID', axis=1)  # drop the CarID value since it's just an identifier

#scaling data
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.fit_transform(X_test))


# #remove highly correlated data
corr_matrix = X_train.corr()
high_corr = corr_matrix[(corr_matrix > 0.9) & (corr_matrix != 1.0)].any()

# # drop one of the highly correlated features
X_train.drop(high_corr.index[high_corr], axis=1, inplace=True)
X_test.drop(high_corr.index[high_corr], axis=1, inplace=True)


In [266]:
#basic logistic regression using dataset to produce model
lr = LogisticRegression(max_iter=10000, solver="lbfgs")
lr.fit(X_train, y_train.to_numpy().ravel())

In [270]:
pred = lr.predict(X_test)

dif = np.subtract(y_test["price"].to_numpy(),pred)
precentage_dif = np.divide(dif,y_test["price"].to_numpy())
precentage_dif *= 100
r2_val = r2_score(y_test,pred)
adj_r2 = 1 - ( 1- r2_val ) * ( len(y_test) - 1 ) / ( len(y_test) - X_test.shape[1] - 1 )

In [271]:
print(adj_r2)
print(np.mean(precentage_dif))

0.8664009523062677
-3.2752246351803516


- Using simple logistic regression and one hot encoding, our model is able to get and adjusted $r^2$ value of 0.87 which is pretty good. 
- On average, our model under estimates price by 3.3%.