#### Car Price Prediction
This Python script is designed to predict the price of cars based on various features such as mileage, engine size, and horsepower. The dataset used in this script is Car-Price-Prediction-dataset.csv.

#### Problem Description
The problem addressed in this script is that of a regression task. Given features such as mileage, engine size, and horsepower, the goal is to predict the price of cars accurately.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [12]:
# Load the dataset
dataset = pd.read_csv("Car-Price-Prediction-dataset.csv")
dataset = dataset.drop(["car_ID"], axis=1)

In [13]:
# Display dataset shape and first 5 rows
print("Dataset Shape:", dataset.shape)
print("\nFirst 5 Rows of the Dataset:\n", dataset.head(5))

Dataset Shape: (205, 25)

First 5 Rows of the Dataset:
    symboling                   CarName fueltype aspiration doornumber  \
0          3        alfa-romero giulia      gas        std        two   
1          3       alfa-romero stelvio      gas        std        two   
2          1  alfa-romero Quadrifoglio      gas        std        two   
3          2               audi 100 ls      gas        std       four   
4          2                audi 100ls      gas        std       four   

       carbody drivewheel enginelocation  wheelbase  carlength  ...  \
0  convertible        rwd          front       88.6      168.8  ...   
1  convertible        rwd          front       88.6      168.8  ...   
2    hatchback        rwd          front       94.5      171.2  ...   
3        sedan        fwd          front       99.8      176.6  ...   
4        sedan        4wd          front       99.4      176.6  ...   

   enginesize  fuelsystem  boreratio stroke compressionratio  horsepower  \
0 

In [14]:
dataset.isnull().sum()

symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

In [15]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   symboling         205 non-null    int64  
 1   CarName           205 non-null    object 
 2   fueltype          205 non-null    object 
 3   aspiration        205 non-null    object 
 4   doornumber        205 non-null    object 
 5   carbody           205 non-null    object 
 6   drivewheel        205 non-null    object 
 7   enginelocation    205 non-null    object 
 8   wheelbase         205 non-null    float64
 9   carlength         205 non-null    float64
 10  carwidth          205 non-null    float64
 11  carheight         205 non-null    float64
 12  curbweight        205 non-null    int64  
 13  enginetype        205 non-null    object 
 14  cylindernumber    205 non-null    object 
 15  enginesize        205 non-null    int64  
 16  fuelsystem        205 non-null    object 
 1

In [16]:
dataset['symboling']=dataset['symboling'].astype(float)
dataset['curbweight']=dataset['curbweight'].astype(float)
dataset['enginesize']=dataset['enginesize'].astype(float)
dataset['horsepower']=dataset['horsepower'].astype(float)
dataset['peakrpm']=dataset['peakrpm'].astype(float)
dataset['citympg']=dataset['citympg'].astype(float)
dataset['highwaympg']=dataset['highwaympg'].astype(float)

In [17]:
# Separate features (X) and target variable (Y)
X_data = dataset.drop("price", axis=1)
numerical_cols = X_data.select_dtypes(exclude=["object"]).columns
X = X_data[numerical_cols]

In [18]:
# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=numerical_cols)

In [19]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, dataset["price"], test_size=0.20, random_state=0
)

In [20]:
# Define regression algorithms
regression_algorithms = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=10.0),
    "Lasso Regression": Lasso(alpha=10.0),
    "Support Vector Regression": SVR(kernel='linear', C=10.0, epsilon=0.1),
    "Decision Tree Regression": DecisionTreeRegressor(max_depth=20, min_samples_split=5),
    "Random Forest Regression": RandomForestRegressor(n_estimators=100, max_depth=None),
    "Gradient Boosting Regression": GradientBoostingRegressor(n_estimators=50, learning_rate=0.1),
}

In [21]:
# Evaluate each regression algorithm using cross-validation
algorithm_scores = {}
for algorithm_name, algorithm in regression_algorithms.items():
    scores = cross_val_score(algorithm, X_train, y_train, cv=5, scoring="r2")
    algorithm_scores[algorithm_name] = scores.mean()
    print(f"{algorithm_name} Mean Cross-Validation R2 Score: {scores.mean()}")

Linear Regression Mean Cross-Validation R2 Score: 0.7949067984025655
Ridge Regression Mean Cross-Validation R2 Score: 0.802111886496722
Lasso Regression Mean Cross-Validation R2 Score: 0.7963688131835991
Support Vector Regression Mean Cross-Validation R2 Score: 0.7807588197714062
Decision Tree Regression Mean Cross-Validation R2 Score: 0.8906642211477399
Random Forest Regression Mean Cross-Validation R2 Score: 0.9112393786018418
Gradient Boosting Regression Mean Cross-Validation R2 Score: 0.9210890746863576


In [22]:
# Choose the best-performing regression algorithm
best_algorithm_name = max(algorithm_scores, key=lambda k: algorithm_scores[k])
best_algorithm = regression_algorithms[best_algorithm_name]

# Train the best algorithm on the entire training set
best_algorithm.fit(X_train, y_train)

In [23]:
y_pred = best_algorithm.predict(X_test)

In [24]:
# Evaluate the performance of the best algorithm using R2 score
print(f"\nBest Algorithm: {best_algorithm_name}")
r2score = r2_score(y_test, y_pred)
print(f"R2 Score of the Best Model on Test Set: {r2score * 100}%")
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE) of the Best Model on Test Set: {mae:.2f}")
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE) of the Best Model on Test Set: {mse:.2f}")
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Root Mean Squared Error (RMSE) of the Best Model on Test Set: {rmse:.2f}")


Best Algorithm: Gradient Boosting Regression
R2 Score of the Best Model on Test Set: 89.93867747691368%
Mean Absolute Error (MAE) of the Best Model on Test Set: 1856.23
Mean Squared Error (MSE) of the Best Model on Test Set: 7789109.49
Root Mean Squared Error (RMSE) of the Best Model on Test Set: 2790.90


