In [1]:
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/versions/
License(s): other
Downloading the-ultimate-halloween-candy-power-ranking.zip to /content
  0% 0.00/2.06k [00:00<?, ?B/s]
100% 2.06k/2.06k [00:00<00:00, 6.41MB/s]


In [2]:
!pip install kaggle



In [33]:
!kaggle datasets download -d avikasliwal/used-cars-price-prediction

Dataset URL: https://www.kaggle.com/datasets/avikasliwal/used-cars-price-prediction
License(s): other
Downloading used-cars-price-prediction.zip to /content
  0% 0.00/172k [00:00<?, ?B/s]
100% 172k/172k [00:00<00:00, 108MB/s]


In [41]:
!unzip used-cars-price-prediction.zip

Archive:  used-cars-price-prediction.zip
  inflating: test-data.csv           
  inflating: train-data.csv          


#Step 1: Business Understanding
The CRISP-DM methodology starts with understanding the business problem and the desired outcomes.

##Objective
The goal is to predict the price of used cars based on various attributes, using machine learning models. Winning the competition involves creating a model with the lowest Root Mean Squared Error (RMSE) on the test set, which requires focusing on minimizing prediction error and maximizing generalizability.

##Key Business Insights
For used car prices, key attributes typically influencing price include:

Age of the car: Older cars typically have lower prices.
Mileage: Higher mileage often lowers the value.

Brand/Make: Certain brands retain value better.

Condition: Wear, tear, and maintenance history impact price.

Features: Higher-end features (e.g., leather seats, advanced safety features) increase value.

Fuel Type: Efficiency and fuel type (e.g., electric, petrol) impact resale value.

By understanding how these variables interplay, the model will learn to estimate prices based on these influences.

##Task Summary
Target variable: Price (continuous variable).

Evaluation metric: RMSE, which penalizes large errors more severely than small ones, making it appropriate for this regression task.

Dataset characteristics: We are provided with a training dataset (including features and target) and a test dataset (with features but without the target).

In [42]:
import pandas as pd

# Load the dataset
file_path = '/content/train-data.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         6019 non-null   int64  
 1   Name               6019 non-null   object 
 2   Location           6019 non-null   object 
 3   Year               6019 non-null   int64  
 4   Kilometers_Driven  6019 non-null   int64  
 5   Fuel_Type          6019 non-null   object 
 6   Transmission       6019 non-null   object 
 7   Owner_Type         6019 non-null   object 
 8   Mileage            6017 non-null   object 
 9   Engine             5983 non-null   object 
 10  Power              5983 non-null   object 
 11  Seats              5977 non-null   float64
 12  New_Price          824 non-null    object 
 13  Price              6019 non-null   float64
dtypes: float64(2), int64(3), object(9)
memory usage: 658.5+ KB


In [44]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year,Kilometers_Driven,Seats,Price
count,6019.0,6019.0,6019.0,5977.0,6019.0
mean,3009.0,2013.358199,58738.38,5.278735,9.479468
std,1737.679967,3.269742,91268.84,0.80884,11.187917
min,0.0,1998.0,171.0,0.0,0.44
25%,1504.5,2011.0,34000.0,5.0,3.5
50%,3009.0,2014.0,53000.0,5.0,5.64
75%,4513.5,2016.0,73000.0,5.0,9.95
max,6018.0,2019.0,6500000.0,10.0,160.0


#Step 2: Data Understanding
##2.1 Load and Inspect the Dataset
First, we will load the dataset and examine the structure, including columns, data types, and missing values. We aim to understand the following:

How many rows and columns are present?
What are the data types of each column?
Are there any missing or inconsistent values?


Initial Data Insights
The dataset contains 6019 rows and 14 columns. Here’s a quick summary of the columns:

Unnamed: 0: Index-like column, can be removed for modeling purposes.

Name: The name of the car, useful for brand/model extraction.

Location: City where the car is sold, could be a factor in pricing.

Year: Year of manufacture, which can help determine car age.

Kilometers_Driven: Distance the car has traveled, likely to impact price.

Fuel_Type: Type of fuel used (CNG, Petrol, Diesel, etc.).

Transmission: Type of transmission (Manual/Automatic).

Owner_Type: How many previous owners (First, Second, etc.).

Mileage, Engine, Power: Specifications of the car, but some of these have
missing or inconsistent values.

Seats: Number of seats, with some missing values.

New_Price: The original price when the car was new, but most values are missing.

Price: The target variable (continuous) to predict.

##3.1 Handling Missing Values
Several columns have missing values, and we need to decide how to handle them based on their importance to the prediction task.

Mileage: Missing values (2 missing entries).

The values are in a string format like "26.6 km/kg" or "19.67 kmpl," which need to be cleaned and converted to a numeric type.
Strategy: Fill missing values using the median mileage value, which is more robust than the mean.
Engine: Missing values (36 missing entries).

Values like "998 CC" or "1199 CC" need to be extracted and converted to numeric form.
Strategy: Fill missing values using the median engine size.
Power: Missing values (36 missing entries).

Values are in strings like "88.7 bhp" or "126.2 bhp." Convert these to numeric.
Strategy: Fill missing values using the median.
Seats: Missing values (42 missing entries).

Since seats vary based on car type, we'll fill missing values with the mode (most frequent value) since the number of seats is often categorical.
New_Price: Most of this column is missing (only 824 non-null entries), so it is not very useful. We will drop it from our model.

Unnamed: 0: This is just an index column, so we will drop it.

In [46]:
df.columns

Index(['Unnamed: 0', 'Name', 'Location', 'Year', 'Kilometers_Driven',
       'Fuel_Type', 'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power',
       'Seats', 'New_Price', 'Price'],
      dtype='object')

In [47]:
# Drop the 'Unnamed: 0' and 'New_Price' columns as they are not useful
df_clean = df.drop(columns=['Unnamed: 0', 'New_Price'])

# Clean and convert 'Mileage' to numeric
df_clean['Mileage'] = df_clean['Mileage'].str.extract(r'(\d+\.?\d*)').astype(float)

# Clean and convert 'Engine' to numeric
df_clean['Engine'] = df_clean['Engine'].str.extract(r'(\d+\.?\d*)').astype(float)

# Clean and convert 'Power' to numeric
df_clean['Power'] = df_clean['Power'].str.extract(r'(\d+\.?\d*)').astype(float)

# Fill missing values
df_clean['Mileage'].fillna(df_clean['Mileage'].median(), inplace=True)
df_clean['Engine'].fillna(df_clean['Engine'].median(), inplace=True)
df_clean['Power'].fillna(df_clean['Power'].median(), inplace=True)
df_clean['Seats'].fillna(df_clean['Seats'].mode()[0], inplace=True)

# Verify the cleaning and imputation
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               6019 non-null   object 
 1   Location           6019 non-null   object 
 2   Year               6019 non-null   int64  
 3   Kilometers_Driven  6019 non-null   int64  
 4   Fuel_Type          6019 non-null   object 
 5   Transmission       6019 non-null   object 
 6   Owner_Type         6019 non-null   object 
 7   Mileage            6019 non-null   float64
 8   Engine             6019 non-null   float64
 9   Power              6019 non-null   float64
 10  Seats              6019 non-null   float64
 11  Price              6019 non-null   float64
dtypes: float64(5), int64(2), object(5)
memory usage: 564.4+ KB


(None,
                                Name    Location  Year  Kilometers_Driven  \
 0            Maruti Wagon R LXI CNG      Mumbai  2010              72000   
 1  Hyundai Creta 1.6 CRDi SX Option        Pune  2015              41000   
 2                      Honda Jazz V     Chennai  2011              46000   
 3                 Maruti Ertiga VDI     Chennai  2012              87000   
 4   Audi A4 New 2.0 TDI Multitronic  Coimbatore  2013              40670   
 
   Fuel_Type Transmission Owner_Type  Mileage  Engine   Power  Seats  Price  
 0       CNG       Manual      First    26.60   998.0   58.16    5.0   1.75  
 1    Diesel       Manual      First    19.67  1582.0  126.20    5.0  12.50  
 2    Petrol       Manual      First    18.20  1199.0   88.70    5.0   4.50  
 3    Diesel       Manual      First    20.77  1248.0   88.76    7.0   6.00  
 4    Diesel    Automatic     Second    15.20  1968.0  140.80    5.0  17.74  )

In [48]:
df_clean.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6,998.0,58.16,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67,1582.0,126.2,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2,1199.0,88.7,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77,1248.0,88.76,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2,1968.0,140.8,5.0,17.74


##3.2 Data Cleaning and Imputation
I will perform these cleaning steps now:

Extract numerical values from columns like Mileage, Engine, and Power.
Impute missing values in relevant columns.

In [49]:
# Feature Engineering: Add 'Car_Age' by subtracting the 'Year' from the current year (2024)
df_clean['Car_Age'] = 2024 - df_clean['Year']

# Extract the 'Brand' from the 'Name' column (Assuming the first word is the brand)
df_clean['Brand'] = df_clean['Name'].str.split(' ').str[0]

# Let's inspect the result after adding the new features
df_clean[['Car_Age', 'Brand']].head()

Unnamed: 0,Car_Age,Brand
0,14,Maruti
1,9,Hyundai
2,13,Honda
3,12,Maruti
4,11,Audi


##3.3 Feature Engineering
Next, we will create useful features from the existing ones. Key steps include:

Car Age: From the "Year" column, we can create a new feature representing the age of the car.
Brand Extraction: The Name column contains both brand and model. Extracting the brand will help capture the brand-specific pricing effect.
Fuel Efficiency: Combining Mileage and Fuel_Type might provide useful insights into fuel efficiency.

In [50]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from pandas.api.types import is_numeric_dtype

# One-hot encoding for categorical variables
categorical_cols = ['Fuel_Type', 'Transmission', 'Owner_Type', 'Brand', 'Location']
df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, drop_first=True)

# Identify continuous numeric columns for scaling
numeric_cols = ['Kilometers_Driven', 'Mileage', 'Engine', 'Power', 'Seats', 'Car_Age']

# Initialize StandardScaler and scale the numeric columns
scaler = StandardScaler()
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

# Display the first few rows of the processed data
df_encoded.head()

Unnamed: 0,Name,Year,Kilometers_Driven,Mileage,Engine,Power,Seats,Price,Car_Age,Fuel_Type_Diesel,...,Location_Bangalore,Location_Chennai,Location_Coimbatore,Location_Delhi,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune
0,Maruti Wagon R LXI CNG,2010,0.145315,1.847798,-1.038232,-1.027107,-0.343293,1.75,1.027139,False,...,False,False,False,False,False,False,False,False,True,False
1,Hyundai Creta 1.6 CRDi SX Option,2015,-0.194369,0.335076,-0.064226,0.249937,-0.343293,12.5,-0.502161,True,...,False,False,False,False,False,False,False,False,False,True
2,Honda Jazz V,2011,-0.139581,0.014196,-0.703001,-0.453901,-0.343293,4.5,0.721279,False,...,False,True,False,False,False,False,False,False,False,False
3,Maruti Ertiga VDI,2012,0.309678,0.575191,-0.621278,-0.452775,2.137237,6.0,0.415419,True,...,False,True,False,False,False,False,False,False,False,False
4,Audi A4 New 2.0 TDI Multitronic,2013,-0.197985,-0.640662,0.579552,0.523965,-0.343293,17.74,0.109559,True,...,False,False,True,False,False,False,False,False,False,False


###Feature Engineering Result
We have successfully added the following new features:

Car_Age: Represents the age of the car, calculated by subtracting the car's manufacturing year from 2024.
Brand: Extracted from the Name column, capturing the make of the car.

#Step 4: Modeling
Now that our data is ready, we will move on to the modeling phase. The steps involved include:

Splitting the Data: We'll split the data into training and validation sets to assess model performance.

Training Initial Models: We'll train various regression models and compare their performance.

Evaluating Models: The evaluation metric is RMSE, so we will focus on minimizing this value during model comparison.

In [51]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = df_encoded.drop(columns=['Price', 'Name', 'Year'])  # 'Name' and 'Year' are not necessary for the model
y = df_encoded['Price']

# Split the data into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the splits
X_train.shape, X_val.shape, y_train.shape, y_val.shape


((4815, 54), (1204, 54), (4815,), (1204,))

In [52]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42, n_estimators=100),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42, n_estimators=100)
}

# Dictionary to store RMSE for each model
rmse_scores = {}

# Train each model and calculate RMSE
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Predict on validation set
    y_pred = model.predict(X_val)

    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_scores[model_name] = rmse

# Display RMSE scores for each model
rmse_scores


{'Linear Regression': 7.191683401354857,
 'Decision Tree': 5.17297822303683,
 'Random Forest': 3.6154671253831827,
 'Gradient Boosting': 3.6673296579806034}

##Hyperparameter Tuning

In [53]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters for Random Forest
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define hyperparameters for Gradient Boosting
gb_params = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Grid Search for Random Forest
rf_grid = GridSearchCV(RandomForestRegressor(random_state=42), rf_params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_

# Grid Search for Gradient Boosting
gb_grid = GridSearchCV(GradientBoostingRegressor(random_state=42), gb_params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
gb_grid.fit(X_train, y_train)
best_gb = gb_grid.best_estimator_

# Evaluate the best models on the validation set
rf_rmse = np.sqrt(mean_squared_error(y_val, best_rf.predict(X_val)))
gb_rmse = np.sqrt(mean_squared_error(y_val, best_gb.predict(X_val)))

best_rf_params = rf_grid.best_params_
best_gb_params = gb_grid.best_params_

rf_rmse, gb_rmse, best_rf_params, best_gb_params


  _data = np.array(data, dtype=dtype, copy=copy,


(3.830130909343681,
 3.5159099364360995,
 {'max_depth': 20,
  'min_samples_leaf': 2,
  'min_samples_split': 2,
  'n_estimators': 100},
 {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 200})

In [62]:
import pandas as pd

# Load the dataset
file_path = 'test-data.csv'
test_data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
test_data.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price
0,0,Maruti Alto K10 LXI CNG,Delhi,2014,40929,CNG,Manual,First,32.26 km/kg,998 CC,58.2 bhp,4.0,
1,1,Maruti Alto 800 2016-2019 LXI,Coimbatore,2013,54493,Petrol,Manual,Second,24.7 kmpl,796 CC,47.3 bhp,5.0,
2,2,Toyota Innova Crysta Touring Sport 2.4 MT,Mumbai,2017,34000,Diesel,Manual,First,13.68 kmpl,2393 CC,147.8 bhp,7.0,25.27 Lakh
3,3,Toyota Etios Liva GD,Hyderabad,2012,139000,Diesel,Manual,First,23.59 kmpl,1364 CC,null bhp,5.0,
4,4,Hyundai i20 Magna,Mumbai,2014,29000,Petrol,Manual,First,18.5 kmpl,1197 CC,82.85 bhp,5.0,


In [63]:
# Drop the 'Unnamed: 0' and 'New_Price' columns as they are not useful
df_test__clean = test_data.drop(columns=['Unnamed: 0', 'New_Price'])

# Clean and convert 'Mileage' to numeric
df_test__clean['Mileage'] = df_test__clean['Mileage'].str.extract(r'(\d+\.?\d*)').astype(float)

# Clean and convert 'Engine' to numeric
df_test__clean['Engine'] = df_test__clean['Engine'].str.extract(r'(\d+\.?\d*)').astype(float)

# Clean and convert 'Power' to numeric
df_test__clean['Power'] = df_test__clean['Power'].str.extract(r'(\d+\.?\d*)').astype(float)

# Fill missing values
df_test__clean['Mileage'].fillna(df_test__clean['Mileage'].median(), inplace=True)
df_test__clean['Engine'].fillna(df_test__clean['Engine'].median(), inplace=True)
df_test__clean['Power'].fillna(df_test__clean['Power'].median(), inplace=True)
df_test__clean['Seats'].fillna(df_test__clean['Seats'].mode()[0], inplace=True)

# Verify the cleaning and imputation
df_test__clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1234 entries, 0 to 1233
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               1234 non-null   object 
 1   Location           1234 non-null   object 
 2   Year               1234 non-null   int64  
 3   Kilometers_Driven  1234 non-null   int64  
 4   Fuel_Type          1234 non-null   object 
 5   Transmission       1234 non-null   object 
 6   Owner_Type         1234 non-null   object 
 7   Mileage            1234 non-null   float64
 8   Engine             1234 non-null   float64
 9   Power              1234 non-null   float64
 10  Seats              1234 non-null   float64
dtypes: float64(4), int64(2), object(5)
memory usage: 106.2+ KB


In [64]:
# Assuming the test dataset is loaded into 'test_data'
# Apply the same feature engineering as we did on the training data

# Feature Engineering: Add 'Car_Age' by subtracting the 'Year' from the current year (2024)
df_test__clean['Car_Age'] = 2024 - df_test__clean['Year']

# Extract the 'Brand' from the 'Name' column (Assuming the first word is the brand)
df_test__clean['Brand'] = df_test__clean['Name'].str.split(' ').str[0]

# Display the first few rows after feature engineering
df_test__clean[['Car_Age', 'Brand']].head()


Unnamed: 0,Car_Age,Brand
0,10,Maruti
1,11,Maruti
2,7,Toyota
3,12,Toyota
4,10,Hyundai


In [65]:
# Apply the same one-hot encoding as used in the training data
categorical_cols = ['Fuel_Type', 'Transmission', 'Owner_Type', 'Brand', 'Location']
test_data_encoded = pd.get_dummies(df_test__clean, columns=categorical_cols, drop_first=True)

# Ensure the test data has the same columns as the training data
# Align with the training dataset (X_train)
test_data_encoded = test_data_encoded.reindex(columns=X_train.columns, fill_value=0)

# Check the results
test_data_encoded.head()

Unnamed: 0,Kilometers_Driven,Mileage,Engine,Power,Seats,Car_Age,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Fuel_Type_Petrol,...,Location_Bangalore,Location_Chennai,Location_Coimbatore,Location_Delhi,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune
0,40929,32.26,998.0,58.2,4.0,10,False,0,False,False,...,False,False,False,True,False,False,False,False,False,False
1,54493,24.7,796.0,47.3,5.0,11,False,0,False,True,...,False,False,True,False,False,False,False,False,False,False
2,34000,13.68,2393.0,147.8,7.0,7,True,0,False,False,...,False,False,False,False,False,False,False,False,True,False
3,139000,23.59,1364.0,93.7,5.0,12,True,0,False,False,...,False,False,False,False,True,False,False,False,False,False
4,29000,18.5,1197.0,82.85,5.0,10,False,0,False,True,...,False,False,False,False,False,False,False,False,True,False


In [67]:
test_data_encoded.columns

Index(['Kilometers_Driven', 'Mileage', 'Engine', 'Power', 'Seats', 'Car_Age',
       'Fuel_Type_Diesel', 'Fuel_Type_Electric', 'Fuel_Type_LPG',
       'Fuel_Type_Petrol', 'Transmission_Manual', 'Owner_Type_Fourth & Above',
       'Owner_Type_Second', 'Owner_Type_Third', 'Brand_Audi', 'Brand_BMW',
       'Brand_Bentley', 'Brand_Chevrolet', 'Brand_Datsun', 'Brand_Fiat',
       'Brand_Force', 'Brand_Ford', 'Brand_Honda', 'Brand_Hyundai',
       'Brand_ISUZU', 'Brand_Isuzu', 'Brand_Jaguar', 'Brand_Jeep',
       'Brand_Lamborghini', 'Brand_Land', 'Brand_Mahindra', 'Brand_Maruti',
       'Brand_Mercedes-Benz', 'Brand_Mini', 'Brand_Mitsubishi', 'Brand_Nissan',
       'Brand_Porsche', 'Brand_Renault', 'Brand_Skoda', 'Brand_Smart',
       'Brand_Tata', 'Brand_Toyota', 'Brand_Volkswagen', 'Brand_Volvo',
       'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore',
       'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur',
       'Location_Kochi', 'Location_Kolkata', 'Location_Mu

In [68]:
# Make predictions using the Random Forest model
test_predictions = models['Random Forest'].predict(test_data_encoded)

In [69]:
test_predictions

array([36.0103, 37.5416, 35.9715, ..., 35.9285, 36.5947, 34.964 ])

In [70]:
# Prepare the submission DataFrame (replace 'ID' with the actual identifier column in the test data)
submission = pd.DataFrame({
    'ID': test_data['Unnamed: 0'],  # Assuming 'ID' is the column name in your test data
    'Price': test_predictions
})
# Save the submission file
submission.to_csv('submission.csv', index=False)

print("Submission file created as 'submission.csv'")


Submission file created as 'submission.csv'


In [71]:
!kaggle competitions submit -c playground-series-s4e9 -f submission.csv -m "Message"

100% 24.3k/24.3k [00:00<00:00, 38.1kB/s]
Successfully submitted to Regression of Used Car Prices