# Problem Definition and Dataset Selection:
# Problem:
# To predict the overall rating of cars based on various features such as price, exterior, interior, ride quality, color, number of airbags, and fuel type. This will help in understanding which factors contribute most to the overall rating and potentially assist in car purchase decisions.

# Dataset:
# We will use a car rating dataset containing the following columns:

# Index: A unique identifier for each car entry.
# Car: The name and model of the car.
# Price: The price range of the car.
# Overall Rating: The overall rating given to the car.
# Exterior: The rating of the car's exterior.
# Interior: The rating of the car's interior.
# Ride Quality: The rating of the car's ride quality.
# Color: The color of the car.
# Airbags: The number of airbags the car has.
# Fuel type: The type of fuel the car uses (e.g., Petrol, Hybrid).
# Price.1: Another price-related metric or an additional feature related to the car price.

# 2. Data Preprocessing and Exploration:

In [11]:
import pandas as pd

In [12]:
# Load the dataset
dataset= pd.read_csv('Car_Rating data.csv')

In [13]:
dataset

Unnamed: 0,Index,Car,Price,Overall Rating,Exterior,Interior,Ride Quality,Color,Airbags,Fuel type,Price.1
0,0,Maruti Suzuki Alto K10,3.99 - 5.96 Lakh,4.4,4.4,4.4,4.4,Silver,12,Hybrid,13328
1,1,Maruti Suzuki Alto 800,3.25 - 5.12 Lakh,4.2,4.0,3.8,4.2,Black,8,Petrol,16621
2,2,Renault Kwid,4.70 - 6.45 Lakh,3.5,2.7,3.0,2.3,Black,2,Petrol,8467
3,3,Maruti Suzuki Alto K10,3.99 - 5.96 Lakh,4.4,4.4,4.4,4.4,White,0,Hybrid,3607
4,4,Maruti Suzuki Alto 800,3.25 - 5.12 Lakh,4.2,4.0,3.8,4.2,Silver,4,Petrol,11726
...,...,...,...,...,...,...,...,...,...,...,...
167,167,BMW i7,1.95 Crore,0.0,,,,White,4,Petrol,20005
168,168,Land Rover Range Rover Sport,1.64 - 1.84 Crore,0.0,,,,Silver,8,Petrol,36110
169,169,Aston Martin DBX,3.82 - 4.63 Crore,0.0,,,,Blue,4,Petrol,7840
170,170,Porsche Macan,88.06 Lakh - 1.53 Crore,0.0,,,,Black,12,Petrol,470


In [28]:
# Assuming 'dataset' is your DataFrame and you want to fill missing values in the column  with 0
dataset["Exterior"].fillna(0, inplace=True)
dataset["Interior"].fillna(0, inplace=True)
dataset["Ride Quality"].fillna(0, inplace=True)

In [29]:
# Handle missing values
data.dropna(inplace=True)

In [30]:
# Encode categorical variables if any
dataset=pd.get_dummies(dataset,drop_first=True) 

In [31]:
# Split the data into features and target variable
X = dataset.drop('Price.1', axis=1)
y = dataset['Price.1']

In [32]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Feature Engineering:
# StandardScaler before training your model. This preprocessing step helps ensure that all features have the same scale, which can be crucial for certain machine learning algorithms.

# Scaling the features involves transforming them so that they have a mean of 0 and a standard deviation of 1. This is particularly important for algorithms that rely on distance metrics, such as K-Nearest Neighbors or Support Vector Machines.

# By fitting the StandardScaler on the training data and then transforming both the training and test data,ata,

In [63]:
from sklearn.preprocessing import StandardScaler
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# the target variables y_train and y_test using the reshape method. Reshaping is often necessary when dealing with data in scikit-learn, especially when the target variable is a one-dimensional array.

# In this case, you're reshaping the target variables into a two-dimensional array with a single column (reshape(-1, 1)), which is a common format for target variables in scikit-learn.n.

In [46]:
# Reshape y_train and y_test
y_train_reshaped = y_train.values.reshape(-1, 1)
y_test_reshaped = y_test.values.reshape(-1, 1)

# the target variables using StandardScaler. While it's common to scale features (input variables), scaling target variables is not as common in most machine learning workflows.

# However, there might be specific cases where scaling the target variable could be beneficial, such as when using certain algorithms like Support Vector Machines (SVMs) or when the target variable has a wide range of values.

# Just ensure that scaling the target variable aligns with the requirements of your model and the characteristics of your data. If you have specific reasons for scalingling

In [48]:
# Scale features
scaler = StandardScaler()
y_train_scaled = scaler.fit_transform(y_train_reshaped)
y_test_scaled = scaler.transform(y_test_reshaped)

# 4. Model Selection and Training:

# initializing and training a Random Forest Classifier using the scaled features X_train_scaled and the target variable y_train. This is a common approach for training machine learning models.

In [58]:
from sklearn.ensemble import RandomForestClassifier
# Initialize and train a random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_scaled, y_train)

# 5. Hyperparameter Tuning

In [50]:
from sklearn.model_selection import GridSearchCV
# Define hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


In [51]:
# Perform grid search cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)




In [52]:
# Get the best hyperparameters
best_params = grid_search.best_params_


In [53]:
# Train the model with the best hyperparameters
best_rf_classifier = RandomForestClassifier(**best_params)
best_rf_classifier.fit(X_train_scaled, y_train)

# 6. *Model Evaluation*:

In [54]:
# python
from sklearn.metrics import accuracy_score, classification_report

In [55]:
# Predict on the test set
y_pred = best_rf_classifier.predict(X_test_scaled)

# Model Evaluation
# Accuracy
# Accuracy Score: 0.057
# The accuracy score represents the proportion of correctly classified samples out of the total samples.
 # In this case, the accuracy score indicates that the model correctly predicted the target variable for approximately 5.7% of the test samples.les.

In [56]:
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.05714285714285714


In [57]:
# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

         314       0.00      0.00      0.00         1
         549       0.06      1.00      0.11         2
         627       0.00      0.00      0.00         1
         706       0.00      0.00      0.00         1
         862       0.00      0.00      0.00         1
         941       0.00      0.00      0.00         1
        1019       0.00      0.00      0.00         1
        3000       0.00      0.00      0.00         1
        3763       0.00      0.00      0.00         1
        4547       0.00      0.00      0.00         1
        4704       0.00      0.00      0.00         1
        7683       0.00      0.00      0.00         1
        8781       0.00      0.00      0.00         1
       12074       0.00      0.00      0.00         1
       12231       0.00      0.00      0.00         1
       12858       0.00      0.00      0.00         1
       15681       0.00      0.00      0.00         2
    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Classification Report
# The classification report provides a detailed summary of the model's performance on each class.
# Precision: Indicates the proportion of true positive predictions out of all positive predictions.
# Recall: Represents the proportion of true positive predictions out of all actual positives.
# F1-score: Harmonic mean of precision and recall, providing a balance between the two metrics.
# Support: Number of actual occurrences of the class in the test set.
# Metrics Overview
# Accuracy: 0.06
# Macro Average F1-score: 0.00
# Weighted Average F1-score: 0.01
re: 0.01
