In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'car-price-predictionused-cars:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F2491159%2F4226692%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240824%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240824T041304Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D6a64687b0beb2f46bd4d17c3b7786137eaeeee1ea7078d8b5aa2459e415a167a070028abea6aaf7263bfa4ef86c90c3ae680fbd76e5b041f9a74b1cea35b2cb711217f034b5390ea80f20090ee1d5efb6c6f5e7d00f60175f85c0aef839806f358961c82a663a9cece3432913deeea0b71b73905643b6e13770389ff397e3aa2e24fc7e1994e8c086539ed1f6eb4472b3a83726ebb1a3d14a2157dd728d349fc863f7b16872a8aa6a092347876bec41dda5788e53bc751965fbda85a437cccd3c0f111366b53b433c7f5a3dbedebcddea41053d346a298b7f4cae360a6620f76a6c5820185be1a5200efdfef48d8154062d5666a32dac861ce3a4ee349b09df8'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# ABOUT DATASET

The dataset titled "Car price prediction(used cars)" available on Kaggle is designed for predicting the price of used cars based on various attributes. Here's an overview of its structure and contents:

### Dataset Overview:
- **File Name**: `car data.csv`
- **File Size**: 16.91 kB
- **Number of Columns**: 9

### Columns:
1. **Car_Name**: Name of the car (categorical)
2. **Year**: Year of the car's manufacturing (numerical)
3. **Selling_Price**: Selling price of the car (target variable, numerical)
4. **Present_Price**: Current market price of the car (numerical)
5. **Driven_kms**: Kilometers driven by the car (numerical)
6. **Fuel_Type**: Type of fuel used by the car (categorical)
7. **Selling_type**: Selling type (categorical)
8. **Transmission**: Type of transmission (categorical)
9. **Owner**: Number of previous owners (numerical)

### Tags and Usability:
- **Tags**: Tabular, Automobiles and Vehicles, Beginner, India, Regression
- **Usability Rating**: 10.00 (indicating high usability for machine learning tasks)
- **License**: CC0: Public Domain
- **Expected Update Frequency**: Never (static dataset)

### Dataset Description:
This dataset is ideal for regression tasks where the goal is to predict the selling price of a used car based on its characteristics such as age (Year), current market price (Present_Price), kilometers driven (Driven_kms), fuel type (Fuel_Type), transmission type (Transmission), and more. It's suitable for learning regression modeling techniques, exploring feature engineering, and evaluating various machine learning algorithms.

### Usage Examples:
- **Learning**: Useful for understanding how to train a car price prediction model.
- **Research**: Supports research in the domain of automotive pricing models.
- **Application**: Applicable for developing real-world applications related to used car valuation.

### Data Quality:
- **Cleanliness**: Well-documented and maintained.
- **Originality**: Original dataset source on Kaggle with high-quality notebooks available for reference.

### Additional Notes:
- The dataset has been actively viewed and downloaded, indicating its popularity and usefulness among data enthusiasts and learners.

This dataset provides a rich opportunity for exploration and experimentation in machine learning, particularly in the field of regression analysis applied to automotive data.

# LIBRARIES

In [None]:
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings


In [None]:
warnings.filterwarnings('ignore')

# LOAD DATASET

In [None]:
df = pd.read_csv('/kaggle/input/car-price-predictionused-cars/car data.csv')

In [None]:
df.shape

In [None]:
df.head()

# DATA CLEANING

In [None]:
df.columns = df.columns.str.lower()

In [None]:
df.isnull().sum().any()

In [None]:
df.info()

In [None]:
obj_col = df.select_dtypes(['object']).columns
num_col = df.select_dtypes(['int', 'float']).columns

In [None]:
for col in list(obj_col):
  print(f'{col} = > {len(df[col].unique())}')
  print()

In [None]:
le = LabelEncoder()
df['fuel_type'] = le.fit_transform(df['fuel_type'])
df['selling_type'] = le.fit_transform(df['selling_type'])
df['transmission'] = le.fit_transform(df['transmission'])

In [None]:
df.drop('car_name', axis = 1, inplace = True)
df.info()

In [None]:
df.describe()

# DATA PRE-PROCESSING

## EDA & TRANSFORMATION

In [None]:
df.hist(figsize = (12,10), bins = 50)
plt.show()

In [None]:
df['driven_kms'] = np.log(df['driven_kms'])
df['selling_price'] = np.log(df['selling_price'])
df['present_price'] = np.log(df['present_price'])

In [None]:
df.hist(figsize = (12,10), bins = 50)
plt.show()

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm', linewidths=.5)
plt.show()

## FEATURES SELECTION

In [None]:
x = df.drop('selling_price', axis = 1)
y = df.selling_price

x, y = shuffle(x, y, random_state=42)

In [None]:
model_sfs = RandomForestRegressor(random_state = 42)

In [None]:
sfs = SFS(model_sfs, k_features = 'best', forward = True, floating = True, scoring = 'neg_root_mean_squared_error', cv = 5, n_jobs = 1, verbose = 2)

In [None]:
sfs.fit(x,y)

In [None]:
list(sfs.k_feature_names_)

In [None]:
x = df[list(sfs.k_feature_names_)]
y = df.selling_price

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size = 0.2,random_state = 42, shuffle = True)

# MODELLING

In [None]:
model_l = LinearRegression()
model_r = RandomForestRegressor()

In [None]:
model_l.fit(xtrain,ytrain)

In [None]:
model_r.fit(xtest,ytest)

# EVALUATION

In [None]:
pred_l = model_l.predict(xtest)


mse = mean_squared_error(ytest, pred_l)
rmse = np.sqrt(mse)
mae = mean_absolute_error(ytest, pred_l)
r2 = r2_score(ytest, pred_l)

# Print metrics
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R-squared (R2): {r2}')

baseline_pred = [np.mean(ytest)] * len(ytest)
baseline_mse = mean_squared_error(ytest, baseline_pred)
baseline_rmse = np.sqrt(baseline_mse)
print(f'Baseline MSE: {baseline_mse}')
print(f'Baseline RMSE: {baseline_rmse}')


In [None]:
pred_r = model_r.predict(xtest)

mse = mean_squared_error(ytest, pred_r)
rmse = np.sqrt(mse)
mae = mean_absolute_error(ytest, pred_r)
r2 = r2_score(ytest, pred_r)

print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R-squared (R2): {r2}')

baseline_pred = [np.mean(ytest)] * len(ytest)
baseline_mse = mean_squared_error(ytest, baseline_pred)
baseline_rmse = np.sqrt(baseline_mse)
print(f'Baseline MSE: {baseline_mse}')
print(f'Baseline RMSE: {baseline_rmse}')


# CROSS VALIDATION

In [None]:
model = RandomForestRegressor(random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

scoring = {
    'mse': make_scorer(mean_squared_error),
    'mae': make_scorer(mean_absolute_error),
    'r2': make_scorer(r2_score)
}

scores_mse = cross_val_score(model, xtrain, ytrain, cv=kf, scoring='neg_mean_squared_error')
scores_mae = cross_val_score(model, xtrain, ytrain, cv=kf, scoring='neg_mean_absolute_error')
scores_r2 = cross_val_score(model, xtrain, ytrain, cv=kf, scoring='r2')

rmse_scores = np.sqrt(-scores_mse)

print(f'Mean MSE: {-scores_mse.mean()}')
print(f'Standard Deviation of MSE: {scores_mse.std()}')
print(f'Mean RMSE: {rmse_scores.mean()}')
print(f'Standard Deviation of RMSE: {rmse_scores.std()}')
print(f'Mean MAE: {-scores_mae.mean()}')
print(f'Standard Deviation of MAE: {scores_mae.std()}')
print(f'Mean R-squared: {scores_r2.mean()}')
print(f'Standard Deviation of R-squared: {scores_r2.std()}')

model.fit(xtrain, ytrain)
pred = model.predict(xtest)

mse_test = mean_squared_error(ytest, pred)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(ytest, pred)
r2_test = r2_score(ytest, pred)

print('\n\n')
print(f'Test Set MSE: {mse_test}')
print(f'Test Set RMSE: {rmse_test}')
print(f'Test Set MAE: {mae_test}')
print(f'Test Set R-squared: {r2_test}')


# CONCLUSION

### Cross-Validation Metrics (Training Data):
- **Mean MSE**: 0.0537
- **Mean RMSE**: 0.2290
- **Mean MAE**: 0.1602
- **Mean R-squared**: 0.9677

These metrics indicate that the model fits the training data very well:
- **Mean MSE** (0.0537) suggests that, on average, the squared differences between predicted and actual values are low, indicating accurate predictions.
- **Mean RMSE** (0.2290) is relatively low, showing that predictions are close to the actual values in terms of the target variable's scale.
- **Mean MAE** (0.1602) indicates that the average absolute difference between predicted and actual values is small.
- **Mean R-squared** (0.9677) means the model explains approximately 96.77% of the variance in the target variable, demonstrating a strong fit to the data.

### Test Set Metrics:
- **Test Set MSE**: 0.0383
- **Test Set RMSE**: 0.1958
- **Test Set MAE**: 0.1480
- **Test Set R-squared**: 0.9732

The metrics on the test set confirm the model’s excellent performance:
- **Test Set MSE** (0.0383) is lower than the training set MSE, suggesting good generalization to unseen data.
- **Test Set RMSE** (0.1958) is low, indicating that predictions on the test set are also close to the actual values.
- **Test Set MAE** (0.1480) shows that the average absolute difference between predicted and actual values is small in the test set.
- **Test Set R-squared** (0.9732) implies that the model explains approximately 97.32% of the variance in the test set, reinforcing its strong predictive capability.

### Summary:
The model demonstrates exceptional performance across both cross-validation on the training data and evaluation on the test set. It achieves low error metrics (MSE, RMSE, MAE) and high R-squared values consistently, indicating precise predictions and a robust fit to the data. The low variance in performance metrics between training and test sets suggests that the model generalizes well to new data, making it a reliable and effective tool for predicting the target variable.

