<a href="https://colab.research.google.com/github/CandraHarefa/Candra_Harefa/blob/main/price_gears_interactive_car_market_insights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

file_path = '/content/car_price_prediction_.csv'
data = pd.read_csv(file_path)

In [2]:
brand_price = data.groupby('Brand')['Price'].mean().sort_values(ascending=False)
print("Average price of cars by brand:\n", brand_price)

Average price of cars by brand:
 Brand
BMW         54157.114385
Tesla       53475.547471
Mercedes    53191.090085
Toyota      52078.728235
Honda       52050.283949
Audi        51953.424810
Ford        51593.254813
Name: Price, dtype: float64


In [None]:
plt.figure(figsize=(14, 8))
sns.barplot(x=brand_price.index, y=brand_price.values, palette='coolwarm')
plt.title('Average Car Price by Brand')
plt.xticks(rotation=0)
plt.ylabel('Average Price')
plt.show()


plt.figure(figsize=(14, 8))
sns.boxplot(x='Model', y='Price', data=data, palette='coolwarm')
plt.title('Distribution of Car Prices by Model')
plt.xticks(rotation=60)
plt.show()

# **Cars from premium brands such as Tesla and BMW have a higher average price, supporting the hypothesis that luxury brands are indeed significantly more expensive.**

In [None]:
year_price = data.groupby('Year')['Price'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
sns.lineplot(x='Year', y='Price', data=data, ci=None, marker='o')
plt.title('Car Price Depreciation Over Time (Line Plot)')
plt.show()
fig = px.scatter(data, x="Year", y="Price", color="Brand", title="Price Depreciation by Year and Brand")
fig.show()
heatmap_data = pd.pivot_table(data, values='Price', index='Year', columns='Brand', aggfunc='mean')
plt.figure(figsize=(14, 8))
sns.heatmap(heatmap_data, cmap="YlGnBu", annot=True, fmt=".1f", linewidths=0.5)
plt.title('Heatmap of Car Prices by Year and Brand')
plt.show()

1. 	Car prices vary by brand and year. For example, Mercedes and Tesla cars have high average prices in recent years (especially 2021-2023), which can be explained by their premium positioning and new technological solutions.
1. 	Older cars (e.g. 2000-2005) show a significant decrease in value across all brands. This is due to natural depreciation and technology obsolescence.
1. 	Sharp price changes can be seen in certain years for some brands. For example, BMW has a sharp price increase in 2011, while Honda shows a decline in the same year.
1. 	Tesla and Mercedes maintain higher prices compared to brands such as Ford and Honda throughout the period, emphasising their premium nature.
1. 	Ford and Honda show more stable and lower prices, indicating their positioning in the mid-range and budget segment.

**These trends demonstrate how year of manufacture and brand affect pricing in the car market.**

In [None]:
engine_price = data.groupby('Engine Size')['Price'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
sns.scatterplot(x='Engine Size', y='Price', hue='Fuel Type', data=data, palette='coolwarm')
plt.title('Engine Size vs. Price with Fuel Type')
plt.show()

In [None]:
fuel_price = data.groupby('Fuel Type')['Price'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
sns.boxplot(x='Fuel Type', y='Price', data=data, palette='Set2')
plt.title('Impact of Fuel Type on Price (Boxplot)')
plt.show()

plt.figure(figsize=(14, 8))
sns.histplot(data=data, x='Price', hue='Fuel Type', multiple='stack', palette='Set2', kde=True)
plt.title('Price Distribution by Fuel Type (Histogram)')
plt.show()

In [None]:
transmission_price = data.groupby('Transmission')['Price'].mean().sort_values(ascending=False)
transmission_fuel = pd.crosstab(data['Transmission'], data['Fuel Type'])
transmission_fuel.plot(kind='bar', stacked=True, figsize=(14, 8), colormap='Set2')
plt.title('Transmission Type and Fuel Type (Stacked Bar Plot)')
plt.show()

In [None]:
data_ml = data.drop(columns=['Car ID'])
encoder = OneHotEncoder(sparse=False)
categorical_columns = ['Brand', 'Fuel Type', 'Transmission', 'Condition', 'Model']
encoded_categorical_data = encoder.fit_transform(data_ml[categorical_columns])
encoded_df = pd.DataFrame(encoded_categorical_data, columns=encoder.get_feature_names_out(categorical_columns))
data_ml = pd.concat([data_ml.drop(columns=categorical_columns), encoded_df], axis=1)
X = data_ml.drop(columns=['Price'])
y = data_ml['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
print(f'RMSE Linear_Regression: {rmse_lr}')

In [None]:
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)
print(f'RMSE Decision_Tree_Regressor: {rmse_dt}')

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
print(f'RMSE Random_Forest_Regressor: {rmse_rf}')

In [None]:
importances = rf_model.feature_importances_
feature_names = X.columns

importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(14, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in Random Forest Model')
plt.show()

In [None]:
correlation = data['Mileage'].corr(data['Price'])
print(f'Correlation between mileage and price: \n{correlation}\n')

X = data['Mileage'].values.reshape(-1, 1)
y = data['Price'].values
fig = px.scatter(data, x='Mileage', y='Price', color='Brand',
                 title="Mileage vs. Price",
                 labels={"Mileage": "Mileage (km)", "Price": "Price ($)"},
                 hover_data=['Model'])

fig.update_layout(template='plotly_dark', width=800, height=500)
fig.show()

* Weak negative correlation: The correlation coefficient between Mileage and Price is approximately -0.0086, indicating a very weak negative relationship. This means that Mileage has almost no significant effect on the price of the car in this sample.

**Mileage is not a major factor affecting the price of a car, at least in this data sample. This may suggest that other characteristics (e.g. year of manufacture, condition, brand) play a more important role in determining price.**

In [None]:
data['Mileage_Group'] = pd.cut(data['Mileage'], bins=[0, 50000, 100000, 150000, 200000, np.inf],
                               labels=['0-50k', '50k-100k', '100k-150k', '150k-200k', '200k+'])

group_price = data.groupby('Mileage_Group')['Price'].mean().reset_index()
group_price['Price'] = group_price['Price'].round()

fig = px.bar(group_price, x='Mileage_Group', y='Price', title="Average Car Price by Mileage Group",
             labels={"Mileage_Group": "Mileage Group", "Price": "Average Price ($)"}, text='Price')

fig.update_layout(template='plotly_dark', width=800, height=500)
fig.show()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
condition_price = data.groupby('Condition')['Price'].mean().reset_index()

fig = px.sunburst(data, path=['Condition', 'Brand'], values='Price',
                  title="Impact of Car Condition on Price (Sunburst Plot)",
                  color='Price', hover_data=['Price'],
                  color_continuous_scale='RdYlBu', maxdepth=2)

fig.update_layout(template='plotly_dark', width=800, height=600)
fig.show()

In [None]:
fig = px.treemap(data, path=['Condition', 'Brand', 'Model'], values='Price',
                 title="Impact of Car Condition, Brand, and Model on Price (Treemap)",
                 color='Price', hover_data=['Price'],
                 color_continuous_scale='RdYlBu')

fig.update_layout(template='plotly_dark', width=900, height=700)
fig.show()

# 🏁 **Crossing the Finish Line** 🏁

And there we have it—**the ultimate journey** through the intricate world of car pricing! 🚗💨

---

### 💡 **What Did We Learn?**
- **Condition matters**, but it's not the only driver—other factors like **brand** can shift the gears on pricing.
- **Mileage vs. Price?** While you'd expect a direct link, our data says, "Not so fast!" 🚦
- **Fuel type and engine size** show their power, but the real winners? **Those premium brands and models** that hold their value mile after mile.

---

### 🏆 **Why Vote?**
Because this isn’t just a notebook—it’s a full-throttle **data experience**! Your vote powers this analysis to the top of the leaderboard.
