# Business Understanding

## Project Domain

Domain proyek ini adalah E-commerce dan Teknologi Konsumen, khususnya terkait analisis dan prediksi harga laptop berbasis spesifikasi teknis (RAM, prosesor, penyimpanan, GPU, merek, dll). Proyek ini membantu toko online, pembeli, dan analis pasar memahami faktor utama yang memengaruhi harga laptop.

## Problem Statements

1. Harga laptop di pasaran sangat bervariasi meskipun spesifikasinya mirip, sehingga sulit bagi konsumen menentukan apakah sebuah laptop layak dengan harganya.

2. Penjual sering kesulitan menentukan harga optimal untuk produk baru di e-commerce tanpa acuan data historis yang kuat.

3. Tidak adanya alat otomatis yang memprediksi harga laptop berbasis spesifikasi membuat proses penetapan harga lama dan rentan salah.

## Goals

1. Membangun model machine learning yang mampu memprediksi harga laptop berdasarkan fitur-fitur seperti merek, prosesor, RAM, penyimpanan, layar, GPU, dll.

2. Mengidentifikasi faktor spesifikasi mana yang paling berpengaruh terhadap harga (feature importance).

3. Memberikan rekomendasi harga optimal untuk pembeli maupun penjual agar keputusan mereka lebih data-driven.

## Solution Statements

Proyek ini akan menggunakan dataset laptop yang tersedia di Kaggle untuk melatih model prediksi harga berbasis machine learning (misalnya regresi, random forest, atau XGBoost). Dataset akan dibersihkan, dianalisis, dan diuji untuk mengevaluasi performa model. Hasil akhir berupa sistem atau skrip yang mampu memprediksi harga laptop baru dengan spesifikasi tertentu dan membantu memvisualisasikan faktor-faktor utama yang memengaruhi harga.

# Data Understanding

## Import data dari kaggle

In [86]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"marsckalrestujagad","key":"e33cbbeab799153994f5cb737e59c336"}'}

In [87]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [88]:
!kaggle datasets download -d jacksondivakarr/laptop-price-prediction-dataset

Dataset URL: https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset
License(s): apache-2.0
laptop-price-prediction-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [89]:
!mkdir laptop-price-prediction-dataset
!unzip laptop-price-prediction-dataset.zip -d laptop-price-prediction-dataset
!ls laptop-price-prediction-dataset

mkdir: cannot create directory ‘laptop-price-prediction-dataset’: File exists
Archive:  laptop-price-prediction-dataset.zip
replace laptop-price-prediction-dataset/data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: laptop-price-prediction-dataset/data.csv  
replace laptop-price-prediction-dataset/data.xlsx? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: laptop-price-prediction-dataset/data.xlsx  
data.csv  data.xlsx


## Import Library yang dibutuhkan

In [90]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

import tensorflow as tf
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import Sequential
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import Adam

In [91]:
df = pd.read_csv('laptop-price-prediction-dataset/data.csv')

In [92]:
df.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0.1,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
brand,HP,HP,Acer,Lenovo,Apple
name,Victus 15-fb0157AX Gaming Laptop,15s-fq5007TU Laptop,One 14 Z8-415 Laptop,Yoga Slim 6 14IAP8 82WU0095IN Laptop,MacBook Air 2020 MGND3HN Laptop
price,49900,39900,26990,59729,69990
spec_rating,73.0,60.0,69.323529,66.0,69.323529
processor,5th Gen AMD Ryzen 5 5600H,12th Gen Intel Core i3 1215U,11th Gen Intel Core i3 1115G4,12th Gen Intel Core i5 1240P,Apple M1
CPU,"Hexa Core, 12 Threads","Hexa Core (2P + 4E), 8 Threads","Dual Core, 4 Threads","12 Cores (4P + 8E), 16 Threads",Octa Core (4P + 4E)
Ram,8GB,8GB,8GB,16GB,8GB
Ram_type,DDR4,DDR4,DDR4,LPDDR5,DDR4


In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0.1       893 non-null    int64  
 1   Unnamed: 0         893 non-null    int64  
 2   brand              893 non-null    object 
 3   name               893 non-null    object 
 4   price              893 non-null    int64  
 5   spec_rating        893 non-null    float64
 6   processor          893 non-null    object 
 7   CPU                893 non-null    object 
 8   Ram                893 non-null    object 
 9   Ram_type           893 non-null    object 
 10  ROM                893 non-null    object 
 11  ROM_type           893 non-null    object 
 12  GPU                893 non-null    object 
 13  display_size       893 non-null    float64
 14  resolution_width   893 non-null    float64
 15  resolution_height  893 non-null    float64
 16  OS                 893 non

In [94]:
print(f"The Laptop Price Dataset has {df.shape[0]} rows and {df.shape[1]} columns")

The Laptop Price Dataset has 893 rows and 18 columns


In [95]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0.1,0
Unnamed: 0,0
brand,0
name,0
price,0
spec_rating,0
processor,0
CPU,0
Ram,0
Ram_type,0


In [96]:
df.drop(['Unnamed: 0.1', 'Unnamed: 0'], axis=1, inplace=True)

In [97]:
df.describe()

Unnamed: 0,price,spec_rating,display_size,resolution_width,resolution_height,warranty
count,893.0,893.0,893.0,893.0,893.0,893.0
mean,79907.409854,69.379026,15.173751,2035.393057,1218.324748,1.079507
std,60880.043823,5.541555,0.939095,426.076009,326.756883,0.326956
min,9999.0,60.0,11.6,1080.0,768.0,0.0
25%,44500.0,66.0,14.0,1920.0,1080.0,1.0
50%,61990.0,69.323529,15.6,1920.0,1080.0,1.0
75%,90990.0,71.0,15.6,1920.0,1200.0,1.0
max,450039.0,89.0,18.0,3840.0,3456.0,3.0


## Exploratory Data Analysis

In [98]:
# Visualising the brands in the dataset
import plotly.express as px # Import the plotly.express module and assign it to the alias 'px'
plt = px.histogram(df, x="brand", title="Brand Distribution",
                   color="brand", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [99]:
# Visualising the distribution of specs rating in the dataset

plt = px.scatter(df, x="spec_rating", title="Specs Rating Distribution",
                 color_discrete_sequence=px.colors.qualitative.Pastel)
# Drawing the mean line
plt.add_shape(type='line', x0=df['spec_rating'].mean(), y0=0,
              x1=df['spec_rating'].mean(), y1=1000, line=dict(color='red', dash='dot'))
plt.show()


In [100]:
# Visualising the Specs Distribution with Brands to check which brands has more Specs Rating
plt = px.scatter(df, x="spec_rating", y="brand", title="Specs Rating Distribution",
                 color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [101]:
plt = px.histogram(df, x='processor', title="Processor Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [102]:
plt = px.histogram(df, x='CPU', title="CPU Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [103]:
plt = px.histogram(df, x='Ram_type', color='Ram', title="Ram Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [104]:
plt = px.histogram(df, x='ROM_type', color='ROM', title="ROM Distribution", color_discrete_sequence=px.colors.qualitative.Pastel )
plt.show()

In [105]:
plt = px.histogram(df, x='GPU', color='GPU', title="GPU Distribution")
plt.show()

In [106]:
plt = px.histogram(df, x='display_size', color='display_size', title="Display Size Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [107]:
plt = px.histogram(df, x='OS', color='OS', title="OS Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

# Data Preparation

In [108]:
label_encoder = LabelEncoder()
categorical_cols = ['brand', 'name', 'processor', 'CPU', 'Ram', 'Ram_type', 'ROM', 'ROM_type', 'GPU', 'OS']
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

In [109]:
numerical_cols = ['price', 'spec_rating', 'display_size', 'resolution_width', 'resolution_height', 'warranty']
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [110]:
X = df.drop('price', axis=1)
y = df['price']

In [111]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=50)

In [112]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Modeling

In [124]:
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

# Evaluation

In [125]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 548929720.7566094
R-squared: 0.8398054695243926


# Deployment

## Model Simulation

In [115]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('laptop-price-prediction-dataset/data.csv')

df = df.dropna()

df = df.rename(columns={'Ram': 'Ram'})
label_encoders = {}
for col in df.select_dtypes(include='object').columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

example_input = {
    col: (label_encoders[col].transform([label_encoders[col].classes_[0]])[0] if col in label_encoders else X_train[col].median())
    for col in X_train.columns
}

example_input['Ram'] = label_encoders['Ram'].transform([label_encoders['Ram'].classes_[0]])[0]

input_df = pd.DataFrame([example_input])

predicted_price = model.predict(input_df)[0]
print(f"Predicted Laptop Price for input {example_input}: ${predicted_price:,.2f}")

Predicted Laptop Price for input {'Unnamed: 0.1': 474.5, 'Unnamed: 0': 538.5, 'brand': np.int64(0), 'name': np.int64(0), 'spec_rating': 69.32352941176471, 'processor': np.int64(0), 'CPU': np.int64(0), 'Ram': np.int64(0), 'Ram_type': np.int64(0), 'ROM': np.int64(0), 'ROM_type': np.int64(0), 'GPU': np.int64(0), 'display_size': 15.6, 'resolution_width': 1920.0, 'resolution_height': 1080.0, 'OS': np.int64(0), 'warranty': 1.0}: $113,317.48


In [116]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

X_test_df = pd.DataFrame(X_test)

simulated_data = X_test_df.sample(n=len(X_test_df), random_state=42)

simulated_salary = model.predict(simulated_data)

simulation_results = pd.DataFrame(simulated_data)
simulation_results['Predicted_Salary'] = simulated_salary

print(simulation_results.head())

     Unnamed: 0.1  Unnamed: 0  brand  name  spec_rating  processor  CPU  Ram  \
357           376         417     15   400    69.323529         58    1    6   
572           597         677      3   705    73.000000         75    8    1   
773           811         896     14   577    66.000000         37    1    6   
477           499         565     24   514    69.323529         23   28    6   
768           806         891      9     2    60.000000        129   28    1   

     Ram_type  ROM  ROM_type  GPU  display_size  resolution_width  \
357         2    5         1  111          15.6            1920.0   
572         2    1         1   94          16.0            1920.0   
773         2    5         1  131          15.6            1920.0   
477         2    5         1  123          14.0            1920.0   
768         8    5         1   79          14.0            1920.0   

     resolution_height  OS  warranty  Predicted_Salary  
357             1080.0  12         1          6

## Save Model

In [117]:
from google.colab import files
import joblib

filename = 'laptop_price_prediction_model.sav'
joblib.dump(model, filename)

files.download(filename)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [118]:
import joblib
joblib.dump(label_encoder, 'label_encoder.pkl')
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

In [119]:
from google.colab import files
import joblib

filename = 'laptop_price_prediction_model.sav'
joblib.dump(model, filename)
files.download(filename)

joblib.dump(label_encoders, 'label_encoder.pkl')
files.download('label_encoder.pkl')

joblib.dump(scaler, 'scaler.pkl')
files.download('scaler.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [120]:
df_original = pd.read_csv('laptop-price-prediction-dataset/data.csv')

df_original = df_original.loc[:, ~df_original.columns.str.contains('^Unnamed', case=False)]


In [121]:
model_features = list(model.feature_names_in_) if hasattr(model, 'feature_names_in_') else [c for c in df_original.columns if c != 'price']

input_df = input_df[[col for col in input_df.columns if col in model_features]]

for col in model_features:
    if col not in input_df.columns:
        input_df[col] = 0

input_df = input_df[model_features]


In [122]:
print("Input DataFrame columns:", input_df.columns)


Input DataFrame columns: Index(['Unnamed: 0.1', 'Unnamed: 0', 'brand', 'name', 'spec_rating',
       'processor', 'CPU', 'Ram', 'Ram_type', 'ROM', 'ROM_type', 'GPU',
       'display_size', 'resolution_width', 'resolution_height', 'OS',
       'warranty'],
      dtype='object')
