# Question 1: Regression - Laptop Price Prediction (1.0 điểm)

Use laptop_price_clean.csv dataset (in folder laptop) to build a model to predict 
"price of laptop" (Inputs: select suitable features, Output: Price_euros) 
Then, make new prediction:
- If the information about a latop is as follows:

| Company | TypeName  | Ram | OpSys  | Weight | Screen     | IPS Screen| Screen PPI|Cpu_brand | HDD | SSD | Gpu_brand | Os  |
|---------|-----------|-----|--------|--------|------------|-----------|-----------|-----------|-----|-----|-----------|-----|
| Apple   | Ultrabook | 8   | macOS  | 1.34   | NormalScreen| Yes | 127.677     | Intel Core i7   | 0   | 128 | AMD       | Mac |

- What is the price of that laptop?

Read more information here:
https://www.kaggle.com/datasets/muhammetvarl/laptop-price

First thing to do is start a Spark Session

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Regression_LaptopPrice').getOrCreate()

In [4]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

In [5]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("laptop/laptop_price_clean.csv",inferSchema=True,header=True)

In [6]:
# Print the Schema of the DataFrame
data.printSchema()

root
 |-- Company: string (nullable = true)
 |-- TypeName: string (nullable = true)
 |-- Ram: integer (nullable = true)
 |-- OpSys: string (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Price_euros: double (nullable = true)
 |-- Screen: string (nullable = true)
 |-- IPS Screen: string (nullable = true)
 |-- Screen PPI: double (nullable = true)
 |-- Cpu_brand: string (nullable = true)
 |-- HDD: integer (nullable = true)
 |-- SSD: integer (nullable = true)
 |-- Gpu brand: string (nullable = true)
 |-- os: string (nullable = true)



In [7]:
data.show(5)

+-------+---------+---+-----+------+-----------+------------+----------+-----------+-------------+---+---+---------+------------------+
|Company| TypeName|Ram|OpSys|Weight|Price_euros|      Screen|IPS Screen| Screen PPI|    Cpu_brand|HDD|SSD|Gpu brand|                os|
+-------+---------+---+-----+------+-----------+------------+----------+-----------+-------------+---+---+---------+------------------+
|  Apple|Ultrabook|  8|macOS|  1.37|    1339.69|NormalScreen|       Yes|226.9830047|Intel Core i5|  0|128|    Intel|               Mac|
|  Apple|Ultrabook|  8|macOS|  1.34|     898.94|NormalScreen|        No|127.6779401|Intel Core i5|  0|  0|    Intel|               Mac|
|     HP| Notebook|  8|No OS|  1.86|      575.0|NormalScreen|        No|141.2119981|Intel Core i5|  0|256|    Intel|Others/No OS/Linux|
|  Apple|Ultrabook| 16|macOS|  1.83|    2537.45|NormalScreen|       Yes|220.5346239|Intel Core i7|  0|512|      AMD|               Mac|
|  Apple|Ultrabook|  8|macOS|  1.37|     1803.6|

In [8]:
data.head()

Row(Company='Apple', TypeName='Ultrabook', Ram=8, OpSys='macOS', Weight=1.37, Price_euros=1339.69, Screen='NormalScreen', IPS Screen='Yes', Screen PPI=226.9830047, Cpu_brand='Intel Core i5', HDD=0, SSD=128, Gpu brand='Intel', os='Mac')

In [9]:
for item in data.head():
    print(item)

Apple
Ultrabook
8
macOS
1.37
1339.69
NormalScreen
Yes
226.9830047
Intel Core i5
0
128
Intel
Mac


In [10]:
#3. Kiểm tra dữ liệu NaN, null
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
Company,0
TypeName,0
Ram,0
OpSys,0
Weight,0
Price_euros,0
Screen,0
IPS Screen,0
Screen PPI,0
Cpu_brand,0


In [11]:
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
Company,0
TypeName,0
Ram,0
OpSys,0
Weight,0
Price_euros,0
Screen,0
IPS Screen,0
Screen PPI,0
Cpu_brand,0


=> Không có nan, null

In [12]:
#4. Kiểm tra dữ liệu trùng. 
num_rows = data.count()
num_dist_rows = data.distinct().count()
dup_rows = num_rows - num_dist_rows

In [13]:
display(num_rows, num_dist_rows, dup_rows)

1302

1273

29

In [14]:
# Xóa dữ liệu trùng.
data = data.drop_duplicates()

In [15]:
data.count()

1273

## Spark Formatting of Data

In [16]:
# Create a StringIndexer for categorical features
indexers = [
    StringIndexer(inputCol=column, outputCol=column+"_index").fit(data)
    for column in ["Company", "TypeName", "OpSys", 'Screen', 'IPS Screen',"Cpu_brand", "Gpu brand", "os"]
]

In [17]:
# Assemble features
assembler = VectorAssembler(
    inputCols=['Ram', 'Weight', 'Screen PPI', 'HDD', 'SSD', 'Company_index', 'TypeName_index', 'OpSys_index', 'Screen_index', 'IPS Screen_index', 'Cpu_brand_index', 'Gpu brand_index', 'os_index'],
    outputCol='features')

In [18]:
# Create a StandardScaler for numeric features
scaler = StandardScaler(inputCol="features", outputCol="features_scaled", withStd=True, withMean=True)

In [19]:
# Create pipeline
pipeline = Pipeline(stages=indexers + [assembler, scaler])

In [20]:
# Fit and transform data
model = pipeline.fit(data)
data_pre = model.transform(data)

In [21]:
data_pre.select("features_scaled").show(2, False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features_scaled                                                                                                                                                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-0.08781747519530542,-0.24072008344791684,-0.13368845240395108,-0.7985948264202817,1.7463373711913306,-0.8872357946577107,0.8640185965926662,-0.3831512042929425,-0.4147966540203555,-0.6228302320631895,-0

In [22]:
data_pre.show(2)

+-------+---------+---+----------+------+-----------+------------+----------+-----------+-------------+---+---+---------+-------+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+
|Company| TypeName|Ram|     OpSys|Weight|Price_euros|      Screen|IPS Screen| Screen PPI|    Cpu_brand|HDD|SSD|Gpu brand|     os|Company_index|TypeName_index|OpSys_index|Screen_index|IPS Screen_index|Cpu_brand_index|Gpu brand_index|os_index|            features|     features_scaled|
+-------+---------+---+----------+------+-----------+------------+----------+-----------+-------------+---+---+---------+-------+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+
|   Dell|Ultrabook|  8|Windows 10|  1.88|     1298.0|NormalScreen|        No|141.2119981|Intel Core i7|  0|512|    Intel|Windows|          0.0|     

In [23]:
final_data = data_pre.select('features_scaled', 'Price_euros')

In [24]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [25]:
train_data.describe().show()

+-------+------------------+
|summary|       Price_euros|
+-------+------------------+
|  count|               898|
|   mean|1134.0044432071272|
| stddev| 686.7726422596256|
|    min|             174.0|
|    max|            5499.0|
+-------+------------------+



In [26]:
test_data.describe().show()

+-------+------------------+
|summary|       Price_euros|
+-------+------------------+
|  count|               375|
|   mean|1140.5908266666663|
| stddev| 734.3754296210716|
|    min|             196.0|
|    max|            6099.0|
+-------+------------------+



## The Regression Models

In [27]:
lr = LinearRegression(featuresCol="features_scaled",  labelCol='Price_euros', predictionCol='Predict_Price_euros')
rf = RandomForestRegressor(featuresCol="features_scaled", labelCol="Price_euros", predictionCol='Predict_Price_euros', seed=42, maxDepth=10)

In [28]:
# Fit the model to the data 
lrModel = lr.fit(train_data,)
rfModel = rf.fit(train_data,)


## Model Comparison

Let's compare each of these models!

In [29]:
# Make predictions
rf_test_model = rfModel.transform(test_data)
lr_test_model = lrModel.transform(test_data)

In [30]:
print("Some Linear Regression model predictions")
rf_test_model.select("Price_euros", "Predict_Price_euros").show(10)

Some Linear Regression model predictions
+-----------+-------------------+
|Price_euros|Predict_Price_euros|
+-----------+-------------------+
|      275.0|  325.1122112179487|
|      245.0|  234.2854549731183|
|      209.0| 268.70306900820594|
|      249.0| 282.59239739736074|
|      210.8| 272.21977575757575|
|      199.0| 385.73522655918964|
|      379.0|    328.46748593518|
|      479.0|  672.3121153846154|
|     646.27|  672.3121153846154|
|      319.0|  706.1341153846154|
+-----------+-------------------+
only showing top 10 rows



In [31]:
print("Some Random Forest Regressor model predictions")
rf_test_model.select("Price_euros", "Predict_Price_euros").show(10)

Some Random Forest Regressor model predictions
+-----------+-------------------+
|Price_euros|Predict_Price_euros|
+-----------+-------------------+
|      275.0|  325.1122112179487|
|      245.0|  234.2854549731183|
|      209.0| 268.70306900820594|
|      249.0| 282.59239739736074|
|      210.8| 272.21977575757575|
|      199.0| 385.73522655918964|
|      379.0|    328.46748593518|
|      479.0|  672.3121153846154|
|     646.27|  672.3121153846154|
|      319.0|  706.1341153846154|
+-----------+-------------------+
only showing top 10 rows



In [32]:
evaluator = RegressionEvaluator(labelCol="Price_euros", predictionCol="Predict_Price_euros")

In [33]:
print('Model Linear Regression :')
print("RMSE on test data: ", evaluator.evaluate(lr_test_model, {evaluator.metricName: "rmse"}))
print("MSE on test data: ", evaluator.evaluate(lr_test_model, {evaluator.metricName: "mse"}))
print("R2 on test data: ", evaluator.evaluate(lr_test_model, {evaluator.metricName: "r2"}))

Model Linear Regression :
RMSE on test data:  361.7424934709763
MSE on test data:  130857.63158259934
R2 on test data:  0.7567110157791419


In [34]:
print('Model Random Forest Regressor:')
print("RMSE on test data: ", evaluator.evaluate(rf_test_model, {evaluator.metricName: "rmse"}))
print("MSE on test data: ", evaluator.evaluate(rf_test_model, {evaluator.metricName: "mse"}))
print("R2 on test data: ", evaluator.evaluate(rf_test_model, {evaluator.metricName: "r2"}))

Model Random Forest Regressor:
RMSE on test data:  313.9657640809379
MSE on test data:  98574.50101492717
R2 on test data:  0.8167314360503177


| Metric                   | Linear Regression | Random Forest Regressor |
|--------------------------|-------------------|-------------------------|
| **RMSE on test data**    | 361.742           | 313.965                 |
| **MSE on test data**     | 130857.631        | 98574.501              |
| **R^2 on test data**     | 75.67%            | 81.67%                  |


So sánh:

- RMSE: Model Random Forest Regressor có RMSE thấp hơn => Model Random Forest Regressor có khả năng dự đoán chính xác hơn trên dữ liệu test.
- MSE: Tương tự như RMSE, MSE của Model Random Forest Regressor cũng thấp hơn
- R^2: Model Random Forest Regressor có R^2 cao hơn, điều này cho thấy Model Random Forest Regressor giải thích được sự biến thiên của dữ liệu tốt hơn so với Model Linear Regression.
#### Kết luận: Model Random Forest Regressor là lựa chọn tốt hơn vì nó có các chỉ số đánh giá (RMSE, MSE và R^2) tốt hơn trên dữ liệu test so với Model Linear Regression.

=> Chọn Model Random Forest Regressor

In [35]:
# Save model
rfModel.save('rfModel_laptop_price')

In [36]:
from pyspark.ml.regression import RandomForestRegressionModel
# Load model from
rfModel2 = RandomForestRegressionModel.load('rfModel_laptop_price')

# Predict new values

| Company | TypeName  | Ram | OpSys  | Weight | Screen     | IPS Screen| Screen PPI|Cpu_brand | HDD | SSD | Gpu_brand | Os  |
|---------|-----------|-----|--------|--------|------------|-----------|-----------|-----------|-----|-----|-----------|-----|
| Apple   | Ultrabook | 8   | macOS  | 1.34   | NormalScreen| Yes | 127.677     | Intel Core i7   | 0   | 128 | AMD       | Mac |

In [37]:
# Predict new values
new_laptop_data = spark.createDataFrame([
    ("Apple", "Ultrabook", 8, "macOS", 1.34, "NormalScreen", "Yes", 127.677 ,"Intel Core i7", 0, 128, "AMD", "Mac")
], ["Company", "TypeName", "Ram", "OpSys", "Weight", "Screen", "IPS Screen", "Screen PPI", "Cpu_brand", "HDD", "SSD", "Gpu brand", "os"])


In [38]:
# transform data
new_data_pre = model.transform(new_laptop_data)

In [39]:
new_data_pre.select("features_scaled").show(2, False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features_scaled                                                                                                                                                                                                                                                 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-0.08781747519530542,-1.0476042250695756,-0.44899278145609417,-0.7985948264202817,-0.31229536818640175,1.7729789315182192,0.8640185965926662,3.641200966209663,-0.4147966540203555,1.6043127044998449,-0.9079225200382877,1.9

In [40]:
new_data_pre.show(2)

+-------+---------+---+-----+------+------------+----------+----------+-------------+---+---+---------+---+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+
|Company| TypeName|Ram|OpSys|Weight|      Screen|IPS Screen|Screen PPI|    Cpu_brand|HDD|SSD|Gpu brand| os|Company_index|TypeName_index|OpSys_index|Screen_index|IPS Screen_index|Cpu_brand_index|Gpu brand_index|os_index|            features|     features_scaled|
+-------+---------+---+-----+------+------------+----------+----------+-------------+---+---+---------+---+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+
|  Apple|Ultrabook|  8|macOS|  1.34|NormalScreen|       Yes|   127.677|Intel Core i7|  0|128|      AMD|Mac|          7.0|           2.0|        5.0|         0.0|             1.0|            0.0|            2.0|    

In [41]:
new_prediction = rfModel.transform(new_data_pre)

In [42]:
new_prediction.show()

+-------+---------+---+-----+------+------------+----------+----------+-------------+---+---+---------+---+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+-------------------+
|Company| TypeName|Ram|OpSys|Weight|      Screen|IPS Screen|Screen PPI|    Cpu_brand|HDD|SSD|Gpu brand| os|Company_index|TypeName_index|OpSys_index|Screen_index|IPS Screen_index|Cpu_brand_index|Gpu brand_index|os_index|            features|     features_scaled|Predict_Price_euros|
+-------+---------+---+-----+------+------------+----------+----------+-------------+---+---+---------+---+-------------+--------------+-----------+------------+----------------+---------------+---------------+--------+--------------------+--------------------+-------------------+
|  Apple|Ultrabook|  8|macOS|  1.34|NormalScreen|       Yes|   127.677|Intel Core i7|  0|128|      AMD|Mac|          7.0|           2.0|        5.0|      

In [43]:
# Show the predicted price
predicted_price = new_prediction.select("Predict_Price_euros").collect()[0][0]
print(f"The predicted price of the given laptop is: {predicted_price} euros")

The predicted price of the given laptop is: 1401.9612500000003 euros
