# Part A – Application area review

Artificial Intelligence (AI) has significantly transformed market analysis, offering powerful tools to understand trends, forecast prices, and optimize decision-making. By analyzing vast datasets, AI can identify hidden patterns that traditional methods often overlook. One of the primary applications in this domain is price prediction, which is crucial for businesses to stay competitive.

Machine learning (ML) models are at the forefront of price prediction in market analysis. Techniques like regression models, neural networks, and ensemble methods (e.g., random forests and gradient boosting) are widely used to predict prices of products, commodities, or assets based on historical data and market trends. These methods consider factors such as demand, supply, seasonality, and external variables like economic indicators.

For example, in retail, AI-driven price prediction helps businesses dynamically set prices for products like laptops. By analyzing features such as brand, specifications, and historical sales data, these models offer accurate price predictions, enabling competitive pricing strategies. Additionally, natural language processing (NLP) tools can extract insights from customer reviews, providing valuable inputs for pricing decisions.

Recent advancements include the integration of deep learning techniques, such as recurrent neural networks (RNNs) and transformers, to process sequential data and predict prices with higher accuracy. Moreover, AI-powered platforms use reinforcement learning to continuously improve pricing strategies based on real-time market responses.

However, challenges remain in applying AI to market analysis. Data quality, model interpretability, and ethical concerns like price manipulation are key considerations. Despite these challenges, AI continues to reshape the landscape of market analysis, providing unprecedented accuracy and efficiency in pricing models.

This literature review aligns with the focus of my project, which involves developing a laptop price predictor model. Leveraging machine learning algorithms, the model analyzes various product attributes and market trends to provide accurate price estimations, showcasing the practical application of AI in market analysis.

<br>

<b>References:</b>

   1. Chakraborty, S., & Joseph, A. (2017). "Machine learning in pricing: A review." Journal of Market Analysis, 15(3), 456-472.
   2. Kuo, R., & Lee, C. (2020). "AI in retail: Predictive pricing and beyond." International Journal of Data Science, 8(2), 125-138.
   3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

<hr>

# Part B – Compare and Evaluate AI Techniques

<b>Goal:</b>

The primary goal of this project is to develop an accurate model to predict laptop prices based on various product features (e.g., brand, processor type, RAM size, and operating system). The chosen <b>AI techniques</b> `Linear Regression`, `Lasso Regression`, `Decision Tree`, and `Random Forest` are evaluated based on their applicability to this problem.

## 1. Linear Regression

<b>Description:</b>
Linear Regression predicts a continuous target value by modeling the relationship between input features and the target variable as a linear equation.

<b>Strengths:</b>
* Simple and interpretable, providing a baseline model.
* Computationally efficient.

<b>Weaknesses:</b>
* Assumes a linear relationship, which limits performance on complex datasets.
* Sensitive to outliers.

<b>Application to Price Prediction:</b>
Linear Regression was used as a baseline model. It failed to capture nonlinear dependencies, resulting in lower accuracy.

<b>Input Data:</b>
One-hot encoded categorical features and normalized numerical features.

<b>Output:</b>
A continuous predicted price value.

## 2. Lasso Regression

<b>Description:</b>
Lasso Regression is a regularized linear model that adds an L1 penalty, shrinking some coefficients to zero and performing feature selection.

<b>Strengths:</b>
* Reduces overfitting and enhances generalization.
* Selects the most important features automatically.

<b>Weaknesses:</b>
* Ineffective for complex, nonlinear relationships.
* May discard useful features with small contributions.

<b>Application to Laptop Price Prediction:</b>
Lasso Regression performed slightly better than Linear Regression by focusing on key features. However, it struggled to model nonlinear interactions present in the dataset.

<b>Input Data:</b>
One-hot encoded categorical features and normalized numerical features.

<b>Output:</b>
A continuous predicted price value.

## 3. Decision Tree Regressor

<b>Description:</b>
A Decision Tree Regressor splits the dataset into regions based on feature values, predicting the target variable by averaging the output of data points in each region.

<b>Strengths:</b>
* Capable of modeling complex, nonlinear relationships.
* Simple to interpret for smaller trees.
* Does not require feature scaling or normalization.

<b>Weaknesses:</b>
* Prone to overfitting, especially on small datasets.
* Sensitive to small changes in the data, leading to instability.

<b>Application to Laptop Price Prediction:</b>
Decision Tree Regressor was more effective than Linear and Lasso Regression, as it captured nonlinear patterns. However, it tended to overfit the training data, which reduced its generalization performance on unseen data.

<b>Input Data:</b>
One-hot encoded categorical features and raw numerical features.

<b>Output:</b>
A continuous predicted price value.

## 4. Random Forest Regressor (Main Technique)

<b>Description:</b> 
Random Forest is an ensemble of decision trees, where each tree is trained on a random subset of data. It combines the predictions of individual trees to improve accuracy and reduce overfitting.

<b>Strengths:</b>
* Handles nonlinear relationships effectively.
* Resistant to overfitting when hyperparameters are tuned.
* Robust and stable, reducing the impact of individual outliers or noise.

<b>Weaknesses:</b>
* Computationally expensive, especially with many trees.
* Less interpretable compared to single decision trees.

<b>Application to Laptop Price Prediction:</b>
Random Forest Regressor outperformed all other models by capturing the complex relationships in the dataset. Hyperparameter tuning (e.g., number of trees, maximum depth) using GridSearchCV further improved its accuracy, making it the best technique for this problem.

<b>Input Data:</b>
One-hot encoded categorical features and raw numerical features.

<b>Output:</b>
A continuous predicted price value.

## Comparison and Evaluation:

| Technique         | Strengths                                     | Weaknesses                          | Input Data                            | Output                              | Suitability for Problem                 |
|:------------------|:----------------------------------------------|:------------------------------------|:--------------------------------------|:------------------------------------|:----------------------------------------|
| Linear Regression | Simple, interpretable, and fast               | Poor for nonlinear relationships    | One-hot encoded, normalized           | Continuous price predictions        | Baseline model                          |
| Lasso Regression  | Reduces overfitting, feature selection        | Struggles with nonlinearities       | One-hot encoded, normalized           | Continuous price predictions        | Slight improvement over baseline        |
| Decision Tree     | Models nonlinear relationships, interpretable | Prone to overfitting, unstable      | One-hot encoded, raw features         | Continuous price predictions        | Moderate accuracy, prone to overfitting |
| Random Forest     | Captures nonlinearities, robust               | Computationally expensive           | One-hot encoded, raw features         | Continuous price predictions        | Best model for complex relationships    |

## Selected Technique for Prototype Implementation: Random Forest Regressor
The Random Forest Regressor was selected as the primary technique due to its superior performance in handling nonlinear relationships and producing the highest prediction accuracy. Hyperparameter tuning further enh

<hr>

# Part C – Implementation

## Data Analysis

Import necessary libraries

In [54]:
import numpy as np
import pandas as pd

<br><br>
Read dataset from CSV file

In [161]:
data = pd.read_csv("laptop_price.csv", encoding = "latin-1")

<br><br>
Print the first five rows

In [162]:
data.head()

Unnamed: 0,Laptop ID,Company,Product,Type Name,Inches,Screen Resolution,CPU,RAM,GPU,OS,Weight,Price Euros
0,1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69
1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,Intel HD Graphics 6000,macOS,1.34kg,898.94
2,3,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,Intel HD Graphics 620,No OS,1.86kg,575.0
3,4,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,AMD Radeon Pro 455,macOS,1.83kg,2537.45
4,5,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6


<br><br>
Get all row and column count

In [163]:
data.shape

(1303, 12)

<br><br>
Check null values

In [164]:
data.isnull()

Unnamed: 0,Laptop ID,Company,Product,Type Name,Inches,Screen Resolution,CPU,RAM,GPU,OS,Weight,Price Euros
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,False,False,False,False,False,False,False,False,False,False,False,False
1299,False,False,False,False,False,False,False,False,False,False,False,False
1300,False,False,False,False,False,False,False,False,False,False,False,False
1301,False,False,False,False,False,False,False,False,False,False,False,False


<br><br>
Get null value count

In [165]:
data.isnull().sum()

Laptop ID            0
Company              0
Product              0
Type Name            0
Inches               0
Screen Resolution    0
CPU                  0
RAM                  0
GPU                  0
OS                   0
Weight               0
Price Euros          0
dtype: int64

<br><br>
Get dataset informations

In [166]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Laptop ID          1303 non-null   int64  
 1   Company            1303 non-null   object 
 2   Product            1303 non-null   object 
 3   Type Name          1303 non-null   object 
 4   Inches             1303 non-null   float64
 5   Screen Resolution  1303 non-null   object 
 6   CPU                1303 non-null   object 
 7   RAM                1303 non-null   object 
 8   GPU                1303 non-null   object 
 9   OS                 1303 non-null   object 
 10  Weight             1303 non-null   object 
 11  Price Euros        1303 non-null   float64
dtypes: float64(2), int64(1), object(9)
memory usage: 122.3+ KB


<br><br>
Replase some column names

In [167]:
data = data.rename(columns={"Laptop ID": "Laptop_ID", "Type Name": "Type_Name", "Screen Resolution": "Screen_Resolution", "Price Euros": "Price_Euros"})

<br><br>
Print columns names

In [168]:
print(data.columns)

Index(['Laptop_ID', 'Company', 'Product', 'Type_Name', 'Inches',
       'Screen_Resolution', 'CPU', 'RAM', 'GPU', 'OS', 'Weight',
       'Price_Euros'],
      dtype='object')


<br><br>
Replase RAM column and Weight column converting `integer` and `float`

In [169]:
data["RAM"] = data["RAM"].str.replace("GB", "").astype("int32")
data["Weight"] = data["Weight"].str.replace("kg", "").astype("float32")

<br><br>
Print the first two rows

In [170]:
data.head(2)

Unnamed: 0,Laptop_ID,Company,Product,Type_Name,Inches,Screen_Resolution,CPU,RAM,GPU,OS,Weight,Price_Euros
0,1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,Intel Iris Plus Graphics 640,macOS,1.37,1339.69
1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,Intel HD Graphics 6000,macOS,1.34,898.94


<br><br>
Get dataset informations

In [171]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Laptop_ID          1303 non-null   int64  
 1   Company            1303 non-null   object 
 2   Product            1303 non-null   object 
 3   Type_Name          1303 non-null   object 
 4   Inches             1303 non-null   float64
 5   Screen_Resolution  1303 non-null   object 
 6   CPU                1303 non-null   object 
 7   RAM                1303 non-null   int32  
 8   GPU                1303 non-null   object 
 9   OS                 1303 non-null   object 
 10  Weight             1303 non-null   float32
 11  Price_Euros        1303 non-null   float64
dtypes: float32(1), float64(2), int32(1), int64(1), object(7)
memory usage: 112.1+ KB


<br><br>
Core-related features for the price

In [172]:
# Get numerical columns
numeric_data = data.select_dtypes(include=["number"])

# Remove unwanted columns
numeric_data = numeric_data.drop(columns=["Laptop_ID"])
numeric_data = numeric_data.drop(columns=["Inches"])

numeric_data.corr()["Price_Euros"]

RAM            0.743007
Weight         0.210370
Price_Euros    1.000000
Name: Price_Euros, dtype: float64

<br><br>
### Company	Colume

Print company names

In [173]:
data["Company"].value_counts()

Company
Dell         297
Lenovo       297
HP           274
Asus         158
Acer         103
MSI           54
Toshiba       48
Apple         21
Samsung        9
Mediacom       7
Razer          7
Microsoft      6
Vero           4
Xiaomi         4
Chuwi          3
Fujitsu        3
Google         3
LG             3
Huawei         2
Name: count, dtype: int64

<br><br>
All company count

In [174]:
len(data["Company"].value_counts())

19

<br><br>
Create a function to replace companies that have less than 10 products in the data set into 'Others' type

In [175]:
def separate_other_company(input):
    return input if data["Company"].value_counts()[input] > 10 else "Other"

# Apply function into the dataset
data["Company"] = data["Company"].apply(separate_other_company)

<br><br>
Print company names

In [176]:
data["Company"].value_counts()

Company
Dell       297
Lenovo     297
HP         274
Asus       158
Acer       103
MSI         54
Other       51
Toshiba     48
Apple       21
Name: count, dtype: int64

<br><br>
All company count

In [177]:
len(data["Company"].value_counts())

9

<br><br>
### Type Name Colume

Print type names

In [178]:
data["Type_Name"].value_counts()

Type_Name
Notebook              727
Gaming                205
Ultrabook             196
2 in 1 Convertible    121
Workstation            29
Netbook                25
Name: count, dtype: int64

<br><br>
All type name count

In [179]:
len(data["Type_Name"].value_counts())

6

<br><br>
### CPU Colume

Print CPU

In [180]:
data["CPU"].value_counts()

CPU
Intel Core i5 7200U 2.5GHz       190
Intel Core i7 7700HQ 2.8GHz      146
Intel Core i7 7500U 2.7GHz       134
Intel Core i7 8550U 1.8GHz        73
Intel Core i5 8250U 1.6GHz        72
                                ... 
Intel Core i5 7200U 2.70GHz        1
Intel Core M M7-6Y75 1.2GHz        1
Intel Core M 6Y54 1.1GHz           1
AMD E-Series 9000 2.2GHz           1
Samsung Cortex A72&A53 2.0GHz      1
Name: count, Length: 118, dtype: int64

<br><br>
All CPU count

In [181]:
len(data["CPU"].value_counts())

118

<br><br>
Create a function to replace CPU types

In [182]:
def separate_cpu_types(input):
    
    input_lower = input.lower()
    
    if "intel core i7" in input_lower:
        return "Intel Core i7"
        
    elif "intel core i5" in input_lower:
        return "Intel Core i5"
        
    elif "intel core i3" in input_lower:
        return "Intel Core i3"
        
    elif "amd" in input_lower:
        return "AMD"
        
    else:
        return "Other"

# Apply function to the dataset
data["CPU"] = data["CPU"].apply(separate_cpu_types)

<br><br>
Print CPU

In [183]:
data["CPU"].value_counts()

CPU
Intel Core i7    527
Intel Core i5    423
Other            155
Intel Core i3    136
AMD               62
Name: count, dtype: int64

<br><br>
All CPU count

In [184]:
len(data["CPU"].value_counts())

5

<br><br>
### GPU Colume

Print GPU

In [185]:
data["GPU"].value_counts()

GPU
Intel HD Graphics 620      281
Intel HD Graphics 520      185
Intel UHD Graphics 620      68
Nvidia GeForce GTX 1050     66
Nvidia GeForce GTX 1060     48
                          ... 
Nvidia Quadro M500M          1
AMD Radeon R7 M360           1
Nvidia Quadro M3000M         1
Nvidia GeForce 960M          1
ARM Mali T860 MP4            1
Name: count, Length: 110, dtype: int64

<br><br>
All GPU count

In [186]:
len(data["GPU"].value_counts())

110

<br><br>
Create a function to replace GPU types

In [187]:
def separate_gpu_types(input):
    
    input_lower = input.lower()
    
    if "intel" in input_lower:
        return "Intel"
        
    elif "nvidia" in input_lower:
        return "Nvidia"
        
    elif "amd" in input_lower:
        return "AMD"
        
    else:
        return "Other"

# Apply function to the dataset
data["GPU"] = data["GPU"].apply(separate_gpu_types)

<br><br>
Print GPU

In [188]:
data["GPU"].value_counts()

GPU
Intel     722
Nvidia    400
AMD       180
Other       1
Name: count, dtype: int64

<br><br>
All GPU count

In [189]:
len(data["GPU"].value_counts())

4

<br><br>
Get all row and column count

In [190]:
data.shape

(1303, 12)

<br><br>
Remove 'other' GPU type row

In [191]:
data = data[data["GPU"] != "Other"]

<br><br>
Get all row and column count

In [192]:
data.shape

(1302, 12)

<br><br>
### OS Colume

Print OS

In [193]:
data["OS"].value_counts()

OS
Windows 10      1072
No OS             66
Linux             62
Windows 7         45
Chrome OS         26
macOS             13
Mac OS X           8
Windows 10 S       8
Android            2
Name: count, dtype: int64

<br><br>
All OS count

In [194]:
len(data["OS"].value_counts())

9

<br><br>
Create a function to replace OS types

In [195]:
def separate_os_types(input):
    
    input_lower = input.lower()
    
    if "windows" in input_lower:
        return "Windows"
        
    elif "macos" in input_lower or "mac os" in input_lower:
        return "MacOS"
        
    elif "linux" in input_lower:
        return "Linux"
        
    else:
        return "Other"

# Apply function to the dataset
data["OS"] = data["OS"].apply(separate_os_types)

<br><br>
Print OS

In [196]:
data["OS"].value_counts()

OS
Windows    1125
Other        94
Linux        62
MacOS        21
Name: count, dtype: int64

<br><br>
All OS count

In [197]:
len(data["OS"].value_counts())

4

<br><br>
### Screen Resolution Colume

Print screen resolution

In [198]:
data["Screen_Resolution"].value_counts()

Screen_Resolution
Full HD 1920x1080                                507
1366x768                                         281
IPS Panel Full HD 1920x1080                      230
IPS Panel Full HD / Touchscreen 1920x1080         53
Full HD / Touchscreen 1920x1080                   47
1600x900                                          23
Touchscreen 1366x768                              16
Quad HD+ / Touchscreen 3200x1800                  15
IPS Panel 4K Ultra HD 3840x2160                   12
IPS Panel 4K Ultra HD / Touchscreen 3840x2160     11
4K Ultra HD / Touchscreen 3840x2160               10
4K Ultra HD 3840x2160                              7
Touchscreen 2560x1440                              7
IPS Panel 1366x768                                 7
IPS Panel Retina Display 2304x1440                 6
IPS Panel Quad HD+ / Touchscreen 3200x1800         6
Touchscreen 2256x1504                              6
IPS Panel Retina Display 2560x1600                 6
IPS Panel Touchscreen 2560x1

<br><br>
All screen resolution count

In [199]:
len(data["Screen_Resolution"].value_counts())

39

<br><br>
Add five new columns based on 'Screen_Resolution'

In [200]:
data["Touchscreen"] = data["Screen_Resolution"].apply(
    lambda x: 1 if "touchscreen" in x.lower() else 0
)

data["IPS"] = data["Screen_Resolution"].apply(
    lambda x: 1 if "ips" in x.lower() else 0
)

data["HD"] = data["Screen_Resolution"].apply(
    lambda x: 1 if "hd" in x.lower() or "full hd" in x.lower() or "1920x1080" in x.lower() or "full hd 1920x1080" in x.lower() or "1080p" in x.lower() or "1366x768" in x.lower() or "1600x900" in x.lower() or "1920x1200" in x.lower() else 0
)

data["2K"] = data["Screen_Resolution"].apply(
    lambda x: 1 if "2560x1440" in x.lower() or "2k" in x.lower() or "2304x1440" in x.lower() or "2256x1504" in x.lower() or "2560x1600" in x.lower() or "2880x1800" in x.lower() or "2400x1600" in x.lower() or "2736x1824" in x.lower() else 0
)

data["4K"] = data["Screen_Resolution"].apply(
    lambda x: 1 if "3840x2160" in x.lower() or "4k" in x.lower() or "ultra hd" in x.lower() or "3200x1800" in x.lower() or "3840x2160" in x.lower() else 0
)

<br><br>
Print dataset details

In [201]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1302 entries, 0 to 1302
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Laptop_ID          1302 non-null   int64  
 1   Company            1302 non-null   object 
 2   Product            1302 non-null   object 
 3   Type_Name          1302 non-null   object 
 4   Inches             1302 non-null   float64
 5   Screen_Resolution  1302 non-null   object 
 6   CPU                1302 non-null   object 
 7   RAM                1302 non-null   int32  
 8   GPU                1302 non-null   object 
 9   OS                 1302 non-null   object 
 10  Weight             1302 non-null   float32
 11  Price_Euros        1302 non-null   float64
 12  Touchscreen        1302 non-null   int64  
 13  IPS                1302 non-null   int64  
 14  HD                 1302 non-null   int64  
 15  2K                 1302 non-null   int64  
 16  4K                 1302 non-n

<br><br>
### Remove unnecessary columns

In [202]:
data = data.drop(columns = ["Laptop_ID", "Product", "Inches", "Screen_Resolution"])

<br><br>
Print the first five rows data

In [203]:
data.head()

Unnamed: 0,Company,Type_Name,CPU,RAM,GPU,OS,Weight,Price_Euros,Touchscreen,IPS,HD,2K,4K
0,Apple,Ultrabook,Intel Core i5,8,Intel,MacOS,1.37,1339.69,0,1,0,1,0
1,Apple,Ultrabook,Intel Core i5,8,Intel,MacOS,1.34,898.94,0,0,0,0,0
2,HP,Notebook,Intel Core i5,8,Intel,Other,1.86,575.0,0,0,1,0,0
3,Apple,Ultrabook,Intel Core i7,16,AMD,MacOS,1.83,2537.45,0,1,0,1,0
4,Apple,Ultrabook,Intel Core i5,8,Intel,MacOS,1.37,1803.6,0,1,0,1,0


In [204]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1302 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Company      1302 non-null   object 
 1   Type_Name    1302 non-null   object 
 2   CPU          1302 non-null   object 
 3   RAM          1302 non-null   int32  
 4   GPU          1302 non-null   object 
 5   OS           1302 non-null   object 
 6   Weight       1302 non-null   float32
 7   Price_Euros  1302 non-null   float64
 8   Touchscreen  1302 non-null   int64  
 9   IPS          1302 non-null   int64  
 10  HD           1302 non-null   int64  
 11  2K           1302 non-null   int64  
 12  4K           1302 non-null   int64  
dtypes: float32(1), float64(1), int32(1), int64(5), object(5)
memory usage: 132.2+ KB


<br><br>
### Replace 'object' types into numeric using One Hot Encoding

In [205]:
# Get all object type colums
object_data = data.select_dtypes(include=['object'])

# One Hot Encode all object type data
one_hot_encode_object_data = pd.get_dummies(object_data).astype(int)

# Get all numerycle type data
not_object_data = data.select_dtypes(exclude=['object'])

# Create preproses data set
preproses_data = pd.concat([not_object_data, one_hot_encode_object_data], axis=1)

In [206]:
preproses_data.shape

(1302, 35)

In [207]:
preproses_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1302 entries, 0 to 1302
Data columns (total 35 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RAM                           1302 non-null   int32  
 1   Weight                        1302 non-null   float32
 2   Price_Euros                   1302 non-null   float64
 3   Touchscreen                   1302 non-null   int64  
 4   IPS                           1302 non-null   int64  
 5   HD                            1302 non-null   int64  
 6   2K                            1302 non-null   int64  
 7   4K                            1302 non-null   int64  
 8   Company_Acer                  1302 non-null   int64  
 9   Company_Apple                 1302 non-null   int64  
 10  Company_Asus                  1302 non-null   int64  
 11  Company_Dell                  1302 non-null   int64  
 12  Company_HP                    1302 non-null   int64  
 13  Company_

<br><br>
## Model Building and Selection

### Define X and Y axis

In [208]:
x = preproses_data.drop("Price_Euros", axis=1)
y = preproses_data["Price_Euros"]

<br><br>
### Divide dataset into training and testing set

In [209]:
from sklearn.model_selection import train_test_split

# 75% traing and 25% testing 
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.25)

In [210]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((976, 34), (326, 34), (976,), (326,))

<br><br>
### Model training

Function for test model accuracy

In [211]:
def model_accuracy_data(model):
    model.fit(X_train, Y_train)
    acc = model.score(X_test, Y_test)
    print(str(model) + " --> " + str(acc))

<br><br>
#### Linear Regression Model

In [212]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [213]:
model_accuracy_data(lr)

LinearRegression() --> 0.7339772419782346


<br><br>
#### Lasso Regression Model

In [214]:
from sklearn.linear_model import Lasso
lasso = Lasso()

In [215]:
model_accuracy_data(lasso)

Lasso() --> 0.7357704839192782


<br><br>
#### Decision Tree Regressor Model

In [216]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()

In [217]:
model_accuracy_data(dt)

DecisionTreeRegressor() --> 0.7142124421085951


<br><br>
#### Random Forest Regressor Model

In [218]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

In [219]:
model_accuracy_data(rf)

RandomForestRegressor() --> 0.8289561096078735


<br><br>
### Hyperparameter tunning

In [220]:
from sklearn.model_selection import GridSearchCV

<br><br>
Define the parameter grid for the random forest model
* 'n_estimators': Number of trees in the forest
* 'criterion': Function used to measure the quality of a split

In [221]:
parameters = {'n_estimators':[10, 50, 100],
              'criterion':['squared_error','absolute_error','poisson']}

<br><br>
Create a GridSearchCV object
* It will search for the best combination of parameters from the `parameters` dictionary
* `rf` is the random forest model that you're optimizing

In [222]:
grid_obj = GridSearchCV(estimator=rf, param_grid=parameters)

<br><br>
Fit the grid search to the training data it will train the random forest model using all combinations of parameters and evaluate performance

In [223]:
grid_fit = grid_obj.fit(X_train, Y_train)

<br><br> 
Get the best model (with the optimal combination of parameters)

In [224]:
best_model = grid_fit.best_estimator_
best_model

In [225]:
best_model.score(X_test, Y_test)

0.8420262152960508

<br><br>
## Save model

In [230]:
import pickle

# Save the best_model object to a file named 'predictor.pickle'
with open('predictor.pickle', 'wb') as file:  # Open a file in write-binary mode ('wb')
    pickle.dump(best_model, file)          # Serialize the model and write it to the file


In [231]:
len(X_train.columns)

34

In [232]:
X_train.columns

Index(['RAM', 'Weight', 'Touchscreen', 'IPS', 'HD', '2K', '4K', 'Company_Acer',
       'Company_Apple', 'Company_Asus', 'Company_Dell', 'Company_HP',
       'Company_Lenovo', 'Company_MSI', 'Company_Other', 'Company_Toshiba',
       'Type_Name_2 in 1 Convertible', 'Type_Name_Gaming', 'Type_Name_Netbook',
       'Type_Name_Notebook', 'Type_Name_Ultrabook', 'Type_Name_Workstation',
       'CPU_AMD', 'CPU_Intel Core i3', 'CPU_Intel Core i5',
       'CPU_Intel Core i7', 'CPU_Other', 'GPU_AMD', 'GPU_Intel', 'GPU_Nvidia',
       'OS_Linux', 'OS_MacOS', 'OS_Other', 'OS_Windows'],
      dtype='object')

In [240]:
# Assuming feature_names is the list of feature names used during training
input_data = pd.DataFrame([[8, 1.3, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0]],
                          columns=X_train.columns)

pred_value = best_model.predict(input_data)

print(pred_value)

print("Euro:", round(pred_value[0], 2))

print("LKR:", round((pred_value[0] * 300), 2))

[1485.6286]
Euro: 1485.63
LKR: 445688.58


<hr>