<a href="https://colab.research.google.com/github/nikesh11xx/Data-to-Decision-blog/blob/main/laptop_price_prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

name = Nikesh kumar singh


github : https://github.com/nikesh11xx/Data-to-Decision-blog/blob/cf982642018fe4e1609bb48f46682205237e9ef1/laptop_price_prediction_.ipynb

#Summary


The dataset contains information about laptop specifications and prices. It has 1333 rows and 13 columns. The columns include laptop company, type name, RAM, weight, touch screen availability, IPS display availability, CPU brand, price, PPI, and operating system.

The data is generally clean and free of missing values or duplicates. However, the RAM and weight columns were converted to integer and float data types, respectively, to improve data consistency.

The dataset provides insights into the relationship between laptop features and prices. For example, laptops with higher RAM and larger screen sizes tend to have higher prices. Additionally, the presence of touch screen and IPS display features also contributes to higher prices.

Overall, the dataset is suitable for building machine learning models to predict laptop prices based on their specifications.


##Aim of the projects

**Problem Statement:**

Given a dataset containing information about laptops, the goal is to build a machine learning model that can predict the price of a laptop based on its features. The dataset includes various features such as brand, processor, RAM, storage, and display size.

**Objective:**

The objective is to develop a model that can accurately estimate the price of a laptop given its specifications. This model can be used by consumers to make informed decisions when purchasing a laptop, and by manufacturers to optimize their pricing strategies.

**Data Description:**

The dataset contains the following features:

* Brand: The brand of the laptop (e.g., Apple, Dell, HP, Lenovo).
* Processor: The type of processor (e.g., Intel Core i3, Intel Core i5, Intel Core i7).
* RAM: The amount of RAM in gigabytes (GB).
* Storage: The storage capacity in gigabytes (GB).
* Display Size: The size of the display in inches.
* Price: The price of the laptop in US dollars.

**Challenges:**

* The dataset may contain outliers or missing values that need to be handled appropriately.
* The features may have different scales, which can affect the model's performance.
* The relationship between the features and the price may be complex and non-linear.

**Potential Solutions:**

* Use data preprocessing techniques such as scaling and imputation to handle outliers and missing values.
* Explore different machine learning algorithms and select the one that best suits the data and the problem.
* Consider using ensemble methods or deep learning models to capture complex relationships between features and the price.

**Evaluation Metrics:**

* The performance of the model can be evaluated using metrics such as mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).


#Loading libaries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from  sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor,ExtraTreesRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

from sklearn.metrics import r2_score,mean_absolute_error

#loading datase

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/csv data/laptop_data (1).csv')


top 5 data view

In [None]:
df.head(5)

tail

In [None]:
df.tail(5)

#data rows and columns

In [None]:
r,c = df.shape
print(f'there are {r} rows and {c} columns in our dataset')

#data info

In [None]:
df.info()

#Duplicate value

In [None]:
df.duplicated().sum()

there is 0 duplicate values in our dataset

# missing values

In [None]:
df.isnull().sum()

there is no missing or null values in our dataset

# summary of all variables

In [None]:
df.describe(include='all')

# numerical feature and object featur

In [None]:
numerical_feature = [i for i in df.columns if df[i].dtype != 'object']

object_features =  [i for i in df.columns if df[i].dtype == 'object']

print('Numerical_feature: ', numerical_feature)
print('object_features: ', object_features)

#Drop the unnecessary data

In [None]:
df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [None]:
df

# Data wrangling

In [None]:
# removing gb from the ram columns and kg from weight

df['Ram'] = df['Ram'].str.replace('GB','')
df['Weight'] = df['Weight'].str.replace('kg','')

In [None]:
# changing data type object to  int and float
df['Ram'] = df['Ram'].astype('int32')
df['Weight'] = df['Weight'].astype('float32')

#Univariate analysis

In [None]:
sns.distplot(df['Price'])
plt.title('Price distribution')
plt.show()

## chart2

In [None]:
sns.barplot(y = df['Company'].value_counts().index,x=df['Company'].value_counts().values,orient='h',palette='viridis')


for i,v in enumerate(df['Company'].value_counts()):
  plt.text(v,i,str(v))

plt.title('Horizontal Bar Plot of Laptop Brands with Counts')
plt.ylabel('Brand name')
plt.xlabel('Number of laptop')
plt.show()

## chart 3

In [None]:
sns.barplot(x=df['Price'],y=df['Company'],palette='spring')
plt.title('Comparison of Laptop Prices by Company')
plt.show()

## chart 4

In [None]:
sns.barplot(y=df['TypeName'].value_counts().index,x=df['TypeName'].value_counts().values,palette='husl')

plt.title("Frequency of Laptop Types")
plt.ylabel('laptop type')
plt.xlabel('count of laptop')
for i,v in enumerate(df['TypeName'].value_counts()):
  plt.text(v,i,str(v))

##chart5

In [None]:
sns.barplot(y=df['TypeName'],x=df['Price'],palette='husl')

##chart 6

In [None]:
sns.distplot(df['Inches'])

##chart 7

In [None]:
sns.scatterplot(x=df['Inches'],y=df['Price'])

there is not a strong relationship between price and screen size

#creating touchscreen column

In [None]:
df['Touchscreen'] = df['ScreenResolution'].apply(lambda x:1 if 'Touchscreen' in x else 0)

chart 8

In [None]:

sns.barplot(x=df['Touchscreen'].value_counts().index, y=df['Touchscreen'].value_counts().values,palette='rocket')
plt.xlabel('Touchscreen')
plt.ylabel('Count')
plt.title('Number of Laptops with and without Touchscreen')



for i,v in enumerate(df['Touchscreen'].value_counts().values):
  plt.text(i,v,str(v))
plt.show()


chart 9

In [None]:
sns.barplot(x=df['Touchscreen'],y=df['Price'],palette='colorblind')
sns.set(style="whitegrid")

plt.title('Price of Laptops with and without Touchscreen')
plt.show()


# creating Ips columns

In [None]:
df['Ips'] = df['ScreenResolution'].apply(lambda x:1 if 'IPS' in x else 0)

chart 10

In [None]:
sns.barplot(x=df['Ips'].value_counts().index,y=df['Ips'].value_counts().values,palette='colorblind')
plt.title("Number of Laptops with and without IPS Display")

for i,v in enumerate(df['Ips'].value_counts().values):
  plt.text(i,v,str(v))


chart 11

In [None]:
sns.barplot(x=df['Ips'],y=df['Price'],palette='mako')
plt.title('"Price Comparison of Laptops with and without IPS Display"')

In [None]:
new = df['ScreenResolution'].str.split('x',n=1,expand=True)

In [None]:
df['X_res'] = new[0]
df['Y_res'] = new[1]


In [None]:
df['X_res'] = df['X_res'].str.replace(',','').str.findall(r'(\d+\.?\d+)').apply(lambda x:x[0])

In [None]:
# changing the data type

df['X_res'] = df['X_res'].astype('int32')
df['Y_res'] = df['Y_res'].astype('int32')

In [None]:
# checking corr relation
df.corr()['Price']

In [None]:
# creating ppi columns from x_res,y_res and inches columns
df['ppi'] = (((df['X_res']**2) + (df['Y_res']**2))**0.5/df['Inches']).astype('float')

In [None]:
#  droping unnecessery columns
df.drop(columns=['ScreenResolution','X_res','Y_res','Inches'],inplace=True)

In [None]:
df['cpu name']=df['Cpu'].apply(lambda x:' '.join(x.split()[0:3]))

In [None]:
df['cpu name'].value_counts()

In [None]:
# creating a function to get the cpu name

def fetch_processor(text):
  if text == 'Intel Core i7' or text == 'Intel Core i5' or text == 'Intel Core i3':
    return text
  else:
    if text.split()[0] =='Intel':
      return 'Other Intel Processor'
    else:
      return 'AMD Processor'

In [None]:
df['Cpu brand'] = df['cpu name'].apply(fetch_processor)

chart 12

In [None]:
sns.barplot(x = df['Cpu brand'].value_counts().values, y = df['Cpu brand'].value_counts().index,orient='h',palette='mako')

for i,v in enumerate(df['Cpu brand'].value_counts()):
  plt.text(v,i,str(v))

plt.title("Distribution of CPU Brands")

chart 13

In [None]:
plt.figure(figsize=(12, 8))  # Adjusting the figure size

# Creating the bar plot with custom colors and horizontal orientation
ax = sns.barplot(x=df['Cpu brand'], y=df['Price'], color='skyblue')

# Adding annotations with price values on each bar
for p in ax.patches:
    ax.annotate(f"${p.get_height():.2f}",
                (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=10, color='black')

# Setting title and labels
plt.title('Price Distribution by CPU Brand', fontsize=16)
plt.xlabel('CPU Brand', fontsize=14)
plt.ylabel('Price', fontsize=14)

# Adjusting layout
plt.tight_layout()

# Removing top and right spines
sns.despine()

# Adding a horizontal line for the average price
average_price = df['Price'].mean()
plt.axhline(y=average_price, color='red', linestyle='--', linewidth=2, label=f'Average Price (${average_price:.2f})')

# Adding legend
plt.legend()

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Showing the plot
plt.show()


In [None]:
# droping cpu and cpu name columns

df.drop(columns=['Cpu','cpu name'],inplace = True)

# ram columns

chart 14

In [None]:
ax = df['Ram'].value_counts().plot(kind='bar')

for i,v in enumerate(df['Ram'].value_counts()):
  ax.text(i,v,str(v),ha='center', va='bottom', fontsize=10, color='black')

# Setting title and labels
plt.title('Distribution of RAM', fontsize=16)
plt.xlabel('RAM (GB)', fontsize=14)
plt.ylabel('Count', fontsize=14)

chart 15

In [None]:
sns.barplot(x=df['Ram'],y=df['Price'],palette='mako')
plt.title("Price Distribution by RAM Capacity")


#memory columns

In [None]:
# replace .0 to empty string this code try to  remove the decimal
df['Memory'] = df['Memory'].astype(str).replace('\.0','',regex=True)

# replace 'GB' to empty string 'TB' to '000'
df['Memory'] = df['Memory'].str.replace('GB','')
df['Memory'] = df['Memory'].str.replace('TB','000')

new = df['Memory'].str.split('+',n=1,expand = True)

df['first'] = new[0].str.strip()
df['second'] = new[1].fillna('0')

# creating columns layer1HHD
df['layer1HHD'] = df['first'].apply(lambda x:1 if 'HDD' in x else 0)

# creating columns layer1ssd
df['layer1ssd'] = df['first'].apply(lambda x:1 if 'SSD' in x else 0)

# create columns Hybrid
df['layer1Hybrid'] =df['first'].apply(lambda x: 1 if 'Hybrid' in x else 0)

# creating columns flash storge

df['layer1Flash_storage'] = df['first'].apply(lambda x :1 if 'Flash Storage' in x else 0)

#  removing all the character
df['second'] = df['second'].str.replace(r'\D','')

# extraxcting only digit
df['first'] = df['first'].str.extract('(\d+)',expand=False)
df['second'].fillna('0',inplace=True)



# creating layer2HDD
df['layer2HDD'] = df['second'].apply(lambda x:1 if 'HDD' in x else 0)

# creating layer2SSD
df['layer2SSD'] = df['second'].apply(lambda x:1 if 'SSD' in x else 0)

# creating layer2Hybrid columns
df['layer2Hybrid'] = df['second'].apply(lambda x: 1 if 'Hybrid' in x else 0)

#  creating layer2Flashstorage columns
df['layer2Flash_storage'] = df['second'].apply(lambda x: 1 if 'Flash Storage' in x else 0)

# removing all the character
df['second'] = df['second'].str.replace(r'\D','')

# changing the datatype
df['first'] = df['first'].astype('int32')
df['second'] = df['second'].astype('int32')


In [None]:
# creating HDD,SSD,Hybrid and Flash_storage columns
df['HDD'] = (df['first'] * df['layer1HHD'] + df['second'] * df['layer2HDD'])
df['SSD'] = (df['first'] * df['layer1ssd'] + df['second'] * df['layer2SSD'])
df['Hybrid'] = (df['first'] * df['layer1Hybrid'] + df['second'] * df['layer2Hybrid'])
df['Flash_storage'] = (df['first'] * df['layer1Flash_storage'] + df['second'] * df['layer2Flash_storage'])


In [None]:
# droping unnecessary columns
df.drop(columns=['first','second','layer1HHD','layer1ssd','layer1Hybrid','layer2HDD','layer2SSD','layer2Hybrid','layer2Flash_storage','Memory','Hybrid','Flash_storage','layer1Flash_storage'],inplace=True)

In [None]:
df.corr()['Price']

#gpu column

In [None]:
df['gpu brand'] = df['Gpu'].apply(lambda x:x.split()[0])
df['gpu brand']

chart 16

In [None]:
sns.barplot(x= df['gpu brand'].value_counts().index,y=df['gpu brand'].value_counts().values,palette='mako')
for i,v in enumerate(df['gpu brand'].value_counts().values):
  plt.text(i,v,str(v))

plt.title('GPU Brand Distribution')

chart 17

In [None]:
sns.barplot(x= df['gpu brand'],y=df['Price'],palette='mako')
plt.title('Brand and price distribution')

In [None]:
# droping gpu columns
df.drop(columns=['Gpu'],inplace=True)

#operating system

In [None]:
## Categorizes operating systems into broader categories: Windows, Mac, and Other/No OS/Linux

def cat_os(inp):
    if inp == 'Windows 10' or inp == 'Windows 7' or inp == 'Windows 10 S':
        return 'Windows'
    elif inp == 'macOS' or inp == 'Mac OS X':
        return 'Mac'
    else:
        return 'Others/No OS/Linux'

In [None]:
# creating os columns

df['os'] = df['OpSys'].apply(cat_os)
df['os'].value_counts()

In [None]:
# droping the OpSys columns

df.drop(columns=['OpSys'],inplace = True)

chart 18

In [None]:
sns.barplot(x=df['os'].value_counts().index,y=df['os'].value_counts().values,palette='mako')
for i,v in enumerate(df['os'].value_counts().values):
  plt.text(i,v,str(v))

chart 19

In [None]:
# Visualizes the distribution of laptop prices across different operating systems using a bar plot.


sns.barplot(x=df['os'],y=df['Price'],palette='mako')
plt.title("Price Distribution by Operating System")

chart 20

In [None]:
sns.barplot(x=df['os'],y=df['Price'],palette='viridis')
plt.title('OS and Price distribution')

#feature ingenering

In [None]:
X=df.drop(columns=['Price'])
y = np.log(df['Price'])

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=2)

#Linear regression

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop = 'first'),[0,1,7,10,11])
],remainder='passthrough')

step2= LinearRegression()

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2_Score ',r2_score(y_test,y_pred))
print('MAE ',mean_absolute_error(y_test,y_pred))

#Ridge Regression

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])
],remainder='passthrough')

step2 = Ridge(alpha=10)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('r2_score ',r2_score(y_test,y_pred))
print('MAE ' ,mean_absolute_error(y_test,y_pred))

#Lasso Regression

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])
],remainder='passthrough')

step2 = Lasso(alpha=0.001)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)

print('r2_score ',r2_score(y_test,y_pred))
print('MAE ' ,mean_absolute_error(y_test,y_pred))

#KNN

In [None]:
step1  = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])
],remainder='passthrough')

step2 = KNeighborsRegressor(n_neighbors=3)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)

print('R2 score ',r2_score(y_test,y_pred))
print('MAE ', mean_absolute_error(y_test,y_pred) )

#Decision Tree

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = DecisionTreeRegressor(max_depth=8)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#SVM

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = SVR(kernel='rbf',C=10000,epsilon=0.1)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#RandomForest

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = RandomForestRegressor(n_estimators=991,
                              random_state=5,
                              max_samples=0.9,
                              max_features=0.77,
                              max_depth=15)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#Extratree

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = ExtraTreesRegressor(n_estimators=100,
                              random_state=5,
                              max_samples=None,
                              max_features=0.75,
                              max_depth=15)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#AdaBOOST

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = AdaBoostRegressor(n_estimators=15,learning_rate=1.0)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#Gradient Boost

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = GradientBoostingRegressor(n_estimators=500)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#XgBoost

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,7,10,11])


],remainder='passthrough')

step2 = XGBRegressor(n_estimators=31,max_depth=6,learning_rate=0.5)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)
print('r2 score ',r2_score(y_test,y_pred))
print('AME ',mean_absolute_error(y_test,y_pred))

#Conclusion


In conclusion, our analysis of the laptop dataset revealed several key insights:

1. **Brand Popularity:** HP, Lenovo, Dell, and Asus emerged as the dominant brands in terms of market share.

2. **Screen Size and Resolution:** The majority of laptops had screen sizes ranging from 14 to 15 inches, with Full HD (1920x1080) resolution being the most common.

3. **Processor and RAM:** Intel processors, particularly the Core i5 and Core i7 variants, were prevalent, paired with varying amounts of RAM, with 8GB being the most popular choice.

4. **Storage Options:** Solid-state drives (SSDs) were more common than traditional hard disk drives (HDDs), offering faster performance and reliability.

5. **Operating System:** Windows dominated as the preferred operating system, with a small percentage of laptops running macOS or other alternatives.

6. **Touchscreen and IPS Displays:** Touchscreen laptops were less common, while IPS displays offered superior viewing angles and color accuracy.

7. **Graphics Performance:** Integrated graphics were the norm, with dedicated graphics cards being present in higher-end models.

8. **Price Distribution:** Laptop prices exhibited a wide range, with a significant number of models falling within the $500 to $1000 price bracket.

9. **Correlation with Price:** Factors such as brand, processor, RAM, storage capacity, and screen resolution showed strong correlations with laptop prices.

10. **Predictive Modeling:** Our machine learning models achieved varying levels of accuracy in predicting laptop prices, with Random Forest, Gradient Boosting, and XGBoost yielding the best results.

Overall, the analysis provides valuable insights into the key characteristics, trends, and factors influencing laptop prices in the market.