# Laptop price predictor 
In this note book we will make a predictor to predict the price of a laptop giving some characteristics. In this project we will perform a lot of preprocessing and exploratory data analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import re


ModuleNotFoundError: No module named 'pandas'

Importing the data from csv

In [None]:
df = pd.read_csv("laptop_data.csv",encoding='latin-1')
df.head(2)

In [None]:
df.columns

In [None]:
df.shape

We have 11 characteristics on this dataset, but some of them are really noicy, we will deal with them in the EDA.

Lets get some more information about this dataset.

In [None]:
df.info()

It looks like we have no null values

In [None]:
df.isnull().sum()

So this is a near perfect dataset, its awsome for academic research and to get some knowledge on how things work, the only problem is that you rarely come across data like this as it usually needs to be cleaned and processed

Now lets see the unique values that we have

In [None]:
for col in df.columns:
    if col!='Price_euros':
        print(f'{col} colum has {df[col].unique().size} unique elements'+'__'*20,f'\nUnique values in {col}:\n {df[col].unique()}\n')


As we can observe; `Ram`, `Memory` and `Weight` are numerical data but have a unit attached, lets fix that

In [None]:
df['Ram']=df['Ram'].str.replace('GB','')
df['Ram']=df['Ram'].astype('int32')
df=df.rename(columns={"Ram": "Ram (GB)"})
df['Weight']=df['Weight'].str.replace('kg','')
df['Weight']=df['Weight'].astype('float64')

df=df.rename(columns={"Weight": "Weight (kg)"})

In [None]:
df.head()

## Exploratory Data Analysis

### Numerical data

In [None]:
df.describe()

#### Discrete data 


In [None]:
for variable in ['Inches','Ram (GB)']:
       plt.figure(figsize=(15,7))
       sn.countplot(x=variable,data = df)

These graphs tell us that people prefere laptops with 15.6" displays, and laptops with 8GB of Ram are the most bought.

#### Continuous data 

Let's take a plot the Price and weight counts

In [None]:
sn.displot(df['Price_euros'])
plt.plot([df['Price_euros'].mean(), df['Price_euros'].mean()], [160, 0],color='red', linewidth=2)

In [None]:
sn.displot(np.log1p(df['Price_euros']),kind='kde')

In [None]:
sn.displot(df['Weight (kg)'])
plt.plot([df['Weight (kg)'].mean(), df['Weight (kg)'].mean()], [180, 0],color='red' ,linewidth=2)


We observer that weight is unevenly spread so the mean doesn't give a useful information, we can see that this data has a some laptops whos weight is more than 4kg but are still sold, we can probably say that these laptops are the workstation.

### Categorical data

Let's plot some of these variables

In [None]:
Categorical=['Company', 'TypeName', 'OpSys']

for variable in Categorical:
       plt.figure(figsize=(20,7))
       sn.countplot(x=variable,data = df, order =df[variable].value_counts().index)

From these graphs we can see that windows 10 is the most popular OS in the dataset. We can also observe that Dell and Lenovo are fighting for the most popular brand, notebooks are the most popular type sold.

Let's explore more about each company and their prices

In [None]:
plt.figure(figsize=(15,7))
plt.xticks(rotation=20)
sn.barplot(x = df['Company'],y = df['Price_euros'],order=df['Company'].value_counts().index)

We observe that Razor product have the most variation in price; peaking at 42244.11 MAD and reaching a bottom at 7127.26 MAD with a mean of 23176.72 MAD.

Now let's see the price variation based on laptop type.

In [None]:
plt.figure(figsize=(15,7))
sn.countplot(x=df['TypeName'],order=df['TypeName'].value_counts().index)

In [None]:
plt.figure(figsize=(15,7))
x = df.groupby(['TypeName']).Price_euros.max().sort_values().keys()
sn.barplot(x = 'TypeName',y= 'Price_euros',data=df,order=x)

Notebooks have the least amount of variation in price compared to other laptop types, this is probably a results from the high number of notebooks and competitors in the market.
Also the high price of workstation and the low number of units sold shows that workstations are more of a niche product.

#### Deep dive to `ScreenResolution` column

In [None]:
df['ScreenResolution'].value_counts()

This column has three type of data in it:
- _TouchScreen_,
- _display panel_ (IPS or TN Panels)
- and _Max Screen Resolution_ (i.e. Full HD 2560x1440 ...).

We need to separate this information into different columns. To do that we will one-hot encode some of this data, meaning we will convert each categorical (Touchscreen and Display) value into a new categorical column and assign a binary value of 'Yes' or 'No', for `TouchScreen` and 'TN' or 'IPS' for display panel.


##### Starting with _TouchScreen_:

In [None]:
df["TouchScreen"]=df['ScreenResolution'].apply(lambda x:'Yes' if 'Touchscreen' in x else 'No')
df['ScreenResolution']=df['ScreenResolution'].replace(regex={r"/* *Touchscreen /*":""})
df.tail()

In [None]:
sn.countplot(x='TouchScreen',data=df)

As we can see most laptops in this dataset don't have a TouchScreen, just 192 have this functionality

In [None]:
sn.barplot(x = 'TouchScreen',y= 'Price_euros',data=df)

Price of laptops with touchscreen is subjected to alot of variation avreging at 10009.37, in contrast to normal ones that have an average price of about 7753.11

let see how each type of df 

In [None]:
sn.countplot(x="TypeName", hue="TouchScreen", data=df)
plt.xticks(rotation=20)

In [None]:
pd.crosstab(index=df['TouchScreen'], columns=df['TypeName'])

As we can see most of the laptops that have a Touchscreen are 2 in 1 Convertible, we can also see the most popular laptop type don't usual come with touch screen.

We could say that Touchscreen are mostly a luxury for laptop users

##### Display panel

For the display panel we will asign 1 to observations with IPS panles and 0 to TN panels we can flip them it doesn't matter.

In [None]:
df["Display panel"]=df['ScreenResolution'].apply(lambda x:'IPS' if 'IPS' in x else 'TN')
df['ScreenResolution']=df['ScreenResolution'].replace(regex={r"IPS Panel.[Retina Display ]*":""})
df.tail()

While we at it lets also clean the `ScreenResolution`

In [None]:
df['ScreenResolution']=df['ScreenResolution'].replace(regex={r"(4K)?[^0-9^x]*":""})

In [None]:
sn.countplot(x='Display panel',data=df)

Most laptops in this dataset dont have IPS panel but TN ones insted.

In [None]:
sn.barplot(x = 'Display panel',y= 'Price_euros',data=df)

IPS display panels are more expencive than TN panels 

In [None]:
sn.countplot(x="TypeName", hue="Display panel", data=df)
plt.xticks(rotation=20)

In [None]:
pd.crosstab(index=df['Display panel'], columns=df['TypeName'])

We can see here that most Notebooks and netbooks laptop types dont use TN panel probably because of the high-price, more than half of 2 in 1 convertible laptops use IPS panels, similarly for workstations, ultrabooks and gaming laptops, close to half of them have IPS, this is logical as these panel provide better view from an angel compared to TN counterpart so they are more of a premium adition.

##### ScreenResolution

In [None]:
df['ScreenResolution'].value_counts()

In [None]:
plt.figure(figsize=(15,7))
sn.countplot(x='ScreenResolution',data=df,order=df['ScreenResolution'].value_counts().index)
plt.xticks(rotation=20)

we observe that 1920x1080 (FHD) is the most popular display resolution across all dfs

In [None]:
X=df['ScreenResolution'].str.split("x",expand=True)
df['X_res']=X[0]
df['Y_res']=X[1]
for value in ['X_res','Y_res']:
    df[value]=df[value].astype("int")

## Feature Engineering

In [None]:
df.sample(3)

### Pixel Per Inch

Pixel density indicates how many pixels per inch (PPI) there are on a display. The higher the pixel density, the more detailed and spacious the picture is.

In contrast, displays with low pixel density will have less screen space and more pixelated image quality.

Inches and resolution give almost the same amount of information we can combine them into a single metric PixelPerInch or PPI for short its calculated like the following. At the end of the day, our goal is to improve the performance by having fewer features.

$$
    PPI = \frac{\sqrt{X_{resolution}^2+Y_{resolution}^2}}{inches}
$$

In [None]:
df['ppi'] = (((df['X_res']**2) + (df['Y_res']**2))**0.5/df['Inches']).astype('float')

In [None]:
df["Res_value"]=df['X_res']*df['Y_res']
df.drop(columns = ['X_res','Y_res'], inplace=True)

In [None]:
Convert=dict(zip(df["Res_value"].value_counts().index, df["ScreenResolution"].value_counts().index))
df.pop('ScreenResolution')

In [None]:
df.info()

In [None]:
plt.figure(figsize=(15,7))
sn.boxplot(y='Price_euros',x ='Inches',data=df)

df.drop(columns = ['Inches','X_res','Y_res'], inplace=True)

### `CPU` column

The CPU column also contains lots of information, like CPU manifactuer and model and also it's speed (GHz) with 118 different categories.

In [None]:
df['Cpu'].value_counts()

we will put the CPU speed in it's own column and change its dtype

In [None]:
def get_GHz(CPU):
    return re.search(r'\d?\.?\d(?!(?!GHz))',CPU).group()

In [None]:
df['GHz']=df['Cpu'].apply(get_GHz)


In [None]:
df['GHz']=df['GHz'].astype("float")

We will be clustering this column, so each df will either have an one of intel processors (Intel Xeon, i3, i5, i7 or Other Intel Processor) or AMD Processor

In [None]:
def get_processor(x):
    match=re.search(r'Intel Core i[357]',x)
    if match:
        return match[0]
    if 'xeon' in x.lower():
        return 'Intel Xeon E3'
    if 'intel' in x.lower():
        return 'Other Intel Processor'
    return 'AMD Processor'
        

In [None]:
df['Cpu_brand']=df['Cpu'].apply(get_processor)

How does the price vary with processors?

In [None]:
df.pop('Cpu')

In [None]:
sn.countplot(x='Cpu_brand',data=df,palette='plasma',order=df['Cpu_brand'].value_counts().index)
plt.xticks(rotation=20)

In [None]:
plt.subplots(figsize=(16,10))
x = df.groupby(['Cpu_brand']).Price_euros.median().sort_values().keys()
sn.barplot(x='Price_euros',y='Cpu_brand',data= df,order=x)

### Storage

Xeon processors are one if the best processors Intel offers which explains their very high price we can also see that i7 and i5 processors are the most popular and better than core i3 and AMD

In [None]:
df['Memory'].value_counts()

First of all let's standardize the units and remove the decimal point

In [None]:
df['Memory']=df['Memory'].str.replace('.0','',regex=False)
df['Memory']=df['Memory'].str.replace('GB','')
df['Memory']=df['Memory'].str.replace('TB','000')

In [None]:
def get_memory(mem,StorageType='HDD'):
  if StorageType in mem.split(' '):
    i = mem.split(' ').index(StorageType)
    return mem.split(' ')[i-1]
  else:
    return 0


In [None]:
for type in ['HDD', 'SSD','Flash','Hybrid']:
    df[type]=df['Memory'].apply(lambda x: get_memory(x,StorageType=type))
    df[type]=df[type].astype("int")

In [None]:
df.pop('Memory')
df.head()

### Gpu

In [None]:
df['Gpu'].value_counts()

In [None]:
df.loc[182,'Gpu']='AMD Radeon R7'

In [None]:
def get_Gpu(x):
    if 'Intel' in x:
        return 'Intel Iris Graphics' if 'Iris' in x else "Intel HD Graphics"
    if  'AMD' in x:
        match = re.search(r'R[579X]',x)
        if match:
            return 'AMD Radeon '+match.group()
        if 'FirePro' in x:
            return 'AMD FirePro'
        else:
            return "Other AMD Radeon"
    # if not AMD or Intel then its Nvidia
    if 'Nvidia' in x:
        if 'Quadro' in x:
            return 'Nvidia Quadro'
        if 'GTX' in x:
            return 'Nvidia GeForce GTX'
        else:
            return 'Nvidia GeForce GT'
    return 'Other Gpu'
    

In [None]:
df['Gpu_model']=df['Gpu'].apply(get_Gpu)

In [None]:
plt.subplots(figsize=(16,7))
x = df.groupby(['Gpu_model']).Price_euros.median().sort_values().keys()
sn.boxplot(x='Price_euros',y='Gpu_model',data= df,order=x)

In [None]:
f, ax = plt.subplots(2,3,figsize=(25,20))
companies=df["TypeName"].unique()
j=i=0
for company in companies[0:12]:
  data = df.groupby(["TypeName","Gpu_model"]).count()['laptop_ID'][company]   
  data.plot.pie (autopct="%.1f%%",ax=ax[i][j])
  ax[i][j].set_title(company)
  j+=1
  if j%3 == 0:
    i+=1
    j=0

In [None]:
df.drop(columns = ['laptop_ID','Product','Inches', 'Gpu'], inplace=True)

In [None]:
df.head()

In [None]:
df.info()

## Model Building

### Correlation

In [None]:
sn.displot(df['Price_euros'],kind='kde')

In [None]:
sn.displot(np.log(df['Price_euros']),kind='kde')

In [None]:
df['Price_euros']=df.pop('Price_euros')
plt.figure(figsize=(20,12))
sn.heatmap(df.corr(),annot=True)

In [None]:
df.corr()['Price_euros'].sort_values(ascending=False)

### Pipeline

Now that have prepared our data and hold a better understanding of the dataset. let’s get started with Machine learning modeling! and finding the best algorithm with the best hyperparameters to achieve maximum accuracy

let's load the libraries that we will use

In [None]:
df.head(2)

In [None]:
import sklearn as sl
from sklearn.compose import ColumnTransformer,make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


from sklearn.model_selection import train_test_split, GridSearchCV
# models
from sklearn.linear_model import LinearRegression,Ridge,BayesianRidge,SGDRegressor
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error,max_error


In [None]:
df.info()

In [None]:
def getName(x):
    return re.search(r"\w*(?!(?!'>))",str(x)).group()

In [None]:
y=np.log(df['Price_euros'])
X=df.loc[:, df.columns != 'Price_euros']


mapper = {i:value for i,value in enumerate(X_train.columns)}
mapper

We create the preprocessing pipelines for both numeric and categorical data.

In [None]:
numerical_features = make_column_selector(dtype_include=np.number)
cat_features = make_column_selector(dtype_exclude=np.number)

In [None]:
#numeric_features = ['Ram (GB)','Weight (kg)','ppi','Res_value','GHz','HDD','SSD','Flash','Hybrid']
numeric_transformer = StandardScaler()

#categorical_features = ['Company','TypeName','OpSys','Cpu_brand','Gpu_model']
categorical_transformer = OneHotEncoder(sparse=False,drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, cat_features),
    ],
    remainder='passthrough'
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(preprocessor.fit_transform(X),y,test_size=0.15,random_state=2)

In [None]:
def Model_pipe(mdl):
    return Pipeline(steps=[("Reg", mdl())])

After appending a regression model we have a full prediction pipeline.

In [None]:
#models=[LinearRegression,Ridge,SGDRegressor,DecisionTreeRegressor,RandomForestRegressor,GradientBoostingRegressor,
#        SVR,BayesianRidge,LGBMRegressor,XGBRegressor,AdaBoostRegressor,CatBoostRegressor]

In [None]:
def scores(md):
    for score_type,X,y in zip(['+ Training accuracy','+ Test accuracy'],[X_train,X_test],[y_train,y_test]):
        print(score_type)
        y_pred = md.predict(X)
        print('|            R2 Score: {:.3}%'.format(r2_score(y,y_pred)*100))
        print('| Mean Absolute Error: {:.3}'.format(mean_squared_error(y,y_pred)))
        print('|           Max Error: {:.3}'.format(max_error(y,y_pred)))

### Model parameters

In [None]:
params=[]

In [None]:
def evaluate(model):
    md=Model_pipe(model)
    md.fit(X_train,y_train)
    print('\n------ {} ------'.format(getName(model)))
    scores(md)


#### Linear Regression

**Ordinary least squares Linear Regression.**

$\mathbf {y}$ is a $n\times 1$, ($1303\times 1$ in our case), vectors of the response variables, and ${\mathrm  {X}}$ is an $n\times p$, ($1303\times 16$), matrix of regressors, whose $i$ th row is $\mathbf {x} _{i}$ and contains the $i-th$ observations on all the explanatory variables. In a linear regression model, the response variable, $y_{i}$, is a linear function of the regressors:
$$
{\displaystyle \mathbf {y} =\mathrm {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,}
$$
where ${\boldsymbol {\beta }}$ is a $16\times 1$ vector of unknown parameters and $\boldsymbol{\varepsilon}$ is a $1303\times 1$ vectors of the errors of the $n$ observations.

`LinearRegression` fits a linear model with coefficients $ \boldsymbol {\beta } = ({\beta }_1, …, {\beta }_p) $ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
$$
\sum _{i=1}^{1303}{\biggl |}y_{i}-\sum _{j=1}^{16}X_{ij}\beta _{j}{\biggr |}^{2}={\bigl \|}\mathbf {y} -\mathrm {X} {\boldsymbol {\beta }}{\bigr \|}^{2}_2.
$$

In [None]:
evaluate(LinearRegression)

#### GradientBoostingRegressor

hel

In [None]:
evaluate(GradientBoostingRegressor)

In [None]:
GBR = Model_pipe(GradientBoostingRegressor)

In [None]:
param_GBR = {
    'Reg__learning_rate': [.01,.03, 0.05, .07],
    'Reg__subsample'    : [0.7, 0.5, 0.2, 0.1],
    'Reg__n_estimators' : [1000],
    'Reg__max_depth'    : [4,6,8,10,12],
    'Reg' : [GradientBoostingRegressor()]
}

In [None]:
grid_GBR = GridSearchCV(GBR,param_GBR,n_jobs=-1,scoring='r2',error_score='raise')


In [None]:
grid_GBR.fit(X_train, y_train)

In [None]:
GBR.get_params()

In [None]:
print("Best parameter (CV score=%0.3f):" % grid_GBR.best_score_)
print(grid_GBR.best_params_)

Best parameter (CV score=0.905):
{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 1000, 'subsample': 0.1}

In [None]:
scores(grid_GBR)

#### XGBRegressor 

In [None]:
evaluate(XGBRegressor)

In [None]:
param_XGBR = {
    'Reg__learning_rate'    : [.01,.03, 0.05, .07],
    'Reg__subsample'        : [0.7, 0.5, 0.2, 0.1],
    'Reg__n_estimators'     : [1000],
    'Reg__max_depth'        : [2,4,6,8,10],
    'Reg'                   : [XGBRegressor()]
}


In [None]:
XGBR = Model_pipe(XGBRegressor)

In [None]:
grid_XGBR = GridSearchCV(XGBR,param_XGBR,n_jobs=-1,scoring='r2',error_score='raise')


In [None]:
grid_XGBR.fit(X_train, y_train)

In [None]:
print("Best parameter (CV score=%0.3f):" % grid_XGBR.best_score_)
print(grid_XGBR.best_params_)

In [None]:
scores(grid_XGBR)

This model doesn't have Hyperparameters to tune.

#### Support vector machines

In [None]:
evaluate(SVR)

In [None]:
SVMR=Model_pipe(SVR)

In [None]:
param_SVMR={
            'Reg__C': [1.1, 5.4, 170, 1001],
            'Reg__epsilon': [0.0003, 0.007, 0.0109, 0.019, 0.14, 0.05, 8, 0.2, 3, 2, 7],
            'Reg__gamma': [0.7001, 0.008, 0.001, 3.1, 1, 1.3, 5],
            'Reg' : [SVR()]
        }

In [None]:
grid_SVMR = GridSearchCV(SVMR,param_SVMR,n_jobs=60,scoring='r2',error_score='raise')

In [None]:
grid_SVMR.fit(X_train,y_train)

In [None]:
print("Best parameter (CV score=%0.3f):" % grid_SVMR.best_score_)
print(grid_SVMR.best_params_)

In [None]:
scores(grid_SVMR)

#### Decision Tree

In [None]:
evaluate(DecisionTreeRegressor)

In [None]:
DTR=Model_pipe(DecisionTreeRegressor)

In [None]:
param_DTR={
    "Reg__criterion": ["mse", "mae"],
    "Reg__min_samples_split": [10, 20, 40],
    "Reg__max_depth": [2, 6, 8],
    "Reg__min_samples_leaf": [20, 40, 100],
    "Reg__max_leaf_nodes": [5, 20, 100],
    "Reg" : [DecisionTreeRegressor()]

}

In [None]:
grid_DTR = GridSearchCV(DTR,param_DTR,n_jobs=60,scoring='r2',error_score='raise')

In [None]:
grid_DTR.fit(X_train,y_train)

In [None]:
print("Best parameter (CV score=%0.3f):" % grid_DTR.best_score_)
print(grid_DTR.best_params_)

In [None]:
scores(grid_DTR)

#### Random Forest

In [None]:
evaluate(DecisionTreeRegressor)

In [None]:
RFR=Model_pipe(RandomForestRegressor)

In [None]:
param_RFR={
    "Reg__criterion": ["mse", "mae"],
    "Reg__max_depth": [2, 6, 8],
    "Reg__min_samples_leaf": [20, 40, 100],
    "Reg__max_leaf_nodes": [5, 20, 100],
    "Reg__min_samples_split": [10, 20, 40],
    "Reg" : [RandomForestRegressor()]
}

In [None]:
grid_RFR = GridSearchCV(RFR,param_RFR,n_jobs=60,scoring='r2',error_score='raise')

In [None]:
grid_RFR.fit(X_train,y_train)

In [None]:
print("Best parameter (CV score=%0.3f):" % grid_RFR.best_score_)
print(grid_RFR.best_params_)

In [None]:
scores(grid_RFR)

### Performance


the next step is to know how well our models performes for making predictions on the unknown test set. There are various metrics to check that. However, mean absolute error, mean squared error, and root mean squared error are three of the most common metrics.

In [None]:
md=Model_pipe(LinearRegression)
md.fit(X_train,y_train)
print('\n------ {} ------'.format(getName(md)))