# Linear Regression "laptop price"





# 1&nbsp;Table of contents


The contents of this notebook are divided into various categories which are given as follows:

<ol>
<li><b>Table of contents</b>
<li><b>Background</b>
<li><b>Business Understanding</b>
<li><b>Import required libraries</b>
<li><b>Import data and check import</b><br>
    5.1 Import Data<br>
    5.2 Check Import<br>
<li><b>First Data Preparation for Analysis</b><br>
    6.1 Remove unwanted columns (optional)<br>
    6.2 Rename column names (optional)<br>
    6.3 Check and handle Null values<br>
    6.4 Check and change data types<br>
    6.5 Remove special characters<br>
    6.6 Replace values<br>
    6.7 Replace rare values<br>

<li><b>Data Understanding</b><br>
    7.1 Univariate - statistical analysis<br>
    7.2 Univariate - visualizations<br>
    7.3 Bivariate - correlation matrix<br>
    7.4 Bivariate - visualization<br>
    7.5 Trivariate - visualization<br>

<li><b>Data Preparation</b><br>
    8.1 Feature Engineering<br>
    8.2 Scaling<br>
    8.3 Delete data records

<li><b>Split data (Train and Test)</b>

<li><b>Create Linear Models</b><br>
    10.1 Fit Linear Models<br>
    10.2 Visual check<br>
    10.3 Statistical' check<br>
    10.4 Summary table - upper area<br>
    10.5 Summary table - middle area<br>
    10.6 Summary table - lower area<br>

<li><b>Check performance with test data</b>
    11.1 Handling extra categories in test set
    11.2 Check R2 and RMSE for test
    11.3 Residual plot (test)

</ol>



# 2&nbsp;Background

<ul>
<li>The data set is available via the following link:<br>
https://www.kaggle.com/datasets/muhammetvarl/laptop-price
<li>The data was uploaded 4 years ago.
<li>This dataset provides a comprehensive collection of information on various laptops, enabling a detailed analysis of their specifications and pricing.
<li>It encompasses a wide range of laptops, encompassing diverse brands, models, and configurations, making it a valuable resource for researchers, data analysts, and machine learning enthusiasts interested in the laptop industry.
</ul>

# 3 Business understanding<br>
<ul>
<li>The task is to analyze the selling price of laptops depending on their characteristics.
<li>The aim is to develop a linear regression equation with which the individual prices can be estimated as well as possible.
<li>This equation can then be used to estimate which of these laptops are rather expensive and which are rather cheap.
<li>In addition, the expected price of a laptop with certain features can be estimated.
</ul>

<b>Question:</b><br>
<ul>
<li>Is this a supervised or unsupervides learning task?
<li>Is it classification or regression?
</ul>

<b>Selecting a performance measure</b><br>
<ul>
<li>Can you remember the target function of the linear regression?
<li>Is this target function also useful here?
</ul>

Each line in the file <b>laptop-price-DAS.csv</b> represents the features and selling price of a laptop.<br>
It contains 1275 laptops and the following 15 columns:<br>

\begin{array}{|l|l|l|} \hline
\textbf{Company} & String & \text{Laptop Manufacturer} \\ \hline
\textbf{Product} & String & \text{Brand and Model} \\ \hline
\textbf{TypeName} & String & \text{Type (Notebook, Ultrabook, Gaming, etc.)} \\ \hline
\textbf{Inches} & Numeric & \text{Screen Size} \\ \hline
\textbf{ScreenResolution} & String & \text{Screen Resolution} \\ \hline
\textbf{CPU_Company} & String & \text{CPU Maufacturer} \\ \hline
\textbf{CPU_Type} & String & \text{CPU Type} \\ \hline
\textbf{CPU_Frequency} & Numeric & \text{CPU Frequency in GHz} \\ \hline
\textbf{RAM} & String & \text{RAM in GB} \\ \hline
\textbf{Memory} & String & \text{Hard Disk / SSD Memory} \\ \hline
\textbf{GPU_Company} & String & \text{GPU Manufacturer} \\ \hline
\textbf{GPU_Type} & String & \text{GPU Type} \\ \hline
\textbf{OpSys} & String & \text{OPerating System} \\ \hline
\textbf{Weight} & Numeric & \text{Weight in kg} \\ \hline
\textbf{Price} & Numeric & \text{Price in Euro} \\ \hline
\end{array}
<br>The columns are separated by a comma. A point is used as the decimal separator.

# 4&nbsp;Import required libraries

<b>Brief description of the used libraries</b><br>
<ul>
<li><b>numpy (numerical python):</b><br>
supports <b>efficient numerical operations</b> on large quantities of data.
<li><b>pandas (dervived from 'panel data'):</b><br>
is a very popular library for working with data.<br>
DataFrames are at the center of pandas.<br>
It is based on nnumpy.
<li><b>matplotlib:</b><br>
is a library for creating static, animated, and interactive <b>visualizations</b>.
<li><b>seaborn:</b><br>
is a data <b>visualization</b> library based on matplotlib.<br>
It provides a high-level interface for drawing attractive and informative statistical graphics.
<li><b>sklearn (scikit-learn)</b><br>
Built on NumPy, SciPy, and matplotlib<br>
Simple and efficient tools for predictive data analysis
<li><b>statsmodels</b><br>
provides classes and functions for the estimation of many different statistical models,<br>
as well as for conducting statistical tests, and statistical data exploration.
</ul>

In [None]:
# import Python libraries
import numpy as np  # for numerical operations
import pandas as pd  # for dataframe operations
import matplotlib.pyplot as plt  # for plotting and visualisations
import seaborn as sns  # for plotting and visualisations - sometimes nicer than plt
from sklearn.linear_model import (
    LinearRegression,
)  # sklearn implementation of linear regression
from sklearn.metrics import r2_score  # calcuation of R2
from sklearn.metrics import root_mean_squared_error  # calcuation of RMSE
from sklearn.model_selection import (
    train_test_split,
)  # split a dataset in training and test
import statsmodels.api as sm  # statsmodel implementation of linear regression
import statsmodels.formula.api as smf  # to use formular in linear regression ('R-Style')
import sys  # access to some variables used or maintained by the interpreter

ModuleNotFoundError: No module named 'matplotlib'

# 5.&nbsp;Import Data and check import

## 5.1&nbsp;Import Data

**(A) For Colab-Users and loading from your local drive:**

The code will prompt you to select a file.<br>
Click on “Choose Files” then select the file to be imported (e.g., 'housing-california.csv').<br>
<b>Wait for the file to be 100% uploaded.</b><br>
You should see the name of the file once Colab has uploaded it.

In [None]:
"""
if "google.colab" in sys.modules:   # checks if google is used
  from google.colab import files
  uploaded = files.upload()
"""

<b>After executing this code cell, it should be commented out.</b><br>
This can be done with ''' at the beginning and at the end of that call.</b>
Advantage: you can run the code again and again from the beginning.

The following code imports the file (that is uploaded in google drive) into a DataFrame.<br>
Make sure the filename in the code matches the name of the uploaded file - tip: copy the filename with extension (after 'to').<br>
sep: stands for separator for the columns - adjust if necessary<br>
decimals: decimal point - adjust if necessary

In [None]:
if "google.colab" in sys.modules:
    file = "laptop-price-DAS.csv"  # change filename

    import io

    df = pd.read_csv(
        io.BytesIO(uploaded[file]), sep=",", decimal="."
    )  # change values for sep and decimal
    # Dataset is now stored in a Pandas Dataframe named df

**(B) For Colab-Users and loading from Google Drive:**

<b>In this case, you have to remove "#" in this section (B)</b>

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Click on the folder-symbol on the left side (onmouseover: "files" is shown).<br>
Then you can see folders and files under "content/drive".<br>
Go through the folders und look for the file you want to import.<br> Mark the file, and klick on the "three points". Click on "copy path".<br>
Copy the copied path in the following code cell like in the comment.

In [None]:
# df = pd.read_csv('/content/drive/MyDrive/DAS/Regression/Housing/housing-california.csv',sep=',', decimal= '.')#

**(C) For those, who do not use Colab:**

<b>In this case, you have to remove one "#" in the following code cell,<br>
and adjust the path.</b>

In [None]:
# df = pd.read_csv(r'C:/data_folder/house-california.csv')   # absolute path
# df = pd.read_csv("../data_folder/house-cailfornia.csv")    # relative path

## 5.2&nbsp;Check import</b>



<b>pandas.DataFrame.info</b><br>
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html

In [None]:
df.info()

<b>pandas.DataFrame.sample</b><br>
Return a random sample of items from an axis of object.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

In [None]:
df.sample(3)  # show three randomly selected rows with all cols

<b>pandas.DataFrame.shape</b><br>
Return a tuple representing the dimensionality of the DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html

In [None]:
df.shape

# 6&nbsp;First Data Preparation for Analysis

## 6.1&nbsp;Remove unwanted columns (optional)

<b>pandas.DataFrame.drop</b><br>
Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Columns that are not needed can be removed.

In [None]:
# df = df.drop('name_of_column_to_be_dropped', axis = 1)          # axis = 1:  means columns (axis=0 are rows)

## 6.2&nbsp;Rename column names (optional)

<b>pandas.DataFrame.columns</b><br>
The column labels of the DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html

If the column names are very long or include white spaces,<br>
it makes sense to rename them.<br>
We save the long names for the possible later use.

i prefer column names with lower case letters:

In [None]:
df.columns = df.columns.str.lower()
df.columns

In [None]:
oldname = "screenresolution"
newname = "resolution"

df = df.rename(columns={oldname: newname})
df.columns

## 6.3&nbsp;Check and handle Null values

<b>pandas.isnull</b><br>
Detect missing values for an array-like object.<br>
https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

In [None]:
df.isnull()

<b>pandas.DataFrame.sum</b><br>
Return the sum of the values over the requested axis (default: axis=0).<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html

Remark:<br>
<b>In Python you can calculate with boolean values.<br>
"False" corresponds to the value 0, "True" corresponds to the value 1.</b>


In [None]:
df.isnull().sum()

<b>pandas.DataFrame.dropna</b><br>
Remove missing values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

<b>Delete every row, that has 'any' Null value in it:</b>

In [None]:
df = df.dropna(
    how="any", axis=0
)  # Erase every row (axis=0) that has "any" Null value in it

In [None]:
# always check what you have done!
df.isnull().sum()

## 6.4&nbsp;Check and change data types

<b>A change of the data type is mandatory, when a categorical variable has been coded as a number!<br>
In this case, this means that you inform the dataframe that it is a categorical variable.

In [None]:
df.info()

We can see that the data types like 'float', 'integer' and 'object'.<br>
'object' is for text or mixed text and numeric values.

<b>pandas.DataFrame.astype</b><br>
Cast a pandas object to a specified dtype.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

<b>Cast one single column to datatype 'category':</b><br>
If several columns are to be cast, copy the code-cell and the column name adjusted (reproducibility).


In [None]:
"""
col = 'ocean'                        # change column name
df[col] = df[col].astype('category') # change type, e.g.: 'category', 'int', 'float', 'str'
"""

In [None]:
df.info()

Remark: 'object' and 'categorial' are very similar.<br>
When range of possible values is fixed and finite, categorial hat advantages resp. speed and memory.<br>
It is also possible to order categories - then they become ordinal data.


<b>Task: Cast column 'bedroom' to datatype 'integer':</b>

## 6.5&nbsp;Remove special characters

In [None]:
df = df.replace(" ", "", regex=True)
df = df.replace("-", "", regex=True)
df = df.replace("\+", "", regex=True)
df = df.replace("/", "", regex=True)
df = df.replace(".", "", regex=False)
df = df.replace("<", "", regex=True)
df = df.replace(">", "", regex=True)
df = df.replace("\[", "", regex=True)
df = df.replace("]", "", regex=True)
df = df.replace("\(", "", regex=True)
df = df.replace("\)", "", regex=True)

## 6.6 Replace Values

In [None]:
"""
oldvalue = 'old'     # change string
newvalue = 'new'     # change string
df = df.replace(oldvalue,newvalue , regex=True)
"""

## 6.7&nbsp;Replace rare values

In [None]:
rare = 10
cols_cat = df.select_dtypes(include=["object", "string", "category"]).columns
for col in cols_cat:
    ToReplace = df[col].value_counts()[df[col].value_counts() < rare].index
    for replace in ToReplace:
        df = df.replace(replace, "rare", regex=True)

In [None]:
col = ["company"]  # change column name
print(df[col].value_counts())

# 7.&nbsp;Data Understanding

In [None]:
df.info()

## 7.1 Univariate - statistical analysis

<b><u>numerical columns<u></b>

<b>pandas.DataFrame.describe</b><br>
Generate descriptive statistics.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

In [None]:
df.describe().round(1)  # select a suitable number of decimal places

<u><b>categorial columns</b></u>

In [None]:
df.describe(include=["object", "string", "category"])  # analysis of categorial columns

<b>pandas.Series.value_counts</b><br>
Return a Series containing counts of unique values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

In [None]:
col = ["company"]  # change column name
print(df[col].value_counts())

## 7.2 Univariate - visualizations

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

<b><u>Bar chart</u></b>

In [None]:
"""
col   = 'gpu_company'                       # change column name
kind  = 'bar'                               # 'bar': vertical bar plot, 'barh': horizontal bar plot
title = 'Bar Chart - number of occurences'  # change title
xlabel = 'Values of column "' + col + '"'   # change xlabel to ylabel, change string
ylabel = 'Number'                           # change ylabbel to xlabel, change string
width = 10                                  # change width
height = 5                                  # change height

df[col].value_counts().plot(kind=kind, title= title, xlabel= xlabel, ylabel= ylabel, figsize=(width, height));
"""

<u><b>Histogram of all numerical variables</b></u>

In [None]:
# calculation recommended number of bins in histogram
import math

if df.shape[0] <= 1000:
    bins = round(math.sqrt(df.shape[0]) + 0.5)
else:
    bins = round(10 * math.log10(df.shape[0]) + 0.5)
print("Calculated recommended number of bins: ", bins)

In [None]:
bins = 32  # change number of bins
width = 12  # change width
height = 10  # change height

df.hist(bins=bins, figsize=(width, height));

<b>Question:</b><br>
What is the shape of the histograms? Do they have a bell shape?

<u><b>Creation of boxplots for all numerical variables</b></u>

In [None]:
kind = "box"
title = "Bar Chart - number of occurences"  # change title
width = 30  # change width
height = 5  # change height

df.plot(kind=kind, subplots=True, sharey=False, title=title, figsize=(width, height));

<u><b>Creation of boxplots of one numerical variable</b></u>

In [None]:
col = "inches"  # change column name
kind = "box"
title = "Boxplot"  # change title
width = 4  # change width
height = 4  # change height

df[col].plot(kind=kind, title=title, figsize=(width, height));

## 7.3&nbsp;Bivariate - correlation matrix

<u><b>Show correlation matrix</b></u>

<b>pandas.DataFrame.corr</b><br>
Compute pairwise correlation of columns, excluding NA/null values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

In [None]:
df.corr(numeric_only=True).round(2)

<b>Colors facilitate the analysis....</b>

<u><b>Show correlation heat map</b></u>

<b>seaborn.heatmap</b><br>
Plot rectangular data as a color-encoded matrix.<br>
https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
width = 4  # change width
height = 4  # change height
title = "Correlation heat map"  # change title

plt.figure(figsize=(width, height))
plt.title(title)
a = sns.heatmap(
    df.corr(numeric_only=True),
    square=True,
    annot=True,
    fmt=".2f",
    linecolor="white",
    vmin=-1,
    vmax=1,
    center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
)
a.set_xticklabels(a.get_xticklabels(), rotation=90)
a.set_yticklabels(a.get_yticklabels(), rotation=0);

<b>Task: analyze the correlation matrix</b>

## 7.4&nbsp;Bivariate - visualization

<u><b>Pairplot: scatterplot of alle combinations of numerical features</b></u>

<b>seaborn.pairplot</b><br>
Plot pairwise relationships in a dataset.<br>
https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
sns.pairplot(df);

<b>Problem: overplotting!</b>

<u><b>Scatterplot of two numerical features</b></u>

<b>pandas.DataFrame.plot</b><br>
Make plots of Series or DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

In [None]:
# Show numerival features
df.select_dtypes(include=["float", "int"]).columns

In [None]:
col_x = "weight"  # column for x-axis
col_y = "price"  # column for y-axis
width = 4  # change width
height = 4  # change height
title = "Scatterplot: " + col_y + " vs. " + col_x
alpha = 0.1  # regulate the transparency of a graph plot using the alpha attribute.

df.plot.scatter(title=title, x=col_x, y=col_y, alpha=alpha, figsize=(width, height));

<b>Question:</b><br>
How is the relationship between price with RAM?

<u><b>Creation of "side-by-side" boxplots</b></u>

In [None]:
print("Numerical features:")
print(df.select_dtypes(include=["float", "int"]).columns)
print("Categorial features:")
print(df.select_dtypes(include=["object", "string", "category"]).columns)

In [None]:
target = "price"  # change column name
width = 8  # change width
height = 4  # change height
rotation = 90  # change orientation degree x-label

col_categorial = df.select_dtypes(
    include=["object", "string", "category"]
)  # column names of categorials

for col in col_categorial:
    plt.figure(figsize=(width, height))
    title = "Boxplots: " + target + " vs. " + col
    ax = sns.boxplot(x=df[col], y=df[target])
    plt.setp(ax.get_xticklabels(), rotation=rotation)
    plt.title(title, fontsize=16)
    plt.xlabel("")

## 7.5 Trivariate - visualisation

<u><b>Scatterplot of selected two features with color-information of third feature</b></u>

In [None]:
"""
col_x     = 'RAM'         # change column name (for x axis)
col_y     = 'Weight'      # change column name (for y axis)
var_color = 'Price'       # change column name (different colors)
alpha     =  0.2          # change parameter to see density of data points
width  = 5                # change width
height = 3                # change height

df.plot(kind="scatter", x=col_x, y=col_y, alpha=alpha,
        c = var_color,  cmap=plt.get_cmap("jet"), colorbar=True, figsize=(width, height));
"""

# 8&nbsp;Data Preparation

## 8.1&nbsp;Feature Engineering

In [None]:
df.info()

<b>The task is to generate features, that (probably) have a high correlation with the target.</b>

In [None]:
df["weight2"] = df.weight**2  # change column names and formula

In [None]:
df.sample(2)

Do you have any additional suggestions for features engineering?

<b>The correlations of all features with the target are shown.</b>

In [None]:
corr_matrix = df.corr(numeric_only=True)
corr_matrix[target].sort_values(ascending=False)

## 8.2&nbsp;Scaling

In lineare regression: scaling is not mandantory.<br>
Scaling may help here, that the parameters are more understandable.

<b>Show min, median and max of every feature</b>.

In [None]:
df.describe().loc[["min", "50%", "max"], :]

In [None]:
"""
# example for scaling
df.price = df.price / 1000          # change column name and scaling number
"""

## 8.3&nbsp;Delete data records

In [None]:
"""
col    = 'column'                        # change column
df = df[ df[col] <= 500 ]    # keep only records that fulfill this condition
"""

# 9.&nbsp;Split Data (Train + Test)

<b>sklearn.model_selection.train_test_split</b><br>
Split arrays or matrices into random train and test subsets.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
seed = 12345  # change seed, to get another split
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=seed
)  # change proportion of the dataset to include in the test

In [None]:
print("Shape from df_train:", df_train.shape)
print("Shape from df_test:", df_test.shape)

# 10.&nbsp;Create Linear models

## 10.1&nbsp;Fit linear models

<b>statsmodels.regression.linear_model.OLS.fit</b><br>
Full fit of the model.<br>
https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit.html

<ul>
<li>To create a linear model, you must define the target, and the features you'd like to use.
<li>The colnames of categorials must be put in C() - or you encode them before.
<li><b>Syntax:</b><br>
    name_of_model = smf.OLS(formula = target ~ feature1 + feature2 + ..., data=df).fit()</b>

In [None]:
target = "price"  # change column name - which variable is the target?

In [None]:
df_train.columns

<b>It makes sense to primarily use the features with the highest absolute correlation to the target.</b>

In [None]:
corr_matrix = df_train.corr(numeric_only=True)
ct = corr_matrix[target].abs().sort_values(ascending=False)  # ct: correlation target
ct.round(2)

<b>We start using features with the highest absolute correlations to the target.</b>

In [None]:
# del lm05
formula05 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + C(cpu_company) + gpu_company + typename + opsys + company"
)
lm05 = smf.ols(formula=formula05, data=df_train).fit()

In [None]:
formula01 = target + "~ ram + cpu_frequency + weight2 + weight + inches"
formula02 = target + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company"
formula03 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company"
)
formula04 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename"
)
formula05 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys"
)
formula06 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys + company"
)
formula07 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys + company + resolution"
)
formula08 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys + company + resolution + memory"
)
formula09 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys + company + resolution + memory + cpu_type"
)
formula09 = (
    target
    + "~ ram + cpu_frequency + weight2 + weight + inches + cpu_company + gpu_company + typename + opsys + company + resolution + memory + cpu_type + gpu_type"
)

lm01 = smf.ols(formula=formula01, data=df_train).fit()
lm02 = smf.ols(formula=formula02, data=df_train).fit()
lm03 = smf.ols(formula=formula03, data=df_train).fit()
lm04 = smf.ols(formula=formula04, data=df_train).fit()
lm05 = smf.ols(formula=formula05, data=df_train).fit()
lm06 = smf.ols(formula=formula06, data=df_train).fit()
lm07 = smf.ols(formula=formula07, data=df_train).fit()
lm08 = smf.ols(formula=formula08, data=df_train).fit()
lm09 = smf.ols(formula=formula09, data=df_train).fit()

models = [
    lm01,
    lm02,
    lm03,
    lm04,
    lm05,
    lm06,
    lm07,
    lm08,
    lm09,
]  # we collect the four linear models in the variable 'models'
model_name = [
    "lm01",
    "lm02",
    "lm03",
    "lm04",
    "lm05",
    "lm06",
    "lm07",
    "lm08",
    "lm09",
]  # we store the names of the model in 'model_name'
formulas = [
    formula01,
    formula02,
    formula03,
    formula04,
    formula05,
    formula06,
    formula07,
    formula08,
    formula09,
]

In [None]:
print(lm05.summary2())

## 10.2&nbsp;Visual check

In [None]:
def plot_pred_target(df, target, model, modelname, fit=1):
    var_x = df[target]  # insert column name for x axis here!
    alpha = 0.1  # parameter to see density of data points
    if fit:
        plt.title("Scatterplot: Fitted vs. Target, model: " + modelname, fontsize=16)
        plt.ylabel("fitted " + target, fontsize=15)
        var_y = model.fittedvalues  # insert column name for y axis here!
    else:
        plt.title("Scatterplot: Predicted vs. Target, model: " + modelname, fontsize=16)
        plt.ylabel("predicted " + target, fontsize=15)
        var_y = model.predict(df)  # insert column name for y axis here!
    maxvalue = max(max(var_x), max(var_y)) * 1.05
    minvalue = min(-5, min(var_x), min(var_y)) * 1.05
    plt.scatter(var_x, var_y, alpha=alpha)
    plt.grid()
    plt.xlim([minvalue, maxvalue])
    plt.ylim([minvalue, maxvalue])
    plt.xlabel(target, fontsize=15)
    plt.plot([minvalue, maxvalue], [minvalue, maxvalue], color="red")
    plt.axvline(0, color="black")  # vertical
    plt.axhline(0, color="black")  # horizontal
    plt.show()

In [None]:
for i in range(len(models)):
    plot_pred_target(df_train, target, models[i], model_name[i], fit=1)

<ul>
<li>Various pieces of information are stored in our linear models (RegressionResultsWrappers).
<li>The "fittedvalues" are the predicted values based on the training data.
</ul>

In [None]:
lm = lm04
print("Parameters:")
print(lm.params)

## 10.3&nbsp;R2, adj. R2 and RMSE (training)

In [None]:
i = 0
print("Training Data")
print("Name", " R2", "    adj R2")
for model in models:
    print(
        model_name[i],
        "",
        "{:.2f}".format(model.rsquared),
        " ",
        "{:.2f}".format(model.rsquared_adj, 3),
    )
    i += 1

Adjusted R2 is a key figure to compare lineare models with different numbers of features.<br>
Adjusted R2 is less equal than R2 and peanlizes any additional feature.<br>
R2(training) always decreases, when additional features are added, for adjsuted R2 this is not the case.

<b>Task: Analyze the results</b>

<u><b>Print summary of the full model</b></u>

In [None]:
lm = lm09
print(lm.summary2())

## 10.4&nbsp;Summary table - upper area


\begin{array}{|l|l|} \hline
\text{Model:} & \text{OLS stands for Ordinary Least Square.} \\
 & \text{The model tries to find out a linear function which minimizes the sum of residual squares} \\ \hline
\text{Dep Variable:} & \text{Name of the dependent variable} \\ \hline
\text{Date:} & \text{Timestamp of creation of this OLSResults-Object} \\ \hline
\text{No Obs.:} & \text{Number of observations} \\ \hline
\text{Df Model:} & \text{Degrees of freedom of the model (# of independent features)} \\ \hline
\text{Df Residuals:} & \text{Degrees of freedom of the residuals (# Observations - Df Model - 1)} \\ \hline
\text{R-squared:} & R^2: \text{coefficient of determination} \\ \hline
\textbf{Adj. R-squared:} & R^2 \text{ is adjusted to the number of features, goal: to maximize} \\  \hline
\text{AIC:} & \text{Akaike's information criteria, goal: to minimize} \\ \hline
\text{BIC:} & \text{Bayes' information criteria, goal: to minimize} \\ \hline
\text{Log-Likel.hood:} & \text{The higher the value, the better the model fits the given data.} \\ \hline
\text{F-statistic:} & \text{F-test: check if all features together are related to the dependent variable.} \\ \hline
\textbf{Prob (F-stat):} & \text{<0.05:   at least one feature is significantly related with the output} \\
& \text{>=0.05: no evidence of relationship between any of the features with the output}
 \\ \hline
\text{Scale:} & \text{The Default value is ssr/(n-p)} \\
              & \text{ssr = Sum of squared (whitened) residuals.} \\
              & \text{n-p = (# observations - # parameters)} \\ \hline
\end{array}


https://towardsdatascience.com/simple-explanation-of-statsmodel-linear-regression-model-summary-35961919868b

## 10.5&nbsp;Summary table - middle area

In [None]:
print(lm.summary2())

**The following values are given for the intercept and every feature:**

\begin{array}{|l|l|} \hline
\textbf{Coef.} & \text{The estimated coefficient value} \\ \hline
\text{Std.Err.} & \text{The standard error of the coefficient measures - how precisely the coefficient's value is estimated} \\ \hline
\text{t} & \text{t-value = Coef. / Std.Err.} \\ \hline
\textbf{P>|t|} & \text{p-value of this t-statistic} \\
   & \text{A low p-value (< 0.05) means that the coefficient is likely not to equal zero} \\
   & \text{A high p-value (> 0.05) means that we cannot conclude that the feature affects the outcome} \\ \hline
\text{[0.025 ; 0.975]} & \text{To information: coef. are exactly in the middle of this interval} \\
& \text{Set of values for which a hypothesis test to the level of 5% cannot be rejected} \\ \hline
\end{array}

## 10.6&nbsp;Summary table - lower area

In [None]:
print(lm.summary2())

**If the data is good for modeling, then our residuals will have certain characteristics. This is checked here.**

\begin{array}{|l|l|} \hline
\text{Omnibus:} & \text{A test of the skewness and kurtosis of the residual.} \\
& \text{We hope to see a value close to zero which would indicate normalcy.} \\ \hline
\text{Prob(Omnibus):} & \text{Performs a statistical test indicating the probability that the residuals are normally distributed.} \\
&  \text{We hope to see something close to 1 here.} \\
& \text{Chance that the residuals the normally distributed.} \\ \hline
\text{Skew} & \text{A measure of data symmetry.} \\
& \text{We want to see something close to zero, indicating the residual distribution is normal.} \\
& \text{ Note that this value also drives the Omnibus.} \\ \hline
\text{Kurtosis:} & \text{A measure of "peakiness", or curvature of the data.} \\
& \text{Higher peaks lead to greater Kurtosis.} \\
& \text{Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero.} \\ \hline
\text{Durbin-Watson:} & \text{Test for homoscedasticity. We hope to have a value between 1 and 2.} \\ \hline
\text{Jarque-Bera (JB):} & \text{Test like the Omnibus test, in that it tests both skew and kurtosis.} \\ \hline
\text{Prob(JB)} & \text{We hope to see in this test a confirmation of the Omnibus test.} \\ & \text{We hope to see something close to 1 here.} \\ \hline
\text{Condition Number} & \text{This test measures the sensitivity of a function's output as compared to its input.} \\
& \text{When we have multicollinearity, we can expect much higher fluctuations to small changes in the data.} \\
& \text{Hence, we hope to see a relatively small number, something below 30.} \\ \hline
\end{array}

https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate

# 11&nbsp;Check performance with test data

##&nbsp;11.1 Extra categories in test set

In [None]:
df_test.shape

In [None]:
cols_cat = df.select_dtypes(include=["object", "string", "category"]).columns
for col in cols_cat:
    df_test = df_test[df_test[col].isin(df_train[col].unique())]

In [None]:
df_test.shape

##&nbsp;11.2 R2 and RMSE (test)

The model with <b>maximum $R^2$</b> respective <b>minimal RMSE</b> is to selected.

In [None]:
i = 0
y_test = df_test[target]

print("Test Data")
print("Name", " R2", "   RMSE")
for model in models:
    y_pred = model.predict(df_test)
    resid = y_test - y_pred
    print(
        model_name[i],
        "",
        "{:.2f}".format(r2_score(y_pred, y_test)),
        "",
        "{:.0f}".format(root_mean_squared_error(y_test, y_pred)),
    )
    # list(model.params.index[1:100])
    i += 1

<b>Question:</b><br>
What model performs best?<br>




In [None]:
lm = lm09
lm_name = "lm09"
y_pred = lm.predict(df_test)

plot_pred_target(df_test, target, lm, "lm_name", fit=0)
print("R2: ", round(r2_score(y_pred, y_test), 2))
print("Mean Residual: ", round(np.mean(y_test - y_pred), 2))
print("RMSE: ", round(root_mean_squared_error(y_test, y_pred), 1))

In [None]:
lm = lm08
lm_name = "lm08"
y_pred = lm.predict(df_test)
plot_pred_target(df_test, target, lm, "lm08", fit=0)
print("R2: ", round(r2_score(y_pred, y_test), 2))
print("Mean Residual: ", round(np.mean(y_test - y_pred), 2))
print("RMSE: ", round(root_mean_squared_error(y_test, y_pred), 1))

##&nbsp;11.3 Residual plot (test)

In [None]:
y_pred = lm09.predict(df_test)
plt.scatter(y_pred, y_test - y_pred)
plt.plot([min(y_pred), max(y_pred)], [0, 0], color="red")
plt.title("Residual Plot")
plt.xlabel("Predictions", fontsize=15)
plt.ylabel("Residuals", fontsize=15);

<b>Task: try to produce overfitting!</b>