# Lineare Regrssion

Lineare Regression ist eine der flexibelsten und am häufigsten verwendeten statistischen Methoden in Forschung und betrieblicher Praxis. Sie wird verwendet, um die Beziehung zwischen einer abhängigen und einer oder mehreren unabhängigen Variablen zu analysieren.

Lineare Regression wird verwendet für

1. Inferenz, d.h. zum Testen einer zuvor entwickelten Hypothese über die Beziehung zwischen interessierenden Variablen
2. Prognose, d.h. zur Schätzung des Wertes einer abhängigen Variable anhand der Werte unabhängiger Variablen
Der primäre Anwendungsfall für die lineare Regressionsanalyse ist die Analyse von kausalen Beziehungen.

Diese Beziehung kann ausgedrückt werden als

y=f(x)

In [44]:
import pandas as pd

link_to_csv = "https://www.statlearning.com/s/Advertising.csv"
df = pd.read_csv( link_to_csv
                 ,usecols=["TV", "radio", "newspaper","sales"])
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [45]:
import numpy as np
x = df["TV"]
xbar = np.mean(x)
y = df["sales"]
ybar = np.mean(y)

b1 = sum((x-xbar)*(y-ybar)) / sum((x - xbar)**2)
b0 = ybar - b1*xbar

print("b0:", b0, "und b1: ", b1)

b0: 7.032593549127704 und b1:  0.04753664043301969


In [46]:
import statsmodels.formula.api as smf

# Modell definieren und Daten übergeben
model = smf.ols("sales ~ TV", data=df)

# Modellschätzung (engl: "fitting")
model = model.fit()

paras = model.params # Modellparameter

print(paras)

Intercept    7.032594
TV           0.047537
dtype: float64


In [47]:
import statsmodels.formula.api as smf

# Modell definieren und Daten übergeben
model = smf.ols("sales ~ TV + newspaper + radio", data=df)

# Modellschätzung (engl: "fitting")
model = model.fit()

paras = model.params # Modellparameter

print(paras)

Intercept    2.938889
TV           0.045765
newspaper   -0.001037
radio        0.188530
dtype: float64


In [48]:
model = smf.ols("sales ~ TV", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.61
Dependent Variable:,sales,AIC:,1042.0913
Date:,2023-11-22 12:36,BIC:,1048.688
No. Observations:,200,Log-Likelihood:,-519.05
Df Model:,1,F-statistic:,312.1
Df Residuals:,198,Prob (F-statistic):,1.47e-42
R-squared:,0.612,Scale:,10.619

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.4578,15.3603,0.0000,6.1297,7.9355
TV,0.0475,0.0027,17.6676,0.0000,0.0422,0.0528

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Condition No.:,338.0


In [49]:
model = smf.ols("sales ~ radio", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.329
Dependent Variable:,sales,AIC:,1150.6738
Date:,2023-11-22 12:36,BIC:,1157.2704
No. Observations:,200,Log-Likelihood:,-573.34
Df Model:,1,F-statistic:,98.42
Df Residuals:,198,Prob (F-statistic):,4.35e-19
R-squared:,0.332,Scale:,18.275

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,9.3116,0.5629,16.5422,0.0000,8.2016,10.4217
radio,0.2025,0.0204,9.9208,0.0000,0.1622,0.2427

0,1,2,3
Omnibus:,19.358,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.91
Skew:,-0.764,Prob(JB):,0.0
Kurtosis:,3.544,Condition No.:,51.0


In [50]:
model = smf.ols("sales ~ TV+radio", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.896
Dependent Variable:,sales,AIC:,778.3941
Date:,2023-11-22 12:36,BIC:,788.2891
No. Observations:,200,Log-Likelihood:,-386.2
Df Model:,2,F-statistic:,859.6
Df Residuals:,197,Prob (F-statistic):,4.83e-98
R-squared:,0.897,Scale:,2.827

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,2.9211,0.2945,9.9192,0.0000,2.3403,3.5019
TV,0.0458,0.0014,32.9087,0.0000,0.0430,0.0485
radio,0.1880,0.0080,23.3824,0.0000,0.1721,0.2038

0,1,2,3
Omnibus:,60.022,Durbin-Watson:,2.081
Prob(Omnibus):,0.0,Jarque-Bera (JB):,148.679
Skew:,-1.323,Prob(JB):,0.0
Kurtosis:,6.292,Condition No.:,425.0


In [51]:
link = "https://raw.githubusercontent.com/fredzett/BA-C/master/Slides/_data/Auto.csv"
df = pd.read_csv(link)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [52]:
model = smf.ols("mpg ~ horsepower+ weight+year", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.807
Dependent Variable:,mpg,AIC:,2082.8292
Date:,2023-11-22 12:36,BIC:,2098.7142
No. Observations:,392,Log-Likelihood:,-1037.4
Df Model:,3,F-statistic:,545.4
Df Residuals:,388,Prob (F-statistic):,9.37e-139
R-squared:,0.808,Scale:,11.767

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,-13.7194,4.1818,-3.2808,0.0011,-21.9411,-5.4976
horsepower,-0.0050,0.0094,-0.5297,0.5966,-0.0236,0.0136
weight,-0.0064,0.0004,-15.7675,0.0000,-0.0073,-0.0056
year,0.7487,0.0521,14.3650,0.0000,0.6462,0.8512

0,1,2,3
Omnibus:,41.952,Durbin-Watson:,1.227
Prob(Omnibus):,0.0,Jarque-Bera (JB):,69.49
Skew:,0.671,Prob(JB):,0.0
Kurtosis:,4.566,Condition No.:,74800.0


In [53]:
#Write def to get a z_score
def z_score(x):
    return (x - np.mean(x)) / np.std(x) 


In [54]:
# Importiere Bibliothek
import scipy.stats as stats

 # Standardisiere Variablen
cols = ["horsepower", "weight", "year"]
df[["horsepower_z", "weight_z", "year_z"]] = df[cols].apply(stats.zscore)

In [55]:
model = smf.ols("mpg ~ horsepower_z+ weight_z+year_z", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.807
Dependent Variable:,mpg,AIC:,2082.8292
Date:,2023-11-22 12:36,BIC:,2098.7142
No. Observations:,392,Log-Likelihood:,-1037.4
Df Model:,3,F-statistic:,545.4
Df Residuals:,388,Prob (F-statistic):,9.37e-139
R-squared:,0.808,Scale:,11.767

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,23.4459,0.1733,135.3240,0.0000,23.1053,23.7866
horsepower_z,-0.1922,0.3629,-0.5297,0.5966,-0.9056,0.5212
weight_z,-5.4697,0.3469,-15.7675,0.0000,-6.1518,-4.7877
year_z,2.7545,0.1918,14.3650,0.0000,2.3775,3.1315

0,1,2,3
Omnibus:,41.952,Durbin-Watson:,1.227
Prob(Omnibus):,0.0,Jarque-Bera (JB):,69.49
Skew:,0.671,Prob(JB):,0.0
Kurtosis:,4.566,Condition No.:,4.0


## One hot and coding

One-Hot-Encoding ist ein Verfahren, um kategoriale Merkmale in Merkmale umzuwandeln, die von algorithmen des maschinellen Lernens leichter verarbeitet werden können.

In [64]:
# Erzeuge Spalte "origin_names" mit den Namen der Herkunftsländer
df["origin_names"] = df["origin"].replace({1:"US", 2:"EUR", 3:"Asia"}) 
# Erzeuge Dummy-Variablen
dummies = pd.get_dummies(df["origin_names"], prefix="origin")
df = pd.concat([df, dummies], axis=1)
dummies = dummies.astype(int)

# Füge Dummy-Variablen zum Datensatz hinzu
dummies

Unnamed: 0,origin_Asia,origin_EUR,origin_US
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
387,0,0,1
388,0,1,0
389,0,0,1
390,0,0,1


In [65]:
model = smf.ols(formula="mpg ~ origin_US + origin_EUR", data=df).fit()
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.328
Dependent Variable:,mpg,AIC:,2570.3126
Date:,2023-11-22 12:42,BIC:,2582.2264
No. Observations:,392,Log-Likelihood:,-1282.2
Df Model:,2,F-statistic:,96.6
Df Residuals:,389,Prob (F-statistic):,8.67e-35
R-squared:,0.332,Scale:,40.912

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,30.4506,0.7196,42.3141,0.0000,29.0358,31.8655
origin_US[T.True],-10.4172,0.8276,-12.5878,0.0000,-12.0442,-8.7901
origin_EUR[T.True],-2.8477,1.0581,-2.6914,0.0074,-4.9279,-0.7674

0,1,2,3
Omnibus:,26.33,Durbin-Watson:,0.763
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.217
Skew:,0.679,Prob(JB):,0.0
Kurtosis:,3.066,Condition No.:,5.0
