# Çoklu Doğrusal Regresyon

**Temel Amaç:** bağımlı ve bağımsız değişken**LER** arasındaki ilşkiyi ifade eden dogrusal fonksiyonu bulma

**Doğrusal Reğresyonun Varsayımları**

* Hatalar normal dağılır, birbirinden bağımsızdır ve araarında otokorelasyon yoktur.

* Her bir gözlem için hata terimleri varyansları sabittir.

* Değişkenler ve hata terimleri arasında ilişki yoktur

* Bağımsız değişkenler arasında çoklu doğrusal ilişki problemi yoktur

**NOT: Regresyon modelleri aykırı gözlemlere duyarlıdır.**

**NOT:** Verilere çoklu regresyon uygulanabilmesi için, bağımsız değişkenler arasında çoklu bağımlılık (multicollinearity) olmaması gerekir.

In [4]:
import pandas as pd 
ad = pd.read_csv("Advertising.csv", usecols=[1,2,3,4])
df = ad.copy()
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [9]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

In [13]:
X = df.drop("sales", axis=1)
y = df["sales"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [14]:
X_train.shape

(160, 3)

In [16]:
y_train.shape

(160,)

In [18]:
X_test.shape

(40, 3)

In [19]:
y_test.shape

(40,)

In [20]:
training = df.copy()

In [21]:
training.shape


(200, 4)

# Statsmodels

In [24]:
import statsmodels.api as sm
lm = sm.OLS(y_train, X_train)
model = lm.fit()
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.982
Model:,OLS,Adj. R-squared:,0.982
Method:,Least Squares,F-statistic:,2935.0
Date:,"Sun, 14 Feb 2021",Prob (F-statistic):,1.28e-137
Time:,03:09:05,Log-Likelihood:,-336.65
No. Observations:,160,AIC:,679.3
Df Residuals:,157,BIC:,688.5
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
TV,0.0531,0.001,36.467,0.000,0.050,0.056
radio,0.2188,0.011,20.138,0.000,0.197,0.240
newspaper,0.0239,0.008,3.011,0.003,0.008,0.040

0,1,2,3
Omnibus:,11.405,Durbin-Watson:,1.895
Prob(Omnibus):,0.003,Jarque-Bera (JB):,15.574
Skew:,-0.432,Prob(JB):,0.000415
Kurtosis:,4.261,Cond. No.,13.5


**NOT :** Hangi algoritma kullanılacak olunursa olunsun hedefimizde sürekli bir değişken varsa öncelikle bir doğrusal modele sokup analitik anlamda bağımlı değişkene etkilerinin anlamlı olup olmadığı gözlemlenmelidir.

In [26]:
model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
TV,0.0531,0.001,36.467,0.000,0.050,0.056
radio,0.2188,0.011,20.138,0.000,0.197,0.240
newspaper,0.0239,0.008,3.011,0.003,0.008,0.040


**coef:** bir br harcama artışı sonrası satışlarda görülecek ortalama artış

# Scikit-learn Modeli

In [31]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
model = lm.fit(X_train, y_train)

In [32]:
model.intercept_ #sabit katsayı

2.9790673381226256

In [33]:
model.coef_ #tüm değişkenlerin katsayısı

array([0.04472952, 0.18919505, 0.00276111])

# Tahmin

Sales * 2.97 + TV * 0.04 + Radio * 0.18 + Newspaper * 0.002

Orn: 30 br TV, 10 br radyo, 40 br gazete harcaması sonucu satış tahmini?

In [36]:
yeni_veri = [[30],[10],[40]] #dataframe e cevirmemiz gerekli
yeni_veri = pd.DataFrame(yeni_veri).T

In [37]:
model.predict(yeni_veri)

array([6.32334798])

In [54]:
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))

In [55]:
rmse #eğitim seti hatası

1.6447277656443373

In [56]:
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))

In [51]:
rmse #test seti hatası

1.7815996615334502

# Model Tuning/Model Doğrulama

In [57]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [59]:
X = df.drop('sales', axis=1)
y = df["sales"]
X_tarin, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=42)
lm = LinearRegression()
model = lm.fit(X_train, y_train)

In [60]:
np.sqrt(mean_squared_error(y_train, model.predict(X_train)))

1.6447277656443373

In [71]:
np.sqrt(mean_squared_error(y_test, model.predict(X_test)))

1.7815996615334502

In [61]:
model.score(X_train, y_train)

0.8957008271017817

In [65]:
cross_val_score(model, X_train, y_train, cv=10, scoring="r2").mean()

0.7913548596916338

In [70]:
np.sqrt(-cross_val_score(model, 
                X_train, 
                y_train, 
                cv=10, 
                scoring="neg_mean_squared_error")).mean()

1.6513523730313338

In [72]:
np.sqrt(-cross_val_score(model, 
                X_test, 
                y_test, 
                cv=10, 
                scoring="neg_mean_squared_error")).mean()

1.8462778823997084