### Multiple Linear Regression

   ####   The goal of multiple linear regression is to model the linear relationship between the independent variables and response dependent variable.

More than one variables.

#### Dummy Variable (Kukla Değişkeni)

Another variables that represents a variable is called a dummy variable.

Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1.

However, social scientists often need to work with categorical variables in which the different values have no real numerical relationship with each other. Examples include variables for race, political affiliation, or marital status. If you have a variable for political affiliation with possible responses including Democrat, Independent, and Republican, it obviously doesn't make sense to assign values of 1 - 3 and interpret that as meaning that a Republican is somehow three times as politically affiliated as a Democrat.

The solution is to use dummy variables - variables with only two values, zero and one. It does make sense to create a variable called "Republican" and interpret it as meaning that someone assigned a 1 on this varible is Republican and someone with an 0 is not.

* visit https://dss.princeton.edu/online_help/analysis/dummy_variables.htm for more

#### P-value (Olasılık Değeri) 

How many samples I need to reject a hypothesis? P-value answers this question.

H0 : Null Hypothesis

H1 : Alternative Hypothesis

The smaller the p-value is, the more likely the H0 is erroneous.

The bigger the p-value is, the more likely the H1 is erroneous.

#### Feature Selection

1. Include all features.
2. Backward Elimination (Geriye Doğru Eleme)
3. Forward Selection (İleri Seçim)
4. Bidirectional Elimination (İki Yönlü Eleme)
5. Score Comparison (Skor Karşılaştırması)

#### Include all features
1. If the variable selection has been made and we are sure about the variables, If we have to, for discovery (if we want to get an idea before applying the other four methods), we can use all off the features our model.

#### Backward Elimination
1. Select significance level (SL) first. (mostly .05)
2. A model is built using all variables.
3. The variable with the highest p value is considered, and if p > SL go to step 4, if not go to step 6.
4. The variable with the largest p-value is removed.
5. Machine learning is updated and return to third step.
6. Machine learning is terminated.

#### Forward Selection
1. Select significance level (SL) first. (mostly .05)
2. A model is built using all variables.
3. The variable with the lowest p value is considered.
4. At this stage, keeping the variable selected in the third step constant, a new variable is selected and added to the system.
5. Machine learning is updated and return to the third step, if p < SL for the smallest p value, go back to the third step, if doesn't, end. (Go to the 6. step.)
6. Machine learning is terminated.


#### Bidirectional Elimination
1. Select significance level (SL) first. (mostly .05)
2. A model is built using all variables.
3. The variable with the lowest p value is considered.
4. At this stage, keeping the variable selected in the third step constant, all other variables are included in the system and the one with the lowest p value remains in the system.
5. Variables below the sl value remain in the system and none of the old variables can be removed from the system.
6. Machine learning is terminated.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("veriler.csv")

In [3]:
data.head()

Unnamed: 0,ulke,boy,kilo,yas,cinsiyet
0,tr,130,30,10,e
1,tr,125,36,11,e
2,tr,135,34,10,k
3,tr,133,30,9,k
4,tr,129,38,12,e


In [4]:
data.isna().sum()

ulke        0
boy         0
kilo        0
yas         0
cinsiyet    0
dtype: int64

In [5]:
ulke = data.iloc[:, 0:1].values
ulke

array([['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['tr'],
       ['us'],
       ['us'],
       ['us'],
       ['us'],
       ['us'],
       ['us'],
       ['fr'],
       ['fr'],
       ['fr'],
       ['fr'],
       ['fr'],
       ['fr'],
       ['fr']], dtype=object)

In [6]:
from sklearn import preprocessing

In [7]:
le = preprocessing.LabelEncoder()

In [8]:
ulke[:, 0] = le.fit_transform(data.iloc[:, 0])

In [9]:
ulke

array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [2],
       [2],
       [2],
       [2],
       [2],
       [2],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0]], dtype=object)

In [10]:
ohe = preprocessing.OneHotEncoder()

In [11]:
ulke = ohe.fit_transform(ulke).toarray()

In [12]:
ulke

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [13]:
cinsiyet = data.iloc[:, 4:].values
cinsiyet

array([['e'],
       ['e'],
       ['k'],
       ['k'],
       ['e'],
       ['e'],
       ['e'],
       ['e'],
       ['k'],
       ['e'],
       ['k'],
       ['k'],
       ['k'],
       ['k'],
       ['k'],
       ['e'],
       ['e'],
       ['e'],
       ['e'],
       ['k'],
       ['k'],
       ['k']], dtype=object)

In [14]:
le = preprocessing.LabelEncoder()

In [15]:
cinsiyet[:, 0] = le.fit_transform(data.iloc[:, 4])

In [16]:
cinsiyet

array([[0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1]], dtype=object)

In [17]:
ohe = preprocessing.OneHotEncoder()

In [18]:
cinsiyet = ohe.fit_transform(cinsiyet).toarray()

In [19]:
cinsiyet

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [20]:
final = pd.DataFrame(data=ulke, index=range(22), columns=['fr', 'tr', 'us'])

In [22]:
yas = data.iloc[:, 1:4].values
final2 = pd.DataFrame(data=yas, index=range(22), columns=['boy', 'kilo', 'yas'])

In [23]:
final3 = pd.DataFrame(data = cinsiyet[:, :1], index=range(22), columns=['cinsiyet'])

In [24]:
final3

Unnamed: 0,cinsiyet
0,1.0
1,1.0
2,0.0
3,0.0
4,1.0
5,1.0
6,1.0
7,1.0
8,0.0
9,1.0


In [25]:
df = pd.concat([final, final2], axis=1)

In [26]:
df2 = pd.concat([df, final3], axis=1)

In [27]:
df2

Unnamed: 0,fr,tr,us,boy,kilo,yas,cinsiyet
0,0.0,1.0,0.0,130,30,10,1.0
1,0.0,1.0,0.0,125,36,11,1.0
2,0.0,1.0,0.0,135,34,10,0.0
3,0.0,1.0,0.0,133,30,9,0.0
4,0.0,1.0,0.0,129,38,12,1.0
5,0.0,1.0,0.0,180,90,30,1.0
6,0.0,1.0,0.0,190,80,25,1.0
7,0.0,1.0,0.0,175,90,35,1.0
8,0.0,1.0,0.0,177,60,22,0.0
9,0.0,0.0,1.0,185,105,33,1.0


In [28]:
from sklearn.model_selection import train_test_split

In [29]:
x_train, x_test, y_train, y_test = train_test_split(df, final3, test_size=0.33, random_state=0)

In [30]:
from sklearn.linear_model import LinearRegression

In [31]:
regressor = LinearRegression()

In [33]:
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)

In [34]:
y_pred

array([[ 0.98720204],
       [-0.12036863],
       [ 0.05009703],
       [ 0.07137418],
       [ 0.72473935],
       [ 0.64615044],
       [-0.03567453],
       [ 0.32612171]])

In [35]:
y_test

Unnamed: 0,cinsiyet
20,0.0
10,0.0
14,0.0
13,0.0
1,1.0
21,0.0
11,0.0
19,0.0


In [37]:
final2

Unnamed: 0,boy,kilo,yas
0,130,30,10
1,125,36,11
2,135,34,10
3,133,30,9
4,129,38,12
5,180,90,30
6,190,80,25
7,175,90,35
8,177,60,22
9,185,105,33


In [51]:
boy = pd.DataFrame(df2.iloc[:, 3:4].values)

In [52]:
boy # bağımlı değişkenleri içeren dataframe

Unnamed: 0,0
0,130
1,125
2,135
3,133
4,129
5,180
6,190
7,175
8,177
9,185


In [48]:
left = pd.DataFrame(df2.iloc[:, :3])
right = pd.DataFrame(df2.iloc[:, 4:])

In [49]:
dta = pd.concat([left, right], axis=1)

In [50]:
dta # bağımsız değişkenler içeren data frame

Unnamed: 0,fr,tr,us,kilo,yas,cinsiyet
0,0.0,1.0,0.0,30,10,1.0
1,0.0,1.0,0.0,36,11,1.0
2,0.0,1.0,0.0,34,10,0.0
3,0.0,1.0,0.0,30,9,0.0
4,0.0,1.0,0.0,38,12,1.0
5,0.0,1.0,0.0,90,30,1.0
6,0.0,1.0,0.0,80,25,1.0
7,0.0,1.0,0.0,90,35,1.0
8,0.0,1.0,0.0,60,22,0.0
9,0.0,0.0,1.0,105,33,1.0


In [53]:
x_train, x_test, y_train, y_test = train_test_split(dta, boy, 
                                            test_size=0.33, random_state=0)

In [54]:
x_train

Unnamed: 0,fr,tr,us,kilo,yas,cinsiyet
8,0.0,1.0,0.0,60,22,0.0
6,0.0,1.0,0.0,80,25,1.0
16,1.0,0.0,0.0,90,23,1.0
4,0.0,1.0,0.0,38,12,1.0
2,0.0,1.0,0.0,34,10,0.0
5,0.0,1.0,0.0,90,30,1.0
17,1.0,0.0,0.0,80,27,1.0
9,0.0,0.0,1.0,105,33,1.0
7,0.0,1.0,0.0,90,35,1.0
18,1.0,0.0,0.0,88,28,1.0


In [55]:
y_train

Unnamed: 0,0
8,177
6,190
16,193
4,129
2,135
5,180
17,187
9,185
7,175
18,183


In [56]:
regressor2 = LinearRegression()
regressor2.fit(x_train, y_train)
y_pred2 = regressor2.predict(x_test)

In [57]:
y_pred2

array([[182.26638686],
       [152.87161474],
       [162.79386375],
       [158.30668577],
       [130.82888952],
       [173.96138408],
       [150.12782663],
       [157.26898922]])

In [58]:
y_test

Unnamed: 0,0
20,164
10,165
14,167
13,162
1,125
21,166
11,155
19,159


#### Backward Elimination

In [60]:
import statsmodels.api as sm

In [63]:
X = pd.DataFrame(np.append(arr=np.ones((22, 2)).astype("int"), values=dta, axis=1))

In [64]:
X

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,1.0,0.0,1.0,0.0,30.0,10.0,1.0
1,1.0,1.0,0.0,1.0,0.0,36.0,11.0,1.0
2,1.0,1.0,0.0,1.0,0.0,34.0,10.0,0.0
3,1.0,1.0,0.0,1.0,0.0,30.0,9.0,0.0
4,1.0,1.0,0.0,1.0,0.0,38.0,12.0,1.0
5,1.0,1.0,0.0,1.0,0.0,90.0,30.0,1.0
6,1.0,1.0,0.0,1.0,0.0,80.0,25.0,1.0
7,1.0,1.0,0.0,1.0,0.0,90.0,35.0,1.0
8,1.0,1.0,0.0,1.0,0.0,60.0,22.0,0.0
9,1.0,1.0,0.0,0.0,1.0,105.0,33.0,1.0


In [68]:
X_1 = dta.iloc[:, [0, 1, 2, 3, 4, 5]]
X_1 = pd.DataFrame(np.array(X_1, dtype=float))

In [69]:
X_1

Unnamed: 0,0,1,2,3,4,5
0,0.0,1.0,0.0,30.0,10.0,1.0
1,0.0,1.0,0.0,36.0,11.0,1.0
2,0.0,1.0,0.0,34.0,10.0,0.0
3,0.0,1.0,0.0,30.0,9.0,0.0
4,0.0,1.0,0.0,38.0,12.0,1.0
5,0.0,1.0,0.0,90.0,30.0,1.0
6,0.0,1.0,0.0,80.0,25.0,1.0
7,0.0,1.0,0.0,90.0,35.0,1.0
8,0.0,1.0,0.0,60.0,22.0,0.0
9,0.0,0.0,1.0,105.0,33.0,1.0


In [70]:
model = sm.OLS(boy, X_1).fit()

In [71]:
model.summary()

0,1,2,3
Dep. Variable:,0,R-squared:,0.885
Model:,OLS,Adj. R-squared:,0.849
Method:,Least Squares,F-statistic:,24.69
Date:,"Fri, 30 Oct 2020",Prob (F-statistic):,5.41e-07
Time:,16:48:18,Log-Likelihood:,-73.95
No. Observations:,22,AIC:,159.9
Df Residuals:,16,BIC:,166.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
0.0,114.0688,8.145,14.005,0.000,96.802,131.335
1.0,108.3030,5.736,18.880,0.000,96.143,120.463
2.0,104.4714,9.195,11.361,0.000,84.978,123.964
3.0,0.9211,0.119,7.737,0.000,0.669,1.174
4.0,0.0814,0.221,0.369,0.717,-0.386,0.549
5.0,-10.5980,5.052,-2.098,0.052,-21.308,0.112

0,1,2,3
Omnibus:,1.031,Durbin-Watson:,2.759
Prob(Omnibus):,0.597,Jarque-Bera (JB):,0.624
Skew:,0.407,Prob(JB):,0.732
Kurtosis:,2.863,Cond. No.,524.0


#### We will eliminate the highest p-value, which is forth index.

In [72]:
X_1 = dta.iloc[:, [0, 1, 2, 3, 5]]
X_1 = pd.DataFrame(np.array(X_1, dtype=float))

In [73]:
model = sm.OLS(boy, X_1).fit()

In [74]:
model.summary()

0,1,2,3
Dep. Variable:,0,R-squared:,0.884
Model:,OLS,Adj. R-squared:,0.857
Method:,Least Squares,F-statistic:,32.47
Date:,"Fri, 30 Oct 2020",Prob (F-statistic):,9.32e-08
Time:,16:51:38,Log-Likelihood:,-74.043
No. Observations:,22,AIC:,158.1
Df Residuals:,17,BIC:,163.5
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
0.0,115.6583,6.734,17.175,0.000,101.451,129.866
1.0,109.0786,5.200,20.978,0.000,98.108,120.049
2.0,106.5445,7.090,15.026,0.000,91.585,121.504
3.0,0.9405,0.104,9.029,0.000,0.721,1.160
4.0,-11.1093,4.733,-2.347,0.031,-21.096,-1.123

0,1,2,3
Omnibus:,0.871,Durbin-Watson:,2.719
Prob(Omnibus):,0.647,Jarque-Bera (JB):,0.459
Skew:,0.351,Prob(JB):,0.795
Kurtosis:,2.91,Cond. No.,397.0
