# Regression Application 1

#### This code performs modelling and feature selection on a tennis dataset. First, the data is loaded and categorical features are processed by label coding and one-hot coding methods. The processed data is split into training and test sets. The linear regression model is trained and predictions are made on the test set. Then, backward elimination method is applied for feature selection: the first model is trained with all features and then the model is rebuilt by removing one feature. The summaries and prediction results of the new model obtained after feature selection are compared. This process eliminates redundant features to obtain the best model performance.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

data = pd.read_csv("C:\\Users\\Arif Furkan\\OneDrive\\Belgeler\\Python_kullanirken\\odev_tenis.csv")
print(data)

     outlook  temperature  humidity  windy play
0      sunny           85        85  False   no
1      sunny           80        90   True   no
2   overcast           83        86  False  yes
3      rainy           70        96  False  yes
4      rainy           68        80  False  yes
5      rainy           65        70   True   no
6   overcast           64        65   True  yes
7      sunny           72        95  False   no
8      sunny           69        70  False  yes
9      rainy           75        80  False  yes
10     sunny           75        70   True  yes
11  overcast           72        90   True  yes
12  overcast           81        75  False  yes
13     rainy           71        91   True   no


## Encode categorical features

In [2]:
label_encoder = LabelEncoder()
data_encoded = data.apply(label_encoder.fit_transform)

In [3]:
column_transformer = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), [0])],
    remainder='passthrough')
weather_encoded = column_transformer.fit_transform(data_encoded.iloc[:, :1])

In [4]:
weather_df = pd.DataFrame(data=weather_encoded, index=range(len(data)), columns=['o', 'r', 's'])

In [5]:
processed_data = pd.concat([weather_df, data_encoded.iloc[:, 1:3]], axis=1)
processed_data = pd.concat([data_encoded.iloc[:, -2:], processed_data], axis=1)

## Split the dataset into training and testing sets

In [6]:
x_train, x_test, y_train, y_test = train_test_split(processed_data.iloc[:, :-1], processed_data.iloc[:, -1:], test_size=0.33, random_state=0)

## Train a Linear Regression model

In [7]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print("Predictions:\n", y_pred)

Predictions:
 [[ 5.64285714]
 [ 0.35714286]
 [ 5.92857143]
 [-0.5       ]
 [ 1.85714286]]


## Backward elimination for feature selection

In [8]:
# Add a column of ones to the dataset for the intercept term
X = np.append(arr=np.ones((len(processed_data), 1)).astype(int), values=processed_data.iloc[:, :-1], axis=1)

## Initial model

In [9]:
X_1 = processed_data.iloc[:, [0, 1, 2, 3, 4, 5]].values
X_1 = np.array(X_1, dtype=float)
model = sm.OLS(processed_data.iloc[:, -1], X_1).fit()
print("Initial Model Summary:\n", model.summary())

Initial Model Summary:
                             OLS Regression Results                            
Dep. Variable:               humidity   R-squared:                       0.266
Model:                            OLS   Adj. R-squared:                 -0.192
Method:                 Least Squares   F-statistic:                    0.5807
Date:                Thu, 25 Jul 2024   Prob (F-statistic):              0.715
Time:                        16:52:55   Log-Likelihood:                -31.999
No. Observations:                  14   AIC:                             76.00
Df Residuals:                       8   BIC:                             79.83
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.7947      2

  res = hypotest_fun_out(*samples, **kwds)


## Remove one feature and update model

In [10]:
processed_data = processed_data.iloc[:, 1:]
X = np.append(arr=np.ones((len(processed_data), 1)).astype(int), values=processed_data.iloc[:, :-1], axis=1)
X_1 = processed_data.iloc[:, [0, 1, 2, 3, 4]].values
X_1 = np.array(X_1, dtype=float)
model = sm.OLS(processed_data.iloc[:, -1], X_1).fit()
print("Updated Model Summary (after feature removal):\n", model.summary())

Updated Model Summary (after feature removal):
                             OLS Regression Results                            
Dep. Variable:               humidity   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                 -0.080
Method:                 Least Squares   F-statistic:                    0.7587
Date:                Thu, 25 Jul 2024   Prob (F-statistic):              0.577
Time:                        16:53:57   Log-Likelihood:                -32.133
No. Observations:                  14   AIC:                             74.27
Df Residuals:                       9   BIC:                             77.46
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1  

  res = hypotest_fun_out(*samples, **kwds)


## Train the model again after feature selection

In [11]:
x_train = x_train.iloc[:, 1:]
x_test = x_test.iloc[:, 1:]
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print("Predictions after feature selection:\n", y_pred)

Predictions after feature selection:
 [[3.19296254]
 [0.71282633]
 [4.1169126 ]
 [1.57094211]
 [2.1430193 ]]
