# Linear Regression

## Regression on a simple dataset

In [3]:
import pandas as pd 

path = '../datasets/50_Startups.csv'
dataset = pd.read_csv(path)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
# Encoding categorical data
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [6]:
from LinearRegression import LinearRegression
regressor = LinearRegression(mode=2)
regressor.fit(X_train, y_train)

In [7]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[101047.32456433728 103282.38]
 [127872.52969434661 144259.4]
 [128620.62725706577 146121.95]
 [72243.07019823034 77798.83]
 [171431.66391190744 191050.39]
 [111517.45537074801 105008.31]
 [66853.83139358481 81229.06]
 [95705.5640514028 97483.56]
 [111135.85392294121 110352.25]
 [161612.88844856032 166187.94]]


In [8]:
from LinearRegression import r2_score

r2_score(y_test, y_pred)

R-squared:  0.9007496218952408


## Regression on a more complicated dataset

### Information about the dataset

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. 

Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP)  of the plant.

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, the other three of the ambient variables effect the GT performance.

In [9]:
path = '../datasets/Combined_cycle_plants.ods'
dataset = pd.read_excel(path, engine='odf', sheet_name="Sheet1")
dataset.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [10]:
dataset.info(verbose="True")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AT      9568 non-null   float64
 1   V       9568 non-null   float64
 2   AP      9568 non-null   float64
 3   RH      9568 non-null   float64
 4   PE      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


## Data Preprocessing

In [12]:
import numpy as np

target = dataset['PE']
y = np.asarray(target)
X = dataset.iloc[:, 0:4].values

#print(X)
#print(y)

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3,
                                                    stratify=None, #preserve target proportions 
                                                    random_state= 123) #fix random seed for replicability

print(X_train.shape, X_test.shape)

(6697, 4) (2871, 4)


In [14]:
from LinearRegression import LinearRegression
regressor = LinearRegression( learning_rate=0.2)
regressor.fit(X_train, y_train)

In [15]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
print

[[455.98 455.41]
 [448.62 441.64]
 [463.38 471.63]
 ...
 [460.9  464.38]
 [453.82 450.72]
 [440.62 432.68]]


In [16]:
from LinearRegression import r2_score

r2_score(y_test, y_pred)

R-squared:  0.6712627575672694


## Comparison with Scikit-learn API

In [21]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[447.73 455.41]
 [437.86 441.64]
 [471.82 471.63]
 ...
 [462.6  464.38]
 [452.19 450.72]
 [429.7  432.68]]


In [22]:
from sklearn.metrics import r2_score

# Calcola il coefficiente R²
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

R-squared: 0.9232530522555914


## Comparison with Weka tool

![](lin_reg_weka.png)