Regression is used to predict a range of continuous values. For eg, from the previous data, what is the expeccted humidity today OR what salary a person might get after working in a company for 10 years.

# Prepocessing the Data

## Importing the libraries

In [1]:
import matplotlib as plt
import pandas as pd
import numpy as np

## Importing the dataset

In [2]:
dataset = pd.read_csv("Data.csv")
x = dataset.iloc[:, :-1].values #considering the dependent variable is in the last column, select all columns till the last one
y = dataset.iloc[:, -1].values #select only the last column

In [3]:
print(x)

[[  14.96   41.76 1024.07   73.17]
 [  25.18   62.96 1020.04   59.08]
 [   5.11   39.4  1012.16   92.14]
 ...
 [  31.32   74.33 1012.92   36.48]
 [  24.48   69.45 1013.86   62.39]
 [  21.6    62.52 1017.23   67.87]]


WARNING: Choose, whether or not you want to encode the data, if there is nothing to be encoded, don't run this cell. Choose only if you have non-numerical data

## Encoding categorical data

WARNING: Beware of the dummy variable trap.

### Encoding Independent Variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder'), OneHotEncoder(), [3]], remainder='passthrough')
x = np.array(ct.fit_transform(x))

In [None]:
print(x)

### Encoding dependent variable data

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print(y)

## Splitting the dataset into the Training set and Test set

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [None]:
print(x_train)

In [None]:
print(x_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

WARNING: Choose whether of not Feature Scaling is important to your data or not

## Feature Scaling

Feature scaling is applied when the variables used are not in the same range on values. It helps determing how close one variable is to another. Eg a variable range is 1-10, but the dependent variable/ other variable is 1-2000. In such cases we must apply Feature Scaling.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

#featurescale dependent variable data too if needed

In [None]:
print(x_train)

In [None]:
print(x_test)

# Multi-Linear Regression

## Training Regressor on Model built by SciKit Learn

In [5]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

## Predicting the Test set results

In [6]:
y_pred = regressor.predict(x_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.43 431.23]
 [458.56 460.01]
 [462.75 461.14]
 ...
 [469.52 473.26]
 [442.42 438.  ]
 [461.88 463.28]]


## Evaluating the Model Performance

In [7]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9325315554761303

# Polynomial Regression Model

## Training Polynomial Regression Model

In [8]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
x_poly = poly_reg.fit_transform(x_train)
regressor = LinearRegression()
regressor.fit(x_poly, y_train)

## Predicting the Test set results

In [9]:
y_pred = regressor.predict(poly_reg.transform(x_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[433.94 431.23]
 [457.9  460.01]
 [460.52 461.14]
 ...
 [469.53 473.26]
 [438.27 438.  ]
 [461.67 463.28]]


## Evaluating the Model Performance

In [10]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9458193219771797

# Support Vector Regression

Here, mostly Feature Scaling is applied on both X and Y train results, but not on the test sets of both.
WARNING: Please run at the end, as we need the feature scaling only for this regression, not other ones.

## A few preprocessing neccesities

In [21]:
y = y.reshape(len(y),1)

from sklearn.model_selection import train_test_split
x_train, x, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

## Feature Scaling (For reference)

In [22]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
x_train = sc_X.fit_transform(x_train)
y_train = sc_y.fit_transform(y_train)

## Training SVR Model

In [23]:
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


## Predicting Test Set Results

In [24]:
y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(x_test)).reshape(-1,1)) 
#here it was expected that only the test set results are Feature scaled, hence inverse transformation is used.
#if Everything was scaled, this step is not neccessary
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[434.05 431.23]
 [457.94 460.01]
 [461.03 461.14]
 ...
 [470.6  473.26]
 [439.42 438.  ]
 [460.92 463.28]]


## Evaluating the model performance

In [25]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9480784049986262

# Descision Tree Regression

## Training the Descision Tree model

In [11]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(x_train, y_train)

## Prdeicting the results

In [13]:
y_pred = regressor.predict(x_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.28 431.23]
 [459.59 460.01]
 [460.06 461.14]
 ...
 [471.46 473.26]
 [437.76 438.  ]
 [462.74 463.28]]


## Evaluating the performance

In [14]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.922905874177941

# Random Forest Regression 

## Training the model

In [15]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)

## Predicting the results

In [16]:
y_pred = regressor.predict(x_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[434.05 431.23]
 [458.79 460.01]
 [463.02 461.14]
 ...
 [469.48 473.26]
 [439.57 438.  ]
 [460.38 463.28]]


## Evaluating Model Performance

In [17]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9615908334363876