# Supervised learning

# Regression
is the process of estimating the relationship between input data and the continuous-valued output data. This data is usually in the form of real numbers, and our goal is to estimate the underlying function that governs the mapping from the input to the output.

# Generalized Linear Models 
The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables. 

In mathematical notion, if `(𝑦^)` is the predicted value.

**𝑦^(𝑤,𝑥) =𝑤0+𝑤1𝑥1+...+𝑤𝑝𝑥𝑝**

Across the module (`Scikit-Learn`), we designate the vector `𝑤= (𝑤1,...,𝑤𝑝)` as `coef_` and `𝑤0` as `intercept_`

# Ordinary Least Squares
visualization on: `http://setosa.io/ev/ordinary-least-squares-regression/`

`LinearRegression` fits a linear model with coefficients `𝑤= (𝑤1,...,𝑤𝑝)` to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:  

$$
\min _{w}\|X w-y\|_{2}^{2}
$$

However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms
are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix
becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the
observed response, producing a large `variance`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Importing our dataset from CSV file:
dataset = pd.read_csv("../input/student-scores/student_scores.csv")

# Now let's explore our dataset:
dataset.shape

In [None]:
# Let's take a look at what our dataset actually looks like:
dataset.head()

In [None]:
# To see statistical details of the dataset, we can use describe():
dataset.describe()

In [None]:
#And finally, let's plot our data points on 2-D graph our dataset 
#and see if we can manually find any relationship between the data:

dataset.plot(x='Hours', y='Scores', style='o')
plt.title('Hours vs Score')
plt.xlabel('Hours Studied')
plt.ylabel('Percentage Score')
plt.show()


**From the graph above, we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.**

In [None]:
# Preparing our data:
# Divide the data into "attributes" and "labels". Attributes are the independent variables
# while labels are dependent variables whose values are to be predicted.

X = dataset.iloc[:, :-1].values # all colomns except the last one (reshape it into column vector)
 
y = dataset.iloc[:, 1].values # first colomn only


In [None]:
''' 
The next step is to split this data into training and test sets. 
We'll do this by using Scikit-Learn's built-in train_test_split() method:
'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# The above script splits 80% of the data to training set while 20% of the data to test set. 
# The test_size variable is where we actually specify the proportion of test set.

In [None]:
# Training the Algorithm:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In the theory section we said that linear regression model basically finds the best value for the intercept (bias) and slope, which results in a line that best fits the data. To see the value of the intercept and slop calculated by the linear regression algorithm for our dataset, execute the following code.

In [None]:
# To retrieve the intercept:
print(regressor.intercept_)

# For retrieving the slope (coefficient of x):
print(regressor.coef_)


This means that for every one unit of change in hours studied, the change in the score is about 9.91%

In [None]:
# Making Predictions:
# Now that we have trained our algorithm, it's time to make some predictions.

y_pred = regressor.predict(X_test)    # The y_pred is a numpy array that contains all the predicted values.


In [None]:
# To compare the actual output values for X_test with the predicted values, execute the following script:

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df


In [None]:
# Plot actual value vs predicted one:

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color='red')

plt.title('Hours vs Percentage')
plt.xlabel('Hours Studied')
plt.ylabel('Percentage Score')
plt.show()

The final step is to evaluate the performance of algorithm. 
This step is particularly important to compare how well different algorithms perform on a particular dataset.
For regression algorithms, four evaluation metrics are commonly used:

# Evaluating the Algorithm:

## Mean Absolute Error

$$
\mathrm{MAE}=\frac{1}{N} \sum_{i=1}^{N}\left|y_{i}-\hat{y}_{i}\right|
$$

## Mean Squared Error

$$
\mathrm{MSE}=\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}
$$

## r2_score

$$
R^{2}=1-\frac{\mathrm{MSE}(\text { model })}{\text { MSE (baseline) }}
$$

**The MSE of the model is computed as above, while the MSE of the baseline is defined as:**

$$
\mathrm{MSE}(\text { baseline })=\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-\bar{y}\right)^{2}
$$

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('r2_score: ', metrics.r2_score(y_test,y_pred))

****Ideally, lower RMSE and higher R-squared values are indicative of a good model.  