## Linear Regression in Python with Scikit-Learn ##

reference : https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
        
        
Supervised machine learning :
    
    1. Regression - predicts discrete outputs

    2. Classification - continuous value outputs

Example : Linear regression Simple

Linear relationship between two or more variables

Determine the linear relationship between the numbers of hours a student studies and the percentage of marks that student scores in an exam. 

Questions : To find out that given the number of hours a student prepares for a test , determine how high a score can the student achieve ? Plot the independent variables( hours) vs dependent variable ( percentage) on the y - axis where b is the intercept and m is the slope of the line. linear regression fits the best line that returns least errors


Definition of multiple linear regression : This is where there are more than two variables eg prediction of the price of a house based on the area , no, of bedrooms and average income of the people in the area , the age of the house and so on. Do the dependent varibale is dependent on several independent variables. 
The equation is a hyper plane - three dimensions.



Using Scikit learn : 

Example : predict the % marks that a student is expected to score based upon the no. of hrs studied


Steps involved : 

1. Importing libraries : 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

2. Importing the Dataset 

Downloading the dataset of interest

dataset = pd.read_csv('D:\Datasets\student_scores.csv')

3. Exploration of the Dataset

dataset.shape - details of the rows and columns
dataset.head() - you can choose how many records you'd like to see
dataset.describe()- statistical details of the dataset

4. Simple plotting of the variables

dataset.plot(x='Hours', y='Scores', style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours Studied')
plt.ylabel('Percentage Score')
plt.show()

NB : In the script above, we use plot() function of the pandas dataframe and pass it the column names for x coordinate and y coordinate, which are "Hours" and "Scores" respectively.
NB: 


5. Preparing the Data 

The data needs to be divided into attributes ( independent) and labels ( dependent variables to be predicted ) . 
To extract the attributes , we use : 

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

Attributes stored in X variable , -1 is the range for columns and Y stores the labels, 1 for the range of scores ( in column 1) 

Next is to split the data into training and test sets using scikit learn built in train_tes_split() method : 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


- This splits 80 % of the data to the training set and 20 % to the test set * 0.2


6. Training the Algorithm 

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

- Linear regression model basically finds the best value for the intercept and slope which results in the best line fit the data. this calculated using the following code : 


to retrieve the intercept:

print(regressor.intercept_)
The resulting value you see should be approximately 2.01816004143.

For retrieving the slope (coefficient of x):

print(regressor.coef_)
The result should be approximately 9.91065648.

Explanation : for one unit change in hours studied , the change in score is about 9.9%. 

7. Making Predictions 

We can now use the trained algorithm to make some predictions using the etst data and see how the algorithm predicts the precentage score. 


y_pred = regressor.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test series.

To compare the actual output values for X_test with the predicted values, execute the following script:

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Though our model is not very precise, the predicted percentages are close to the actual ones.

Note:

The values in the columns above may be different in your case because the train_test_split function randomly splits data into train and test sets, and your splits are likely different from the one shown in this article.



8. Evaluating the Algorithm  : 

Important to compare how well the different algorithms perform on a particular dataset. For regression algorithms , there are three evaluation metrics used : 


Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For regression algorithms, three evaluation metrics are commonly used:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors. It is calculated as:
Mean Absolute Error
Mean Squared Error (MSE) is the mean of the squared errors and is calculated as:
Mean Squared Error
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Root Mean Squared Error
Luckily, we don't have to perform these calculations manually. The Scikit-Learn library comes with pre-built functions that can be used to find out these values for us.

Let's find the values for these metrics using our test data. Execute the following code:

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
The output will look similar to this (but probably slightly different):

Mean Absolute Error: 4.183859899
Mean Squared Error: 21.5987693072
Root Mean Squared Error: 4.6474476121

You can see that the value of root mean squared error is 4.64, which is less than 10% of the mean value of the percentages of all the students i.e. 51.48. This means that our algorithm did a decent job.