## Regression Template

* Obtain all the required libraries for basic regression template.
* Libraries for
    * Data Analysis - Numpy, Pandas
    * Machine Learning Libraries - Sklearn
    * Data Visualization - Plotly

In [1]:
# Import all dependencies required for the problem.
from __future__ import print_function
from plotly.offline import iplot, init_notebook_mode

import numpy as np
import pandas as pd
import plotly.graph_objs as go

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
# Set a Seed for random number generation for reproducible results
init_notebook_mode(connected=True)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## Load and preview the Dataset

* Read the dataset from the required format.
* Preview the dataset which is loaded

In [3]:
# Load the titanic dataset using Pandas library 
df = pd.read_csv('../../data/assignment/salary_dataset.csv')
df.head()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0


## Split the dataset into "what needs to be predicted", "what it should learn from"
* "What needs to be predicted" - Dependent Variables, Target Variables, x
* "What it should learn from" - Independent Variables, y

In [4]:
x = df.loc[:,:'YearsExperience']
y = df['Salary']

## Split the dataset into training, testing set
* Training dataset is for the Machine learning model to learn from
* Testing dataset is for testing how accurate our Machine learning model will perform on unseen test dataset.

In [5]:
# Split the dataset into train and test, for learning from one dataset and test it on the other.
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

## Scaling the dataset
* Scale all features, in this case 1 single feature as years of experience
* Done to ensure all features are at the same scale, and 1 feature does not impact the algorithm diffently because of its range of values.

In [6]:
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Select the Machine Learning Model and train the same
* Select Machine Learning Model
    * http://scikit-learn.org/stable/modules/ensemble.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor
    * http://scikit-learn.org/stable/modules/tree.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
    * http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
    * http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
* Train the Machine Learning Model, run the fit method with the training data


In [7]:
# Try out other models, and improve accuracy
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Evaluate the model for accuracy
* Obtain Root Mean Squared Error
* Smaller the Root Mean Squared Error the better (Improve Accuracy by reducing your error less than 7059)

In [11]:
from math import sqrt
from sklearn.metrics import mean_squared_error
print("Root Mean Squared Error Linear Regression Model: {:.2f}".format(
    sqrt(mean_squared_error(y_test.values, regressor.predict(X_test_scaled)))
))

Root Mean Squared Error Linear Regression Model: 7059.04


In [13]:
def build_graph(column_interested, output_column):
    plot1 = go.Scatter(
        x = X_train[column_interested].values,
        y = y_train.values,
        name='Training Data',
        mode='markers',
        connectgaps=True
    )

    plot2 = go.Scatter(
        x = X_train[column_interested].values,
        y = regressor.predict(X_train_scaled),
        name='Model Prediction',
        connectgaps=True
    )

    fig = dict(data=[plot1, plot2], layout=dict(
        title='{} vs {}'.format(output_column, column_interested),
        xaxis=dict(title=column_interested),
        yaxis=dict(title=output_column)))
    return fig

iplot(build_graph('YearsExperience', 'Salary'))