## Regression Template

* Obtain all the required libraries for basic regression template.
* Libraries for
    * Data Analysis - Numpy, Pandas
    * Machine Learning Libraries - Sklearn
    * Data Visualization - Plotly

In [1]:
# Import all dependencies required for the problem.
from __future__ import print_function
from plotly.offline import iplot, init_notebook_mode

import numpy as np
import pandas as pd
import plotly.graph_objs as go

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
# Set a Seed for random number generation for reproducible results
init_notebook_mode(connected=True)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## Load and preview the Dataset

* Read the dataset from the required format.
* Preview the dataset which is loaded

In [16]:
# Load the titanic dataset using Pandas library 
df = pd.read_csv('../../data/assignment/startup_profit_dataset.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Split the dataset into "what needs to be predicted", "what it should learn from"
* "What needs to be predicted" - Dependent Variables, Target Variables, x
* "What it should learn from" - Independent Variables, y

In [33]:
x = df.loc[:,:'State']
y = df['Profit']

## Categorical Data
* Convert textual data like States to numeric form, and then into separate columns: Ex Column California, Florida
* Seperate Column as there is no order implied in having florida as 1 , california as 2. this should not be picked up

In [39]:
# Encoding categorical data
encoded_states = pd.get_dummies(x['State']).drop(['Florida'], axis=1)
encoded_x = pd.concat([x.loc[:, :'Marketing Spend'], encoded_states], axis=1)
encoded_x.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,California,New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,1,0
2,153441.51,101145.55,407934.54,0,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,0,0


## Split the dataset into training, testing set
* Training dataset is for the Machine learning model to learn from
* Testing dataset is for testing how accurate our Machine learning model will perform on unseen test dataset.

In [40]:
# Split the dataset into train and test, for learning from one dataset and test it on the other.
X_train, X_test, y_train, y_test = train_test_split(
    encoded_x, y, test_size=0.2, random_state=42)

## Scaling the dataset
* Scale all features, in this case 1 single feature as years of experience
* Done to ensure all features are at the same scale, and 1 feature does not impact the algorithm diffently because of its range of values.

In [41]:
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Select the Machine Learning Model and train the same
* Select Machine Learning Model
    * http://scikit-learn.org/stable/modules/ensemble.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor
    * http://scikit-learn.org/stable/modules/tree.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
    * http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
    * http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
* Train the Machine Learning Model, run the fit method with the training data


In [42]:
# Try out other models, and improve accuracy
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Evaluate the model for accuracy
* Obtain Root Mean Squared Error
* Smaller the Root Mean Squared Error the better (Improve Accuracy by reducing your error less than 9055)

In [43]:
from math import sqrt
from sklearn.metrics import mean_squared_error
print("Root Mean Squared Error Linear Regression Model: {:.2f}".format(
    sqrt(mean_squared_error(y_test.values, regressor.predict(X_test_scaled)))
))

Root Mean Squared Error Linear Regression Model: 9055.96


In [49]:
def build_graph(column_interested, output_column):
    plot1 = go.Scatter(
        x = X_train[column_interested].values,
        y = y_train.values,
        name='Training Data',
        mode='markers',
        connectgaps=True
    )

    plot2 = go.Scatter(
        x = X_train[column_interested].values,
        y = regressor.predict(X_train_scaled),
        name='Model Prediction',
        connectgaps=True,
        mode='markers'
    )

    fig = dict(data=[plot1, plot2], layout=dict(
        title='{} vs {}'.format(output_column, column_interested),
        xaxis=dict(title=column_interested),
        yaxis=dict(title=output_column)))
    return fig

In [50]:
iplot(build_graph('R&D Spend', 'Profit'))

In [52]:
iplot(build_graph('Marketing Spend', 'Profit'))