## Employee Salary Prediction

This project deals with predicting salary given years of experience. Here is a link to a detailed description of the data set. 

### Instructions
* There follows a series of questions and tasks.
* Answer the questions by double clicking on the cell right after the question and type in your answer.
* Wherever you see a comment that says "TODO", you will have to fill it in with code as guided by the comment.



**Question**: Enter the names of the names of the students in your team.

**Answer**: 

#### Clone the workshop repository to access the data sets

In [None]:
!git clone https://github.com/KshitijKarthick/reva_ml_workshop

### Import all the libraries that you will use for the task

In [None]:
# Import all dependencies required for the problem.
from __future__ import print_function
from plotly.offline import iplot, init_notebook_mode

import numpy as np
import pandas as pd
import plotly.graph_objs as go

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
# Set a Seed for random number generation for reproducible results
init_notebook_mode(connected=True)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## Load and preview the Dataset

* Read the dataset from the required format.
* Preview the dataset which is loaded

In [None]:
df = pd.read_csv('../../data/assignment/salary_dataset.csv')
# TODO: use df.head() to display the first 5 rows of the data set

## Split the dataset into "what needs to be predicted", "what it should learn from"
* "What needs to be predicted" - Dependent Variables, x
* "What it should learn from" - Independent Variables, y

In [None]:
X = df.loc[:,:'YearsExperience']
y = df['Salary']

## Split the dataset into training, testing set
* Training dataset is for the Machine learning model to learn from
* Testing dataset is for testing how accurate our Machine learning model will perform on unseen test dataset.

In [None]:
# Split the dataset into train and test, for learning from one dataset and test it on the other.
X_train, X_test, y_train, y_test = 
# TODO: complete the above line of code by using the train_test_split function from sklearn to split the data set into training and testing datasets
# use test_size=0.2
# Find an example from here: 
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

## Scaling the dataset
* Scale all features, in this case 1 single feature as years of experience
* Done to ensure all features are at the same scale, and 1 feature does not impact the algorithm diffently because of its range of values.

In [None]:
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Train a regression model 

In [None]:
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

## Evaluate the model for accuracy
* Obtain Root Mean Squared Error
* Smaller the Root Mean Squared Error the better.

In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error
print("Root Mean Squared Error Linear Regression Model: {:.2f}".format(
    sqrt(mean_squared_error(y_test.values, regressor.predict(X_test_scaled)))
))

In [None]:
def build_graph(column_interested, output_column):
    plot1 = go.Scatter(
        x = X_train[column_interested].values,
        y = y_train.values,
        name='Training Data',
        mode='markers',
        connectgaps=True
    )

    plot2 = go.Scatter(
        x = X_train[column_interested].values,
        y = regressor.predict(X_train_scaled),
        name='Model Prediction',
        connectgaps=True
    )

    fig = dict(data=[plot1, plot2], layout=dict(
        title='{} vs {}'.format(output_column, column_interested),
        xaxis=dict(title=column_interested),
        yaxis=dict(title=output_column)))
    return fig

iplot(build_graph('YearsExperience', 'Salary'))

* Select one of the below Machine Learning Models
    * http://scikit-learn.org/stable/modules/ensemble.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor
    * http://scikit-learn.org/stable/modules/tree.html#regression
    * http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
    * http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
    * http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

* Build a model to reduce the Root Mean Squared Error further 

In [None]:
## TODO

### What to do after completing the assignment
* Press ctrl+s to save your progress
* Click on 'share' at the top right corner.
* Enter these two email ids *upman16@gmail.com* and *kshitij.karthick@gmail.com*
* Click on done.