# Supervised Learning 
In Machine Learning, Supervised learning is the task of learning a function that maps an input to an output based on sample input-output pairs. 

One of its applications is to predict numeric outcomes; we train an alogorithm to learn how to map input(s) to a numeric outcome value. 

# Numeric Prediction
In this activity We will become familiar with several models for **predicting numeric outcomes**. 

In [None]:
# make sure to upload the dataset to your colab folder before proceeding (see instructions in 1_GettingStarted)
#loading the 50_startups datasets as a panda dataframe
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')

In [None]:
# if you are using the file from your google drive 
import pandas as pd
from google.colab import drive
dataset = pd.read_csv('/content/drive/MyDrive/MIS7720/50_Startups.csv')

In [None]:
#take a look a descriptive statistics 
dataset.describe()

## Visually exploring our dataset
We will use the seaborn library to create some visualizations

In [None]:
#visually exploring the dataset 
#the next few cells demonstrate several graphs that can be useful to visually explore your data
import seaborn as sb
sb.distplot(dataset['R&D Spend']) #histogram for R&D spending

In [None]:
sb.distplot(dataset['Administration']) #histogram for Administration spending
sb.distplot(dataset['Marketing Spend'])  #histogram for Marketing spending

In [None]:
#comparing R&D spending across states using violin plots
sb.violinplot(x="State",y="R&D Spend", data=dataset)

In [None]:
#comparing R&D spending across states using box
sb.boxplot(x="State",y="R&D Spend", data=dataset)

In [None]:
#scatter plot of Profits vs. Marketing Spending, colored by State
sb.scatterplot(x="Marketing Spend",y="Profit",hue="State", data=dataset)


In [None]:
dataset.head(5)

In [None]:
##preparing the data for model training
#dataset.iloc[0,3] #points to a specific [row,column] ; index starts at 0
#defining input and outcome variables
y = dataset[['Profit']]  #profit
#X = dataset.drop(labels=['Profit','State'], axis=1) #other variables
X = dataset.drop('Profit',axis=1) #other variables

In [None]:
#before encoding the State variable
X.head(3)

In [None]:
#The state variable is categorical, 
#we use binary encoding to create binary variable for each level of the State variable
pd.get_dummies(X['State']).head(3)

In [None]:
#add the binary encoded State variables to our X variable
X=pd.concat([X,pd.get_dummies(X['State'])],axis=1) 
# drop the State column, since we now have the binary encoded vars
X.drop(['State'],axis=1,inplace=True)

In [None]:
#after encoding the State variable
X.head(3)

## Splitting the data into Train and Test sets

We split the dataset: keep 20% for testing and the rest for training 

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)  #using same random_state value for replicability


###**Question1 :**
Do we have to use 20% for the testing? would you use a 40% test set for this dataset? why?

answer here 

In [None]:
# let's take another look at the resulting datasets after splitting
X_train.head(2)

## Linear regression model

In [None]:
#### Fitting Multiple Linear Regression to the Training set  ####
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)

In [None]:
# Predicting the Test set results
y_pred_lin = lin_reg.predict(X_test)

importing model evaluation metrics from sklearn 

In [None]:
from sklearn import metrics
import math
# The coefficients
print('Coefficients: \n', lin_reg.coef_)
# The mean squared error
print("Mean squared error: %.2f" % metrics.mean_squared_error(Y_test, y_pred_lin))
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_lin)))
# The mean absolute error
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_lin))
# R-square: 1 is perfect prediction
print('R-square: %.2f' % metrics.r2_score(Y_test, y_pred_lin))


### Using the trained model to predict Profit for a new input

In [None]:
#X_new=pd.read_csv('50_Startups_newinput.csv')
#print(X_new)
#lin_reg.predict(X_new)

## Decision Tree Regression model

In [None]:
#### Fitting Decision Tree Regression to the dataset  ###########
from sklearn.tree import DecisionTreeRegressor
DecTree_reg = DecisionTreeRegressor(random_state = 123)
DecTree_reg.fit(X_train, Y_train)

In [None]:
# evaluating Decision Tree Regression
y_pred_DT = DecTree_reg.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f" % metrics.mean_squared_error(Y_test, y_pred_DT))
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_DT)))
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_DT))

## Random Forest Regression

In [None]:
#### Fitting Random Forest Regression to the dataset ##########
from sklearn.ensemble import RandomForestRegressor
RandForest_reg = RandomForestRegressor(n_estimators = 10, min_samples_leaf=1, random_state = 0)
RandForest_reg.fit(X_train, Y_train)

In [None]:
#evaluting RandForest_reg
y_pred_RF = RandForest_reg.predict(X_test)
print("Mean squared error: %.2f" % metrics.mean_squared_error(Y_test, y_pred_RF))
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_RF)))
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_RF))

## Support Vector Regression

In [None]:
#########  Support Vector Regression #########
# Feature Scaling
from sklearn.preprocessing import StandardScaler,minmax_scale
sc_X = StandardScaler() 
sc_y = StandardScaler()
X_train_sc = sc_X.fit_transform(X_train)
y_train_sc = sc_y.fit_transform(Y_train) #.reshape(-1,1)

In [None]:
# Fitting SVR to the dataset
from sklearn.svm import SVR
svr_reg = SVR(kernel = 'rbf')
svr_reg.fit(X_train_sc, y_train_sc)

In [None]:
# evaluating SVR regresion
y_pred = svr_reg.predict(sc_X.fit_transform(X_test))
y_pred_SVC = sc_y.inverse_transform(y_pred) #inverse applying the scaler

print("Mean squared error: %.2f" % metrics.mean_squared_error(Y_test, y_pred_SVC))
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_SVC)))
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_SVC))

##Comparing Different models


In [None]:
import matplotlib.pyplot as plt
names = ['Linear Reg','DT Reg','RF reg','SVR']
predictions=[y_pred_lin,y_pred_DT,y_pred_RF,y_pred_SVC]
results = []

for y_pred in predictions:
  rmse=round(math.sqrt(metrics.mean_squared_error(Y_test, y_pred)),2)
  #mae=round(metrics.mean_absolute_error(Y_test, y_pred),2)
  results.append(rmse) #change rmse to mae 
  
# create a bar plot to compare values
fig = plt.figure()
fig.suptitle('Model RMSE Comparison: ')
ax = fig.add_subplot(111)
plt.bar(names,results)
plt.show()

###Question 2
Which model would have a better predictive performance? you can run the next cell to see the RMSE values.

In [None]:
results

answer here

###Question 3
Use Mean Absolute error (MAE) in the previous code cell as model evluation metric (see video instructions). Which model has a better performance in terms of MAE? 

answer here