## Objectives and Approach

For this task, I create a model that predicts the data in *Column 12* of the provided datasets. To do this, I explore the data to identify useful features, pick and tune machine learning algorithms, then test and evaluate the algorithms. Arbitrarily, I choose *0529test4.csv* and *0530test5.csv* as testing datasets. This leaves me with 7 training datasets, and 2 testing datasets. 

In [1]:
import pandas as pd

# read in all training datasets

df_0529test1 = pd.read_csv('0529test1.csv')
df_0529test2 = pd.read_csv('0529test2.csv')
df_0529test3 = pd.read_csv('0529test3.csv')

df_0530test1 = pd.read_csv('0530test1.csv')
df_0530test2 = pd.read_csv('0530test2.csv')
df_0530test3 = pd.read_csv('0530test3.csv')
df_0530test4 = pd.read_csv('0530test4.csv')


# combine the datasets into one dataframe

df = pd.concat([df_0529test1, df_0529test2, df_0529test3, df_0530test1,
                 df_0530test2, df_0530test3, df_0530test4])

## Data Exploration

In [9]:
df.head()

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5,Column 6,Column 7,Column 8,Column 9,Column 10,Column 11,Column 12
0,25.259884,Wed 05/29/19 10:07:53.000 AM,0,22.041807,20.503357,9.93066,7.78125,0.11977,82.0,82.0,48.96875,43.585938
1,25.259884,Wed 05/29/19 10:07:54.451 AM,0,22.041807,20.503357,9.93066,7.78125,0.11977,82.0,82.0,48.96875,43.585938
2,25.259884,Wed 05/29/19 10:07:54.452 AM,0,22.041807,20.503357,9.93066,7.78125,0.11977,82.0,82.0,48.96875,43.585938
3,25.259884,Wed 05/29/19 10:07:54.455 AM,0,22.041807,20.503357,9.93066,7.78125,0.11977,82.0,82.0,48.96875,43.585938
4,25.259884,Wed 05/29/19 10:07:54.457 AM,0,22.041807,20.503357,9.93066,7.78125,0.11977,82.0,82.0,48.96875,43.585938


In [5]:
print('number of rows:', df['Column 1'].count())

number of rows: 109074


### Interactive Visualizations

Using Plotly and Dash, I developed an interactive dashboard for the automation of my data exploration. The feature visualizations within the dashboard uncover insights that help build better machine learning models, and improve my understanding of the models. Also, the dashboard can be used to view the dataset before and after outlier removal. It was deployed to the web using Heroku and Git (visit: https://emmaomere1.herokuapp.com/ ). Feel free to checkout *app2.py* within the repository to view the Python script that was used to develop this App.

Please note - The deployed version of this App contains only 3 of the 7 training datasets. This is because reduced numbers of plotted data points allow the App to load and update faster upon user interaction. Exploring the plotted 3 training datasets will reveal all insight that I gained from investigating all 7 training datasets.

### Target Exploration Summary: Outlier Detection and Removal

Within the "Target Outlier Detection and Removal" section of the dashboard, https://emmaomere1.herokuapp.com/ , I noticed some consistent outliers across the data. These outliers are restricted to the beginning of each test (Phase 0). Another thing I noticed is that Phase 0 lasts a very short time in comparison to every other Phase. For other Phases, it appears the test is run until the output is at steady state. Because of the outliers and short duration, I conclude that Phase 0 is only a set up for other Phases that Apple is curious about. I remove Phase 0 from the entire training and testing sets.


#### Interesting Target Trend 

The 3rd training dataset reveals something interesting. During Phase 1 of the test, the initial output is less than 30. This is atypical because the intial value at this Phase for all other training test datasets are above 35. Hence, only during this test does Phase 1 have a positive slope. During other training test datasets, Phase 1 has a negative slope. The potential reason for this unique trend was identified during feature exploration, and is discussed below.

### Feature Exploration Summary

The "Feature Exploration" section of the dashboard, https://emmaomere1.herokuapp.com/ , reveals trends and outliers. This helps me gain an idea of features that most impact the target.

- Column 11: Compared to all other features, this feature trend best mirrors the trend of the target. It will likely be the most important to the trained algorithms. 

- Columns 4 & 5: Though their magnitudes differ, these features have very similar trends. Phase 1 of training set 3 for the se features contains outliers. It is plausible that these outliers have a strong impact on the target. If so, this will explain the "Interesting Target Trend" discussed above. 

- Column 1: Just like Columns 4 & 5, there are outliers in Column 1. The outliers are visible within Phase 1 of training set 1. These outliers don't appear to impact the target. This leads me to assume that Column 1 may be less important to the trained algorithms than Column 4. 

- No features have constant values throughout the test. Such a feature will be removed from the study because it contains no information. 

## Feature Removal

- Although the time (Column 2) is useful for gaining insight through data visualiziation, I exclude it from the study. It is not obvious to me that time is useful for predictions here. 

- The integers within column 3 are labels for the test phases. This is discrete data, as opposed to continuous data. Continuous data is typically employed for training regression machine learning algorithms. 

In [13]:
# drop columns 2 and 3 from the dataframe here

df.drop(columns = ['Column 2', 'Column 3'], inplace = True)

# Create a dataframe containing only features

df_features = df.loc[:, df.columns != 'Column 12']
df_features.head()

Unnamed: 0,Column 1,Column 4,Column 5,Column 6,Column 7,Column 8,Column 9,Column 10,Column 11
299,25.984465,18.162004,16.702856,23.36466,18.90625,0.113796,82.0,82.0,65.960938
300,26.631109,18.233017,16.856773,35.141407,29.960938,0.109063,82.0,82.0,65.582031
301,27.116058,18.301517,16.916975,30.75914,25.71875,0.115157,82.0,82.0,67.269531
302,27.145935,18.142601,16.837379,37.806789,32.453125,0.107264,82.0,82.0,68.265625
303,27.010479,18.230341,16.739904,37.806789,32.453125,0.107264,82.0,82.0,69.628906


## Pick and Tune Algorithms

For my model, I employ and compare the performance of tuned Lasso Regression and Ridge Regression algorithms. I choose these algorithms because they are designed to reduce overfitting. GridSearchCV is used to tune both algorithms. Alpha is the varied input parameter, and cross-validation is set to maximize R^2 (Coefficient of determination).

As expected, the output of the code below reveals that *Column 11* is the most important feature for predicting the target. Both algorithms rank *Column 11* and *Column 10* as the most important features. For Lasso Regression, the best alpha setting is 0.1. For Ridge Regression, the best alpha setting is 200.

In [16]:
from sklearn import linear_model
from sklearn import model_selection

# lasso regression

lasso = linear_model.Lasso()
parameters = {'alpha': [1e-1, 1, 5, 10, 20]}

reg_lasso = model_selection.GridSearchCV(lasso, parameters, scoring = 'r2', cv=5)
reg_lasso.fit(df_features[df_features.columns], df['Column 12'])

print('best lasso regression performance:', reg_lasso.best_score_)
print('best lasso regression parameter setting:', reg_lasso.best_params_)


# ridge regression

ridge = linear_model.Ridge()
parameters = {'alpha': [1e-1, 1.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0]}

reg_ridge = model_selection.GridSearchCV(ridge, parameters, scoring = 'r2', cv=5)
reg_ridge.fit(df_features[df_features.columns], df['Column 12'])

print('best ridge regression performance:', reg_ridge.best_score_)
print('best ridge regression parameter setting:', reg_ridge.best_params_)

best lasso regression performance: 0.9590366981067651
best lasso regression parameter setting: {'alpha': 0.1}
best ridge regression performance: 0.9590137373681866
best ridge regression parameter setting: {'alpha': 200.0}


In [15]:
# employing the best parameter settings
# to reveal the most important features

# lasso regression
reg_l = linear_model.Lasso(alpha = 0.1)
reg_l.fit(df_features[df_features.columns], df['Column 12'])

print('most important features per lasso regression:', reg_l.coef_)

# ridge regression
reg_r = linear_model.Ridge(alpha = 200.0)
reg_r.fit(df_features[df_features.columns], df['Column 12'])

print('most important features per ridge regression:', reg_r.coef_)

most important features per lasso regression: [ 0.          0.07215365  0.         -0.06144051 -0.11918156 -0.
  0.01500647  0.24581384  0.35655008]
most important features per ridge regression: [ 0.09528398  0.03601126  0.04062941 -0.05816639 -0.12349886 -0.64700621
  0.01505164  0.27359149  0.3581574 ]


## Test Algorithms

Below, I read in all testing datasets, remove Phase 0, and evaluate the performance of both algorithms. With an R^2 value of 0.9345, Ridge Regression performs slightly better than Lasso Regression, 0.9327. **Ridge Regression is the algorithm selected for Column 12 prediction**. 

In [17]:
# read in testing datasets

df_0529test4 = pd.read_csv('0529test4.csv')
df_0530test5 = pd.read_csv('0530test5.csv')

# remove Phase 0

df_test = pd.concat([df_0529test4, df_0530test5])
df_test = df_test[df_test['Column 3'] != 0]

# test lasso regression

print('lasso regression performance on testing data:',reg_lasso.score(df_test[df_features.columns], df_test['Column 12']))

# test ridge regression

print('ridge regression performance on testing data:', reg_ridge.score(df_test[df_features.columns], df_test['Column 12']))

lasso regression performance on testing data: 0.9326526982513662
ridge regression performance on testing data: 0.9344935946217693
