<a href="https://colab.research.google.com/github/MatthewFried/Udemy/blob/master/Day1_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>
<br>
<br>
<br>

# __Module 1: Getting Started__
<br>
<br>
<br>

## What is the Life Cycle of a Typical Data Science Project?

__Step 1__: Define a question

__Step 2__: Identify and acquire the data

__Step 3__: Exploratory Data Analysis (EDA): including the derivation + interpretation of relevant summary statistics, formulation of appropriate exploratory graphics, including, but not limited to, histograms, bar plots, box plots, correlation matrices, quartiles, IQR's, means, medians, standard deviations, variances, etc.

__Step 4__: Data Preparation (e.g, clean the data; transform the data into an appropriate format; create visualizations; etc.)

__Step 5__: Data Splitting (e.g., separating data into training, validation, testing subsets)

__Step 6__: Model Training and Selection

__Step 7__: Model Testing

__Step 8__: Communicate your findings and revise model and/or data as needed (repeat Steps 2 - 7 as necessary)

__Step 9__: Model Deployment

<br>

A lot of the time spent in modeling is on steps 1-4. Since the proper use of machine learning algorithms is usually dependent upon the type of data to be analyzed, be *extremely* careful when choosing your data.

<br>

We will be using/doing the following:
  * Anaconda
  * Jupyter or Colab Notebooks 
  * Python programming, including how to use + implement user-defined functions, comments, MatPlotLib, Seaborn, etc.
  * Creating a narrative within our analysis
  * Uploading data and outputs to Github  

## __Reference Texts__

-	_Feature Engineering and Selection_ (Kuhn, Johnson), CRC Press


-	_Machine Learning Pocket Reference_ (Matt Harrison), O’Reilly
https://github.com/mattharrison/ml_pocket_reference (link to all Python code examples provided in textbook)


-	_Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow, 2nd Edition_ (Aurelien Geron) O’Reilly
https://github.com/ageron/handson-ml2 (link to sample Python code examples provided in textbook)


-	_Data Science From Scratch, 2nd Edition_ (Joel Grus), O’Reilly
https://github.com/joelgrus/data-science-from-scratch (link to sample Python code examples provided in textbook)


-	_Data Science for Business_ (Provost, Fawcett) O’Reilly


## __Keywords__

- An __"attribute"__ (a.k.a., __"feature"__) is a data field or variable that represents a single aspect of an observation. For example, if we are recording meteorological data via a weather station, we would be collecting attributes such as temperature, wind speed, humidity, and ambient air pressure. Each column represents a distinct variable, refered to as an __"attributes"__ or __"features"__.


- An __"observation"__ represents a collections of attributes that comprise the characteristics of a single data record. In other words, it is the row. (__NOTE__: Some scientists also refer to such collections of attributes as __"feature vectors"__ or __"attribute vectors"__).


- A __"response" (aka "Dependent")__ variable describes the output attribute of a model. For example, when constructing a predictive model, the variable we are attempting to predict would identified as the __"response"__ variable.


- By contrast, __"Explanatory" (aka "Independent" or "Predictor")__ are the input variables we will estimate.

<br>

## __Types of Machine Learning Systems__

- __Supervised Learning__: The data set used to train the machine learning model __includes__ "labels" for the response variables fore each observation. Examples include regression models, support vector machines, decision trees, Naive Bayes models, K-nearest neighbor models, and many neural network algorithms.


- __Unsupervised Learning__: The data for the model contains no data values for the desired response variable. Examples include clustering algorithms and some neural network algorithms. 


- __Semisupervised Learning__: The data set used to train the machine learning model is __partially labeled__, i.e., some observations contain the data value for the desired response variable while some do not. Most semisupervised machine learning algorithms are constructed by combining supervised and unsupervised algorithms. Examples include deep belief networks (DBN's) and restricted Boltzman machines (RBM's).


- __Reinforcement Learning__: Reinforcement learning is much more computationally complex than supervised or unsupervised learning. It requires the implementation of an "agent" that can observe its environment and then initiate actions in response to what it has observed. Examples of reinforcement learning include some robotics applications that incorporate sensors and/or machine vision components. 

<br>

### __Online vs. Batch Machine Learning Systems__

- __Online__ machine learning systems learn incrementally from a continuous stream of incoming data. Capital market feeds would be an example of this. 


- By contrast, __batch__ machine learning systems require the inputting of all available data for purposes of model training.

<br>

### __Instance Based vs. Model Based Machine Learning Systems__

- __Instance Based__ machine learning systems rely on __similarity metrics__ for purposes of generating predictions for previously unseen data observations. Examples of __instance based__ machine learning algorithms include K-nearest neighbors and K-means clustering.


- By contrast, __model based__ machine learning algorithms require that we embed all of our assumptions about the problem we are trying to solve within the form of a model and then use that model to make predictions for the response variable of a previously unseen data observation.

## Main Challenges of Machine Learning

- __Insufficient Data__: Models with insufficient data can be highly variable and inaccurate

- __Nonrepresentative Training Data__: If the data we train a model is not representative of the actual data it will be highly ineffective.


- __Sampling Bias__: Training a model with non-random samples of a population is sampling bias. It underrepresents certain sub-populations of the broader population. 

- __Sampling Noise__: If our model is trained on a data set that is too small to accurately reflect the broader population it will be ineffective when applied to previously unseen data from that broader population. 

- __Poor Quality Data__: If our training data is representative but suffers from too many missing or invalid data values, the resulting model is likely to be ineffective.

- __Irrelevant Features__: Our model will clearly be ineffectual with irrelevant features.

- __Overfitting__: Our model is highly accurate when applied to our training data but relatively inaccurate when applied to previously unseen data. This is typical when using a polynomial regression.


- __Underfitting__: Our model is too "simple" relative to the broader population of data we intend to apply it to and as a result its output is not very useful. This is typical when using a linear regression instead of a more appropriate regression such as an exponential regression.



# __Model Complexity__
###__Bias Vs Variance__

![Data](https://drive.google.com/uc?export=view&id=1SaZvz1Z2bKl-AcP2LjghM5XDctckgZJ3)

* Bias is the difference between the Predicted Value and the Expected Value. 
* Variance is when the model takes into account the fluctuations in the data, including the noise. When there is high variance, the model learns too much from the training data, it is called overfitting. 
<br>

>- A model with a high bias error underfits data and makes very simplistic assumptions on it
>- A model with a high variance error overfits the data and learns too much from it
>- A good model is where both Bias and Variance errors are balanced

![Data](https://drive.google.com/uc?export=view&id=1raQvJJcpMqz-9afJMqoG45DeS7EBLEz6)




<br>
<br>
<br>

# __Training Data__
</br>
</br>
</br>

### Data Splitting: Training, Evaluation / Validation, and Testing Subsets

- Machine learning models should be trained and tested on distinct subsets of the available data.

- Set aside 20%-35% of the available data for model testing.


- There is no "ideal" testing/training split. In general, the larger the amount of data, the more you can safely set aside for model testing while reducing the chance of an ineffective model being produced.


- Use __random sampling__ to create the training and testing subsets unless you are working with temporal data (e.g., time series data).


- A portion of the training subset (e.g, a random sample of 5% - 10% of the training subset) should then be set aside for purposes of evaluating / validating the results of the model training process (hence the name __Evaluation__ or __Validation__ subset).  There are several approaches to  this.

- Instead of using a data subset for distinct evaluation / validation we can use __cross validation__ instead.

<br>

### __Automated Data Splitting via scikit-learn: An Example__

scikit-learn's __train_test_split()__ function can be used to create training and testing subsets. However, we must first split our response variable from our other variable.

In [None]:
# load the pandas library
import pandas as pd

# load the train_test_split function from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# start by reading a set of sample data from github. This data contains data related to wines
filename = "https://raw.githubusercontent.com/MatthewFried/DAV6150/master/M3_Data.csv"
df = pd.read_csv(filename)
df.head()

#set up a clean copy as a safety for later
df_copy = df

Unnamed: 0,INDEX,TARGET,FixedAcidity,VolatileAcidity,CitricAcid,ResidualSugar,Chlorides,FreeSulfurDioxide,TotalSulfurDioxide,Density,pH,Sulphates,Alcohol,LabelAppeal,AcidIndex,STARS
0,1,3,3.2,1.16,-0.98,54.2,-0.567,,268.0,0.9928,3.33,-0.59,9.9,0,8,2.0
1,2,3,4.5,0.16,-0.81,26.1,-0.425,15.0,-327.0,1.02792,3.38,0.7,,-1,7,3.0
2,4,5,7.1,2.64,-0.88,14.8,0.037,214.0,142.0,0.99518,3.12,0.48,22.0,-1,8,3.0
3,5,3,5.7,0.385,0.04,18.8,-0.425,22.0,115.0,0.9964,2.24,1.83,6.2,-1,6,1.0
4,6,4,8.0,0.33,-1.26,9.4,,-167.0,108.0,0.99457,3.12,1.77,13.7,0,9,2.0


In [None]:
# how many observations are contained within the example data set?
len(df)

12795

In [None]:
# check for missing values
df.isnull().sum()

INDEX                    0
TARGET                   0
FixedAcidity             0
VolatileAcidity          0
CitricAcid               0
ResidualSugar          616
Chlorides              638
FreeSulfurDioxide      647
TotalSulfurDioxide     682
Density                  0
pH                     395
Sulphates             1210
Alcohol                653
LabelAppeal              0
AcidIndex                0
STARS                 3359
dtype: int64

In [None]:
#it's not suggested we do this in general, and we will learn other (better) ways to deal with this
#but as a first step, we will drop our missing data
df = df.dropna(how='any',axis=0) 
len(df)

6436

In [None]:
#set up an X and y
#we make sure to keep everything as a data frame for simplicity
y = df[['TARGET']].copy()
X = df.drop('TARGET', axis = 1)

In [None]:
# Now split the data into training and testing subsets. 
# We'll set aside 30% of the data for testing purposes; Remember to make sure you specify a value for the inital random_state
# if you want to have the ability to reproduce the exact same training + testing subsets repeatedly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
print("X_train size: " ,len(X_train))
print("X_test size: ", len(X_test))

X_train size:  4505
X_test size:  1931


We first separated the response variable from the explanatory variables and then used the train_test_split() function. We have set aside 30% of the data for testing purposes.

__NOTE__: We opted to explicitly make a copy of the original data set and delete the response variable from that copy. However, an alternative approach would be to simply create a new dataframe object containing only those attributes you plan to make use of as explanatory variables within your model. There is no need to explicitly create a copy of the entire data set every time you want to split your data into training + testing subsets.

## __Cross Validation__

Cross-validation uses __resampling__ of training data to evaluate the performance of machine learning models on a limited data sample. This allows us to avoid the need for the creation of a distinct evaluation / validation subset.


The most common approach to cross validation works as follows:


- __Step 1__: Split the training data into K non-randomly sampled subsets. These subsets are referred to as "__folds__". 


- __Step 2__: The model performance is then trained using __K-1__ of the folds as inputs to the training process. The fold not used for model training is used to evaluate the performance of the model. 


- __Step 3__: Model performance metrics are recorded.


- __Step 4__: A different fold is then selected for use as the validation subset


- __Step 5__: Repeat Steps 2 through 4 until each of the "K" folds has been used as the validation subset. At that point you will have trained and evaluated the model "K" number of times on "K" different subsets of the training data.


This process, referred to as "__K-Fold Cross Validation__".


This process ca be automated using the __cross_val_score()__ function contained within the scikit-learn library.

<br>

#### What K should we choose?

1. The value for "K" is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.


2. K = 10 has been found to have low bias and modest variance.


3. "K" = n, with n = the number of observations in the training data. This gives each item in the training data set an opportunity to be used as the model validation dataset. This approach is called "__leave-one-out cross-validation__" (LOOCV).

<br>

#### Variations of K-Fold Cross Validation

- __Stratified Cross Validation__: In stratified cross validation we split the data into folds based on user-specified criteria. A common use of this technique is ensuring that each fold has the same proportion of observations having a given categorical value, e.g., such as porportional number of samples.


<br>

__**** IMPORTANT ****__: When using cross validation __you must still split your data set into training and testing subsets__. Cross validation eliminates the need to create a separate evaluation / validation subset but it __DOES NOT__ eliminate the need for a dedicated model testing subset.

<br>

### Using scikit-learn's Cross Validation Capabilities: An Example

scikit-learn provides us with __cross_val_score()__. The user must first split their data into training and testing subsets (as above) and select an appropriate machine learning model.

In [None]:
# load the LinearRegression() function from sklearn's 'linear_model' sub-library
from sklearn.linear_model import LinearRegression

# load the cross_val_score function from the sklearn.model_selection module
from sklearn.model_selection import cross_val_score

# Assing the model function you want to use to a variable
model = LinearRegression()

# fit the model using 10-fold cross validation; note how the 'model' variable created above is used as a parameter for the 
# cross_val_score() function. Also note how we can specify the number of folds to use during cross validation via the 'cv' 
# parameter
scores = cross_val_score(model, X_train, y_train, cv=10)

# print out the accuracy metrics derived from the K-fold cross validation process
print (scores)

[0.37391588 0.4815027  0.46241032 0.38677289 0.39322987 0.46292859
 0.44879197 0.44783892 0.42601647 0.41181553]


In [None]:
import numpy as np

# calculate the average accuracy across all 10 folds
np.mean(scores)

0.4295223153950943

We have a very weak performance here of ~43%

<br>
<br>

# __Assignment 1__

</br>
</br>

Cross validation can be applied during model training to assess the performance of a model to assess how well it will handle previously unseen data. 

We will construct a cross validated linear regression model that predicts the energy production of a power plant. The data set is sourced from the UC Irvine machine learning archive: 
- https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant# 

The data set is comprised nearly 10,000 observations of 1 response/dependent variable (net hourly electrical energy output) and 4 explanatory/independent variables (temperature, ambient pressure, relative humidity, and exhaust vacuum)

1. Load the provided into your Github Repository. Data Link: [here](https://docs.google.com/spreadsheets/d/1G9434EXtmv6sqtV_A_t8V3TUJspaSVPp7AguQnA7CJQ/edit?usp=sharing) 
2. Load the data into a pandas dataframe
3. Get familiar with the data - this means possibly getting expert domain knowledge
4. Do an EDA. Include any important analytical highlights.  Identify preliminary predictive inferences. Make sure to be both thorough and succint. Do not include graphics in your EDA that you do not provide a written explanatory narrative for. Graphics lacking explanatory narratives are of no use to a reader of your work.

5. Create two different linear regression models that predict net 
hourly electrical energy output and evaluate them using K-fold cross validation. 

6. Upload your final product to Github
