# Data science automation

This week is all about looking at automation tehcniques for data science and with Python. We can automate a lot of things with Python: collecting data, processing it, cleaning it, and many other parts of the data science pipeline. 

For the purposes for our lecture this week, we will show how to:

- use the TPOT autoML Python package to find an optimized ML model for our diabetes dataset
- create a Python script to ingest new data and make predictions on it

Often, next steps in fully operationalizing an ML pipeline like this are to use a cloud service to scale and serve our ML algorithm. We can use things like AWS lambda, GCP, AWS, or Azure ML depolyment with tools such as docker and kubernetes.

## Auto ML package: TPOT

**What is TPOT?**

>**TPOT** [link](http://epistasislab.github.io/tpot/latest/) is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.

Automated machine learning doesn’t replace the data scientist, (at least not yet) but it might be able to help you find good models faster. TPOT bills itself as your Data Science Assistant.

TPOT helps you find good algorithms. Note that it isn’t designed for automating deep learning — something like AutoKeras might be helpful there.


TPOT is built on the scikit learn library and follows the scikit learn API closely. It can be used for regression and classification tasks and has special implementations for medical research.

TPOT is open source, well documented, and under active development. It’s development was spearheaded by researchers at the University of Pennsylvania.

**How does TPOT work?**

TPOT has what its developers call a genetic search algorithm to find the best parameters and model ensembles. It could also be thought of as a natural selection or evolutionary algorithm. TPOT tries a pipeline, evaluates its performance, and randomly changes parts of the pipeline in search of better performing algorithms.

>AutoML algorithms aren’t as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline.


This power of TPOT comes from evaluating all kinds of possible pipelines automatically and efficiently. Doing this manually is cumbersome and slower.

**Are there disadvantages to using an AutoML package?**

The biggest concern with AutoML packages, such as TPOT, is the length of training time.

So, how long does TPOT take to run? The short answer is that it depends.

TPOT was designed to run for a while — hours or even a day. Although less complex problems with smaller datasets can see great results in minutes. You can adjust several parameters for TPOT to finish its searches faster, but at the expense of a less thorough search for an optimal pipeline. It was not designed to be a comprehensive search of preprocessing steps, feature selection, algorithms, and parameters, but it can come close if you set its parameters to be more exhaustive.

>…TPOT will take a while to run on larger datasets, but it’s important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search.

An important TPOT parameter to set is the number of generations (via the generations kwarg). Since our aim is to just illustrate the use of TPOT, we assume the default setting of 100 generations, whilst bounding the total running time via the max_time_mins kwarg (which may, essentially, override the former setting). Further, we enable control for the maximum amount of time allowed for optimization of a single pipeline, via max_eval_time_mins.

On a standard laptop with 4GB RAM, each generation takes approximately 5 minutes to run. Thus, for the default value of 100, without the explicit duration bound, the total run time could be roughly around 8 hours.

<b>Installation: TPOT</b> <br>
You will need to install/update the following python packages in order for TPOT to work properly:<br>
    * pip install numpy scipy scikit-learn pandas joblib torch torchvision torchaudio pytorch <br>
    * pip install deap update_checker tqdm stopit xgboost <br>
    * pip install tpot **OR** conda install -c conda-forge tpot <br>
    
Additional details for installing extended TPOT functionality can be found on the [TPOT documentation](https://epistasislab.github.io/tpot/examples/) pages.


# Before we get started lets build a python environment.  

Creating a Python environment (usually a virtual environment) is a best practice for Python development, primarily to isolate project dependencies, prevent conflicts, and ensure projects are reproducible and portable. We can send the entire environment to another team or development environment and ensure compatibility.  It is best to create the environment in VSCode


Steps to Create the Environment
1. Open the Command Palette: Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (macOS).
2. Run the Create Environment Command:
    - Start typing Python: Create Environment in the command palette search bar.
    - Select the command when it appears.
3. Choose the Environment Type:
    - Select Venv from the options presented.
4. Select a Python Interpreter:
    - A list of available Python interpreters detected on your system will appear. Select the one you want to use for this specific project.
5. Wait for Creation:
    - VS Code will create a new folder, typically named .venv, within your project directory. A notification will show the progress.
    - The extension will automatically select this new environment as the active interpreter for your workspace. 
6. Open the terminal window in VSCode and install all the necessary packages again

In [None]:
# ! pip install numpy scipy scikit-learn pandas torch torchvision torchaudio pytorch

In [None]:
# ! pip install deap update_checker tqdm stopit xgboost

In [None]:
# ! pip install tpot

## Using TPOT with our Diabetes dataset

In [13]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import tpot
from sklearn.model_selection import train_test_split
from sklearn.metrics import get_scorer

import timeit 

### Load data

First, we are going to load our same prepared data from week 2 where everything has been converted to numbers. Many autoML packages can handle non-numeric data (they usually convert it to numeric with various methods).

In [14]:
import pandas as pd

df = pd.read_excel('diabetes_data.xlsx', index_col='Patient number')
df

AttributeError: 'Index' object has no attribute '_format_flat'

                Cholesterol  Glucose  HDL Chol  Age  Gender  Height  Weight  \
Patient number                                                                
1                       193       77        49   19  female      61     119   
2                       146       79        41   19  female      60     135   
3                       217       75        54   20  female      67     187   
4                       226       97        70   20  female      64     114   
5                       164       91        67   20  female      70     141   
...                     ...      ...       ...  ...     ...     ...     ...   
386                     227      105        44   83  female      59     125   
387                     226      279        52   84  female      60     192   
388                     301       90       118   89  female      61     115   
389                     232      184       114   91  female      61     127   
390                     165       94        69   92 

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 390 entries, 1 to 390
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Cholesterol   390 non-null    int64  
 1   Glucose       390 non-null    int64  
 2   HDL Chol      390 non-null    int64  
 3   Age           390 non-null    int64  
 4   Gender        390 non-null    object 
 5   Height        390 non-null    int64  
 6   Weight        390 non-null    int64  
 7   BMI           390 non-null    float64
 8   Systolic BP   390 non-null    int64  
 9   Diastolic BP  390 non-null    int64  
 10  waist         390 non-null    int64  
 11  hip           390 non-null    int64  
 12  Diabetes      390 non-null    object 
dtypes: float64(1), int64(10), object(2)
memory usage: 42.7+ KB


### Splitting our dataset

As we've already seen in the prior couple of week, we need to split our dataset into feature/target sets and then into training and test sets.

In [16]:
features = df.drop('Diabetes', axis=1)
targets = df['Diabetes']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

### Using TPOT

Most of the following should look really familar at this point.  We will:
* establish an instance of the model
* fit the model with our data
* evaluate our data

Also the `%%time` command is telling Jupyter to provide us with length of run information.

In [17]:
%%time
tpot_model = tpot.TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42)

tpot_model.fit(x_train, y_train)


Perhaps you already have a cluster running?
Hosting the HTTP server on port 55229 instead
Generation:   0%|          | 0/5 [01:21<?, ?it/s]
Generation:   0%|          | 0/5 [00:00<?, ?it/s]

Exception: No individuals could be evaluated in the initial population. This may indicate a bug in the configuration, included models, or objective functions. Set verbose>=4 to see the errors that caused individuals to fail.

Did you notice the one difference from our prior machine learning materials?

Yup! We are not evaluating the performance of our model against our training data this time. TPOT does that for us.
For more details on the individual parameters we are using, take a look at the [TPOT documentation](https://epistasislab.github.io/tpot/examples/) pages.

Now we can use the TPOT model to make predictions for our test dataset.

In [18]:
tpot_model.evaluated_individuals

In [19]:
tpot_model.pareto_front['Instance']

TypeError: 'NoneType' object is not subscriptable

In [None]:
scorer = get_scorer("accuracy")
scorer(tpot_model, x_test, y_test)

In [None]:
predictions = tpot_model.predict(x_test)
predictions

Let's compare our TPOT's predictions against the actuals for the test dataseet.

In [None]:
# display the actuals and predictions for the test set
print('Predictions for test data set')
print(predictions)
print('Actuals for test data set')
print(y_test)

In [None]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

### Saving our TPOT model

Next, we want to save our trained model so we can use it in a Python file later. Afterall, having a model that is only available on your local machine isn't idea in a corporate setting. You will want to by able to deploy your model into a production environment. 

We will use Pickel to save our model and then run a the predict_diabetes.py python file

In [None]:
import pickle

# Save the fitted pipeline
with open('tpot_diabetes_pipeline.pkl', 'wb') as f:
    pickle.dump(tpot_model.fitted_pipeline_, f)

# Later, load and use it
with open('tpot_diabetes_pipeline.pkl', 'rb') as f:
    loaded_pipeline = pickle.load(f)
    predictions = loaded_pipeline.predict(X_test)

Yup! That's all there is to it. Let's take a look at the file that was created.

We can now use this model in a Python file to take in new data and make a prediction. We will first need to compose a Python file. We can do this in many ways:

- Jupyter and Jupyter Lab
- VS Code
- Atom
- Notepad++/Sublime
- Other text editors or IDEs (integrated development environments)

The benefit of using a code editor or IDE is that it will have lots of bells and whistles, like syntax highlighting, autocomplete, and many other things depending on the code editor or IDE. VS Code is one of the top-most used editors by data scientists and software developers, although you can try any IDE or code editor for Python that you like. You can easily install VS Code through Anaconda Navigator or by visiting the VS Code website. VS Code is developed by Microsoft, and there is also an IDE Visual Studio Code.

The file we've created is show below:

In [None]:
from IPython.display import Code

Code('predict_diabetes.py')

<div class="alert alert-block alert-info">
<b>Important::</b> The code above will not execute as-is. 
    
The following comment is specifying assumptions about the format of your dataset. You would need to provide code to ensure that the incoming dataset met this format <b> OR </b> change the code to reflect your existing dataset format. <br>
&emsp;\# NOTE: Make sure that the outcome column is labeled 'target' in the data file

Also, notice the location of the dataset is not specified below. Generally in a production situation, file locations are passed by a parameter.  Since TPOT doesn't know this information, you have been provided with a template that you need to adjust to fit your environment. <br>
&emsp;tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
</div>

We can test out running the file with the Jupyter "magic" command %run:

In [None]:
%run predict_diabetes.py

In [None]:
predictions

# Saving our code to GitHub

The last few things to do are to write a short summary of our process and results and create a GitHub repository and upload our code there. If you don't already have an account, head over to github.com and create one. Then, you need to install a GitHub client on your computer. Definitely the easiest way is to use the [GUI](https://desktop.github.com/), although if you're more advanced or adventurous you can use the CLI instead.  GitHub is also run and owned by Microsoft (they bought it in 2018).

Once you have an account and GitHub installed, you can create a new repository, either through the GUI with File -> New repository or through the web interface. It's best to select the option 'Initialize this repository with a README' and 'Python' for the 'Git ignore' option. The Git ignore option creates a file that ignores common files that we don't need that are related to Python (like our Jupyter Notebook checkpoints folders). Lastly, it's not a bad idea to choose a license. The MIT license is a a very open and open-source license, although others are more restrictive like Apache. We can choose MIT here since we aren't worried about protecting intellectual property in this case.

When you publish to GitHub you need to include a README file that explains and defines the repo.  It is also good practice to include along with you codebase a `requirements.txt` file `pip freeze > requirements.txt` which will include all the packages used in the projects.

In [None]:
! pip freeze > requirements.txt

Once we've created the repository, we can open the folder by browsing there, with a hotkey through the GUI, or by clicking the button in the GUI to open the folder. We need to copy our work there, then write a short note in the 'summary' area, and hit 'commit to main'. The 'main' label is a branch of the Git repository - we can have several branches in parallel but 'main' is default. Then we can hit the 'push origin' button in the upper right to upload our data to the GitHub's cloud.  Last, we simply need to put the link in a text file and turn that in for our week 5 assignment (using the churn data instead of the diabetes data).

# Optional advanced section

Although we don't have walkthroughs for it this week, there are other autoML packages in Python that we can use. These have documentation with examples showing how to use them. For example, here are the docs for some of these:
- [H2O](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)
- [MLBox](https://mlbox.readthedocs.io/en/latest/introduction.html)
- [LazyPredict](https://lazypredict.readthedocs.io/en/latest/)

Of course, using these packages requires that you first install them. At least with H2O, using conda is easier than with pip.

We can also improve our Python module by using a class instead of plain functions. We can read more about creating classes [here](https://realpython.com/python3-object-oriented-programming/). 