In [None]:
!pip3 install -q fsspec
!pip3 install -q gcsfs
!pip3 install -q vectice
!pip3 install -q mlflow
!pip3 install -q google-cloud-storage

In [None]:
!pip3 show vectice

The main entrypoint of the SDK is the high level API which provide several solutions to follow your runs.

* a procedural solution with 2 methods to call vectice.create_run() and vectice.save_after_run()

* a more powerful solution based on vectice.Vectice class that provides itself several possibilities:

* use an instance of vectice.Vectice object to create_run(), start_run() and end_run() (fluent API)

* You can also use the context manager syntax (python with keyword): In this case, the end of the run will be automatically managed.

In [None]:
import logging
from math import sqrt
import os 

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
from sklearn import metrics

import mlflow
from vectice import Vectice

### Data:
This classic dataset contains the prices and other attributes of almost 54,000 diamonds. There are 10 attributes included in the dataset including the target ie. price.

### Feature description:

price price in US dollars ($326--$18,823)This is the target column containing tags for the features. 

### The 4 Cs of Diamonds:-

- carat (0.2--5.01) The carat is the diamond’s physical weight measured in metric carats.  One carat equals 1/5 gram and is subdivided into 100 points. Carat weight is the most objective grade of the 4Cs. 

- cut (Fair, Good, Very Good, Premium, Ideal) In determining the quality of the cut, the diamond grader evaluates the cutter’s skill in the fashioning of the diamond. The more precise the diamond is cut, the more captivating the diamond is to the eye.  

- color, from J (worst) to D (best) The colour of gem-quality diamonds occurs in many hues. In the range from colourless to light yellow or light brown. Colourless diamonds are the rarest. Other natural colours (blue, red, pink for example) are known as "fancy,” and their colour grading is different than from white colorless diamonds.  

- clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) Diamonds can have internal characteristics known as inclusions or external characteristics known as blemishes. Diamonds without inclusions or blemishes are rare; however, most characteristics can only be seen with magnification.  

### Goal: 

The goal is to predict the prices of diamonds using the features in the given dataset. Thus it's a regression problem, you'll perform a bit of data cleaning and create a multiple models that are fed into MLflow. The code used to achieve this is hiddin but you can view it. However, it'll be more fun to give it a good old college try as a team and resort to the hidden code if all else fails.

Here is a link to the Python SDK Documentation, it's not final nor complete so you might need to troubleshoot a bit. 
[Python SDK Documentation](https://storage.googleapis.com/sdk-documentation/index.html)

Upload the GCS JSON. This is then declared as an environmental as seen below.

```
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'test.json'
```

In [None]:
# In Google Collab you can upload the json file that has your Google Cloud Service account details, with the following widget. This is used to access the data needed to perform the steps in the notenook.
from google.colab import files
uploaded = files.upload()

## Vectice Credentials 

To connect to the Vectice App through the SDK you'll need the Project Token, Vectice API Endpoint and the Vectice API Token. You'll find all of this in the Vectice App. The Workspace allows you to create the Vectice API Token, in Projects you'll be able to get the Project Token, as seen below. The Vectice API Endpoint is 'https://be-beta.vectice.com'. You're provided with the GCS Service Account JSON, this will allow you to connect to the GCS Bucket in the Vectice App and get the needed data for the example. 

## Credentials Setup:
##### The Vectice API Endpoint and Token are needed to connect to the Vectice UI. Furthermore, a Google Cloud Storage credential JSON is needed to connect to the Google Cloud Storage to retrieve and upload the datasets. A project token links the runs to the relevant project and it's needed to create runs.

In [None]:
# Vectice API Endpoint
os.environ['VECTICE_API_ENDPOINT'] ='https://be-beta.vectice.com'
# The connection API token created in the Vectice Workspace
os.environ['VECTICE_API_TOKEN'] = "CONNECTION_TOKEN"
# The Google Cloud Storage Service Account 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "FILE.json"
# Project token from Vectice UI
PROJECT_TOKEN = "TOKEN"

In [None]:
# Intialize the connection with Vectice
vectice = Vectice(project_token=PROJECT_TOKEN)
# Create a ds_version as an input for the run
ds_version = [vectice.create_dataset_version().with_parent_name("diamonds data")]
# Create a run that will be passed into a start run
run = vectice.create_run("Data Cleaning")
# Start the run 
vectice.start_run(run, inputs = ds_version)

This is an example how you would push your data into your GCS bucket. Throughout the tutorial you'll be interacting with GCS but we'll only be utilizing read rights. Thus, we won't be pushing any data into the GCS Bucket.

```
data.to_csv("gs://BUCKET/FILE_PATH/FILE_NAME.csv")
```



The dataset used in this tutorial can be retrieved from a Google Cloud Storage Bucket:

In [None]:
data = pd.read_csv(r"gs://vectice-examples-samples/Diamonds/diamonds.csv")
data.head(5)

In [None]:
# This shows you the number of rows and columns
data.shape

In [None]:
# The details of the data
data.info()

### Data Cleaing 
In machine learning, if the data is irrelevant or error-prone then it leads to an incorrect model being built.

The first column is an index ("Unnamed: 0") and thus we are going to remove it.

In [None]:
# The first column seems to be just index
data = data.drop(["Unnamed: 0"], axis=1)
data.describe()

In [None]:
#Dropping dimentionless diamonds
data = data.drop(data[data["x"]==0].index)
data = data.drop(data[data["y"]==0].index)
data = data.drop(data[data["z"]==0].index)
# We dropped 20 dimensionless entries
data.shape

In [None]:
sns.pairplot(data,hue= "cut", palette="rocket");

#### A few points to notice in these pair plots
##### There are some features with datapoints that are far from the rest of the dataset which will affect the outcome of our regression model.

* "y" and "z" have some dimensional outliers in our dataset that needs to be eliminated.
* The "depth" should be capped but we must examine the regression line to be sure.
* The "table" featured should be capped too.
* Let's have a look at regression plots to get a close look at the outliers.

In [None]:
ax = sns.regplot(x="price", y="y", data=data, scatter_kws={'color': 'purple'}, line_kws={'color': 'orange'}).set(title="Regression Line on Price vs 'y'")

In [None]:
ax = sns.regplot(x="price", y="z", data=data, scatter_kws={'color': 'purple'}, line_kws={'color': 'orange'}).set(title="Regression Line on Price vs 'z'")

In [None]:
ax = sns.regplot(x="price", y="depth", data=data, scatter_kws={'color': 'purple'}, line_kws={'color': 'orange'}).set(title="Regression Line on Price vs 'depth'")

We can clearly spot outliers in these attributes. Next up, we will remove these data points.

In [None]:
#Dropping the outliers. 
data = data[(data["depth"]<75)&(data["depth"]>45)]
data = data[(data["table"]<80)&(data["table"]>40)]
data = data[(data["x"]<30)]
data = data[(data["y"]<30)]
data = data[(data["z"]<30)&(data["z"]>2)]
# We dropped 13 outliers
data.shape

Let us have another look at the pair plot of data.

In [None]:
sns.pairplot(data, hue= "cut",palette="rocket");

That's a much cleaner dataset. Next, we will deal with the categorical variables.

In [None]:
# Get list of categorical variables
object_cols = [i for i in data.columns if data[i].dtype == 'object']
print(f"Categorical variables: {object_cols}")

#### Why are Categorical Features important?
Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

#### We have three categorical variables. Let us have a look at them with violin plots.
##### Violin plots are a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same information as in the box plots:
* median (a white dot on the violin plot)
* interquartile range (the black bar in the center of violin)
* the lower/upper adjacent values (the black lines stretched from the bar) — defined as first quartile — 1.5 IQR and third quartile + 1.5 IQR respectively. These values can be used in a simple outlier detection technique (Tukey’s fences) — observations lying outside of these “fences” can be considered outliers.

![Image](https://miro.medium.com/max/520/1*TTMOaNG1o4PgQd-e8LurMg.png)

Probability Density Function:

![Image](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/525px-Boxplot_vs_PDF.svg.png)

In [None]:
ax = sns.violinplot(x="cut", y="price", data=data).set(title="Violinplot for Cut vs Price")

In [None]:
ax = sns.violinplot(x="color", y="price", data=data).set(title="Violinplot for Color vs Price")

In [None]:
ax = sns.violinplot(x="clarity", y="price", data=data).set(title="Violinplot for Clarity vs Price")

#### Lable encoding the data to get rid of object dtype.
This approach is very simple and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values. Though there will be many more columns in the dataset, to understand label-encoding, we will focus on one categorical column only.We choose to encode the text values by putting a running sequence for each text values like below:

![Markdown Logo is here.](https://miro.medium.com/max/289/1*VinegxkUYMzik9GpucWCFA.png)


In [None]:
# Make copy to avoid changing original data 
label_data = data.copy()

In [None]:
def encoder_labels(columns: list, dataframe: pd.DataFrame, encoder: LabelEncoder) -> pd.DataFrame:
    for col in columns:
        dataframe[col] = encoder.fit_transform(dataframe[col])
    return dataframe

In [None]:
encoder = LabelEncoder()
label_data = encoder_labels(object_cols, label_data, encoder)

In [None]:
label_data.head(5)

In [None]:
data.describe()

#### Correlation Matrix:
A correlation matrix is useful for showing the correlation coefficients (or degree of relationship) between variables. The correlation matrix is symmetric, as the correlation between a variable V1 and variable V2 is the same as the correlation between V2 and variable V1. Also, the values on the diagonal are always equal to one, because a variable is always perfectly correlated with itself.

In [None]:
#correlation matrix
corrmat= label_data.corr()
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(corrmat,annot=True);

#### Points to notice:
* "x", "y" and "z" show a high correlation to the target column.
* "depth", "cut" and "table" show low correlation. We could consider dropping them but let's rather keep them.

In [None]:
# Create a new version of the orginal_cleaned dataset
# An example of uploading the data to GCS -> label_data.to_csv(r'gs://"GCS_URI")
outputs = [vectice.create_dataset_version().with_parent_name("diamonds cleaned")]
# End the run and save the new dataset version.
# Set the diamonds cleaned as an output.
vectice.end_run(outputs=outputs)

In [None]:
# Create inputs 
ds_version = [vectice.create_dataset_version().with_parent_name("diamonds cleaned")]
# Start a run to track this data train-test-split
# It will specify the dataset version we just created as the run's input.
run = vectice.create_run('Split Diamonds Data')
# Start the run
vectice.start_run(run, inputs=ds_version)

In [None]:
train, test = train_test_split(label_data, test_size=0.2, random_state = 42)

In [None]:
# The key you were provided for this tutorial may not have write permissions to GCS.
# Example of uploading to the GCS bucket -> train.to_csv (r'GCS_URI', index = False, header = True)
# Example of uploading to the GCS bucket -> test.to_csv (r'GCS_URI', index = False, header = True)
outputs = [vectice.create_dataset_version().with_parent_name("diamonds train test data")]
# End the run
vectice.end_run(outputs=outputs)

### Model Building
#### Steps involved in Model Building

* Setting up features and target
* Build a pipeline of standard scalar and model for five different regressors.
* Fit all the models on training data
* Get mean of cross-validation on the training set for all the models for negative root mean square error
* Pick the model with the best cross-validation score
* Fit the best model on the training set and get

### Train-Test Split Evaluation 
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

In [None]:
# Assigning the featurs as X and trarget as y
X = label_data.drop(["price"], axis =1)
y = label_data["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=42)

In [None]:
# Initialise Vectice with MLflow
vectice = Vectice(project_token="PROJECT_TOKEN", lib="MLflow")

In [None]:
# Create the inputs
def create_inputs():
    return [
        Vectice.create_dataset_version().with_parent_name("diamonds train test data"),
    ]
# Data preparation
def prepare_data():
    """Read and prepare data."""
    df = pd.read_csv(r"gs://vectice-examples-samples/Diamonds/diamonds_cleaned.csv")

    X = df.drop(["price"], axis =1)
    y = df["price"]
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=42)

    return X_train, X_test, y_train, y_test

### Pipelines
In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in it’s preprocessing package.
However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:
* They make your workflow much easier to read and understand.
* They enforce the implementation and order of steps in your project.
* These in turn make your work much more reproducible.

### StandardScaler Example:
A StandardScaler substarcts the mean and then divides by the standard deviation, this shifts the distribution to have a mean of 0 and a standard deviation of one.

In [None]:
example = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = StandardScaler().fit(example)
X_scaled = scaler.transform(example)
print(f"Before: {example[0]} \nAfter: {X_scaled[0]}")

### Cross Validation:

Cross validation follows the following logic. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using of the folds as training data;

- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

![Image](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

### Models:

1. LinearRegression <a href="https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html" target="_blank">more info</a>.
2. DecisionTreeRegressor <a href="https://ml-cheatsheet.readthedocs.io/en/latest/classification_algos.html#decision-trees" target="_blank">more info</a>.
3. RandomForestRegressor <a href="https://www.geeksforgeeks.org/random-forest-regression-in-python/" target="_blank">more info</a>.
4. KNeighborsRegressor <a href="https://ml-cheatsheet.readthedocs.io/en/latest/classification_algos.html#k-nearest-neighbor" target="_blank">more info</a>.
5. XGBRegressor <a href="https://machinelearningmastery.com/xgboost-for-regression/" target="_blank">more info</a>.

In [None]:
import warnings
warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
"""Vectice MLflow adapter fluent usage in Python ``with`` syntax."""
X_train, X_test, y_train, y_test = prepare_data()

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.autolog(silent=True)


# Set up Vectice MLflow adapter
vectice = Vectice(project_token=PROJECT_TOKEN, lib="MLflow")

# Building pipelins of standard scaler and model for regressors.
pipeline_lr=Pipeline([("scalar1",StandardScaler()),
                 ("lr_classifier",LinearRegression())])

pipeline_dt=Pipeline([("scalar2",StandardScaler()),
                    ("dt_classifier",DecisionTreeRegressor())])

pipeline_rf=Pipeline([("scalar3",StandardScaler()),
                    ("rf_classifier",RandomForestRegressor())])


pipeline_kn=Pipeline([("scalar4",StandardScaler()),
                    ("kn_classifier",KNeighborsRegressor())])


pipeline_xgb=Pipeline([("scalar5",StandardScaler()),
                    ("xgb_classifier",XGBRegressor())])

# Pipelines list to iterate over
pipelines = [pipeline_lr, pipeline_dt, pipeline_rf, pipeline_kn, pipeline_xgb]

for pipe in pipelines:
    # Create inputs for each Vectice & MLflow run
    inputs = create_inputs()
    # Expermient name for each pipeline 
    MLFLOW_EXPERIMENT_NAME = pipe.steps[1][0]
    # Create each run that the start run will then start 
    run = vectice.create_run(MLFLOW_EXPERIMENT_NAME)
    # Fit each model 
    pipe.fit(X_train, y_train)
    
    with vectice.start_run(run, inputs=inputs):
        cv_score = cross_val_score(pipe, X_train, y_train,scoring="neg_root_mean_squared_error", cv=10, n_jobs=-1)
        mlflow.log_param('Algorithm', MLFLOW_EXPERIMENT_NAME)
        mlflow.log_param('Scaler', 'StandardScaler')
        mlflow.log_metric("Cross Validation", float(cv_score.mean()))
        print(f"{MLFLOW_EXPERIMENT_NAME}: {cv_score.mean()}")

#### Testing the Model with the best score on the test set
In the above scores, XGBClassifier appears to be the model with the best scoring on negative root mean square error. Let's test this model on a test set and evaluate it with different parameters. But you might get different results.

In [None]:
# If you have a mlflow run that is still running and you need to end it, then run this cell.
mlflow.end_run()

In [None]:
# Model prediction on test data
pred = pipeline_xgb.predict(X_test)

In [None]:
# Model Evaluation
print("R^2:",metrics.r2_score(y_test, pred))
print("Adjusted R^2:",1 - (1-metrics.r2_score(y_test, pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print("MAE:",metrics.mean_absolute_error(y_test, pred))
print("MSE:",metrics.mean_squared_error(y_test, pred))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, pred)))

#### End

Congratulations and as Jake Peralta would say:

![Image](https://i.imgur.com/I1wR7mE.gif?noredirect)