# Data Science with Python 
## Regression 

In this demo we are going to look at a regression algorthm. 
Regression is typicaly a supervised Machine Learning technique (refer back to the slides for a definition). 

In this demo, we will explore Regression with linear regression. We will use a series of modules:

**matplotlib** - This module will allow us to visualise the output of our model. We will want to examine the data in 2 dimensions, we could do more but that will do for now. Interested in more dimensions? Ask me about PCA.   
**numpy** - Statistical package for working with numbers.  
**sklearn** -sklearn is one of the most used modules for general machine learning. Shallow learning. We can talk more about deep learning another time.      

Ok. Lets begin by looking at importing those modules.

[Original example](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

In [0]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import os
import mlflow
from math import sqrt

# Set the experiment name to an experiment in the shared experiments folder
#creating the experiment
mlflow.set_experiment("/MLFlow/regressionDiabetes")


2023/03/26 10:29:50 INFO mlflow.tracking.fluent: Experiment with name '/MLFlow/regressionDiabetes' does not exist. Creating a new experiment.
Out[2]: <Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/3965886382397629', creation_time=1679826590532, experiment_id='3965886382397629', last_update_time=1679826590532, lifecycle_stage='active', name='/MLFlow/regressionDiabetes', tags={'mlflow.experiment.sourceName': '/MLFlow/regressionDiabetes',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'sandeshhase15@gmail.com',
 'mlflow.ownerId': '5327354747386624'}>

Load a sample dataset. This will use the diabete dataset.

In [0]:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()


For ease, lets load this in to a Padnas DataFrame and look at the top few rows.

In [0]:
diabetespd = pd.DataFrame(data=diabetes.data)
diabetespd.to_csv('diabetes.txt', encoding='utf-8', index=False)


In [0]:
pwd

Out[8]: '/databricks/driver'

In [0]:
diabetespd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [0]:
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

In [0]:
diabetes_X[0:5]

Out[11]: array([[ 0.06169621],
       [-0.05147406],
       [ 0.04445121],
       [-0.01159501],
       [-0.03638469]])

Lets split the data into training/testing sets.
We will do an 80/20 split.

In [0]:
with mlflow.start_run():
  # 1st idea
  diabetes_X = diabetes.data[:, np.newaxis, 2]
  
  # 2nd idea
  #diabetes_X = diabetes.data
  
  diabetes_X_train = diabetes_X[:-20]
  diabetes_X_test = diabetes_X[-20:]

  diabetes_y_train = diabetes.target[:-20]
  diabetes_y_test = diabetes.target[-20:]

  regr = linear_model

  regr = linear_model.Lasso(alpha=0.01)
  mlflow.log_param("alpha", 0.01)
  
#   regr = linear_model.LassoLars(alpha=0.1)
#   mlflow.log_param("alpha", 0.1)

#   regr = linear_model.BayesianRidge()   

  regr.fit(diabetes_X_train, diabetes_y_train)

  diabetes_y_pred = regr.predict(diabetes_X_test)

  mlflow.log_metric("mse", mean_squared_error(diabetes_y_test, diabetes_y_pred))
  mlflow.log_metric("rmse", sqrt(mean_squared_error(diabetes_y_test, diabetes_y_pred)))
  mlflow.log_metric("r2", r2_score(diabetes_y_test, diabetes_y_pred))
  
  mlflow.log_artifact("diabetes.txt")
