# This tutorial introduces the concepts of Machine Learing Using Dask.

Learning outcomes of the tutorial are:

1. Learn how to do data prepocessing.
2. Learn how to implement a linear regression model.
3. Learn how to implement a K-Means clustering Model.
4. Learn how to cross validate a model.
5. Learn how to build ML pipelines.

Prerequisite:

1. Experience with Scikit Learn library
2. Experience with Dask Dataframe and Dask Arrays

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline  # regular scikit-learn pipeline

import dask 
import dask.dataframe as dd
import dask.array as da

from dask_ml.preprocessing import Categorizer, DummyEncoder, StandardScaler, MinMaxScaler
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression, LinearRegression
from dask_ml.decomposition import PCA
from dask_ml.cluster import KMeans


pd.set_option('future.no_silent_downcasting', True)

In [None]:

# The jupyter notebook is launched from your $HOME directory.
# Change the working directory to the workshop directory
# which was created in your username directory under /scratch/vp91

import os
os.chdir(os.path.expandvars("/scratch/vp91/$USER/"))

In [None]:
# https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package

ddf = dd.read_csv("intro-to-dask/data/weather/weatherAUS*.csv", dtype={'Humidity3pm': 'float64',
       'Humidity9am': 'float64',
       'WindGustSpeed': 'float64',
       'WindSpeed3pm': 'float64',
       'WindSpeed9am': 'float64'})
ddf.head()


# Data Prepocessing

The first process step in building a machine learning model is data cleaning. The data we have here is not very complex which makes data cleaning easier. In the production quality ML model, this is the most time-consuming step. 

Data cleaning mainly involves:
1. Remove any unnecessary observations from your dataset
2. Remove redundant information
3. Remove duplicate information
4. Remove structural errors in data collection
5. Remove unwanted outliers - outliers can result in overfitting
6. Handle missing data:
    * Remove observations with values missing
    * Infer the missing values

In this case, we are taking the easiest method to address missing values. We are removing any dataframe row that has missing values. This is not always advisable as we are losing a lot of information and in the end, we end up not getting the entire picture.

Inferring data is also not always a good idea as we may add some bias to the inference. 


In [None]:
ddf_clean = ddf.dropna() 
ddf_clean.head()

In [None]:
shape = ddf_clean.shape
print(shape)

In [None]:
print(type(shape[0]))
print(type(shape[1]))

As you can see the columns are immediately computed while the rows are not. We have invoked compute get the result.

In [None]:
shape[0].compute()

In [None]:
ddf_clean.columns

Here, we are trying to predict the temperature based at 3PM.
1. We divide the data frame into target and features.
    * Target is the value we are trying to predict
    * Feature are the data points used to predict the target
2.  We remove all the features we deem unncessary

In [None]:
# Target
ddf_target = ddf_clean['Temp3pm']


Data usually have numerical data and categorical data. 
1. Categorical data groups information (usually text) with similar characteristics 
2. Numerical data expresses information in the form of numbers

Most machine learning algorithms cannot handle categorical variables unless it is converted to numerical data. This process is called encoding. 

Ideally, all categorical data should be converted to numerical data. In this case, we remove all catogorical data other than 'RainToday' and 'RainTomorrow'.

In [None]:
# Features
ddf_features = ddf_clean.drop(columns=['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'Temp3pm'])


In [None]:
ddf_features.head()

In [None]:
shape = ddf_features.shape
print(shape)

In [None]:
shape[0].compute()

In [None]:
ddf_features.dtypes

There are two types of categorical data in Dask
1. Known: categories are known statically (from the metadata)
2. Unknown: categories are not known statically (from the metadata)

**categorize()** function in meta scans the entire data find the different catogories in a feature. 

In [None]:
ddf_features = ddf_features.categorize()

In [None]:
ddf_features.dtypes

We can verify if the catogries of a feature are known as shown below

In [None]:
ddf_features.RainTomorrow.cat.known

Encoding is the method of converting categorical values into numerical values (and vice versa). There are two here we use **Dummy Encoding**. Each category end up getting a binary value.

In [None]:
de = DummyEncoder()
ddf_features_preproc = de.fit_transform(ddf_features.categorize())

In [None]:
ddf_features_preproc.head().reset_index()

In [None]:
ddf_features_preproc.dtypes

In [None]:
#bool_cols = ddf_features_preproc.select_dtypes(include='bool').columns
for col in ddf_features_preproc.select_dtypes(include='bool').columns:
    ddf_features_preproc[col] = ddf_features_preproc[col].astype(float)

In [None]:
ddf_features_preproc.dtypes

## Standardization
Data standardization becomes relevant when there are substantial variations in the ranges of features within the input dataset or when those features are measured using different units (meteres, kilogram).

In [None]:
scaler = StandardScaler()
scalar_std = scaler.fit(ddf_features_preproc)

In [None]:
scalar_std.mean_

In [None]:
ddf_features_std = scaler.transform(ddf_features_preproc)

In [None]:
ddf_features_std.head()

In [None]:
# Standardization can result in NaN values. So check that.
na_count = ddf_features_std.isna().sum().compute()

# Output the result
print(na_count)


## Normalization

Normalization is the process of translating data into a range. It is a good practice to normalize the data - especially useful when different features have different value ranges. Normalization ensures that one feature does not overtly influence the model. 

In [None]:
MinMax = MinMaxScaler()

In [None]:
MinMax.fit(ddf_features_std)                          # Fit once

In [None]:
ddf_features_norm = MinMax.transform(ddf_features_std)  # Then transform
ddf_features_norm.head()

In [None]:
# Normalization can result in NaN values. So check that.
na_count = ddf_features_norm.isna().sum().compute()

# Output the result
print(na_count)


## Correlation Matrix

Correlation is often used in machine learning to identify multicollinearity, which is when two or more predictor variables are highly correlated with each other. Multicollinearity can adversely affect the accuracy of predictive models.

* The coefficients become very sensitive to small changes in the model.
* Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. 

Multicollinearity can be addressed by removing one of the correlated variables 

In [None]:
corr_matrix = ddf_features_norm.corr(method='pearson', min_periods=None, numeric_only='__no_default__', split_every=False)


In [None]:
corr_matrix.compute()

In [None]:

f, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, annot=True, mask = mask, cmap=cmap)

Ideally we should remove one of the higly correlated feature or combine those together. For the time being we doing neither. 

## Principal Component Analysis

In [None]:

pca = PCA(n_components=3)

# Fit PCA
pca.fit(ddf_features_norm.to_dask_array(lengths=True))
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None, svd_solver='auto', tol=0.0, whiten=False)


In [None]:
# Transform and apply the principal components
# This gives you a Dask array of shape (n_samples, 3) with the data projected onto the principal components

X_pca = pca.fit_transform(ddf_features_norm.to_dask_array(lengths=True))
X_pca

In [None]:
pca.get_feature_names_out()

In [None]:
print(pca.explained_variance_ratio_) 

In [None]:
pca.components_

In [None]:
pca.n_components

In [None]:

# Optionally, get column names from PCA
column_names = [f"PC{i+1}" for i in range(pca.n_components)]
print(column_names)

# Create a Dask DataFrame
ddf_pca = dd.from_dask_array(X_pca, columns=column_names)

In [None]:
ddf_pca

In [None]:
ddf_pca.head()

In [None]:
# Standardization can result in NaN values. So check that.
na_count = ddf_features_norm.isna().sum().compute()

# Output the result
print(na_count)


# Splitting data

We divide the dataset into training set and testing set. Training set is used to train the model, while the testing set will be used to measure the accuracy of the trained model.

In [None]:
#Divide into learning and test set

X_train, X_test, y_train, y_test = train_test_split(ddf_features_norm, ddf_target, shuffle=False)


In [None]:
X_train.compute().head()
X_test.compute().head()


In [None]:
y_train.compute().head()
y_test.compute().head()

In [None]:
y_train_array = y_train.to_dask_array(lengths=True)
X_train_array = X_train.to_dask_array(lengths=True)

In [None]:
X_train_array

# Linear Regression
Linear regression is used to predict the value of a variable based on the value of another variable or a set of varibales. It a type of **Supervised Learning**. Supervised machine learning involves the process of establishing a connection between input variables and output variables. The input variabls are often called features or independent variables, while the output is commonly denoted as the target or 'dependent variables. Data containing both these features and the target is typically termed labeled data.

Linear regression tries to find the optimal W<sub>1</sub>, W<sub>2</sub>, W<sub>3</sub>, W<sub>4</sub>, so that we can predict the value of Y for the user-supplied X<sub>1</sub>, X<sub>2</sub>, X<sub>3</sub>.

$$
  Y(X_1, X_2, X_3) = W_1 * X_1 + W_2 * X_2 + W_3 * X_3 + W_4
$$

In [None]:
lr = LinearRegression()

### Train the model

In [None]:
lr.fit(X_train_array, y_train_array)


In [None]:
X_test_array = X_test.to_dask_array(lengths=True, chunks=(1000, X_test.shape[1]))  # 1000 samples per chunk, full features
y_test_array = y_test.to_dask_array(lengths=True, chunks=(1000,))  # 1000 samples per chunk


In [None]:
X_test_array

In [None]:
y_test_array

In [None]:
# Get predictions
y_pred = lr.predict(X_test_array)

In [None]:
print(y_pred[0].compute())
print(y_test_array[0].compute())

### Score the performance of the model using test data

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test_array.compute(), y_pred.compute())

print("Accuracy = ", r2)

# Cross validation
Cross-validation is a method for evaluating ML models by training several ML models on subsets of the data and evaluating another subset of the data. The advantages of cross validation are : 

1. Identify Overfitting
2. Comparison between different models 
3. Hyperparameter tuning
4. Efficiency : Allows the use of data for both training and validation

In [None]:
# Perform cross-validation (e.g., 5-fold)
from sklearn.model_selection import TimeSeriesSplit
from dask_ml.model_selection import GridSearchCV
from sklearn.model_selection import KFold 

# Define a (dummy) param grid; even if you don't want to tune, it's needed
param_grid = {
    # LinearRegression has no real hyperparams; this is just for structure
    # Add any other model params here if applicable
}

# Cross-validator (5-fold CV)
cv = KFold(n_splits=5)

# GridSearchCV
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv=cv, scoring='r2')
grid.fit(X_train_array, y_train_array)

# Best score and estimator
print("Best R² score:", grid.best_score_)
print("Best estimator:", grid.best_estimator_)

| Aspect               | `cross-validation`                    | `.score(...)`                         |
| -------------------- | ------------------------------------- | ------------------------------------- |
| Data splits          | Multiple (k-folds)                    | Single                                |
| Trains model?        | Yes, multiple times                   | No (uses already fitted model)        |
| Gives variance?      | Yes (scores per fold)                 | No                                    |
| Purpose              | Model validation/generalization check | Evaluate performance on specific data |
| Risk of overfitting? | Lower                                 | Higher if only tested on one split    |


# K-Means Clustering

k-means clustering partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid). k-means clustering is a type of **Unsupervised Learning**. In unsupervised learning the algorithm groups or patterns without the need of labeled data.

In [None]:


ddf = dd.read_csv("/g/data/vp91/Training-Data/Diabetes/diabetes.csv", dtype={
        'Pregnancies': 'float64',
        'Glucose': 'float64',
        'BloodPressure': 'float64',
        'SkinThickness': 'float64',
        'Insulin': 'float64',
        'BMI': 'float64',
        'DiabetesPedigreeFunction': 'float64',
        'Age': 'float64',
        'Outcome': 'float64'})


ddf.head()

In [None]:
type(ddf)

In [None]:
ddf.columns

In [None]:
ddf.head()

In [None]:
ddf = ddf.dropna()

In [None]:
ddf.head()

In [None]:
# normalization

scaler = StandardScaler()
ddf = scaler.fit_transform(ddf)

In [None]:
sns.scatterplot(data = ddf.compute(), x = 'BMI', y = 'DiabetesPedigreeFunction')

In [None]:
kmeans = KMeans(n_clusters=3, init_max_iter=1, oversampling_factor=8)

In [None]:
print(type(ddf))

In [None]:
lengths = ddf.map_partitions(len, meta=('x', int)).compute()
X = ddf.to_dask_array(lengths=tuple(lengths))


In [None]:
kmeans.fit(X)

In [None]:
kmeans.labels_

In [None]:
sns.scatterplot(data = ddf.compute(), x = 'BMI', y = 'DiabetesPedigreeFunction', hue = kmeans.labels_)

## Exercise
1. Test the result without data normalization
2. Apply other data preprocessing to the data
3. Change the number of clusters

# Pipelining
We saw that an ML workflow involves multiple stages. We can combine multiple stages of this workflow into a single pipeline. This is especially useful when your model is iterative. 

In [None]:
pipe = Pipeline([('reduce_dim', PCA()), 
                 ('cluster', KMeans(n_clusters = 3, random_state = 0, n_init='auto'))
                ])

In [None]:
pipe

In [None]:
kmeans = pipe.fit(ddf.to_dask_array(lengths=tuple(lengths)))

In [None]:
sns.scatterplot(data = ddf.compute(), x = 'BMI', y = 'DiabetesPedigreeFunction', hue = pipe['cluster'].labels_)

# Excersice
1. Add normalization to the pipeline (Solutions1)