# Import Required Packages and Libraries

In [0]:
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.window import Window
import pandas as pd



__These are specific to using ML with PySpark:__

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator



# Look at Databrick's DBFS Datasets

In [0]:
display(dbutils.fs.ls("/databricks-datasets/"))



__The 'Wine quality' dataset is chosen for this demonstration__

In [0]:
file_path = ""
wine_df = spark.read.csv()



In [0]:
display(wine_df)



__Each column is already a numeric type (double or integer), which is required for ML tasks__

In [0]:
display(wine_df.printSchema())



__Looking at the statistics and shape of the wine dataset__

In [0]:
display()



__Checking for nulls within the dataset, none are found within the wine dataset__

In [0]:
null_counts = wine_df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in wine_df.columns])
display(null_counts)



# Step 1: Preparing the Data

__Store all of the columns we will use as features, and exclude the target ('quality')__

In [0]:
feature_columns = 



#### VectorAssembler in Spark MLlib

The `VectorAssembler` is a transformer in Spark MLlib that combines a list of columns into a single vector column. This transformation is crucial for preparing the dataset for machine learning algorithms, as most of these algorithms expect input features to be consolidated into a single vector.

**Key Roles of VectorAssembler:**

- **Aggregating Features**: It efficiently aggregates features, enhancing the efficiency of data processing, especially for algorithms designed for parallel computation.

- **Meeting Model Expectations**: Spark MLlib models are optimized to work with data presented as a single vector column, making `VectorAssembler` essential for meeting these input requirements.

**Creating an Instance of VectorAssembler:**

- `inputCols`: Specifies the list of input columns, which in our case are the feature columns of the dataset.

- `outputCol`: Names the output column that will contain the transformed vector. We use "features" as the name, aligning with standard conventions in Spark ML.

**Considerations and Flexibility:**

While `VectorAssembler` is widely used in Spark MLlib for its efficiency and alignment with model expectations, it's important to consider the specifics of your data and modeling needs. In some cases, different preprocessing steps might be more appropriate based on the data characteristics or the requirements of the specific machine learning model being used.



In [0]:
assembler = VectorAssembler()



__Apply the `VectorAssembler` transformer to the wine dataset__



In [0]:
wine_df_transformed = assembler.transform()



__Select only the features and label for the model__

In [0]:
final_data = wine_df_transformed.select()



__Splitting the Data into Training and Test Sets__


In [0]:
train_data, test_data = final_data.randomSplit()



# Step 2: Building the Model

#### Linear Regression Model
Linear regression is one of the simplest and most widely used statistical techniques for predictive modeling. It models the relationship between a dependent variable and one or more independent variables.

Here, we are defining a linear regression model with:

- `featuresCol`: The name of the features column, which is "features" in our case.
- `labelCol`: The name of the label column, which is "quality", the variable we are trying to predict.

We will then fit this model to our training data.


In [0]:
lr = LinearRegression()



#### Exploring Different Model Types in Spark MLlib

While we focus on linear regression in this lesson Spark MLlib offers a variety of other machine learning models, each suitable for different types of data and tasks:

- **Classification Models**: 
  - *Example - Decision Trees*: Useful for both classification and regression tasks
    ```python
    from pyspark.ml.classification import DecisionTreeClassifier
    dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')
    dt_model = dt.fit(train_data)
    dt_predictions = dt_model.transform(test_data)
    ```

- **Clustering Models**: 
  - *Example - K-Means*: For unsupervised clustering tasks
    ```python
    from pyspark.ml.clustering import KMeans
    kmeans = KMeans().setK(3).setSeed(1)
    kmeans_model = kmeans.fit(dataset) # dataset needs to be prepared for clustering
    centers = kmeans_model.clusterCenters()
    ```

- **Recommendation Models**: 
  - *Example - ALS (Alternating Least Squares)*: Commonly used for building recommendation systems
    ```python
    from pyspark.ml.recommendation import ALS
    als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
    als_model = als.fit(training) # training should be a dataset of user-item interactions
    als_predictions = als_model.transform(test)
    ```

Each model type comes with its own set of parameters and considerations, making Spark MLlib a versatile and powerful tool for a wide range of machine learning applications. These examples serve as a starting point for exploring the diversity of models available in Spark MLlib.



# Step 3: Tuning the Model


#### Understanding the ParamGridBuilder and Hyperparameter Tuning

The `ParamGridBuilder` is a critical tool in Spark MLlib for hyperparameter tuning, allowing us to test different combinations of parameters for our model.

**Parameters in this snippet:**

- `lr.regParam`: This is the regularization parameter, helping to prevent overfitting. We test values 0.1 and 0.01.
- `lr.fitIntercept`: Determines whether an intercept should be used in the linear model. We experiment with `True` and `False`.
- `lr.elasticNetParam`: Controls the mix of L1 and L2 regularization, tested with values 0.0, 0.5, and 1.0.

**Other Parameters to Consider:**

- **Learning Rate**: Often used in iterative algorithms to control the step size at each iteration while moving toward a minimum of a loss function.
- **Max Depth (for tree-based models)**: Specifies the maximum depth of the tree.
- **Number of Trees (for ensemble models)**: In models like Random Forests, this parameter controls the number of trees in the forest.

**Why Hyperparameter Tuning?**

Hyperparameter tuning is essential to find the most effective settings for a model, as different parameters can significantly impact model performance and generalizability.



In [0]:
paramGrid = (
)




#### CrossValidator for Model Tuning
`CrossValidator` in Spark MLlib is used for hyperparameter tuning with cross-validation, which is a method for robustly estimating the performance of a model on unseen data.

- `estimator`: This is the machine learning model we want to tune. In our case, it's the linear regression model (`lr`).
- `estimatorParamMaps`: This parameter takes the grid of parameters we built with `ParamGridBuilder`.
- `evaluator`: This is used to evaluate each model's performance. We're using `RegressionEvaluator` with RMSE as the metric.
- `numFolds`: This parameter specifies the number of folds to use for cross-validation. We use 3 folds, meaning the dataset is divided into three parts, and each part is used as a test set once.

By using `CrossValidator`, we can ensure that our model tuning is not just tailored to a specific subset of our data, thus improving the model's generalizability.


In [0]:
crossval = CrossValidator(
)



__Fitting the Model with Cross Validation__


In [0]:
cvModel = crossval.fit()



### Why a Validation Set Is Not Necessary Here

For our lesson we are a using `CrossValidator` for hyperparameter tuning meaning that a separate validation set is not strictly necessary because `CrossValidator` inherently performs the function of a validation set:

- **Cross-Validation Process**: `CrossValidator` divides the training data into a number of "folds". For each combination of parameters in the grid, it trains the model on all but one fold (considered the training set) and evaluates it on the remaining fold (acting as a validation set). This process repeats for each fold, ensuring that each part of the dataset is used for both training and validation. This method provides a robust way to tune parameters while reducing the chance of overfitting.

#### Splitting the data for demonstrative purposes

```python
# Here we split the data into: training, validation, and test sets and then can use the validation set during our hyperparameter testing phase to find best params
train_data, temp_data = final_data.randomSplit([0.6, 0.4])
validation_data, test_data = temp_data.randomSplit([0.5, 0.5])
```

# Step 4: Evaluating the Model

#### Applying the Tuned Model for Predictions

Once we have our model tuned using cross-validation, we apply this optimized model to our test data:

- `cvModel.transform(test_data)`: This line uses the `transform` method of our cross-validated model (`cvModel`) to make predictions on the test dataset.

**Key Points:**

- **Making Predictions**: The `transform` method applies the best model obtained from cross-validation to the test data, generating predictions based on the learned patterns.

- **Evaluating Model Performance**: By applying the model to the test data, which it hasn't seen during training, we can evaluate how well our model generalizes to new data, an essential aspect of machine learning.


In [0]:
cv_predictions = cvModel.transform()



#### Using RegressionEvaluator for Model Evaluation
The `RegressionEvaluator` in Spark MLlib is used for evaluating regression models. It computes various statistical metrics to assess the performance of regression models.

In this snippet:
- `labelCol="quality"`: Specifies the column in the DataFrame which contains the true label values.
- `predictionCol="prediction"`: Specifies the column which contains the model's predictions.
- `metricName="rmse"`: Indicates that we are using Root Mean Squared Error (RMSE) as our metric for evaluation.

By using this evaluator, we can quantify the performance of our model and understand how well it is predicting the wine quality.


In [0]:
evaluator = RegressionEvaluator()



#### Understanding Model Performance Metrics in Simple Terms

After making predictions using our cross validated model, its important to evaluate its performance using various statistical metrics. This helps us understand the model's accuracy and its ability to generalize to new data. We use the following metrics for evaluation:

- **RMSE (Root Mean Squared Error)**: 
  Imagine you're guessing the weight of a bag of apples. RMSE is like calculating how much you're off on each guess, squaring those differences (to make them positive), taking the average, and then taking the square root of that average. It's a measure of your average error, with larger errors penalized more.

- **MSE (Mean Squared Error)**: 
  Similar to RMSE, but you don't take the square root. It's like guessing the weight of the apples, squaring the errors, and taking the average. The units are squared (e.g., pounds squared if you're measuring weight in pounds), making it harder to interpret directly, but useful for comparing models.


- **MAE (Mean Absolute Error)**: 
  This is straightforward. You're guessing the weight of apples, and MAE is the average of the absolute differences between your guesses and the actual weights. It tells you, on average, how much your guess is likely to be off. It's simpler than RMSE and doesn't heavily penalize larger errors.


- **R-squared**: 
  Think of this like a score in a video game, where 1 (or 100%) is perfect prediction, and 0 is no better than guessing the average every time. It measures how well your guesses match the actual values. The closer to 1, the better your model.




In [0]:
rmse = evaluator.evaluate()
print(f"Root Mean Squared Error (RMSE) on test data = {rmse}")

mse = evaluator.evaluate()
print(f"Mean Squared Error (MSE) on test data = {mse}")

mae = evaluator.evaluate()
print(f"Mean Absolute Error (MAE) on test data = {mae}")

r2 = evaluator.evaluate()
print(f"R-squared on test data = {r2}")



#### Why MLlib Doesn't Extensively Support Deep Learning

Apache Spark's MLlib is a powerful tool for machine learning, particularly in big data contexts. However, it's not primarily focused on deep learning for several reasons:

- **Big Data Processing Focus**: Spark, including MLlib, is designed for distributed computing and excels in processing large-scale data. This focus aligns more with traditional machine learning algorithms rather than the computationally intense requirements of deep learning.

- **Resource Requirements of Deep Learning**: Deep learning models typically require high computational power, requiring many GPUs. Spark's architecture, while great for distributed data tasks, isn't optimized for the dense computations and GPU utilization deep learning demands.

- **Ecosystem Specialization**: Machine learning already features specialized tools. Frameworks like TensorFlow, PyTorch, and Keras are specifically designed for deep learning. This allows Spark MLlib to focus on its strengths in data processing and traditional machine learning.

- **Integration Over Reinvention**: Spark integrates with existing deep learning frameworks rather than building its own. This approach allows users to leverage Spark's data processing and feature engineering capabilities alongside specialized deep learning models from frameworks like TensorFlow or Keras.

