
## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from math import ceil

# PySpark imports
from pyspark.sql import SparkSession  
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number, when, lit, count, lag, expr

# Import necessary libraries
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler

# Import necessary libraries
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator




In [0]:
# File location and type
file_location_train = "dbfs:/FileStore/tables/train_df-2.csv"
file_location_val = "dbfs:/FileStore/tables/val_df.csv"
file_location_test = "dbfs:/FileStore/tables/test_df-2.csv"

train_data = spark.read.csv(file_location_train, header=True, inferSchema=True)
val_data = spark.read.csv(file_location_val, header=True, inferSchema=True)
test_data = spark.read.csv(file_location_test, header=True, inferSchema=True)

In [0]:


# Select feature columns (all except 'label', 'time', and 'file')
feature_cols = [col for col in train_data.columns if col not in ['label', 'time', 'file']]

# Assemble features into a single vector column
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = assembler.transform(train_data).select("features", "label")
val_data = assembler.transform(val_data).select("features", "label")

# Display the transformed train_data to verify
train_data.show(5)
val_data.show(5)


+--------------------+-----+
|            features|label|
+--------------------+-----+
|[-1.611721611721E...|    0|
|[1.95360195360194...|    0|
|[-9.7680097680097...|    0|
|[-4.1025641025641...|    0|
|[4.49328449328449...|    0|
+--------------------+-----+
only showing top 5 rows

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[-1.404639804639E...|    0|
|[-1.408547008547E...|    0|
|[-9.4749694749694...|    0|
|[-7.2869352869352...|    0|
|[-5.5677655677655...|    0|
+--------------------+-----+
only showing top 5 rows



LOGISTIC REGRESSION:

Random Search:

In [0]:
# Initialize Spark session
spark = SparkSession.builder.appName("Logistic Regression Spark").getOrCreate()

# Initialize the Logistic Regression model
log_reg = LogisticRegression(labelCol='label', featuresCol='features', maxIter=5, predictionCol='prediction')

# Define the parameter grid for logistic regression
param_grid = ParamGridBuilder() \
    .addGrid(log_reg.regParam, [0.1, 1, 10]) \
    .addGrid(log_reg.elasticNetParam, [0.0, 0.5]) \
    .build()

# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='f1')

# Initialize CrossValidator for hyperparameter tuning
crossval = CrossValidator(
    estimator=log_reg,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=5,  # Reduced to 3-fold cross-validation for faster training
    parallelism=2  # Reduced parallelism to optimize resource usage
)

# Fit the cross-validator to the training data
cv_model = crossval.fit(train_data)

# Extract the best model
best_model = cv_model.bestModel

# Print only tuned parameters (exclude default values)
tuned_params = ['regParam', 'elasticNetParam']
best_params = {}

for param in tuned_params:
    if best_model.hasParam(param) and best_model.isSet(getattr(log_reg, param)):
        best_params[param] = best_model.getOrDefault(getattr(log_reg, param))

print("Best Tuned Parameters:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

# Print the best parameters
print("Best Coefficient Matrix:")
print(best_model.coefficientMatrix)
print("Best Intercept Vector:")
print(best_model.interceptVector)

# Print the best F1 score
print("Best F1 Score:", evaluator.evaluate(best_model.transform(val_data)))


Best Tuned Parameters:
  regParam: 0.1
  elasticNetParam: 0.0
Best Coefficient Matrix:
DenseMatrix([[  1.40450407,   2.53122411,   0.14011339,   1.97657379,
                2.23938477,  -1.31477332,  -6.1823494 ,   2.68776755,
                2.07758891,  -0.54064081,  -2.19668173,  -3.95818659,
                2.29067844,  -0.66173142,  -3.38937654,  -4.6374889 ,
                6.36666242,   5.59839772,  -0.14467872,  -1.89006418,
                3.36391549,  -0.94161118,  -3.37570875],
             [ -3.20014889,  -1.43777489,   1.07288034,  -0.44998933,
               -1.49782714,  -1.69682513,  -5.29850688,   1.26934949,
               -3.56387198,  -7.6315742 ,   0.94984046,   2.81197186,
               -4.98055581,  -1.70976195,   3.06201981,   2.79480495,
               -5.84092199, -13.01200814,  -1.071159  ,   1.39017911,
               -5.10802035,   2.1044435 ,   3.05682764],
             [  1.79564482,  -1.09344921,  -1.21299374,  -1.52658446,
               -0.74155763,  

for family choose between multinomial vs binomial

Model:

In [0]:
# Create the logistic regression model
log_reg = LogisticRegression(labelCol='label', featuresCol='features', maxIter=5, family='multinomial', regParam=5, elasticNetParam=0)

# Fit the model to the training set
lr_model = log_reg.fit(train_data)

# Make predictions on the training and validation sets
train_predictions = lr_model.transform(train_data)
val_predictions = lr_model.transform(val_data)

# Initialize the evaluator for F1 score
evaluator = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='f1')

# Calculate F1 scores
score_train = evaluator.evaluate(train_predictions)
score_val = evaluator.evaluate(val_predictions)

print("Training F1 Score:", score_train)
print("Validation F1 Score:", score_val)

# Extract predictions as an array
lr_y_pred = train_predictions.select("prediction").rdd.flatMap(lambda x: x).collect()

# Display metrics
print("Model metrics for training and validation:")
print("Training F1 Score:", score_train)
print("Validation F1 Score:", score_val)


Training F1 Score: 0.906385588086969
Validation F1 Score: 0.8554010609489878
Model metrics for training and validation:
Training F1 Score: 0.906385588086969
Validation F1 Score: 0.8554010609489878


Model overfitted with random search parameters, so let's increase the regularization parameters.

In [0]:
from pyspark.ml.classification import OneVsRest

# Create the logistic regression model
log_reg = LogisticRegression(labelCol='label', featuresCol='features', maxIter=5, regParam=5, elasticNetParam=0.5)

# One-vs-Rest strategy
ovr = OneVsRest(classifier=log_reg)

# Fit the model to the training set
lr_model = ovr.fit(train_data)

# Make predictions on the training and validation sets
train_predictions = lr_model.transform(train_data)
test_predictions = lr_model.transform(val_data)

# Calculate F1 scores
score_train = evaluator.evaluate(train_predictions)
score_test = evaluator.evaluate(test_predictions)

print("Training F1 Score:", score_train)
print("Validation F1 Score:", score_test)


Training F1 Score: 0.906385588086969
Validation F1 Score: 0.8579304236958261


Training F1 Score: 0.906385588086969
Validation F1 Score: 0.8579304236958261