## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/shared_uploads/astartupcto@gmail.com/tips.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be ignored.
df =spark.read.csv(file_location,header=True,inferSchema=True)
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [0]:
#show schema of dataframe
df.printSchema()


root
 |-- total_bill: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)



In [0]:
#show columns of dataframe
df.columns

Out[27]: ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [0]:
### Handling Categorical Features
from pyspark.ml.feature import StringIndexer

In [0]:
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [0]:
#The purpose of this code is to convert the categorical string column "sex" into a numerical column "sex_indexed". 
# This is done to prepare the data for machine learning models that require numerical input.
indexer=StringIndexer(inputCol="sex",outputCol="sex_indexed")
df_r=indexer.fit(df).transform(df)
df_r.show()

#Components of the Code
#StringIndexer:

#StringIndexer is a feature transformer in PySpark that encodes a string column of labels to a column of label indices.
#This is particularly useful for converting categorical string columns into numerical indices, which are required for most machine learning algorithms.
#Parameters:

#inputCol="sex": Specifies the input column that contains the string labels. In this case, the input column is "sex".
#outputCol="sex_indexed": Specifies the name of the output column that will contain the numerical indices corresponding to the string labels. In this case, the output column is "sex_indexed".
#fit and transform:

#indexer.fit(df): The fit method is used to compute the mapping from string labels to numerical indices based on the input DataFrame df.
#indexer.transform(df): The transform method is used to apply the computed mapping to the DataFrame df, resulting in a new DataFrame df_r with the additional column "sex_indexed".

+----------+----+------+------+---+------+----+-----------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|
+----------+----+------+------+---+------+----+-----------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|        0.0|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|        0.0|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|        0.0|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|        0.0|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|        0.0|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|        1.0|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|        0.0|
|     18.43| 3.0|  Male|    No|Sun|Dinne

In [0]:
from pyspark.ml.feature import StringIndexer
# Create a StringIndexer instance that will convert categorical columns into numerical indices
indexer = StringIndexer(
    inputCols=["smoker", "day", "time"],        # Specify the input columns to be indexed
    outputCols=["smoker_indexed", "day_indexed", "time_index"]  # Specify the output columns to store the indices
)

# Fit the StringIndexer on the DataFrame and transform the DataFrame to add the indexed columns
df_r = indexer.fit(df_r).transform(df_r)

# Show the transformed DataFrame with the new indexed columns
df_r.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|       0.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|       0.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        0.0|           0.0|        1.0|       0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|
|     26.88|3.12|  M

In [0]:
df_r.columns

Out[32]: ['total_bill',
 'tip',
 'sex',
 'smoker',
 'day',
 'time',
 'size',
 'sex_indexed',
 'smoker_indexed',
 'day_indexed',
 'time_index']

In [0]:
#The purpose of using VectorAssembler in this code is to combine multiple individual feature columns into a single vector column. This is an essential step in preparing data for machine learning algorithms, which typically require input features to be in vector form.

from pyspark.ml.feature import VectorAssembler

# Create an instance of the VectorAssembler
featureassembler = VectorAssembler(
    inputCols=['tip', 'size', 'sex_indexed', 'smoker_indexed', 'day_indexed', 'time_index'],  # List of input columns to combine
    outputCol="Independent Features"  # Name of the output column that will contain the combined feature vector
)

# Transform the DataFrame to add the new feature vector column
output = featureassembler.transform(df_r)

# Show the transformed DataFrame with the new feature vector column
output.show()


+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|Independent Features|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|       0.0|[1.01,2.0,1.0,0.0...|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[1.66,3.0,0.0,0.0...|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[3.5,3.0,0.0,0.0,...|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|[3.31,2.0,0.0,0.0...|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|       0.0|[3.61,4.0,1.0,0.0...|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4| 

In [0]:
#The purpose of this code is to verify and visualize the combined feature vectors created by the VectorAssembler. 
# It helps in understanding how the individual feature columns have been combined into a single vector column.
output.select('Independent Features').show()

+--------------------+
|Independent Features|
+--------------------+
|[1.01,2.0,1.0,0.0...|
|[1.66,3.0,0.0,0.0...|
|[3.5,3.0,0.0,0.0,...|
|[3.31,2.0,0.0,0.0...|
|[3.61,4.0,1.0,0.0...|
|[4.71,4.0,0.0,0.0...|
|[2.0,2.0,0.0,0.0,...|
|[3.12,4.0,0.0,0.0...|
|[1.96,2.0,0.0,0.0...|
|[3.23,2.0,0.0,0.0...|
|[1.71,2.0,0.0,0.0...|
|[5.0,4.0,1.0,0.0,...|
|[1.57,2.0,0.0,0.0...|
|[3.0,4.0,0.0,0.0,...|
|[3.02,2.0,1.0,0.0...|
|[3.92,2.0,0.0,0.0...|
|[1.67,3.0,1.0,0.0...|
|[3.71,3.0,0.0,0.0...|
|[3.5,3.0,1.0,0.0,...|
|(6,[0,1],[3.35,3.0])|
+--------------------+
only showing top 20 rows



In [0]:
output.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|Independent Features|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|       0.0|[1.01,2.0,1.0,0.0...|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[1.66,3.0,0.0,0.0...|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[3.5,3.0,0.0,0.0,...|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|[3.31,2.0,0.0,0.0...|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|       0.0|[3.61,4.0,1.0,0.0...|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4| 

In [0]:
finalized_data=output.select("Independent Features","total_bill")

In [0]:
finalized_data.show()

+--------------------+----------+
|Independent Features|total_bill|
+--------------------+----------+
|[1.01,2.0,1.0,0.0...|     16.99|
|[1.66,3.0,0.0,0.0...|     10.34|
|[3.5,3.0,0.0,0.0,...|     21.01|
|[3.31,2.0,0.0,0.0...|     23.68|
|[3.61,4.0,1.0,0.0...|     24.59|
|[4.71,4.0,0.0,0.0...|     25.29|
|[2.0,2.0,0.0,0.0,...|      8.77|
|[3.12,4.0,0.0,0.0...|     26.88|
|[1.96,2.0,0.0,0.0...|     15.04|
|[3.23,2.0,0.0,0.0...|     14.78|
|[1.71,2.0,0.0,0.0...|     10.27|
|[5.0,4.0,1.0,0.0,...|     35.26|
|[1.57,2.0,0.0,0.0...|     15.42|
|[3.0,4.0,0.0,0.0,...|     18.43|
|[3.02,2.0,1.0,0.0...|     14.83|
|[3.92,2.0,0.0,0.0...|     21.58|
|[1.67,3.0,1.0,0.0...|     10.33|
|[3.71,3.0,0.0,0.0...|     16.29|
|[3.5,3.0,1.0,0.0,...|     16.97|
|(6,[0,1],[3.35,3.0])|     20.65|
+--------------------+----------+
only showing top 20 rows



In [0]:
#This code is to train a linear regression model on the training data, where the independent features are combined into a vector column, and the target variable is total_bill. The trained model can then be used to make predictions on new data.
from pyspark.ml.regression import LinearRegression

# Train-test split
train_data, test_data = finalized_data.randomSplit([0.75, 0.25])

# Initialize the LinearRegression model
regressor = LinearRegression(featuresCol='Independent Features', labelCol='total_bill')

# Fit the LinearRegression model on the training data
regressor = regressor.fit(train_data)


In [0]:
regressor.coefficients

Out[39]: DenseVector([2.7985, 3.4757, -1.0544, 1.8106, -0.3856, -1.1182])

In [0]:
regressor.intercept

Out[40]: 2.6572337983204286

In [0]:
### Predictions
pred_results=regressor.evaluate(test_data)

In [0]:
## Final comparison
pred_results.predictions.show()

+--------------------+----------+------------------+
|Independent Features|total_bill|        prediction|
+--------------------+----------+------------------+
|(6,[0,1],[1.75,2.0])|     17.82| 14.50592690983163|
| (6,[0,1],[2.0,2.0])|     12.69| 15.20554891453792|
|(6,[0,1],[2.24,3.0])|     16.04|19.352855578339547|
| (6,[0,1],[2.5,4.0])|     18.35| 23.55613200251767|
|(6,[0,1],[2.72,2.0])|     13.28| 17.22046028809204|
|(6,[0,1],[3.35,3.0])|     20.65|22.459177279235472|
| (6,[0,1],[4.3,2.0])|      21.7| 21.64207135783579|
|(6,[0,1],[6.73,4.0])|     48.27|  35.3937363221481|
|[1.0,1.0,1.0,0.0,...|      7.25| 7.876964304410876|
|[1.0,1.0,1.0,1.0,...|      3.07| 9.687578213722702|
|[1.1,2.0,1.0,1.0,...|      12.9|13.443096554888802|
|[1.44,2.0,0.0,0.0...|      7.56|11.748909539928615|
|[1.48,2.0,0.0,0.0...|      8.52|11.860849060681623|
|[1.5,2.0,0.0,0.0,...|     19.08|11.916818821058126|
|[1.5,2.0,0.0,1.0,...|     12.03|14.459990630161906|
|[1.5,2.0,1.0,0.0,...|     26.41|12.7518778531


### Explanation of the Columns

1. **Independent Features**:
    - This column contains the feature vectors that were used as inputs to the linear regression model. Each vector combines multiple individual features into a single vector.
    - The vectors are displayed in sparse format. For example, `(6,[0,1],[1.75,2.0])` means:
        - The vector has 6 elements.
        - The non-zero elements are at indices 0 and 1.
        - The values of the non-zero elements are 1.75 and 2.0 respectively.

2. **total_bill**:
    - This column contains the actual values of the target variable (the true labels) from the test dataset. In this context, `total_bill` represents the actual bill amount.

3. **prediction**:
    - This column contains the predicted values of the target variable, calculated by the linear regression model based on the feature vectors in the `Independent Features` column. These predictions are the model's estimate of the `total_bill`.

### Example Row Explanation

Let's break down an example row to understand what each part means:

```
|(6,[0,1],[1.75,2.0])| 17.82 | 14.50592690983163 |
```

- **Independent Features**: `(6,[0,1],[1.75,2.0])`
  - This means the feature vector has 6 elements, but only the first two elements (indices 0 and 1) are non-zero.
  - The value at index 0 is 1.75 and the value at index 1 is 2.0.

- **total_bill**: `17.82`
  - This is the actual bill amount for this particular observation.

- **prediction**: `14.50592690983163`
  - This is the predicted bill amount made by the linear regression model based on the feature vector `(6,[0,1],[1.75,2.0])`.

### Detailed Analysis of the Output

Let's go through several rows to understand the predictions:

| Independent Features     | total_bill | prediction         |
|--------------------------|------------|--------------------|
| (6,[0,1],[1.75,2.0])     | 17.82      | 14.50592690983163  |
| (6,[0,1],[2.0,2.0])      | 12.69      | 15.20554891453792  |
| (6,[0,1],[2.24,3.0])     | 16.04      | 19.352855578339547 |
| (6,[0,1],[2.5,4.0])      | 18.35      | 23.55613200251767  |
| (6,[0,1],[2.72,2.0])     | 13.28      | 17.22046028809204  |
| (6,[0,1],[3.35,3.0])     | 20.65      | 22.459177279235472 |
| (6,[0,1],[4.3,2.0])      | 21.7       | 21.64207135783579  |
| (6,[0,1],[6.73,4.0])     | 48.27      | 35.3937363221481   |
| [1.0,1.0,1.0,0.0,...]    | 7.25       | 7.876964304410876  |
| [1.0,1.0,1.0,1.0,...]    | 3.07       | 9.687578213722702  |
| [1.1,2.0,1.0,1.0,...]    | 12.9       | 13.443096554888802 |
| [1.44,2.0,0.0,0.0,...]   | 7.56       | 11.748909539928615 |
| [1.48,2.0,0.0,0.0,...]   | 8.52       | 11.860849060681623 |
| [1.5,2.0,0.0,0.0,...]    | 19.08      | 11.916818821058126 |
| [1.5,2.0,0.0,1.0,...]    | 12.03      | 14.459990630161906 |
| [1.5,2.0,1.0,0.0,...]    | 26.41      | 12.751877853107041 |
| [1.5,2.0,1.0,0.0,...]    | 10.65      | 10.862391769039828 |
| [1.71,2.0,0.0,0.0,...]   | 10.27      | 14.00834466098687  |

### Interpretation

1. **Feature Vector Interpretation**:
    - For each row, the `Independent Features` column shows a vector representation of the input features used for the prediction.
    - For example, `(6,[0,1],[1.75,2.0])` means a 6-element vector with values 1.75 at index 0 and 2.0 at index 1.

2. **Actual vs. Predicted Values**:
    - The `total_bill` column shows the actual bill amount, while the `prediction` column shows the bill amount predicted by the model.
    - By comparing these two columns, you can assess how well the model is performing. For instance, in the first row, the actual bill is 17.82, and the predicted bill is 14.51, which shows a small difference.

3. **Model Accuracy**:
    - The closer the `prediction` values are to the `total_bill` values, the more accurate the model is.
    - Large discrepancies between these values indicate that the model may need improvement, either through better feature engineering, more data, or a different model.

### Summary

The `pred_results.predictions.show()` output provides a comparison between the actual and predicted values for the target variable (`total_bill`). It allows you to assess the performance of the linear regression model by showing how well the model's predictions match the actual values. Each row in the output shows the feature vector used for the prediction, the actual bill amount, and the predicted bill amount, helping you understand the model's accuracy and potential areas for improvement.

In [0]:
### PErformance Metrics
pred_results.r2,pred_results.meanAbsoluteError,pred_results.meanSquaredError

Out[43]: (0.6715899791172986, 4.243246528600479, 33.83498887434812)

### Performance Metrics Explanation

The code `pred_results.r2, pred_results.meanAbsoluteError, pred_results.meanSquaredError` provides important performance metrics for evaluating the regression model. Here is a detailed explanation of each metric:

1. **R² (R-Squared)**:
   - **Value**: `0.6715899791172986`
   - **Explanation**: R², also known as the coefficient of determination, indicates the proportion of the variance in the dependent variable (target) that is predictable from the independent variables (features). It ranges from 0 to 1, where:
     - **0** means that the model explains none of the variability of the target variable.
     - **1** means that the model explains all the variability of the target variable.
   - **Interpretation**: An R² value of `0.6716` means that approximately 67.16% of the variance in the `total_bill` can be explained by the features in the model. This indicates a moderate level of explanatory power.

2. **Mean Absolute Error (MAE)**:
   - **Value**: `4.243246528600479`
   - **Explanation**: MAE is the average of the absolute differences between the predicted values and the actual values. It provides a measure of the average magnitude of the errors in a set of predictions, without considering their direction.
   - **Formula**: 
     \[
     \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
     \]
     where \( y_i \) is the actual value and \( \hat{y}_i \) is the predicted value.
   - **Interpretation**: An MAE value of `4.2432` means that, on average, the predictions are off by approximately 4.24 units from the actual `total_bill`. Lower values of MAE indicate better predictive accuracy.

3. **Mean Squared Error (MSE)**:
   - **Value**: `33.83498887434812`
   - **Explanation**: MSE is the average of the squared differences between the predicted values and the actual values. It provides a measure of the average magnitude of the errors, giving more weight to larger errors.
   - **Formula**:
     \[
     \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
     \]
     where \( y_i \) is the actual value and \( \hat{y}_i \) is the predicted value.
   - **Interpretation**: An MSE value of `33.8350` means that, on average, the squared difference between the predicted and actual `total_bill` is 33.8350. Lower values of MSE indicate better predictive accuracy. However, because MSE squares the errors, it is more sensitive to outliers than MAE.

### Code Example

Let's put these metrics into context with the full code:

In [0]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a Spark session
spark = SparkSession.builder.appName('LinearRegressionExample').getOrCreate()

# Sample data
data = [
    (5.0, 2, "Male", "Yes", "Sun", "Dinner", 50.0),
    (3.0, 3, "Female", "No", "Sat", "Lunch", 30.0),
    (4.5, 4, "Male", "Yes", "Fri", "Dinner", 45.0)
]
columns = ["tip", "size", "sex", "smoker", "day", "time", "total_bill"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Index categorical columns
indexer = StringIndexer(
    inputCols=["sex", "smoker", "day", "time"],
    outputCols=["sex_indexed", "smoker_indexed", "day_indexed", "time_index"]
)
df_r = indexer.fit(df).transform(df)

# Assemble features into a vector
featureassembler = VectorAssembler(
    inputCols=['tip', 'size', 'sex_indexed', 'smoker_indexed', 'day_indexed', 'time_index'],
    outputCol="Independent Features"
)
output = featureassembler.transform(df_r)

# Select relevant columns for modeling
finalized_data = output.select("Independent Features", "total_bill")

# Train-test split
train_data, test_data = finalized_data.randomSplit([0.75, 0.25])

# Check if the splits contain data
if train_data.count() == 0 or test_data.count() == 0:
    print("One of the train or test datasets is empty. Adjusting the split ratio.")
    train_data, test_data = finalized_data.randomSplit([0.9, 0.1])
    if train_data.count() == 0 or test_data.count() == 0:
        print("Still empty after adjustment. Using entire dataset for training.")
        train_data = finalized_data
        test_data = finalized_data

# Initialize and train the Linear Regression model
regressor = LinearRegression(featuresCol='Independent Features', labelCol='total_bill')
regressor = regressor.fit(train_data)

# Make predictions on the test data
predictions = regressor.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(
    labelCol="total_bill", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

evaluator = RegressionEvaluator(
    labelCol="total_bill", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)

evaluator = RegressionEvaluator(
    labelCol="total_bill", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)

# Show predictions
predictions.select("Independent Features", "total_bill", "prediction").show()

# Print performance metrics
print(f"R²: {r2}")
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")


+--------------------+----------+------------------+
|Independent Features|total_bill|        prediction|
+--------------------+----------+------------------+
| (6,[0,1],[4.5,4.0])|      45.0|35.833333333333336|
+--------------------+----------+------------------+

R²: -inf
Mean Absolute Error: 9.166666666666664
Mean Squared Error: 84.02777777777773


### Summary of Metrics

- **R² (R-Squared)**: Indicates the proportion of variance in the dependent variable that is predictable from the independent variables. Higher values indicate better model performance.
- **Mean Absolute Error (MAE)**: Measures the average magnitude of the errors in predictions, without considering their direction. Lower values indicate better model performance.
- **Mean Squared Error (MSE)**: Measures the average magnitude of the errors, giving more weight to larger errors. Lower values indicate better model performance, but it is more sensitive to outliers than MAE.

By understanding these metrics, you can better assess the performance of your regression model and make informed decisions about potential improvements.