# Introduction.

Bojanapally Santhoshini - U88362375.

In this Notebook, by using sparkML we are going to solve a regression problem where we use a Linear regression model to predict a continuous variable and use root mean square error as our metric to evaluate the model, as usual first step is importing the required packages. Then we proceed to loading the data, pre-processing it then using vector assembler convert it into a linear combination of the features, to feed that to our linear regression model then evaluate the model based on Root mean square error(rmse). Let's do that.

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import numpy as np
import pandas as pd
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import findspark
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

In [2]:
findspark.init()

spark = SparkSession.builder.master("local[4]").appName("ISM6562 Spark App01").enableHiveSupport().getOrCreate();

# Let's get the SparkContext object. It's the entry point to the Spark API. It's created when you create a sparksession
sc = spark.sparkContext  

# note: If you have multiple spark sessions running (like from a previous notebook you've run), 
# this spark session webUI will be on a different port than the default (4040). One way to 
# identify this part is with the following line. If there was only one spark session running, 
# this will be 4040. If it's higher, it means there are still other spark sesssions still running.
spark_session_port = spark.sparkContext.uiWebUrl.split(":")[-1]
print("Spark Session WebUI Port: " + spark_session_port)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/29 17:52:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/29 17:52:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/04/29 17:52:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/04/29 17:52:12 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/04/29 17:52:12 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.


Spark Session WebUI Port: 4044


In [3]:
# fetch dataset 
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,x1,x2,target
0,16.24,4.9,14.7
1,-6.12,2.39,7.17
2,-5.28,-4.48,-13.44
3,-10.73,-6.11,-18.33
4,8.65,-20.3,-60.9


In [4]:
df.isna().sum()

x1        0
x2        0
target    0
dtype: int64

In [5]:
df_spark = spark.createDataFrame(df)

24/04/29 17:52:24 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [6]:
df_spark.show()

                                                                                

+------+------+-------------------+
|    x1|    x2|             target|
+------+------+-------------------+
| 16.24|   4.9|               14.7|
| -6.12|  2.39|               7.17|
| -5.28| -4.48|             -13.44|
|-10.73| -6.11|             -18.33|
|  8.65| -20.3|-60.900000000000006|
|-23.02|  6.08|              18.24|
| 17.45| -3.54|             -10.62|
| -7.61|  1.53|               4.59|
|  3.19|  5.01|              15.03|
| -2.49| -7.86|             -23.58|
| 14.62| 10.17|              30.51|
| -20.6|  1.13|               3.39|
| -3.22| 14.97|              44.91|
| -3.84|  1.69|               5.07|
| 11.34|  3.19|               9.57|
| -11.0| -2.73|              -8.19|
| -1.72| 14.76|              44.28|
| -8.78|-21.03|             -63.09|
|  0.42| -5.33|             -15.99|
|  5.83| -3.05|              -9.15|
+------+------+-------------------+
only showing top 20 rows



In [7]:
df_spark.printSchema()

root
 |-- x1: double (nullable = true)
 |-- x2: double (nullable = true)
 |-- target: double (nullable = true)



In [8]:
df_spark.count()

2000

In [9]:
# since there are only 2000 observations we are doing 80-20 split (maximum for training the model)

train_data, test_data = df_spark.randomSplit([0.8, 0.2], seed=42)

In [10]:
# Define numeric columns
numeric_columns = ['x1', 'x2']

# Define VectorAssembler for numeric features
numeric_assembler = VectorAssembler(inputCols=numeric_columns, outputCol='numeric_feature')

# Define StandardScaler for scaling numeric features
scaler = StandardScaler(inputCol='numeric_feature', outputCol='scaled_numeric_feature')

# Define VectorAssembler for features
assembler = VectorAssembler(inputCols=['scaled_numeric_feature'], outputCol='features')

# Define the pipeline
pipeline = Pipeline(stages=[numeric_assembler, scaler, assembler])

# Fit the pipeline to the training data
pipeline_model = pipeline.fit(train_data)

# Transform the training and testing data
train_data = pipeline_model.transform(train_data)
test_data = pipeline_model.transform(test_data)

# Define the Linear Regression model
lr_model = LinearRegression(labelCol='target')

# Fit the model to the training data
fit_model = lr_model.fit(train_data)

# Make predictions on the test data
predictions = fit_model.transform(test_data)

# Evaluate the model's performance
evaluator = RegressionEvaluator(labelCol='target', predictionCol='prediction', metricName='rmse')
rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

# Optionally, you can also print the coefficients and intercept
coefficients = fit_model.coefficients
intercept = fit_model.intercept
print("Coefficients:", coefficients)
print("Intercept:", intercept)

24/04/29 17:54:37 WARN Instrumentation: [9b753390] regParam is zero, which might cause numerical instability and overfitting.
24/04/29 17:54:38 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


RMSE: 1.3047161790356743e-14
Coefficients: [5.056398194956959e-16,29.90742660673947]
Intercept: -3.634035041144678e-16


## Conclusion.

Here we can see that the root mean square value is pretty much close to zero, and also the coefficient of x1 seems to be very low which suggests there isn't much influence of x1 on our target variable and the coefficient of x2 is high approximately 30 which suggests a significant influence on our target variable. Also, the intercept is almost zero which means may be the graph passes pretty close to the origin. Since the rmse value is close to zero, it suggests either overfitting or this data is synthesized (artificial) for evaluation of regression models, in any case we can reduce this by setting a regularization parameter to non-zero, which will introduce some generalization and reduce overfitting and we can consider other evaluation metrics like Mean absolute error etc. 

In [12]:
# without scaling (optional)

# Define numeric columns
# numeric_columns = ['x1', 'x2']

# Define VectorAssembler for numeric features
#numeric_assembler = VectorAssembler(inputCols=numeric_columns, outputCol='features')

# Define the pipeline
# pipeline = Pipeline(stages=[numeric_assembler])

# Fit the pipeline to the training data
# pipeline_model = pipeline.fit(train_data)

# Transform the training and testing data
# train_data = pipeline_model.transform(train_data)
#test_data = pipeline_model.transform(test_data)

# Define the Linear Regression model
#lr_model = LinearRegression(labelCol='target')

# Fit the model to the training data
#fit_model = lr_model.fit(train_data)

# Make predictions on the test data
#predictions = fit_model.transform(test_data)

# Evaluate the model's performance
#evaluator = RegressionEvaluator(labelCol='target', predictionCol='prediction', metricName='rmse')
#rmse = evaluator.evaluate(predictions)
#print("RMSE:", rmse)

# Optionally, you can also print the coefficients and intercept
#coefficients = fit_model.coefficients
#intercept = fit_model.intercept
#print("Coefficients:", coefficients)
#print("Intercept:", intercept)