
# ðŸ“ˆ Simple Regression on Large Dataset Using Spark MLlib

This guide shows how to build, train, and apply a **linear regression model** using Spark MLlib.

---

## ðŸ“‚ Dataset (`data.csv`)

```text
feature,target
1.0,2.0
2.0,4.1
3.0,6.0
4.0,8.1
5.0,10.2
````

---

## ðŸ”¹ Step 1: Load Dataset

```scala
val data = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("data.csv")
```

---

## ðŸ”¹ Step 2: Prepare Data for Regression

Spark ML requires a **features vector**:

```scala
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("feature"))
  .setOutputCol("features")

val trainingData = assembler.transform(data)
  .select("features", "target")
```

---

## ðŸ”¹ Step 3: Build & Train Linear Regression Model

```scala
import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression()
  .setLabelCol("target")
  .setFeaturesCol("features")

val lrModel = lr.fit(trainingData)
```

---

## ðŸ”¹ Step 4: Apply Model & Predict Target Values

```scala
val predictions = lrModel.transform(trainingData)
predictions.select("features", "target", "prediction").show()
```

---

## âœ… Sample Output

```text
+--------+------+----------+
|features|target|prediction|
+--------+------+----------+
|   [1.0]|   2.0|  2.03    |
|   [2.0]|   4.1|  4.06    |
|   [3.0]|   6.0|  6.09    |
|   [4.0]|   8.1|  8.12    |
|   [5.0]|  10.2| 10.15    |
+--------+------+----------+
```


## ðŸ“Œ PySpark Program

In [2]:
# ================================
# Simple Linear Regression in PySpark
# Colab-ready
# ================================

!pip install pyspark -q
!pip install pandas -q

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# ----------------------------
# 1. Create Spark session
# ----------------------------
spark = SparkSession.builder.appName("SimpleRegressionColab").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# ----------------------------
# 2. Generate synthetic numerical dataset
# ----------------------------
# Let's create 1000 samples: y = 3*x1 + 5*x2 + noise
import numpy as np
np.random.seed(42)

n_samples = 1000
x1 = np.random.rand(n_samples)
x2 = np.random.rand(n_samples)
noise = np.random.normal(0, 0.1, n_samples)
y = 3*x1 + 5*x2 + noise

# Create pandas DataFrame
df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})

# ----------------------------
# 3. Convert pandas DataFrame to Spark DataFrame
# ----------------------------
spark_df = spark.createDataFrame(df)
spark_df.show(5)

# ----------------------------
# 4. Assemble features into vector
# ----------------------------
assembler = VectorAssembler(inputCols=["x1", "x2"], outputCol="features")
data = assembler.transform(spark_df).select("features", "y")

# ----------------------------
# 5. Split data into train/test
# ----------------------------
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

# ----------------------------
# 6. Create and train linear regression model
# ----------------------------
lr = LinearRegression(featuresCol="features", labelCol="y")
lr_model = lr.fit(train_data)

# ----------------------------
# 7. Print model coefficients
# ----------------------------
print("Intercept:", lr_model.intercept)
print("Coefficients:", lr_model.coefficients)

# ----------------------------
# 8. Apply model to test data
# ----------------------------
predictions = lr_model.transform(test_data)
predictions.select("features", "y", "prediction").show(10)

# ----------------------------
# 9. Evaluate model performance
# ----------------------------
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="y", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE):", rmse)

# Create evaluators for different metrics
rmse_evaluator = RegressionEvaluator(labelCol="y", predictionCol="prediction", metricName="rmse")
mae_evaluator = RegressionEvaluator(labelCol="y", predictionCol="prediction", metricName="mae")
r2_evaluator = RegressionEvaluator(labelCol="y", predictionCol="prediction", metricName="r2")

# Evaluate on predictions
rmse = rmse_evaluator.evaluate(predictions)
mae = mae_evaluator.evaluate(predictions)
r2 = r2_evaluator.evaluate(predictions)

# Display performance
print("==== Regression Model Performance ====")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"RÂ² (Coefficient of Determination): {r2:.4f}")


+-------------------+-------------------+------------------+
|                 x1|                 x2|                 y|
+-------------------+-------------------+------------------+
| 0.3745401188473625|0.18513292883861965|1.9614867420595294|
| 0.9507143064099162| 0.5419009473783581|5.4789596207175535|
| 0.7319939418114051| 0.8729458358764083| 6.538063115626104|
| 0.5986584841970366| 0.7322248864095612| 5.493836435320534|
|0.15601864044243652| 0.8065611478614497| 4.592220123257883|
+-------------------+-------------------+------------------+
only showing top 5 rows
Intercept: 0.00745394240299101
Coefficients: [2.9923037286802074,4.998089904211016]
+--------------------+------------------+------------------+
|            features|                 y|        prediction|
+--------------------+------------------+------------------+
|[0.00695213053119...|1.9246719856888028|1.8909875428083662|
|[0.01215447468981...|3.0076105885866533|2.8778220718984207|
|[0.01545661652886...| 3.8195844107349