<a href="https://colab.research.google.com/github/SKitavi/pyspark/blob/main/Spark_ML_Intro_(3_Mnist).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**1. Setting Up the Environment**

In [None]:
!pip install pyspark



In [None]:
import pyspark

#**2. Loading the Dataset**

* The dataset used is California Housing Prices (from
Kaggle).
* It is loaded into PySpark DataFrame for distributed processing:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Mnist").getOrCreate()
df = spark.read.csv("sample_data/mnist_train_small.csv", header=True,
inferSchema=True)

In [16]:
df.printSchema()
df.show(50)

root
 |-- 6: integer (nullable = true)
 |-- 01: integer (nullable = true)
 |-- 02: integer (nullable = true)
 |-- 03: integer (nullable = true)
 |-- 04: integer (nullable = true)
 |-- 05: integer (nullable = true)
 |-- 06: integer (nullable = true)
 |-- 07: integer (nullable = true)
 |-- 08: integer (nullable = true)
 |-- 09: integer (nullable = true)
 |-- 010: integer (nullable = true)
 |-- 011: integer (nullable = true)
 |-- 012: integer (nullable = true)
 |-- 013: integer (nullable = true)
 |-- 014: integer (nullable = true)
 |-- 015: integer (nullable = true)
 |-- 016: integer (nullable = true)
 |-- 017: integer (nullable = true)
 |-- 018: integer (nullable = true)
 |-- 019: integer (nullable = true)
 |-- 020: integer (nullable = true)
 |-- 021: integer (nullable = true)
 |-- 022: integer (nullable = true)
 |-- 023: integer (nullable = true)
 |-- 024: integer (nullable = true)
 |-- 025: integer (nullable = true)
 |-- 026: integer (nullable = true)
 |-- 027: integer (nullable = true

# 3. **Data Preprocessing**

* Handling missing values:



In [3]:
df = df.dropna()

* Selecting features and the target variable:

In [4]:
from pyspark.ml.feature import VectorAssembler

feature_cols = ['01', '02', '03',
 '04', '05']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_transformed = assembler.transform(df)

# **4. Splitting Data for Training and Testing**
* The dataset is split into training and test sets:



In [8]:
train_data, test_data = df_transformed.randomSplit([0.8, 0.2], seed=42)

#**5. Building a Machine Learning Model**
* Using Linear Regression from PySpark MLlib:

In [5]:
from pyspark.ml.regression import LinearRegression

In [14]:
lr = LinearRegression(featuresCol="features", labelCol="6")
model = lr.fit(train_data)

#**6. Evaluating the Model**
* Predictions on test data:

In [17]:
predictions = model.transform(test_data)
predictions.select("6", "prediction").show(50)

+---+----------+
|  6|prediction|
+---+----------+
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
|  0|       0.0|
+---+----------+
only showing top 50 rows


* Checking model performance using evaluation metrics:

In [18]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="07",
predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")

Root Mean Squared Error (RMSE): 5.3570850110371815
