<a href="https://colab.research.google.com/github/SKitavi/pyspark/blob/main/Spark_ML_Intro_(2_California).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**1. Setting Up the Environment**

In [None]:
!pip install pyspark



In [None]:
import pyspark

#**2. Loading the Dataset**

* The dataset used is California Housing Prices (from
Kaggle).
* It is loaded into PySpark DataFrame for distributed processing:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CaliforniaHousing").getOrCreate()
df = spark.read.csv("sample_data/california_housing_train.csv", header=True,
inferSchema=True)

In [2]:
df.printSchema()
df.show(5)

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|     

# 3. **Data Preprocessing**

* Handling missing values:



In [3]:
df = df.dropna()

* Selecting features and the target variable:

In [4]:
from pyspark.ml.feature import VectorAssembler

feature_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
 'total_bedrooms', 'population', 'households', 'median_income']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_transformed = assembler.transform(df)

# **4. Splitting Data for Training and Testing**
* The dataset is split into training and test sets:



In [5]:
train_data, test_data = df_transformed.randomSplit([0.8, 0.2], seed=42)

#**5. Building a Machine Learning Model**
* Using Linear Regression from PySpark MLlib:

In [6]:
from pyspark.ml.regression import LinearRegression

In [7]:
lr = LinearRegression(featuresCol="features", labelCol="median_house_value")
model = lr.fit(train_data)

#**6. Evaluating the Model**
* Predictions on test data:

In [8]:
predictions = model.transform(test_data)
predictions.select("median_house_value", "prediction").show(50)

+------------------+------------------+
|median_house_value|        prediction|
+------------------+------------------+
|          103600.0| 101334.1523250211|
|          106700.0|189023.14109935658|
|           73200.0| 76469.86363691278|
|           90100.0|165186.09220862994|
|           67000.0|120101.40360319754|
|          116100.0|199313.20768269338|
|           62500.0|132221.13859327696|
|           85400.0| 157664.6846611374|
|           90000.0|174691.75308407517|
|           86400.0| 157459.4768907926|
|           74100.0|121336.39571196772|
|           57500.0|104446.81338756438|
|           75100.0| 134649.3450419805|
|          130600.0|187192.50846866565|
|           92100.0|156258.08505277848|
|           90200.0|112560.03201755695|
|           92600.0|129170.19603408081|
|          165600.0|195263.98156292224|
|           36700.0| 38152.75754792476|
|          116700.0| 87423.02334289998|
|           82400.0|141105.39691865863|
|           76800.0| 99441.99060788658|


* Checking model performance using evaluation metrics:

In [9]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="median_house_value",
predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")

Root Mean Squared Error (RMSE): 69231.88317848284
