# Installation of PySpark in google colab.

In [33]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.1
!wget -q http://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
# unzip it
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
# install findspark 
!pip install -q findspark

# environmental variable of java and spark was set
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

import findspark
findspark.init()

**Spark Session was created**

In [36]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('HousePrice_Prediction').getOrCreate()

**All the necessary sparkML libraries required for this task was imported.**

In [37]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler, StringIndexer

from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

**Upload of the House price prediction dataset.**

In [38]:
from google.colab import files
uploaded = files.upload()

Saving data.csv to data (5).csv


In [64]:
dataset = spark.read.csv("data.csv", inferSchema = True, header = True)
dataset.show()

+-------------------+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+--------------------+----------------+--------+-------+
|               date|    price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|sqft_above|sqft_basement|yr_built|yr_renovated|              street|            city|statezip|country|
+-------------------+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+--------------------+----------------+--------+-------+
|2014-05-02 00:00:00| 313000.0|     3.0|      1.5|       1340|    7912|   1.5|         0|   0|        3|      1340|            0|    1955|        2005|18810 Densmore Ave N|       Shoreline|WA 98133|    USA|
|2014-05-02 00:00:00|2384000.0|     5.0|      2.5|       3650|    9050|   2.0|         0|   4|        5|      3370|          280|    1921|           0|     709 W Blaine St|

**Printing the schema of all the columns in the dataset.
Schema gives the description of the structure of the given data.**

In [65]:
dataset.printSchema()

root
 |-- date: string (nullable = true)
 |-- price: double (nullable = true)
 |-- bedrooms: double (nullable = true)
 |-- bathrooms: double (nullable = true)
 |-- sqft_living: integer (nullable = true)
 |-- sqft_lot: integer (nullable = true)
 |-- floors: double (nullable = true)
 |-- waterfront: integer (nullable = true)
 |-- view: integer (nullable = true)
 |-- condition: integer (nullable = true)
 |-- sqft_above: integer (nullable = true)
 |-- sqft_basement: integer (nullable = true)
 |-- yr_built: integer (nullable = true)
 |-- yr_renovated: integer (nullable = true)
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- statezip: string (nullable = true)
 |-- country: string (nullable = true)



In [66]:
dtypes_strings = dataset.select("city", "country", "street", "statezip", "date").distinct()

In [67]:
dtypes_strings.show()

+-------------+-------+--------------------+--------+-------------------+
|         city|country|              street|statezip|               date|
+-------------+-------+--------------------+--------+-------------------+
|     Bellevue|    USA|   3250 165th Ave SE|WA 98008|2014-05-06 00:00:00|
|Mercer Island|    USA|     8300 SE 82nd St|WA 98040|2014-05-06 00:00:00|
|      Seattle|    USA| 4602 Woodlawn Ave N|WA 98103|2014-05-07 00:00:00|
|       Renton|    USA|  16411 120th Ave SE|WA 98058|2014-05-12 00:00:00|
|    Sammamish|    USA|   27747 SE 24th Way|WA 98075|2014-05-13 00:00:00|
|     Kirkland|    USA|   12033 NE 136th Pl|WA 98034|2014-05-13 00:00:00|
|      Seattle|    USA|     4417 47th Ave S|WA 98118|2014-05-15 00:00:00|
|      Seattle|    USA|     8309 36th Ave S|WA 98118|2014-05-19 00:00:00|
|  Federal Way|    USA|    1714 SW 354th Pl|WA 98023|2014-05-19 00:00:00|
|   Des Moines|    USA|    24807 10th Ave S|WA 98198|2014-05-21 00:00:00|
|      Seattle|    USA|     303 SW 108

**Distingusing between categorical and numerical columns.**

In [68]:
categoricalColumns = [item[0] for item in dataset.dtypes if item[1].startswith('string')]
categoricalColumns

['date', 'street', 'city', 'statezip', 'country']

In [69]:
numericalColumnsInt = [item[0] for item in dataset.dtypes if item[1].startswith("int")]
numericalColumnsInt

['sqft_living',
 'sqft_lot',
 'waterfront',
 'view',
 'condition',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated']

In [70]:
numericalColumnsDouble = [item[0] for item in dataset.dtypes if item[1].startswith("double")]
numericalColumnsDouble

['price', 'bedrooms', 'bathrooms', 'floors']

**StringIndexer was imported to convert the columns having string dtypes to numerical values.**

In [71]:
indexer = StringIndexer(inputCols= ["date", "street", "city", "statezip", "country"], 
                        outputCols=["date_Index", "street_Index", "city_Index", "statezip_Index", "country_Index"]) 
indexed = indexer.fit(dataset).transform(dataset) 
indexed.show()

+-------------------+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+--------------------+----------------+--------+-------+----------+------------+--------------+-------------+----------+
|               date|    price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|sqft_above|sqft_basement|yr_built|yr_renovated|              street|            city|statezip|country|city_Index|street_Index|statezip_Index|country_Index|date_Index|
+-------------------+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+--------------------+----------------+--------+-------+----------+------------+--------------+-------------+----------+
|2014-05-02 00:00:00| 313000.0|     3.0|      1.5|       1340|    7912|   1.5|         0|   0|        3|      1340|            0|    1955|        2005|18810 Densmore Ave N|       Shorelin

**Unrequired columns were deleted.**

In [72]:
dataset_new = indexed.drop(*['date', 'street', 'city', 'statezip', 'country','country_Index', 'date_Index', 'statezip_Index'])

**Final Dataset.**

In [73]:
dataset_new.show()

+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+----------+------------+
|    price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|sqft_above|sqft_basement|yr_built|yr_renovated|city_Index|street_Index|
+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+----------+------------+
| 313000.0|     3.0|      1.5|       1340|    7912|   1.5|         0|   0|        3|      1340|            0|    1955|        2005|      10.0|      1574.0|
|2384000.0|     5.0|      2.5|       3650|    9050|   2.0|         0|   4|        5|      3370|          280|    1921|           0|       0.0|      3912.0|
| 342000.0|     3.0|      2.0|       1930|   11947|   1.0|         0|   0|        4|      1930|            0|    1966|           0|       6.0|      2330.0|
| 420000.0|     3.0|     2.25|       2000|    8030|   1.0|      

In [74]:
dataset_new.describe().show()

+-------+-----------------+------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+
|summary|            price|          bedrooms|         bathrooms|       sqft_living|          sqft_lot|            floors|          waterfront|               view|         condition|        sqft_above|     sqft_basement|         yr_built|     yr_renovated|       city_Index|      street_Index|
+-------+-----------------+------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+
|  count|             4600|              4600|              4600|              4600|              4600|              4

**VectorAssembler is a feature transformer that merges multiple columns into a vector column.**

In [75]:
ignore = ['price']
assembler = VectorAssembler(
    inputCols=[x for x in dataset_new.columns if x not in ignore],
    outputCol='features')

**Final output dataset containing the features column which is a vector column.**

In [76]:
output = assembler.transform(dataset_new)
output.show()

+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+----------+------------+--------------------+
|    price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|sqft_above|sqft_basement|yr_built|yr_renovated|city_Index|street_Index|            features|
+---------+--------+---------+-----------+--------+------+----------+----+---------+----------+-------------+--------+------------+----------+------------+--------------------+
| 313000.0|     3.0|      1.5|       1340|    7912|   1.5|         0|   0|        3|      1340|            0|    1955|        2005|      10.0|      1574.0|[3.0,1.5,1340.0,7...|
|2384000.0|     5.0|      2.5|       3650|    9050|   2.0|         0|   4|        5|      3370|          280|    1921|           0|       0.0|      3912.0|[5.0,2.5,3650.0,9...|
| 342000.0|     3.0|      2.0|       1930|   11947|   1.0|         0|   0|        4|      1930|            0|    19

In [77]:
output.select("features").show()

+--------------------+
|            features|
+--------------------+
|[3.0,1.5,1340.0,7...|
|[5.0,2.5,3650.0,9...|
|[3.0,2.0,1930.0,1...|
|[3.0,2.25,2000.0,...|
|[4.0,2.5,1940.0,1...|
|[2.0,1.0,880.0,63...|
|[2.0,2.0,1350.0,2...|
|[4.0,2.5,2710.0,3...|
|[3.0,2.5,2430.0,8...|
|[4.0,2.0,1520.0,6...|
|[3.0,1.75,1710.0,...|
|[4.0,2.5,2920.0,4...|
|[3.0,1.75,2330.0,...|
|[3.0,1.0,1090.0,6...|
|[5.0,2.75,2910.0,...|
|[3.0,1.5,1200.0,9...|
|[3.0,1.5,1570.0,6...|
|[4.0,3.0,3110.0,7...|
|[3.0,1.75,1370.0,...|
|[3.0,1.5,1180.0,1...|
+--------------------+
only showing top 20 rows



In [78]:
output.columns

['price',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'city_Index',
 'street_Index',
 'features']

In [79]:
finalized_data = output.select("features", "price")

**Finalized_data contains both the features and label.**

In [80]:
finalized_data.show()

+--------------------+---------+
|            features|    price|
+--------------------+---------+
|[3.0,1.5,1340.0,7...| 313000.0|
|[5.0,2.5,3650.0,9...|2384000.0|
|[3.0,2.0,1930.0,1...| 342000.0|
|[3.0,2.25,2000.0,...| 420000.0|
|[4.0,2.5,1940.0,1...| 550000.0|
|[2.0,1.0,880.0,63...| 490000.0|
|[2.0,2.0,1350.0,2...| 335000.0|
|[4.0,2.5,2710.0,3...| 482000.0|
|[3.0,2.5,2430.0,8...| 452500.0|
|[4.0,2.0,1520.0,6...| 640000.0|
|[3.0,1.75,1710.0,...| 463000.0|
|[4.0,2.5,2920.0,4...|1400000.0|
|[3.0,1.75,2330.0,...| 588500.0|
|[3.0,1.0,1090.0,6...| 365000.0|
|[5.0,2.75,2910.0,...|1200000.0|
|[3.0,1.5,1200.0,9...| 242500.0|
|[3.0,1.5,1570.0,6...| 419000.0|
|[4.0,3.0,3110.0,7...| 367500.0|
|[3.0,1.75,1370.0,...| 257950.0|
|[3.0,1.5,1180.0,1...| 275000.0|
+--------------------+---------+
only showing top 20 rows



In [81]:
finalized_data.columns

['features', 'price']

**Splitting of finalized_data dataset into train and test data in 3:1 ratio.**

In [82]:
train_data, test_data = finalized_data.randomSplit([0.75, 0.25], seed = 42)

In [93]:
print("This is the Train dataset:- ")
print(train_data.show())

print("This is the Test dataset:- ")
print(test_data.show())

This is the Train dataset:- 
+--------------------+---------+
|            features|    price|
+--------------------+---------+
|[0.0,0.0,3064.0,4...|1095000.0|
|[0.0,0.0,4810.0,2...|1295648.0|
|[1.0,0.75,380.0,1...| 245000.0|
|[1.0,0.75,420.0,6...| 280000.0|
|[1.0,0.75,430.0,5...|  80000.0|
|[1.0,0.75,820.0,5...| 527550.0|
|[1.0,1.0,550.0,12...| 353000.0|
|[1.0,1.0,590.0,83...| 202000.0|
|[1.0,1.0,620.0,82...| 148000.0|
|[1.0,1.0,690.0,19...| 167500.0|
|[1.0,1.0,700.0,25...| 295000.0|
|[1.0,1.0,700.0,51...| 350000.0|
|[1.0,1.0,720.0,48...| 190000.0|
|[1.0,1.0,720.0,51...| 335000.0|
|[1.0,1.0,730.0,19...| 321500.0|
|[1.0,1.0,730.0,30...| 395000.0|
|[1.0,1.0,750.0,40...| 250000.0|
|[1.0,1.0,800.0,16...| 250000.0|
|[1.0,1.0,810.0,24...| 235000.0|
|[1.0,1.0,820.0,10...| 194000.0|
+--------------------+---------+
only showing top 20 rows

None
This is the Test dataset:- 
+--------------------+--------+
|            features|   price|
+--------------------+--------+
|[1.0,0.75,370.0,1...|27

**Linear regression algorithm was used for prediction.**

In [83]:
regressor = LinearRegression(featuresCol='features', labelCol='price')
regressor = regressor.fit(train_data)

In [84]:
regressor.coefficients

DenseVector([-54738.387, 80462.4082, 1016.692, -0.4755, -5674.3828, 174183.5185, 52338.6537, 34444.9211, -756.2056, -822.3781, -1939.0446, 8.4479, -5408.329, -0.2255])

In [85]:
regressor.intercept

3766039.7226613276

**Evaluation of the test_data using our linear regression model.**

In [86]:
pred_results = regressor.evaluate(test_data)

**Price prediction.**

In [87]:
pred_results.predictions.show()

+--------------------+--------+-------------------+
|            features|   price|         prediction|
+--------------------+--------+-------------------+
|[1.0,0.75,370.0,1...|276000.0|  304437.9962327592|
|[1.0,0.75,560.0,1...|299000.0|  59921.95356263919|
|[1.0,0.75,930.0,2...|190000.0|  349893.0134237702|
|[1.0,0.75,1170.0,...|275000.0|  265939.7557767574|
|[1.0,1.0,650.0,15...|129000.0| 239143.50742289657|
|[1.0,1.0,720.0,60...|     0.0| 329107.89825000754|
|[1.0,1.0,790.0,13...|135000.0| 339977.58689325117|
|[1.0,1.0,960.0,26...|332000.0| 441524.80853667296|
|[1.0,1.0,960.0,40...|420850.0| 265977.18040653225|
|[1.0,1.0,1140.0,6...|540000.0| 473126.82464962313|
|[1.0,1.5,810.0,32...|285000.0|  314652.2213279018|
|[1.0,1.5,1010.0,5...|410000.0| 500193.77066104487|
|[2.0,0.75,840.0,4...|528000.0|  534516.5518133589|
|[2.0,0.75,1392.0,...|350000.0|  228522.0436636093|
|[2.0,0.75,1440.0,...|562100.0|   476736.824840344|
|[2.0,1.0,520.0,22...|160000.0| 11640.465541466605|
|[2.0,1.0,59

**Root mean squared error, mean absolute error and R-squared error.**

In [88]:
print("RMSE = %s" % pred_results.rootMeanSquaredError)
print("MAE = ", pred_results.meanAbsoluteError)
print("R-squared = %s" % pred_results.r2)

RMSE = 264588.83654159546
MAE =  157266.05119796735
R-squared = 0.5066408188315306
