
# PySpark Assignment

## RealEstate Housing Data

1. Extract: Load the data
 - Read data all csv as txt as rdd
2. Transform: Exploratory data analysis using rdd
 - Unique records count
 - Extract full address from the column url*
 - from http://www.zillow.com/homes/for_sale//homedetails/V-l-Buell-Newstead-NY10001/2089629334_zpid/
 - to V-l-Buell-Newstead-NY-10001
 - Replace NA by zero in all numerical columns
 - concat - bedrooms*, bathrooms* as bed_bath_rooms* 3b2bh
 - GroupBy zip,bed_bath_rooms* and avg, max, min
3. Load: Save analysis report
 - GroupBy zip,bed_bath_rooms* and avg, max, min, save as files


In [1]:
from random import random
import os
import pyspark
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local").\
        appName("SparkApplication").\
        config("spark.driver.bindAddress","localhost").\
        config("spark.ui.port","4041").\
        getOrCreate()

In [3]:
sc = spark.sparkContext

### To read multiple CSV files in Spark into single RDD.

In [4]:
data=sc.textFile("2018-05-12_154616.csv,2018-05-12_155104.csv,2018-05-12_155435.csv")

In [5]:
# Filter out header row
header=data.first()

In [6]:
print(header)

address,city,state,zip,price,sqft,bedrooms,bathrooms,days_on_zillow,sale_type,url


In [7]:
# remove header
step1= data.filter(lambda line: line !=header)

In [8]:
step1.collect()

['V/l Buell,Newstead,NY,10001,49000,NA,NA,NA,2,Lot/Land For Sale,http://www.zillow.com/homes/for_sale//homedetails/V-l-Buell-Newstead-NY-10001/2089629334_zpid/',
 '263 9th Ave # PHD,New York,NY,10001,4495000,2250,3,2,1,Condo For Sale,http://www.zillow.com/homes/for_sale//homedetails/263-9th-Ave-PHD-New-York-NY-10001/2103425273_zpid/',
 '315 7th Ave APT 21C,NEW YORK,NY,10001,1625000,1000,1,1,NA,Condo For Sale,http://www.zillow.com/homes/for_sale//homedetails/315-7th-Ave-APT-21C-New-York-NY-10001/31503968_zpid/',
 '252 7th Ave APT 4L,NEW YORK,NY,10001,1529000,980,0,1,2,Condo For Sale,http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
 '150 W 26th St APT 201,NEW YORK,NY,10001,1950000,1600,1,2,NA,Condo For Sale,http://www.zillow.com/homes/for_sale//homedetails/150-W-26th-St-APT-201-New-York-NY-10001/60147930_zpid/',
 '133 W 28th St APT 6-C,New York,NY,10001,1550000,1300,2,2,NA,Co-op For Sale,http://www.zillow.com/homes/for_sale//homedeta

### Total records count

In [9]:
step1.count()

1117

### Total unique records count

In [10]:
step1.distinct().count()

1064

In [11]:
### Filtering out duplicate records

In [12]:
step2=step1.distinct()

In [13]:
step2.count()

1064

### Extract full address from url

In [14]:
#Splitting each line by commma to form array
step3= step2.map(lambda line: line.split(","))

In [15]:
step3.first()

['252 7th Ave APT 4L',
 'NEW YORK',
 'NY',
 '10001',
 '1529000',
 '980',
 '0',
 '1',
 '2',
 'Condo For Sale',
 'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/']

In [16]:
step3.collect()

[['252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  '10001',
  '1529000',
  '980',
  '0',
  '1',
  '2',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/'],
 ['133 W 28th St APT 6-C',
  'New York',
  'NY',
  '10001',
  '1550000',
  '1300',
  '2',
  '2',
  'NA',
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/'],
 ['522 W 29th St # 4D',
  'New York',
  'NY',
  '10001',
  '3300000',
  '1255',
  '2',
  '2',
  '9',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/'],
 ['252 7th Ave # PHI',
  'New York',
  'NY',
  '10001',
  '6495000',
  '2400',
  '3',
  '3',
  '7',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-PHI-New-York-NY-10001/2100699091_zpid/'],
 ['252 7th Ave APT 5P',
  'NEW YORK',
  'NY',
  '10001',
  '4350000',
  '2203',


In [17]:
# Module/self designed function for extracting address from url
def extract_address(url):
    after_split=url.split("/")
    return(after_split[-3])

In [18]:
extract_address("http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/")

'252-7th-Ave-APT-4L-New-York-NY-10001'

In [19]:
step4=step3.map(lambda x: (x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7],x[8],x[9],x[10],extract_address(x[-1])))

In [20]:
print(step4.collect())

[('252 7th Ave APT 4L', 'NEW YORK', 'NY', '10001', '1529000', '980', '0', '1', '2', 'Condo For Sale', 'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/', '252-7th-Ave-APT-4L-New-York-NY-10001'), ('133 W 28th St APT 6-C', 'New York', 'NY', '10001', '1550000', '1300', '2', '2', 'NA', 'Co-op For Sale', 'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/', '133-W-28th-St-APT-6-C-New-York-NY-10001'), ('522 W 29th St # 4D', 'New York', 'NY', '10001', '3300000', '1255', '2', '2', '9', 'Condo For Sale', 'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/', '522-W-29th-St-4D-New-York-NY-10001'), ('252 7th Ave # PHI', 'New York', 'NY', '10001', '6495000', '2400', '3', '3', '7', 'Condo For Sale', 'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-PHI-New-York-NY-10001/2100699091_zpid/', '252-7th-Ave-PHI-New-York-NY-10001'), ('252

In [21]:
step4.take(2)

[('252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  '10001',
  '1529000',
  '980',
  '0',
  '1',
  '2',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
  '252-7th-Ave-APT-4L-New-York-NY-10001'),
 ('133 W 28th St APT 6-C',
  'New York',
  'NY',
  '10001',
  '1550000',
  '1300',
  '2',
  '2',
  'NA',
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/',
  '133-W-28th-St-APT-6-C-New-York-NY-10001')]

### Replacing NA by 0 in all numerical columns

In [22]:
# function to convert numerical columns from string to int and replace NA values by 0 for a list of columns
def replace_na_0(column_val):
    try:
        return int(float(column_val))
    except:
        return 0

In [23]:
num_columns=[3,4,5,6,7,8]

In [24]:
step5=step4.map(lambda x: (x[0],x[1],x[2],replace_na_0(x[3]),replace_na_0(x[4]),replace_na_0(x[5]),
                           replace_na_0(x[6]),replace_na_0(x[7]),replace_na_0(x[8]),x[9],x[10],x[11]))

In [25]:
step4.collect()

[('252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  '10001',
  '1529000',
  '980',
  '0',
  '1',
  '2',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
  '252-7th-Ave-APT-4L-New-York-NY-10001'),
 ('133 W 28th St APT 6-C',
  'New York',
  'NY',
  '10001',
  '1550000',
  '1300',
  '2',
  '2',
  'NA',
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/',
  '133-W-28th-St-APT-6-C-New-York-NY-10001'),
 ('522 W 29th St # 4D',
  'New York',
  'NY',
  '10001',
  '3300000',
  '1255',
  '2',
  '2',
  '9',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/',
  '522-W-29th-St-4D-New-York-NY-10001'),
 ('252 7th Ave # PHI',
  'New York',
  'NY',
  '10001',
  '6495000',
  '2400',
  '3',
  '3',
  '7',
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th

* In the above output we can see the second record has NA values for a numerical column.
* This has been replaced in step5 and this is reflected in the following output

In [26]:
step5.collect()

[('252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  10001,
  1529000,
  980,
  0,
  1,
  2,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
  '252-7th-Ave-APT-4L-New-York-NY-10001'),
 ('133 W 28th St APT 6-C',
  'New York',
  'NY',
  10001,
  1550000,
  1300,
  2,
  2,
  0,
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/',
  '133-W-28th-St-APT-6-C-New-York-NY-10001'),
 ('522 W 29th St # 4D',
  'New York',
  'NY',
  10001,
  3300000,
  1255,
  2,
  2,
  9,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/',
  '522-W-29th-St-4D-New-York-NY-10001'),
 ('252 7th Ave # PHI',
  'New York',
  'NY',
  10001,
  6495000,
  2400,
  3,
  3,
  7,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-PHI-New-York-NY-10001/2100699091_zpid/',
  '

### Concat - bedrooms*, bathrooms* as bed_bath_rooms* 3b2bh

In [27]:
# function to concat values in 6 and 7 th columns to give new column bed_bath_rooms
def bed_n_bath_combined(val1,val2):
    return str(val1)+"b"+str(val2)+"bh"

In [28]:
step6=step5.map(lambda x: (x[0],x[1],x[2],x[3],x[4],x[5],bed_n_bath_combined(x[6],x[7]),x[8],x[9],x[10],x[11]))

In [29]:
step5.take(3)

[('252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  10001,
  1529000,
  980,
  0,
  1,
  2,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
  '252-7th-Ave-APT-4L-New-York-NY-10001'),
 ('133 W 28th St APT 6-C',
  'New York',
  'NY',
  10001,
  1550000,
  1300,
  2,
  2,
  0,
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/',
  '133-W-28th-St-APT-6-C-New-York-NY-10001'),
 ('522 W 29th St # 4D',
  'New York',
  'NY',
  10001,
  3300000,
  1255,
  2,
  2,
  9,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/',
  '522-W-29th-St-4D-New-York-NY-10001')]

In [30]:
step6.take(3)

[('252 7th Ave APT 4L',
  'NEW YORK',
  'NY',
  10001,
  1529000,
  980,
  '0b1bh',
  2,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/252-7th-Ave-APT-4L-New-York-NY-10001/55501383_zpid/',
  '252-7th-Ave-APT-4L-New-York-NY-10001'),
 ('133 W 28th St APT 6-C',
  'New York',
  'NY',
  10001,
  1550000,
  1300,
  '2b2bh',
  0,
  'Co-op For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/133-W-28th-St-APT-6-C-New-York-NY-10001/79496201_zpid/',
  '133-W-28th-St-APT-6-C-New-York-NY-10001'),
 ('522 W 29th St # 4D',
  'New York',
  'NY',
  10001,
  3300000,
  1255,
  '2b2bh',
  9,
  'Condo For Sale',
  'http://www.zillow.com/homes/for_sale//homedetails/522-W-29th-St-4D-New-York-NY-10001/2105595161_zpid/',
  '522-W-29th-St-4D-New-York-NY-10001')]

###  GroupBy zip,bed_bath_rooms* and avg, max, min


In [31]:
# Creating an rdd that contains only the zipid, bed_bath_rooms and price columns
step7 = step6.map(lambda x: (x[3],x[6],x[4]))

In [32]:
step7.collect()

[(10001, '0b1bh', 1529000),
 (10001, '2b2bh', 1550000),
 (10001, '2b2bh', 3300000),
 (10001, '3b3bh', 6495000),
 (10001, '2b3bh', 4350000),
 (10001, '2b2bh', 2700000),
 (10001, '0b1bh', 410000),
 (10001, '3b3bh', 4450000),
 (10001, '2b2bh', 1995000),
 (10001, '5b5bh', 6995000),
 (10001, '2b3bh', 2450000),
 (10001, '2b3bh', 4315000),
 (10001, '3b2bh', 3250000),
 (10001, '1b1bh', 775000),
 (10001, '1b2bh', 1685000),
 (10001, '7b10bh', 6500000),
 (10001, '2b3bh', 7875000),
 (10001, '5b5bh', 16000000),
 (10001, '5b5bh', 9750000),
 (10001, '2b3bh', 7575000),
 (10001, '2b3bh', 3875000),
 (10001, '2b3bh', 5750000),
 (10001, '2b3bh', 2995000),
 (10001, '0b1bh', 435000),
 (10001, '4b5bh', 12810000),
 (10001, '3b4bh', 5900000),
 (10001, '3b4bh', 4725000),
 (10001, '2b3bh', 3695000),
 (10001, '1b1bh', 1825000),
 (10001, '0b1bh', 625000),
 (10001, '2b2bh', 2235000),
 (10001, '2b2bh', 2345000),
 (10001, '2b2bh', 1775000),
 (10001, '0b1bh', 435000),
 (10001, '1b1bh', 899000),
 (10001, '1b1bh', 14980

In [33]:
# Grouping by zip and then bed_bath_rooms
step8 = step7.groupBy(lambda x: (x[0],x[1]))

In [34]:
step8.collect()

[((10001, '0b1bh'), <pyspark.resultiterable.ResultIterable at 0x1cf6744a940>),
 ((10001, '0b0bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab430>),
 ((10003, '2b2bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab4f0>),
 ((10003, '2b4bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab580>),
 ((10003, '8b10bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab5e0>),
 ((10002, '1b1bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab640>),
 ((10002, '3b2bh'), <pyspark.resultiterable.ResultIterable at 0x1cf678ab6a0>),
 ((10002, '2b3bh'), <pyspark.resultiterable.ResultIterable at 0x1cf6744ad00>),
 ((10002, '0b0bh'), <pyspark.resultiterable.ResultIterable at 0x1cf6744af70>),
 ((10004, '4b5bh'), <pyspark.resultiterable.ResultIterable at 0x1cf67556ac0>),
 ((10004, '2b2bh'), <pyspark.resultiterable.ResultIterable at 0x1cf675567f0>),
 ((10006, '1b1bh'), <pyspark.resultiterable.ResultIterable at 0x1cf67556bb0>),
 ((10006, '0b1bh'), <pyspark.resultiterable.ResultI

In [35]:
step8.mapValues(list).collect()

[((10001, '0b1bh'),
  [(10001, '0b1bh', 1529000),
   (10001, '0b1bh', 410000),
   (10001, '0b1bh', 435000),
   (10001, '0b1bh', 625000),
   (10001, '0b1bh', 435000),
   (10001, '0b1bh', 449500),
   (10001, '0b1bh', 1295000)]),
 ((10001, '0b0bh'),
  [(10001, '0b0bh', 0), (10001, '0b0bh', 49000), (10001, '0b0bh', 0)]),
 ((10003, '2b2bh'),
  [(10003, '2b2bh', 2850000),
   (10003, '2b2bh', 2195000),
   (10003, '2b2bh', 1995000),
   (10003, '2b2bh', 1900000),
   (10003, '2b2bh', 1745000),
   (10003, '2b2bh', 3800000),
   (10003, '2b2bh', 2200000),
   (10003, '2b2bh', 3750000),
   (10003, '2b2bh', 2775000),
   (10003, '2b2bh', 2695000),
   (10003, '2b2bh', 3100000),
   (10003, '2b2bh', 2100000),
   (10003, '2b2bh', 1845000),
   (10003, '2b2bh', 2300000),
   (10003, '2b2bh', 1975000),
   (10003, '2b2bh', 2550000),
   (10003, '2b2bh', 2195000),
   (10003, '2b2bh', 2249000),
   (10003, '2b2bh', 2150000),
   (10003, '2b2bh', 2995000),
   (10003, '2b2bh', 2750000),
   (10003, '2b2bh', 1900000),
 

In [36]:
# Aggregate min
step9= step8.map(lambda x: min(x[1]))

In [37]:
step9.collect()

[(10001, '0b1bh', 410000),
 (10001, '0b0bh', 0),
 (10003, '2b2bh', 1395000),
 (10003, '2b4bh', 7000000),
 (10003, '8b10bh', 17800000),
 (10002, '1b1bh', 400000),
 (10002, '3b2bh', 1100000),
 (10002, '2b3bh', 1850000),
 (10002, '0b0bh', 0),
 (10004, '4b5bh', 2500000),
 (10004, '2b2bh', 1325000),
 (10006, '1b1bh', 835000),
 (10006, '0b1bh', 635000),
 (10006, '3b3bh', 5755000),
 (10006, '3b4bh', 5800000),
 (10007, '2b2bh', 1995000),
 (10007, '4b5bh', 6495000),
 (10007, '3b2bh', 3650000),
 (10007, '7b10bh', 59000000),
 (10007, '2b3bh', 2725000),
 (10007, '1b2bh', 1935000),
 (10007, '0b0bh', 0),
 (10001, '4b3bh', 8500000),
 (10003, '6b5bh', 7875000),
 (10003, '6b7bh', 22900000),
 (10002, '4b2bh', 1650000),
 (10006, '2b3bh', 2975000),
 (10006, '0b0bh', 0),
 (10006, '1b2bh', 2495000),
 (10007, '3b5bh', 6900000),
 (10007, '6b12bh', 13750000),
 (10007, '4b6bh', 2600000),
 (10001, '3b5bh', 5500000),
 (10001, '14b14bh', 7690000),
 (10003, '5b7bh', 18000000),
 (10002, '5b6bh', 15945000),
 (10007, 

In [38]:
# aggregating by max value
step10= step8.map(lambda x: max(x[1]))

In [39]:
step10.collect()

[(10001, '0b1bh', 1529000),
 (10001, '0b0bh', 49000),
 (10003, '2b2bh', 7350000),
 (10003, '2b4bh', 23000000),
 (10003, '8b10bh', 17800000),
 (10002, '1b1bh', 2750000),
 (10002, '3b2bh', 3527000),
 (10002, '2b3bh', 9995000),
 (10002, '0b0bh', 9750000),
 (10004, '4b5bh', 10995000),
 (10004, '2b2bh', 2995000),
 (10006, '1b1bh', 2200000),
 (10006, '0b1bh', 835000),
 (10006, '3b3bh', 5755000),
 (10006, '3b4bh', 20305000),
 (10007, '2b2bh', 5300000),
 (10007, '4b5bh', 12000000),
 (10007, '3b2bh', 5275000),
 (10007, '7b10bh', 59000000),
 (10007, '2b3bh', 9000000),
 (10007, '1b2bh', 3875000),
 (10007, '0b0bh', 0),
 (10001, '4b3bh', 8500000),
 (10003, '6b5bh', 9400000),
 (10003, '6b7bh', 22900000),
 (10002, '4b2bh', 1680000),
 (10006, '2b3bh', 4600000),
 (10006, '0b0bh', 0),
 (10006, '1b2bh', 3350000),
 (10007, '3b5bh', 30000000),
 (10007, '6b12bh', 13750000),
 (10007, '4b6bh', 21325000),
 (10001, '3b5bh', 5500000),
 (10001, '14b14bh', 7690000),
 (10003, '5b7bh', 18000000),
 (10002, '5b6bh', 1

In [40]:
# function to find mean

def mean_val(x):
    sums=0
    l=0
    for i in x:
        sums=(i[2])+sums
        l=l+1
    return (round(sums/l,2))

In [41]:
# aggregating by mean
step11= step8.map(lambda x: (x[0][0],x[0][1], mean_val(x[1])))

In [42]:
step11.collect()

[(10001, '0b1bh', 739785.71),
 (10001, '0b0bh', 16333.33),
 (10003, '2b2bh', 2505388.88),
 (10003, '2b4bh', 13300000.0),
 (10003, '8b10bh', 17800000.0),
 (10002, '1b1bh', 1093295.45),
 (10002, '3b2bh', 2388217.14),
 (10002, '2b3bh', 3885400.0),
 (10002, '0b0bh', 2295454.55),
 (10004, '4b5bh', 6535750.0),
 (10004, '2b2bh', 1913000.0),
 (10006, '1b1bh', 1407014.06),
 (10006, '0b1bh', 718050.0),
 (10006, '3b3bh', 5755000.0),
 (10006, '3b4bh', 14430750.0),
 (10007, '2b2bh', 3356000.0),
 (10007, '4b5bh', 9101250.0),
 (10007, '3b2bh', 4237500.0),
 (10007, '7b10bh', 59000000.0),
 (10007, '2b3bh', 4171640.62),
 (10007, '1b2bh', 2543125.0),
 (10007, '0b0bh', 0.0),
 (10001, '4b3bh', 8500000.0),
 (10003, '6b5bh', 8741666.67),
 (10003, '6b7bh', 22900000.0),
 (10002, '4b2bh', 1665000.0),
 (10006, '2b3bh', 3923333.33),
 (10006, '0b0bh', 0.0),
 (10006, '1b2bh', 2922500.0),
 (10007, '3b5bh', 16016000.0),
 (10007, '6b12bh', 13750000.0),
 (10007, '4b6bh', 12235000.0),
 (10001, '3b5bh', 5500000.0),
 (100

In [43]:
# aggregate all 3 (min,max and average) in one step
step12= step8.map(lambda x: (x[0][0],x[0][1],mean_val(x[1]),min(x[1])[2],max(x[1])[2]))

In [44]:
step12.collect()

[(10001, '0b1bh', 739785.71, 410000, 1529000),
 (10001, '0b0bh', 16333.33, 0, 49000),
 (10003, '2b2bh', 2505388.88, 1395000, 7350000),
 (10003, '2b4bh', 13300000.0, 7000000, 23000000),
 (10003, '8b10bh', 17800000.0, 17800000, 17800000),
 (10002, '1b1bh', 1093295.45, 400000, 2750000),
 (10002, '3b2bh', 2388217.14, 1100000, 3527000),
 (10002, '2b3bh', 3885400.0, 1850000, 9995000),
 (10002, '0b0bh', 2295454.55, 0, 9750000),
 (10004, '4b5bh', 6535750.0, 2500000, 10995000),
 (10004, '2b2bh', 1913000.0, 1325000, 2995000),
 (10006, '1b1bh', 1407014.06, 835000, 2200000),
 (10006, '0b1bh', 718050.0, 635000, 835000),
 (10006, '3b3bh', 5755000.0, 5755000, 5755000),
 (10006, '3b4bh', 14430750.0, 5800000, 20305000),
 (10007, '2b2bh', 3356000.0, 1995000, 5300000),
 (10007, '4b5bh', 9101250.0, 6495000, 12000000),
 (10007, '3b2bh', 4237500.0, 3650000, 5275000),
 (10007, '7b10bh', 59000000.0, 59000000, 59000000),
 (10007, '2b3bh', 4171640.62, 2725000, 9000000),
 (10007, '1b2bh', 2543125.0, 1935000, 387

### Saving outputs as csv files

In [45]:
# Header 
col1=["zip_code","bed_bath_rooms","min_price"]
col2=["zip_code","bed_bath_rooms","max_price"]
col3=["zip_code","bed_bath_rooms","avg_price"]
col=["zip_code","bed_bath_rooms","avg_price","min_price","max_price"]

# conversion to Data Frame
f1=step9.toDF(col1)
f2=step10.toDF(col2)
f3=step11.toDF(col3)
f=step12.toDF(col)

In [46]:
f1.show()

+--------+--------------+---------+
|zip_code|bed_bath_rooms|min_price|
+--------+--------------+---------+
|   10001|         0b1bh|   410000|
|   10001|         0b0bh|        0|
|   10003|         2b2bh|  1395000|
|   10003|         2b4bh|  7000000|
|   10003|        8b10bh| 17800000|
|   10002|         1b1bh|   400000|
|   10002|         3b2bh|  1100000|
|   10002|         2b3bh|  1850000|
|   10002|         0b0bh|        0|
|   10004|         4b5bh|  2500000|
|   10004|         2b2bh|  1325000|
|   10006|         1b1bh|   835000|
|   10006|         0b1bh|   635000|
|   10006|         3b3bh|  5755000|
|   10006|         3b4bh|  5800000|
|   10007|         2b2bh|  1995000|
|   10007|         4b5bh|  6495000|
|   10007|         3b2bh|  3650000|
|   10007|        7b10bh| 59000000|
|   10007|         2b3bh|  2725000|
+--------+--------------+---------+
only showing top 20 rows



In [47]:
f2.show()

+--------+--------------+---------+
|zip_code|bed_bath_rooms|max_price|
+--------+--------------+---------+
|   10001|         0b1bh|  1529000|
|   10001|         0b0bh|    49000|
|   10003|         2b2bh|  7350000|
|   10003|         2b4bh| 23000000|
|   10003|        8b10bh| 17800000|
|   10002|         1b1bh|  2750000|
|   10002|         3b2bh|  3527000|
|   10002|         2b3bh|  9995000|
|   10002|         0b0bh|  9750000|
|   10004|         4b5bh| 10995000|
|   10004|         2b2bh|  2995000|
|   10006|         1b1bh|  2200000|
|   10006|         0b1bh|   835000|
|   10006|         3b3bh|  5755000|
|   10006|         3b4bh| 20305000|
|   10007|         2b2bh|  5300000|
|   10007|         4b5bh| 12000000|
|   10007|         3b2bh|  5275000|
|   10007|        7b10bh| 59000000|
|   10007|         2b3bh|  9000000|
+--------+--------------+---------+
only showing top 20 rows



In [48]:
f3.show()

+--------+--------------+----------+
|zip_code|bed_bath_rooms| avg_price|
+--------+--------------+----------+
|   10001|         0b1bh| 739785.71|
|   10001|         0b0bh|  16333.33|
|   10003|         2b2bh|2505388.88|
|   10003|         2b4bh|    1.33E7|
|   10003|        8b10bh|    1.78E7|
|   10002|         1b1bh|1093295.45|
|   10002|         3b2bh|2388217.14|
|   10002|         2b3bh| 3885400.0|
|   10002|         0b0bh|2295454.55|
|   10004|         4b5bh| 6535750.0|
|   10004|         2b2bh| 1913000.0|
|   10006|         1b1bh|1407014.06|
|   10006|         0b1bh|  718050.0|
|   10006|         3b3bh| 5755000.0|
|   10006|         3b4bh|1.443075E7|
|   10007|         2b2bh| 3356000.0|
|   10007|         4b5bh| 9101250.0|
|   10007|         3b2bh| 4237500.0|
|   10007|        7b10bh|     5.9E7|
|   10007|         2b3bh|4171640.62|
+--------+--------------+----------+
only showing top 20 rows



In [49]:
f.show()

+--------+--------------+----------+---------+---------+
|zip_code|bed_bath_rooms| avg_price|min_price|max_price|
+--------+--------------+----------+---------+---------+
|   10001|         0b1bh| 739785.71|   410000|  1529000|
|   10001|         0b0bh|  16333.33|        0|    49000|
|   10003|         2b2bh|2505388.88|  1395000|  7350000|
|   10003|         2b4bh|    1.33E7|  7000000| 23000000|
|   10003|        8b10bh|    1.78E7| 17800000| 17800000|
|   10002|         1b1bh|1093295.45|   400000|  2750000|
|   10002|         3b2bh|2388217.14|  1100000|  3527000|
|   10002|         2b3bh| 3885400.0|  1850000|  9995000|
|   10002|         0b0bh|2295454.55|        0|  9750000|
|   10004|         4b5bh| 6535750.0|  2500000| 10995000|
|   10004|         2b2bh| 1913000.0|  1325000|  2995000|
|   10006|         1b1bh|1407014.06|   835000|  2200000|
|   10006|         0b1bh|  718050.0|   635000|   835000|
|   10006|         3b3bh| 5755000.0|  5755000|  5755000|
|   10006|         3b4bh|1.4430

In [50]:
f1.toPandas().to_csv("min.csv")
f2.toPandas().to_csv("max.csv")
f3.toPandas().to_csv("average.csv")
f.toPandas().to_csv("combined.csv")

### Submitted By:
* **Lakshmi V Aji         (20BDA09)**
* **Josmi Agnes Jose      (20BDA27)**
* **Aishwarya Nair M J    (20BDA42)**
* **Mariya Biju           (20BDA61)**
    