# MASSIVE DATA PROCESSING

## BIG DATA PROJECT

### IDEALISTA - TOP N CITIES BY DESIRED PROPERTY

#### Group:
Daniel Vieira Cordeiro

Guillem Orellana Trullols

Marc Viladegut Abert



#### Coordination:
CORES PRADO, FERNANDO

MATEU PIÑOL, CARLOS


#### Description:

Using the previouly gathered and cleaned dataset containing the information of properties on the idealista website, by making use of Apache Spark and Python, we created this notebook that filters properties by:

- Maximum rent price the user is willing to pay 
- Minimum number of rooms the user seeks in his new home
- If the annoucement has photos of the property
- The property type: "Flat, House, etc.."

Once it has the data filtered, it generates a list of TOP N Cities containing the properties that match the requirements.

In [1]:
# Import useful libraries
import pyspark, os, shutil, string, json, time

begin = time.time()

# Creating Spark Context
sc = pyspark.SparkContext('local[*]')
print(sc)

# Removing Output Files if they already exist
if os.path.exists("Output_Idealista/*.json"): 
    shutil.rmtree("Output_Idealista/*.json")
    
# Reading cleaned Idealist properties file
clean_data = sc.textFile("Input_Idealista/*.csv")
print("Header:")
print(clean_data.first())

# Get properties data fields (string split separate by ,)
data_header = clean_data.map(lambda line: line.split(","))

# Separating Header from data
header = data_header.first()
data = data_header.filter(lambda row: row != header)

print("\nInitial Data:")
print(data.take(3))

# Filtering Columns
simple_data = data.map(lambda r: (r[2], r[4], r[5], r[9], r[-1]))

print("\nInitial Data - Important Columns Only:")
print(simple_data.take(3))

# Setting up filters
maxPrice = 1000000.0
minRooms = 4
minPhotos = 0
houseType = "flat"

# Filtering Data
filtered = simple_data.filter(lambda r: int(r[0]) >= minPhotos and float(r[1]) < maxPrice and r[2] == houseType and int(r[3]) >= minRooms)

print("\nFiltered Data:")
print(filtered.take(3))

# Defining number of Top Cities
N = 8

# Keeping only the City info and a value = 1
cities = filtered.map(lambda h: (h[4],1))

# Printing the total number of Cities and how many distincts are there
print("\nTotal number of Cities:  " + str(cities.count()))
print("\nTotal number of Distinct Cities:  " + str(cities.distinct().count()))

# Counting and joining hashtags (#)
property_count = cities.reduceByKey(lambda a, b: a + b)#.filter(lambda t: t[1])
print(property_count.take(N))

# Write Cities in a text/json file.
property_count.saveAsTextFile("Output_Idealista/properties_by_cities.json")

# Ordering decresingly by amount
properties_ordered = property_count.takeOrdered(N, key = lambda x: -x[1])
print("\nTop Cities with matching properties: \n")
print(properties_ordered)

# Converting the list to parallel RDD
parallel_properties_ordered = sc.parallelize(properties_ordered)

# Use the map function to write one element per line and write all elements to a single file (coalesce)
parallel_properties_ordered.coalesce(1).map(lambda row: str(row[0]) + " " + str(row[1])).saveAsTextFile("Output_Idealista/propertiesTopN.json")

print("\nExecution Time (s): " +  str(time.time() - begin))

<SparkContext master=local[*] appName=pyspark-shell>
Header:
propertyCode,thumbnail,numPhotos,floor,price,propertyType,operation,size,exterior,rooms,bathrooms,address,province,municipality,district,country,neighborhood,latitude,longitude,showAddress,url,distance,hasVideo,status,newDevelopment,hasLift,priceByArea,hasPlan,has3DTour,has360,topNewDevelopment,detailedType_typology,suggestedTexts_subtitle,suggestedTexts_title,parkingSpace_hasParkingSpace,parkingSpace_isParkingSpaceIncludedInPrice,detailedType_subTypology,parkingSpace_parkingSpacePrice,spain_state

Initial Data:
[[u'88707408', u'https://img3.idealista.com/blur/WEB_LISTING/0/id.pro.es.image.master/38/94/71/753258284.jpg', u'18', u'3', u'1050.0', u'flat', u'rent', u'50.0', u'True', u'1', u'1', u'"Calle de Sants', u' 208"', u'Barcelona', u'Barcelona', u'Sants-Montju\xefc', u'es', u'Sants', u'41.3756551', u'2.1312542', u'True', u'https://www.idealista.com/inmueble/88707408/', u'4876', u'False', u'good', u'False', u'True', u'21.0'