# Housing Market

### Introduction:

This time we will create our own dataset with fictional numbers to describe a house market. As we are going to create random data don't try to reason of the numbers.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=0ba2849e242ba97e5bdf148b8d9e037623b6ea6410c9ede75d60e1c8a21129ab
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Create 3 differents Series, each of length 100, as follows:
1. The first a random number from 1 to 4
2. The second a random number from 1 to 3
3. The third a random number from 10,000 to 30,000

In [17]:
df = spark.range(100)

# Create three different DataFrames each with a single column of random numbers
df1 = df.select((F.rand() * 3 + 1).cast("int").alias("random_numbers_1_to_4"), "id")
df2 = df.select((F.rand() * 2 + 1).cast("int").alias("random_numbers_1_to_3"), "id")
df3 = df.select((F.rand() * 20000 + 10000).cast("int").alias("random_numbers_10000_to_30000"), "id")


In [18]:
df1.show()
df2.show()
df3.show()

+---------------------+---+
|random_numbers_1_to_4| id|
+---------------------+---+
|                    2|  0|
|                    2|  1|
|                    1|  2|
|                    3|  3|
|                    1|  4|
|                    1|  5|
|                    3|  6|
|                    3|  7|
|                    3|  8|
|                    3|  9|
|                    3| 10|
|                    2| 11|
|                    3| 12|
|                    1| 13|
|                    3| 14|
|                    1| 15|
|                    1| 16|
|                    3| 17|
|                    3| 18|
|                    1| 19|
+---------------------+---+
only showing top 20 rows

+---------------------+---+
|random_numbers_1_to_3| id|
+---------------------+---+
|                    1|  0|
|                    1|  1|
|                    1|  2|
|                    2|  3|
|                    1|  4|
|                    1|  5|
|                    1|  6|
|                    1

### Step 3. Let's create a DataFrame by joinning the Series by column

In [19]:
# Merge the DataFrames on the id column
merged_df = df1.join(df2, "id").join(df3, "id")

# Drop the id column
merged_df = merged_df.drop("id")

# Show the merged DataFrame
merged_df.show()


+---------------------+---------------------+-----------------------------+
|random_numbers_1_to_4|random_numbers_1_to_3|random_numbers_10000_to_30000|
+---------------------+---------------------+-----------------------------+
|                    2|                    1|                        17840|
|                    2|                    1|                        13377|
|                    1|                    1|                        11016|
|                    3|                    2|                        17674|
|                    1|                    1|                        11791|
|                    1|                    1|                        25457|
|                    3|                    1|                        22851|
|                    3|                    1|                        14036|
|                    3|                    1|                        27640|
|                    3|                    2|                        24094|
|           

### Step 4. Change the name of the columns to bedrs, bathrs, price_sqr_meter

In [22]:
merged_df = merged_df.withColumnRenamed("random_numbers_1_to_4", "bedrs").withColumnRenamed("random_numbers_1_to_3", "bathrs").withColumnRenamed("random_numbers_10000_to_30000", "price_sqr_meter")

In [23]:
merged_df.show()

+-----+------+---------------+
|bedrs|bathrs|price_sqr_meter|
+-----+------+---------------+
|    2|     1|          17840|
|    2|     1|          13377|
|    1|     1|          11016|
|    3|     2|          17674|
|    1|     1|          11791|
|    1|     1|          25457|
|    3|     1|          22851|
|    3|     1|          14036|
|    3|     1|          27640|
|    3|     2|          24094|
|    3|     2|          29002|
|    2|     1|          28368|
|    3|     2|          23393|
|    1|     2|          15141|
|    3|     2|          28353|
|    1|     1|          11688|
|    1|     1|          22502|
|    3|     1|          13117|
|    3|     1|          24894|
|    1|     1|          10046|
+-----+------+---------------+
only showing top 20 rows



### Step 5. Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'

In [28]:
big = df1.union(df2).union(df3)
big.show(300)

+---------------------+---+
|random_numbers_1_to_4| id|
+---------------------+---+
|                    2|  0|
|                    2|  1|
|                    1|  2|
|                    3|  3|
|                    1|  4|
|                    1|  5|
|                    3|  6|
|                    3|  7|
|                    3|  8|
|                    3|  9|
|                    3| 10|
|                    2| 11|
|                    3| 12|
|                    1| 13|
|                    3| 14|
|                    1| 15|
|                    1| 16|
|                    3| 17|
|                    3| 18|
|                    1| 19|
|                    2| 20|
|                    2| 21|
|                    1| 22|
|                    1| 23|
|                    1| 24|
|                    3| 25|
|                    3| 26|
|                    1| 27|
|                    3| 28|
|                    3| 29|
|                    2| 30|
|                    3| 31|
|                   

### Step 6. Oops, it seems it is going only until index 99. Is it true?

In [None]:
# NO! :)

### Step 7. Reindex the DataFrame so it goes from 0 to 299

In [30]:
big = big.coalesce(1)

In [31]:
big.show()

+---------------------+---+
|random_numbers_1_to_4| id|
+---------------------+---+
|                    2|  0|
|                    2|  1|
|                    1|  2|
|                    3|  3|
|                    1|  4|
|                    1|  5|
|                    3|  6|
|                    3|  7|
|                    3|  8|
|                    3|  9|
|                    3| 10|
|                    2| 11|
|                    3| 12|
|                    1| 13|
|                    3| 14|
|                    1| 15|
|                    1| 16|
|                    3| 17|
|                    3| 18|
|                    1| 19|
+---------------------+---+
only showing top 20 rows



In [32]:
big = big.withColumn("id", F.monotonically_increasing_id())
big.show(300)

+---------------------+---+
|random_numbers_1_to_4| id|
+---------------------+---+
|                    2|  0|
|                    2|  1|
|                    1|  2|
|                    3|  3|
|                    1|  4|
|                    1|  5|
|                    3|  6|
|                    3|  7|
|                    3|  8|
|                    3|  9|
|                    3| 10|
|                    2| 11|
|                    3| 12|
|                    1| 13|
|                    3| 14|
|                    1| 15|
|                    1| 16|
|                    3| 17|
|                    3| 18|
|                    1| 19|
|                    2| 20|
|                    2| 21|
|                    1| 22|
|                    1| 23|
|                    1| 24|
|                    3| 25|
|                    3| 26|
|                    1| 27|
|                    3| 28|
|                    3| 29|
|                    2| 30|
|                    3| 31|
|                   