# 103088 - MASSIVE DATA PROCESSING

## P2A1

## Summary
1. [Prepare environment](#prepare-environment)
2. [Start working with Spark](#start-spark)
3. [Download dataset](#dataset)
4. [Preprocessing](#preprocessing)
5. [Questions](#questions)

<a name="prepare-environment"></a>
## Prepare environment
First, we are going to prepare the environment for running PySaprk in the Google Collab Machine (if you work directly in your computer, and you want to prepare it, read and follow champter 2 instructions)

In [45]:
from google.colab import drive
drive.mount('/content/drive')
!python /content/drive/MyDrive/colab/massive/install_pyspark.py

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Install JAVA 8
Obtaining last version of spark


  soup = BeautifulSoup(html_doc)
Getting version spark-3.5.1
Downloading https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Installing PySpark
Setting environment variables for JAVA_HOME and SPARK_HOME


In [46]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import year, col, count, to_date, desc, month, weekofyear
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.stat import Correlation
import wget

<a name="start-spark"></a>
## Start working with Spark
Now we now and understand how Spark appeared in our lives and more or less how it works (and you know, it's amazing 🤭), we can start to work with it.
As you now, the SparkSession is the way programmers "talk" with Spark. So, we need to inicialize that.

In [47]:
spark = (SparkSession
 .builder
 .appName("P2A1")
 .getOrCreate())

spark

<a name="dataset"></a>
## Download dataset

In [48]:
wget.download('https://github.com/databricks/LearningSparkV2/raw/master/chapter3/data/sf-fire-calls.csv')
fire_df = spark.read.csv('sf-fire-calls.csv', header = True)

In [49]:
fire_df.show()

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|      UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|    Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+

<a name="preprocessing"></a>
## Preprocessing

In [50]:
fire_df = fire_df.na.drop() #drop nulls
fire_df = fire_df.withColumn("CallDate", to_date(col("CallDate"), "dd/MM/yyyy")) #transform "CallDate" to date type

<a name="questions"></a>
## Questions

+ What were all the different types of fire calls in 2018?

In [51]:
df_types_fire = fire_df.filter(
    (col("CallType").contains("Fire")) &
    (year(col("CallDate")) == 2018)
)
distinct_fire_calls = df_types_fire.select("CallType").distinct()
# Show the results
distinct_fire_calls.show(truncate=False)

+--------------+
|CallType      |
+--------------+
|Vehicle Fire  |
|Outside Fire  |
|Structure Fire|
+--------------+



+ What months within the year 2018 saw the highest number of fire calls?

In [52]:
df_number_fire = fire_df.filter(
    (col("CallType").contains("Fire")) &
    (year(col("CallDate")) == 2018)
)
# Group by month and count the occurrences
fire_counts_by_month = df_number_fire.groupBy(month(col("CallDate")).alias("Month")).agg(count(col("CallType")).alias("Total Fire Calls"))
fire_counts_by_month = fire_counts_by_month.orderBy(desc("Total Fire Calls"))
# Show the results
fire_counts_by_month.show(truncate=False)

+-----+----------------+
|Month|Total Fire Calls|
+-----+----------------+
|1    |52              |
|7    |45              |
|6    |40              |
|5    |39              |
|3    |38              |
|4    |38              |
|2    |38              |
|9    |37              |
|8    |37              |
|12   |28              |
|10   |28              |
|11   |22              |
+-----+----------------+



+ Which neighborhood in San Francisco generated the most fire calls in 2018?

In [53]:
df_sf_fire = fire_df.filter(
    (col("City").contains("San Francisco")) & (col("CallType").contains("Fire")) & (year(col("CallDate")) == 2018)
)

# Group by month and count the occurrences
neighborhood_fire_counts = df_sf_fire.groupBy("Neighborhood").count()
most_fire_neighborhood = neighborhood_fire_counts.orderBy(col("count").desc())

# Show the result
most_fire_neighborhood.show(truncate=False)

+------------------------------+-----+
|Neighborhood                  |count|
+------------------------------+-----+
|Tenderloin                    |44   |
|Financial District/South Beach|40   |
|Bayview Hunters Point         |34   |
|Mission                       |32   |
|Western Addition              |21   |
|South of Market               |20   |
|Haight Ashbury                |15   |
|Sunset/Parkside               |15   |
|Bernal Heights                |14   |
|Outer Richmond                |13   |
|Potrero Hill                  |13   |
|Castro/Upper Market           |13   |
|Pacific Heights               |13   |
|North Beach                   |12   |
|Hayes Valley                  |12   |
|Nob Hill                      |12   |
|Russian Hill                  |12   |
|West of Twin Peaks            |11   |
|Marina                        |10   |
|Inner Richmond                |9    |
+------------------------------+-----+
only showing top 20 rows



+ Which neighborhoods had the worst response times to fire calls in 2018?

In [54]:
# Sort by response time (Delay) in descending order
df_delay = fire_df.orderBy(fire_df["Delay"].desc())

# Select top 3 neighborhoods
top_3_neighborhoods_with_delays = df_delay.select("Neighborhood", "Delay").limit(10)

# Show the results
top_3_neighborhoods_with_delays.show(truncate=False)

+---------------------+---------+
|Neighborhood         |Delay    |
+---------------------+---------+
|Tenderloin           |99.9     |
|Tenderloin           |97.8     |
|Bayview Hunters Point|95.416664|
|South of Market      |94.71667 |
|Chinatown            |931.45   |
|Bayview Hunters Point|92.816666|
|Mission              |92.51667 |
|Lakeshore            |92.28333 |
|Bayview Hunters Point|91.78333 |
|South of Market      |91.666664|
+---------------------+---------+



+ Which week in the year in 2018 had the most fire calls?


In [55]:
df_week_fire_calls = fire_df.filter(
    (col("CallType").contains("Fire")) &
    (col("CallDate").between("2018-01-01", "2018-12-31"))
)

# 3. Group by week and count occurrences
weekly_structure_fires = df_week_fire_calls.groupBy(weekofyear("CallDate").alias("Week")).count()

# 4. Find the week with the highest count
max_week = weekly_structure_fires.orderBy(col("count").desc()).first()
weekly_structure_fires.orderBy(col("count").desc()).show()
# Extract the week number and count
most_fires_week = max_week["Week"]
fire_count = max_week["count"]

print(f"The week {most_fires_week} in 2018 had the most \"Fire\" incidents with {fire_count} calls.")

+----+-----+
|Week|count|
+----+-----+
|   1|   37|
|  27|   30|
|   6|   29|
|  23|   27|
|  14|   26|
|  10|   24|
|  18|   24|
|  40|   23|
|  49|   22|
|  36|   20|
|  31|   19|
|  32|   18|
|  45|   16|
|  19|   15|
|   2|   15|
|   9|   14|
|  22|   13|
|  15|   11|
|  28|    9|
|   5|    9|
+----+-----+
only showing top 20 rows

The week 1 in 2018 had the most "Fire" incidents with 37 calls.


+ Is there a correlation between neighborhood, zip code, and number of fire calls?

In [56]:
# Filter for rows where "CallType" contains "Fire"
df_fire_calls = fire_df.filter(col("CallType").contains("Fire"))
# Group by "neighborhood" and "Zipcode" and count occurrences
fire_counts_by_location = df_fire_calls.groupBy("Neighborhood", "Zipcode").agg(count("CallType").alias("Total_Fire_Incidents"))

# Show the results
fire_counts_by_location.show(truncate=False)

indexer_neighborhood = StringIndexer(inputCol="Neighborhood", outputCol="NeighborhoodIndex")
indexer_zipcode = StringIndexer(inputCol="Zipcode", outputCol="ZipcodeIndex")

df_fire_calls = indexer_neighborhood.fit(fire_counts_by_location).transform(fire_counts_by_location)
df_fire_calls = indexer_zipcode.fit(df_fire_calls).transform(df_fire_calls)

input_columns = ["NeighborhoodIndex", "ZipcodeIndex", "Total_Fire_Incidents"]

# Assemble the feature vector
vector_assembler = VectorAssembler(inputCols=input_columns, outputCol="features")
vector_df = vector_assembler.transform(df_fire_calls)

# Calculate the correlation matrix
correlation_matrix = Correlation.corr(vector_df, "features").head()

# Print the correlation matrix
print("Correlation matrix:\n")
print(correlation_matrix[0])

+------------------------------+-------+--------------------+
|Neighborhood                  |Zipcode|Total_Fire_Incidents|
+------------------------------+-------+--------------------+
|Inner Sunset                  |94131  |18                  |
|Castro/Upper Market           |94110  |14                  |
|Mission Bay                   |94103  |16                  |
|Pacific Heights               |94123  |15                  |
|Nob Hill                      |94109  |228                 |
|Noe Valley                    |94110  |13                  |
|Financial District/South Beach|94103  |33                  |
|Lone Mountain/USF             |94118  |53                  |
|Castro/Upper Market           |94103  |9                   |
|Golden Gate Park              |94121  |3                   |
|Treasure Island               |94130  |63                  |
|Western Addition              |94102  |63                  |
|Presidio Heights              |94115  |24                  |
|Excelsi

+ How can we use Parquet files or SQL tables to store this data and read it back?

In [57]:
fire_df.write.parquet("/content/drive/MyDrive/colab/massive/test.parquet")