# Functions

+ <a href="#functions">1.Built In Functions</a>
    + <a href="#string">String Functions</a>
    + <a href="#numeric">Numeric Functions</a>
    + <a href="#date">Date Functions</a>
+ <a href="#dates">2.Working with Dates</a>
+ <a href="#user">3.User Defined Functions</a>
+ <a href="#join">4.Working with Joins</a>
+ <a href="#challenges">5.Challenges</a>
----

# Set up

In [None]:
spark

------

<p id="functions"></p>

# 1) Built In Functions

-------

<p id="string"></p>

## String Functions

**NOTE: for substring Position is 1 based, not 0 based (not start from 0 like other languages)**

In [None]:
# help(substring)

In [None]:
rc.select(
    lower(col("Primary Type")),
    upper(col("Primary Type")),
    substring(col("Primary Type"), 1, 4),
).show(5)

------

<p id="numeric"></p>

## Numeric Functions

### Show the oldest date and most recent date

In [None]:
rc.select(min(col("Date"))).show(1)

In [None]:
rc.select(max(col("Date"))).show(1)

-------

### What is 3 days earlier than the oldest date and 3 days later than the most recent date?

In [None]:
help(date_add)

In [None]:
# 3 days earlier than the oldest date
rc.select(date_sub(min(col("Date")), 3)).show(1)

In [None]:
# 3 days later than the recent date
rc.select(date_add(max(col("Date")), 3)).show(1)

------

<p id="dates"></p>

# 2) Working with Dates

### Parsing different strings to Date and Timestamp

### 2019-12-25 13:30:00

In [None]:
df = spark.createDataFrame(
    [("2019-12-25 13:30:00",)], ["Christmas"]
)  # value, column name

In [None]:
df.show(1)

In [None]:
# parse to date and timestamp
df.select(to_date(col("Christmas"), "yyyy-MM-dd HH:mm:ss")).show()

### 25/Dec/2019 13:30:00

In [None]:
df = spark.createDataFrame([("25/Dec/2019 13:30:00",)], ["Christmas"])

In [None]:
df.show()

In [None]:
# parse to date and timestamp

In [None]:
df.select(to_date(col("Christmas"), "dd/MMM/yyyy HH:mm:ss")).show()

In [None]:
df.select(to_timestamp(col("Christmas"), "dd/MMM/yyyy HH:mm:ss")).show()

### 12/25/2019 01:30:00 PM

In [None]:
df = spark.createDataFrame([("12/25/2019 01:30:00 PM",)], ["Christmas"])
df.show(1, truncate=False)

In [None]:
# parse to date and timestamp

In [None]:
df.select(to_date(col("Christmas"), "MM/dd/yyyy hh:mm:ss a")).show()

In [None]:
df.select(to_timestamp(col("Christmas"), "MM/dd/yyyy hh:mm:ss a")).show()

------

In [None]:
new_rc = spark.read.csv("../data/reported-crimes.csv", header=True)

In [None]:
new_rc.show(2, truncate=False)

<p id="user"></p>

# 3) User Defined Functions

- can create functions in Java, Scala, Python and R
- For performance, write functions in Java or Scala
- we can still access those functions from Python

-------

<p id="join"></p>

# 4) Working with Joins

In [None]:
rc = (
    spark.read.csv("../data/reported-crimes.csv", header=True)
    .withColumn("Date", to_timestamp(col("Date"), "MM/dd/yyyy hh:mm:ss a"))
    .filter(col("Date") <= lit("2018-11-11"))
)
rc.show(2)

###  download police stations dataset

In [None]:
# !wget -0 police-station.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD

# !ls -l

# Joins

### The reported crimes datasets has only district numbers. Add the district name by joining with police station dataset.

### Caching data
- as Report Crimes data set is very big, we will cache it to speed things up during joining
- Since Cache command or Caching function is lazily evaluated, using Action Function (such as count) to get that Data Frame into cache as soon as possible.

In [None]:
rc.cache()
rc.count()

In [None]:
# get the distinct values of distrct names from police station dataset
ps.select(col("DISTRICT")).distinct().show(30)

In [None]:
# also get distinct values of distrct names from reported crimes dataset
rc.select(col("District")).distinct().show(30)

As we can see from the data, District Number from Police Station doesn't have leading Zero. But Reported Crimes's District Number have leading Zero. So we need to manipulate those data.

#### Put lpad on police station's District Number so that format can be same as Reporte Crime's one

In [None]:
help(lpad)

In [None]:
ps.select(lpad(col("DISTRICT"), 3, "0")).show(5)

#### create new column in police station for new padded value

In [None]:
ps = ps.withColumn("Format_district", lpad(col("DISTRICT"), 3, "0"))

In [None]:
ps.show(5, truncate=False)

### join with Left Outer Join

In [None]:
rc.join(ps, rc.District == ps.Format_district, "left_outer").show(2, truncate=False)

#### As joined data looks a bit messy, we will clean up dropping unecessary columns from police stataion dataset When JOINING

In [None]:
ps.columns

In [None]:
rc.join(ps, rc.District == ps.Format_district, "left_outer").drop(
    " ADDRESS",
    "CITY",
    "STATE",
    "ZIP",
    "WEBSITE",
    "PHONE",
    "FAX",
    "TTY",
    "X COORDINATE",
    "Y COORDINATE",
    "LATITUDE",
    "LONGITUDE",
    "LOCATION",
).show(2, truncate=False)

-------

<p id="challenges"></p>

# 5) Challenges

- What is the most frequently reported non-criminal activity?
- Which day of the week has the most crimes reported?

### What is the most frequently reported non-criminal activity?

In [None]:
rc.show(2, truncate=False)

In [None]:
# rc.filter(instr(col('Primary Type'), 'NON')).show()
rc.filter(col("Primary Type").like("%NON%")).groupBy("Description").count().sort(
    "count", ascending=False
).show(truncate=False)

**How to classify non criminal activity?**

First we wil check Primary Type column.

In [None]:
rc.select(col("Primary Type")).distinct().count()

In [None]:
rc.select(col("Primary Type")).distinct().orderBy(col("Primary Type")).show(
    36, truncate=False
)

We can see from the above result, there are 3 types of NON-CRIMINAL ones.

So we will get the filtered non criminal new dataset using those 3 conditions.

In [None]:
nc = rc.filter(
    (col("Primary Type") == "NON - CRIMINAL")
    | (col("Primary Type") == "NON-CRIMINAL")
    | (col("Primary Type") == "NON-CRIMINAL (SUBJECT SPECIFIED)")
)

In [None]:
nc.show(5, truncate=False)

Then we will groupby Description column.


In [None]:
nc.groupBy(col("Description")).count().orderBy("count", ascending=False).show(
    truncate=False
)

It seem like LOST PASSPORT is the most frequent Non criminal related reported cases.

-----------------------

----------------------

### Which day of the week has the most crimes reported?

https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

In [None]:
help(dayofweek)

In [None]:
rc.show(1, truncate=False)

In [None]:
# get the day of the weeks
rc.select(col("Date"), dayofweek(col("Date"))).show(2)

Now we get the day of the week as integer value. but we want to format it as Monday, Tuesday, etc.

In [None]:
help(date_format)

In [None]:
rc.select(col("Date"), dayofweek(col("Date")), date_format(col("Date"), "E")).show(5)

In [None]:
rc.groupBy(date_format(col("Date"), "E")).count().orderBy(
    "count", ascending=False
).show()

We can see that "Friday" has the highest reported crimes. Prehaps people tend to go out on friday.

### Plotting the results

In [None]:
# collect the row objects
results = rc.groupBy(date_format(col("Date"), "E")).count().orderBy("count").collect()

results

In [None]:
day_of_week = [col[0] for col in results]

day_of_week

In [None]:
reported_counts = [col[1] for col in results]

reported_counts

In [None]:
df = pd.DataFrame({"Day Of Week": day_of_week, "Count": reported_counts})

In [None]:
df.head()

In [None]:
df = df.sort_values(by="Count", ascending=False)

In [None]:
base_color = sns.color_palette()[0]

plt.figure(figsize=(10, 5), dpi=150)
sns.barplot(data=df, x="Day Of Week", y="Count", color=base_color)

plt.ylabel("Number of Reported Crimes")
plt.title("Which day of the week has the most crimes reported? (2001 to present)");