# Wrangle Exercises
Data Acquisition
These exercises should go in a notebook or script named ```wrangle```. Add, commit, and push your changes.

This exercise uses the ```cases```, ```dept```, and ```source``` tables from the ```311_data``` on the ```Codeup MySQL server```.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pandas as pd

# The SparkSession is where you would specify the JDBC driver and additional connection details.
# We'll use pd.read_sql to simplify here so we can focus on the Spark API and not the IT setup.
# When using Spark on the job, you'll work with the operations team to install the right Java drivers and configure your connection
spark = SparkSession.builder.getOrCreate()

# ------------- #
# Local Imports #
# ------------- #

# importing sys
import sys

# adding 00_helper_files to the system path
sys.path.insert(0, '/Users/qmcbt/codeup-data-science/00_helper_files')

# env containing sensitive access credentials
import env
from env import user, password, host
from env import get_db_url

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/22 12:49:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/22 12:49:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/22 12:49:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## 1. Read the ```cases```, ```dept```, and ```source data``` into their own spark dataframes.

In [7]:
# Read in source to a spark dataframe
query = """SELECT * FROM source"""
url = get_db_url("311_data")
source_df = pd.read_sql(query, url)
source_df = spark.createDataFrame(source_df)
source_df.show(5)

+-----+---------+----------------+
|index|source_id| source_username|
+-----+---------+----------------+
|    0|   100137|Merlene Blodgett|
|    1|   103582|     Carmen Cura|
|    2|   106463| Richard Sanchez|
|    3|   119403|  Betty De Hoyos|
|    4|   119555|  Socorro Quiara|
+-----+---------+----------------+
only showing top 5 rows



In [8]:
# Read in cases to a spark dataframe
query = """SELECT * FROM cases"""
url = get_db_url("311_data")
cases_df = pd.read_sql(query, url)
cases_df = spark.createDataFrame(cases_df)
cases_df.show(5)

23/01/22 13:00:59 WARN TaskSetManager: Stage 3 contains a task of very large size (18865 KiB). The maximum recommended task size is 1000 KiB.


[Stage 3:>                                                          (0 + 1) / 1]

23/01/22 13:01:04 WARN PythonRunner: Detected deadlock while completing task 0.0 in stage 3 (TID 3): Attempting to kill Python Worker
+----------+----------------+----------------+------------+---------+-------------+-----------+----------------+--------------------+-----------+-----------+---------+--------------------+----------------+
|   case_id|case_opened_date|case_closed_date|SLA_due_date|case_late|num_days_late|case_closed|   dept_division|service_request_type|   SLA_days|case_status|source_id|     request_address|council_district|
+----------+----------------+----------------+------------+---------+-------------+-----------+----------------+--------------------+-----------+-----------+---------+--------------------+----------------+
|1014127332|     1/1/18 0:42|    1/1/18 12:29|9/26/20 0:42|       NO| -998.5087616|        YES|Field Operations|        Stray Animal|      999.0|     Closed| svcCRMLS|2315  EL PASO ST,...|               5|
|1014127333|     1/1/18 0:46|     1/3/18 8

                                                                                

In [9]:
# Read in source to a spark dataframe
query = """SELECT * FROM dept"""
url = get_db_url("311_data")
dept_df = pd.read_sql(query, url)
dept_df = spark.createDataFrame(dept_df)
dept_df.show(5)

+--------------------+--------------------+----------------------+-------------------+
|       dept_division|           dept_name|standardized_dept_name|dept_subject_to_SLA|
+--------------------+--------------------+----------------------+-------------------+
|     311 Call Center|    Customer Service|      Customer Service|                YES|
|               Brush|Solid Waste Manag...|           Solid Waste|                YES|
|     Clean and Green|Parks and Recreation|    Parks & Recreation|                YES|
|Clean and Green N...|Parks and Recreation|    Parks & Recreation|                YES|
|    Code Enforcement|Code Enforcement ...|  DSD/Code Enforcement|                YES|
+--------------------+--------------------+----------------------+-------------------+
only showing top 5 rows



## 2. Let's see how writing to the local disk works in spark:

* Write the code necessary to store the source data in both csv and json format, store these as ```sources_csv``` and ```sources_json```  
### ANSWER:

In [13]:
# Write the .json file in a data folder overwriting any existing file with the same name
source_df.write.json("data/source_json", mode="overwrite")

# Write the .csv file in a data folder overwriting any existing file with the same name
(
    source_df.write.format("csv")
    .mode("overwrite")
    .option("header", "true")
    .save("data/source_csv")
)

                                                                                

In [17]:
!ls -a

[34m.[m[m                         [34m.ipynb_checkpoints[m[m        [34mdata[m[m
[34m..[m[m                        NOTES_spark.ipynb         spark-wrangle.ipynb
[34m.git[m[m                      NOTES_spark_wrangle.ipynb spark101.ipynb
.gitignore                README.md


In [16]:
!ls data

[34mmpg_csv[m[m     [34mmpg_json[m[m    [34msource_csv[m[m  [34msource_json[m[m


* Inspect your folder structure. What do you notice?  
### ANSWER: There is now a folder named data that was created when we initiated spark to createDataFrame; both the .json and .csv files are saved in that folder because we chose that directory in our code above with ```data/```

## 3. Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [10]:
# The .schema attribute shows the data types that Spark has inferred from the source
source_df.schema

StructType([StructField('index', LongType(), True), StructField('source_id', StringType(), True), StructField('source_username', StringType(), True)])

In [11]:
# The .schema attribute shows the data types that Spark has inferred from the source
cases_df.schema

StructType([StructField('case_id', LongType(), True), StructField('case_opened_date', StringType(), True), StructField('case_closed_date', StringType(), True), StructField('SLA_due_date', StringType(), True), StructField('case_late', StringType(), True), StructField('num_days_late', DoubleType(), True), StructField('case_closed', StringType(), True), StructField('dept_division', StringType(), True), StructField('service_request_type', StringType(), True), StructField('SLA_days', DoubleType(), True), StructField('case_status', StringType(), True), StructField('source_id', StringType(), True), StructField('request_address', StringType(), True), StructField('council_district', LongType(), True)])

In [12]:
# The .schema attribute shows the data types that Spark has inferred from the source
dept_df.schema

StructType([StructField('dept_division', StringType(), True), StructField('dept_name', StringType(), True), StructField('standardized_dept_name', StringType(), True), StructField('dept_subject_to_SLA', StringType(), True)])

### 1. How old is the latest (in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?

### 2. How many Stray Animal cases are there?

### 3. How many service requests that are assigned to the Field Operations department (```dept_division```) are not classified as "Officer Standby" request type (```service_request_type```)?

### 4. Convert the ```council_district``` column to a ```string``` column.

### 5. Extract the year from the ```case_closed_date``` column.

### 6. Convert ```num_days_late``` from days to hours in new columns ```num_hours_late```.

### 7. Join the case data with the source and department data.

### 8. Are there any cases that do not have a request source?

### 9. What are the top 10 service request types in terms of number of requests?

### 10. What are the top 10 service request types in terms of average days late?

### 11. Does number of days late depend on department?

### 12. How do number of days late depend on department and request type?

## You might have noticed that the latest date in the dataset is fairly far off from the present day. To account for this, replace any occurances of the current time with the maximum date from the dataset.