# Exercises

## 11.01_GettingStarted

1. Create a SparkSession that connects to Spark in local mode.  Configure the SparkSession to use two cores.   
`spark = SparkSession.builder \
    .master("local[2]") \
    .appName("connect_solutions") \
    .getOrCreate()`

2. Create a small dataframe   
`courses = spark \  
    .createDataFrame([("Linear Algebra", "Mathematics", 3302), ("Calculus III", "Mathematics", 2302), \
    ("Business Statistics", "Business", 2305)], \  
    schema = ["CourseName", "Department", "CourseID"])`  

3. Print the schema of the dataframe  
`courses.printSchema()`

4. View the dataframe  
`courses.show()`

5. Count the number of records  
`courses.count()`

7. Stop the SparkSession     
`spark.stop()`  

## 11.03_Inspect

1. Read the 311 case data into a Spark DataFrame.

> `df2 = spark.read.csv("/sa311/case.csv", header=True, inferSchema=True)`

2. Inspect the DataFrame.  Are the data types for each column appropriate?

> `df2.dtypes
df2.columns
df2.schema
df2.count()
len(df2.columns)
df2.describe().show()
pd.options.display.html.table_schema = True
df2.describe().toPandas()`

3. Inspect various columns of the driver DataFrame.  Are there any issues with the data?

> `df2.createOrReplaceTempView("df2_temp")
spark.sql("select cat, count(*) from df2_temp group by cat").show()`

- Here's a routine to inspect each column in turn:

> `def inspect_dataframe(df):
  print("====")
  print("====")
  print("INSPECT DATAFRAME")
  print("====")
  df.printSchema()
  from pyspark.sql.functions import count, countDistinct
  for c in df.columns:
    print("====")
    print("Inspecting column: " + c)
    print("====")
    df.select(c).printSchema()
    df.select(c).show(5)
    df.select(c).describe().show()
    df.select(count("*").alias("N"), count(c), countDistinct(c)).show()`

- Begin column notes.  Copy and paste output into a script to begin report.
  
> `print("")
  print("# ## Observations:")
  for c in df.columns:
    print("# * " + c + ":")`
    
- Invoke the routine:

> `inspect_dataframe(df2)`
    
- Observations:

- Another routine, just to consider categorical variables:

> `def inspect_categorical_variables(df, column_list):
  print("====")
  print("====")
  print("INSPECT CATEGORICAL VARIABLES")
  print("====")
  for c in column_list:
    print("====")
    print("Inspecting column: " + c)
    print("====")
    n = df.select(c).distinct().count()
    print("Number of distinct values: " + str(n))
    df.groupBy(c).count().orderBy(c).show(20)`
    
- Begin column notes.  Copy and paste output into a script to begin report.
  
> `print("")
  print("# ## Observations:")
  for c in column_list:
    print("# * " + c + ":")`
        
- Invoke the routine with variables of interest:    

> `df2 =\
  ["case_late", "case_closed", "dept_division", "service_request_type", "case_status"]
inspect_categorical_variables(df2, df2_categorical_columns)    

- Observations:

> - Further inspection w/ `show(200).orderBy("count")` shows many values with just one instance.

## 11.04.01_Perpare_part1


1. Read the raw data 2 into a Spark DataFrame.

> `data = "sa311/case.csv"
df2 = spark.read.csv(data, header=True, inferSchema=True)
df2.printSchema()
df2.dtypes
df2.columns
len(df2.columns)
df2.schema
df2.count()`

> `import pandas as pd
pd.options.display.html.table_schema = True
df2.limit(5).toPandas()`

2. How old is the latest (in terms of days past SLA) currently open issue?  How long has the oldest (in terms of days since opened) currently opened issue been open?

- Latest (SLA) Issue:  

> `from pyspark.sql.functions import datediff, current_timestamp
df2 \
  .select('case_opened_date', 'case_late', 'case_closed', 'num_days_late') \
  .filter(df.case_late == "YES").where(df.case_closed == "NO") \
  .orderBy(df2.num_days_late.desc()) \
  .limit(1) \
  .withColumn( \
              "age_in_days" \
              , datediff(current_timestamp() \
                         , col("case_opened_date"))/365 \
              ) \
  .show(1)`
  
- Oldest Issue: 

> `df2 \
  .select('case_opened_date', 'case_closed') \
  .filter(df.case_closed=="NO") \
  .orderBy(df2.case_opened_date.asc()) \
  .limit(1) \
  .withColumn( \
              "age_in_days" \
              , datediff(current_timestamp() \
                         , col("case_opened_date"))/365 \
              ) \
  .show(1)`
  
  
3. How many Stray Animal cases are there?

> `df2 \
  .select('service_request_type') \
  .filter(df2.service_request_type == 'Stray Animal') \
  .count()`

4. How many service requests that are assigned to the Field Operations department (dept_division) are not classified as "Officer Standby" request type (service_request_type)? 

> `df2 \
  .select('dept_division', 'service_request_type') \
  .filter(df2.dept_division == 'Field Operations') \
  .where(df2.service_request_type != 'Officer Standby') \
  .count()`

4. Create a new DataFrame without any information related to dates or location. 

> `df2.drop('case_opened_date', 'case_closed_date', 'SLA_due_date', 'request_address', 'council_district').show(25)`

5. Read dept.csv into a Spark DataFrame.  Inspect the `dept_name` column. Replace the missing values with "other" for a standard filler. 

> `data = "/sa311/dept.csv"
> df3 = spark.read.csv(data, header=True, inferSchema=True)
df3_fixed = df3.fillna("other", "dept_name")`


## 11.04.02_Prepare_part2


> `from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("transform2_answer").getOrCreate()
data = "/sa311/case"
df2 = spark.read.csv(data, header=True, inferSchema=True)`


1. Convert the `df.council_district` column to a string column.

> `from pyspark.sql.functions import format_string
df = df.withColumn("council_district", format_string("%012d", "council_district"))`

2. Extract the year from the `df.case_closed_date` column.

> `from pyspark.sql.functions import year
df = df.withColumn("case_closed_year", year("case_closed_date"))`

3. Convert `df.num_days_late` from days to hours in new columns `df.num_hours_late`.

> `from pyspark.sql.functions import col, round
df = df.withColumn("num_hours_late", round(col("num_days_late")*24, 0))`

4. Convert the `df.case_late` column to a boolean column.

> `from pyspark.sql.functions import col
df = df.withColumn("df.case_late", col("df.case_late")=="YES")`

5. Convert the `df.SLA_days` columns to a double column.

> `from pyspark.sql.functions import col
df = df.withColumn("SLA_days", col("SLA_days") * 1.0)`


- All together: 

> `from pyspark.sql.functions import format_string, year, col, round
df = df\
  .withColumn("council_district", format_string("%012d", "council_district"))\
  .withColumn("case_closed_year", year("case_closed_date"))\
  .withColumn("num_hours_late", round(col("num_days_late")*24, 0))\
  .withColumn("case_late", col("case_late")=="YES")\
  .withColumn("SLA_days", col("SLA_days") * 1.0)`


## 11.04.03_Prepare_part3


1. Create a DataFrame with all combinations of council_district and service_request_type (regardless of whether the combination is observed in the data).

- Generate a DataFrame with all council_district:

> `council_district_df = df.select("council_district").distinct()`
  
- Generate a DataFrame with all service_request_type:

> `service_request_type_df = df.select("service_request_type").distinct()`

- Use the `crossJoin` method to generate all combinations:

> `combinations = council_district_df.crossJoin(service_request_type_df)
combinations.orderBy("council_district", "service_request_type").show()`

- This is not bulletproof. It would not give the desired result if a service_request_type was missing in the middle of the list.

2. Join the case data with the source and department data.

- Since we want all the case data, we will use a sequence of left outer joins:

> `data3 = "/sa311/source"
df3 = spark.read.csv(data2, header=True, inferSchema=True)
df3.show()` 

> `joined = df \
      .join(df2, df.dept_division == df2.dept_division, "left_outer") \
      .join(df3, df.source_id == df3.source_id, "left_outer")
joined.printSchema()`

- We might want to rename some columns before joining the data and remove the duplicate ID columns after joining the data to make this DataFrame more usable.  

3. Are there any cases who have not a request source?

- A solution using joins:

> `case_df_no_source = df.join(df3, df.source_id == df3.source_id, "left_anti")
case_df_no_source.count()
case_df_no_source.select("source_id").orderBy("source_id").show()`

- A solution using set operations:

> `case_df_no_source2 = df.select("source_id").subtract(df3.select("source_id"))
case_df_no_source2.count()
case_df_no_source2.orderBy("source_id").show()`

## 11.04.04_Prepare_part4


1. Who are the top 10 service request types in terms of number of requests?

> `from pyspark.sql.functions import count
df.groupBy("service_request_type")\
  .agg(count("case_id").alias("count"))\
  .orderBy("count", ascending=False)\
  .show(10)`

2. Who are the top 10 service request types in terms of average days late?

> `from pyspark.sql.functions import sum
df.groupBy("service_request_type")\
  .agg(avg("num_days_late").alias("average_days_late"))\
  .orderBy("average_days_late", ascending=False)\
  .show(10)`

3. Does number of days late depend on department?

> `from pyspark.sql.functions import avg, format_number
df.groupBy("dept_division")\
  .agg(avg("num_days_late").alias("avg_days_late"))\
  .withColumn("avg_days_late", format_number("avg_days_late", 2))\
  .orderBy("avg_days_late")\
  .show(30)`

4. How do number of days late depend on department division and request type?

> `df1.groupBy("service_request_type", "dept_division")\
  .agg(avg("num_days_late"))\
  .orderBy("service_request_type", "dept_division")\
  .show()`
  
- or:

> `df1.groupBy("service_request_type")\
  .pivot("dept_division")\
  .agg(avg("num_days_late").alias("avg_days_late"))\
  .show()`



## 11.05_Explore

## 11.06.01_Model_TopicModeling

0. Prep:  Load the twitter data, which includes the following attributes: "Topic","Sentiment","TweetId","TweetDate","TweetText"

> `data = "twitter_corpus.csv"
df = spark.read.parquet(data)
df.head(5)`



1. Use the `NGram` transformer to generate pairs of words (bigrams) from the tokenized tweets.

- Import the `NGram` class from the `pyspark.ml.feature` module:

> `from pyspark.ml.feature import NGram`

- Create an instance of the `NGram` class:

> `ngramer = NGram(inputCol="words", outputCol="bigrams", n=2)`

- Use the `transform` method to apply the `NGram` instance to the `tokenized` DataFrame:

> `from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="TweetText", outputCol="words", gaps=False, pattern="[a-zA-Z-']+")
tokenized = tokenizer.transform(df)`

> `from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="words_removed")
removed = remover.transform(tokenized)
removed.select("words", "words_removed").head(5)`

> `ngramed = ngramer.transform(removed)
ngramed2 = ngramer.transform(tokenized)`

- Print out a few rows of the transformed DataFrame:

> `ngramed.show(5)`

2. Fit an LDA model with $k=3$ topics.

- Use the `setK` method to change the number of topics for the `lda` instance:

> `lda.setK(3)`

- Use the `fit` method to fit the LDA model to the `vectorized` DataFrame:

> `lda_model = lda.fit(vectorized)`

- Use the `print_topics` function to examine the topics:

> `print_topics(lda_model, 5, vectorizer_model.vocabulary)`

- Use the `transform` method to apply the LDA model to the `vectorized` DataFrame:

> `predictions = lda_model.transform(vectorized)`

- Print out a few rows of the transformed DataFrame:

> `predictions.select("TweetText", "topicDistribution").head(5)`



## 11.06.02_Classification

In the exercises we add another feature to the classification model and
determine if it improves the model performance.

1. Determine if `request_address_zip` is a promising feature.

2. Reassemble the feature vector and include `request_address_zip`.

3. Create new train and test datasets.

> `(train, test) = assembled.randomSplit([0.7, 0.3], 23451)`

4. Refit the logistic regression model on the train dataset.

> `log_reg_model = log_reg.fit(train)`

5. Apply the refit logistic model to the test dataset.

> `predictions = log_reg_model.transform(test)`

6. Compute the AUC on the test dataset.

 `evaluator.setMetricName("areaUnderROC").evaluate(predictions)`