# **Labs 1 and 2 PySpark:**

In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv file that we will be using in this lab is **PatientInfo**.

## PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

### Import the pyspark and check it's version

In [58]:
from pyspark.sql import SparkSession

### Import and create SparkSession

In [59]:
spark = SparkSession.builder.getOrCreate()

### Load the PatientInfo.csv file and show the first 5 rows

In [60]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [61]:
df = spark.read.csv("/kaggle/input/patiens/PatientInfo.csv", header=True, inferSchema=True)

### Display the schema of the dataset

In [62]:
df.printSchema()

root
 |-- patient_id: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: date (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)



### Display the statistical summary

In [63]:
df.describe().show()

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|    5165|
|   mean|2.8636345618679576E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|2.2845944015643125E9|1.6772572523506988E7|              NULL|    NULL|
| stddev| 2.074210725277473E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|1.5265072953383324E9| 3.093097580985502E8|              N

In [64]:
df.summary().show()

[Stage 937:>                                                                            (0 + 1) / 1]

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|    5165|
|   mean|2.8636345618679576E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|2.2845944015643125E9|1.6772572523506988E7|              NULL|    NULL|
| stddev| 2.074210725277473E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|1.5265072953383324E9| 3.093097580985502E8|              N

                                                                                                    

### Using the state column.
### How many people survived (released), and how many didn't survive (isolated/deceased)?

In [65]:
from pyspark.sql.functions import col
df.filter(col("state") == "released").count()

2929

In [66]:
df.filter(col("state").isin("isolated", "deceased")).count()

2236

### Display the number of null values in each column

In [67]:
import pyspark.sql.functions as F

null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns])
null_counts.show()

+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|patient_id| sex| age|country|province|city|infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|         0|1122|1380|      0|       0|  94|           919|       3819|          4374|              4475|             3|         3578|         5099|    0|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+



## Data preprocessing

### Fill the nulls in the deceased_date with the released_date. 
- You can use <b>coalesce</b> function

In [68]:
df_processed = df.withColumn("deceased_date", F.coalesce("deceased_date", "released_date"))

In [69]:
df_processed.select("deceased_date", "released_date").show(10)

+-------------+-------------+
|deceased_date|released_date|
+-------------+-------------+
|   2020-02-05|   2020-02-05|
|   2020-03-02|   2020-03-02|
|   2020-02-19|   2020-02-19|
|   2020-02-15|   2020-02-15|
|   2020-02-24|   2020-02-24|
|   2020-02-19|   2020-02-19|
|   2020-02-10|   2020-02-10|
|   2020-02-24|   2020-02-24|
|   2020-02-21|   2020-02-21|
|   2020-02-29|   2020-02-29|
+-------------+-------------+
only showing top 10 rows



### Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows. Print the schema.
- <b> Hint: You need to typecast these columns as date first <b>

In [70]:
df_processed = df_processed.withColumn("deceased_date", F.to_date("deceased_date", "yyyy-MM-dd"))
df_processed = df_processed.withColumn("confirmed_date", F.to_date("confirmed_date", "yyyy-MM-dd"))

df_processed = df_processed.withColumn("no_days", F.datediff("deceased_date", "confirmed_date"))

In [71]:
df_processed.select("confirmed_date", "deceased_date", "no_days").show(5)

+--------------+-------------+-------+
|confirmed_date|deceased_date|no_days|
+--------------+-------------+-------+
|    2020-01-23|   2020-02-05|     13|
|    2020-01-30|   2020-03-02|     32|
|    2020-01-30|   2020-02-19|     20|
|    2020-01-30|   2020-02-15|     16|
|    2020-01-31|   2020-02-24|     24|
+--------------+-------------+-------+
only showing top 5 rows



### Add a is_male column if male then it should yield true, else then False

In [72]:
df_processed = df_processed.withColumn("is_male", F.when(df_processed.sex == "male", True).otherwise(False))

### Add a is_dead column if patient state is not released then it should yield true, else then False

- Use <b>UDF</b> to perform this task. 
- However, UDF is not recommended there is no built in function can do the required operation.
- UDF is slower than built in functions.

In [73]:
from pyspark.sql.types import BooleanType

def check_is_dead(state):
    return state != "released"

In [74]:
is_dead_udf = F.udf(check_is_dead, BooleanType())

df_processed = df_processed.withColumn("is_dead", is_dead_udf(df_processed.state))

### Change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20

In [75]:
df_processed = df_processed.withColumn("age", F.regexp_replace("age", "s", "").cast("int"))

### Change age, and no_days  to be typecasted as Double

In [76]:
df_processed = df_processed.withColumn("age", F.col("age").cast("double")) \
       .withColumn("no_days", F.col("no_days").cast("double"))

### Drop the columns
["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [77]:
columns_to_drop = ["patient_id", "sex", "infected_by", "contact_number", "released_date",
                   "state", "symptom_onset_date", "confirmed_date", "deceased_date",
                   "country", "no_days", "city", "infection_case"]

df_processed = df_processed.drop(*columns_to_drop)

### Recount the number of nulls now

In [78]:
null_counts_2 = df_processed.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_processed.columns])
null_counts_2.show()

+----+--------+-------+-------+
| age|province|is_male|is_dead|
+----+--------+-------+-------+
|1380|       0|      0|      0|
+----+--------+-------+-------+



## Now do the same but using SQL select statement

### From the original Patient DataFrame, Create a temporary view (table).

In [79]:
df.createOrReplaceTempView("patient_table")

### Use SELECT statement to select all columns from the dataframe and show the output.

In [80]:
spark.sql("SELECT * FROM patient_table").show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|         NULL|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|         NULL|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|   202

### *Using SQL commands*, limit the output to only 5 rows 

In [81]:
spark.sql("SELECT * FROM patient_table LIMIT 5").show()

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|         NULL|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|         NULL|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|   202

### Select the count of males and females in the dataset

In [82]:
query = """
SELECT sex, COUNT(*) AS count
FROM patient_table
GROUP BY sex
"""
spark.sql(query).show()

+------+-----+
|   sex|count|
+------+-----+
|  NULL| 1122|
|female| 2218|
|  male| 1825|
+------+-----+



### How many people did survive, and how many didn't?

In [83]:
query = """
SELECT
  CASE WHEN state = 'released' THEN 'survived' ELSE 'not_survived' END AS survival_status,
  COUNT(*) AS count
FROM patient_table
GROUP BY survival_status
"""
spark.sql(query).show()

+---------------+-----+
|survival_status|count|
+---------------+-----+
|   not_survived| 2236|
|       survived| 2929|
+---------------+-----+



### Now, let's perform some preprocessing using SQL:
1. Convert *age* column to double after removing the 's' at the end -- *hint: check SUBSTRING method*
2. Select only the following columns: `['sex', 'age', 'province', 'state']`
3. Store the result of the query in a new dataframe

In [84]:
query = """
SELECT sex,CAST(SUBSTRING(age, 1, LENGTH(age) - 1) AS DOUBLE) AS age,province,state    
FROM patient_table
"""
### col , start_position , number_of_characters

In [129]:
df_preprocessed = spark.sql(query)
df_preprocessed.show(5)

+------+----+--------+--------+
|   sex| age|province|   state|
+------+----+--------+--------+
|  male|50.0|   Seoul|released|
|  male|30.0|   Seoul|released|
|  male|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|female|20.0|   Seoul|released|
+------+----+--------+--------+
only showing top 5 rows



In [130]:
df_preprocessed.printSchema()

root
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- province: string (nullable = true)
 |-- state: string (nullable = true)



## Machine Learning 
### Create a pipeline model to predict is_dead and evaluate the performance.
- Use <b>StringIndexer</b> to transform <b>string</b> data type to indices.
- Use <b>OneHotEncoder</b> to deal with categorical values.
- Use <b>Imputer</b> to fill missing data with mean.

In [131]:
df_preprocessed = df_preprocessed.withColumn("is_dead",F.when((col("state").isin("isolated", "deceased")), 1).otherwise(0))

In [132]:
df_preprocessed = df_preprocessed.drop("state")
df_preprocessed.show(5)

+------+----+--------+-------+
|   sex| age|province|is_dead|
+------+----+--------+-------+
|  male|50.0|   Seoul|      0|
|  male|30.0|   Seoul|      0|
|  male|50.0|   Seoul|      0|
|  male|20.0|   Seoul|      0|
|female|20.0|   Seoul|      0|
+------+----+--------+-------+
only showing top 5 rows



In [133]:
df_preprocessed.count()

5165

In [134]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Imputer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [135]:
traindf,testdf = df_preprocessed.randomSplit([0.8, 0.2], seed=42)

In [136]:
null_counts_88 = df_preprocessed.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_preprocessed.columns])
null_counts_88.show()

+----+----+--------+-------+
| sex| age|province|is_dead|
+----+----+--------+-------+
|1122|1380|       0|      0|
+----+----+--------+-------+



In [137]:
traindf.dtypes

[('sex', 'string'),
 ('age', 'double'),
 ('province', 'string'),
 ('is_dead', 'int')]

In [138]:
cat_cols = [field for field, dtype in traindf.dtypes if dtype == 'string']

In [139]:
numeric_cols = [field for field, dtype in traindf.dtypes if ((dtype =='double')&(field !='is_dead'))]

In [140]:
indexed_cols = [col + '_index' for col in cat_cols]
OHE_cols = [col + '_OHE' for col in cat_cols]
Imputed =[col + "_imputed" for col in numeric_cols]

In [141]:
strind = StringIndexer(inputCols=cat_cols, outputCols=indexed_cols,handleInvalid='skip')
ohe = OneHotEncoder(inputCols=indexed_cols, outputCols=OHE_cols)
imputer = Imputer(inputCols=numeric_cols, outputCols=Imputed)
assembler = VectorAssembler(inputCols=Imputed + OHE_cols, outputCol='features')

In [142]:
classifier = LogisticRegression(featuresCol='features', labelCol='is_dead')
pl = Pipeline(stages=[imputer , strind, ohe, assembler, classifier])

In [143]:
pl_model = pl.fit(traindf)

In [144]:
pred_train = pl_model.transform(traindf)

In [145]:
pred = pl_model.transform(testdf)

In [146]:
evaluator = BinaryClassificationEvaluator(labelCol="is_dead", rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [147]:
auc_train = evaluator.evaluate(pred_train)
auc_test = evaluator.evaluate(pred)

In [148]:
print(f"auc_train : {auc_train}, auc_test : {auc_test} ")

auc_train : 0.9241543712514069, auc_test : 0.9284758128469465 


In [150]:
preds = pred.select(F.col("prediction").cast("int"), F.col("is_dead").cast("int"))

# Confusion matrix components
TP = preds.filter((col("prediction") == 1) & (col("is_dead") == 1)).count()
TN = preds.filter((col("prediction") == 0) & (col("is_dead") == 0)).count()
FP = preds.filter((col("prediction") == 1) & (col("is_dead") == 0)).count()
FN = preds.filter((col("prediction") == 0) & (col("is_dead") == 1)).count()

# Metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) if (TP + FP) != 0 else 0
recall = TP / (TP + FN) if (TP + FN) != 0 else 0
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")


Accuracy: 0.8877
Precision: 0.8900
Recall: 0.8215
F1 Score: 0.8544


### Support vector Machine

In [151]:
from pyspark.ml.classification import LinearSVC

In [152]:
SVC = LinearSVC(featuresCol='features', labelCol='is_dead')
pl_SVC = Pipeline(stages=[imputer , strind, ohe, assembler, SVC])

In [153]:
pl_model = pl_SVC.fit(traindf)
pred_train_SVC = pl_SVC.transform(traindf)
pred_SVC = pl_SVC.transform(testdf)

In [158]:
evaluator = BinaryClassificationEvaluator(labelCol="is_dead", rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [159]:
auc_train = evaluator.evaluate(pred_train_SVC)
auc_test = evaluator.evaluate(pred_SVC)

In [160]:
print(f"auc_train : {auc_train}, auc_test : {auc_test} ")

auc_train : 0.9075328418987797, auc_test : 0.9117335448057102 


In [161]:
preds = pred_SVC.select(F.col("prediction").cast("int"), F.col("is_dead").cast("int"))

# Confusion matrix components
TP = preds.filter((col("prediction") == 1) & (col("is_dead") == 1)).count()
TN = preds.filter((col("prediction") == 0) & (col("is_dead") == 0)).count()
FP = preds.filter((col("prediction") == 1) & (col("is_dead") == 0)).count()
FN = preds.filter((col("prediction") == 0) & (col("is_dead") == 1)).count()

# Metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) if (TP + FP) != 0 else 0
recall = TP / (TP + FN) if (TP + FN) != 0 else 0
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")


Accuracy: 0.8827
Precision: 0.8734
Recall: 0.8277
F1 Score: 0.8499
