# **Labs 1 and 2 PySpark:**

In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv file that we will be using in this lab is **PatientInfo**.

## PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

### Import the pyspark and check it's version

In [1]:
import pyspark

### Import and create SparkSession

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pyspark.sql.functions as F

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/06/26 12:15:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
print(spark.version)

3.2.1


### Load the PatientInfo.csv file and show the first 5 rows

In [4]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [5]:
df = (spark.read.format('csv')
          .option('inferSchema','true')
          .option('header','true')
          .load('PatientInfo.csv')
         )
df.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|         null|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|

### Display the schema of the dataset

In [6]:
df.printSchema()

root
 |-- patient_id: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: string (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: string (nullable = true)
 |-- state: string (nullable = true)



### Display the statistical summary

In [7]:
df.describe().show()

[Stage 3:>                                                          (0 + 1) / 1]

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|          5162|         1587|           66|    5165|
|   mean|2.8636345618679576E9|  null|null|      null|    null|          null|                null|2.2845944015643125E9|1.6772572523506988E7|            

                                                                                

### Using the state column.
### How many people survived (released), and how many didn't survive (isolated/deceased)?

In [8]:
(df.select('state')
   .groupBy('state')
   .agg(F.count("state"))
  ).show()

+--------+------------+
|   state|count(state)|
+--------+------------+
|isolated|        2158|
|released|        2929|
|deceased|          78|
+--------+------------+



### Display the number of null values in each column

In [9]:
# Find Count of Null, None, NaN of All DataFrame Columns
#from pyspark.sql.functions import col,isnan, when, count
df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|patient_id| sex| age|country|province|city|infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|         0|1122|1380|      0|       0|  94|           919|       3819|          4374|              4475|             3|         3578|         5099|    0|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+



## Data preprocessing

### Fill the nulls in the deceased_date with the released_date. 
- You can use <b>coalesce</b> function

In [10]:
df = df.withColumn('deceased_date',F.coalesce('deceased_date','released_date'))
df.select('deceased_date','released_date').show()

+-------------+-------------+
|deceased_date|released_date|
+-------------+-------------+
|   2020-02-05|   2020-02-05|
|   2020-03-02|   2020-03-02|
|   2020-02-19|   2020-02-19|
|   2020-02-15|   2020-02-15|
|   2020-02-24|   2020-02-24|
|   2020-02-19|   2020-02-19|
|   2020-02-10|   2020-02-10|
|   2020-02-24|   2020-02-24|
|   2020-02-21|   2020-02-21|
|   2020-02-29|   2020-02-29|
|   2020-02-29|   2020-02-29|
|   2020-02-27|   2020-02-27|
|         null|         null|
|   2020-03-12|   2020-03-12|
|         null|         null|
|   2020-03-11|   2020-03-11|
|   2020-03-01|   2020-03-01|
|         null|         null|
|   2020-03-08|   2020-03-08|
|         null|         null|
+-------------+-------------+
only showing top 20 rows



### Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows. Print the schema.
- <b> Hint: You need to typecast these columns as date first <b>

In [11]:
df = df.withColumn('no_days',F.datediff(df['deceased_date'],df['confirmed_date']))
#f.select('no_days').show()

### Add a is_male column if male then it should yield true, else then False

In [12]:
df = df.withColumn("is_male",F.when(df.sex == "male","True")
                             .otherwise("False"))
df.select('is_male').show()   

+-------+
|is_male|
+-------+
|   True|
|   True|
|   True|
|   True|
|  False|
|  False|
|   True|
|   True|
|   True|
|  False|
|  False|
|   True|
|   True|
|  False|
|   True|
|   True|
|   True|
|   True|
|  False|
|  False|
+-------+
only showing top 20 rows



### Add a is_dead column if patient state is not released then it should yield true, else then False

- Use <b>UDF</b> to perform this task. 
- However, UDF is not recommended there is no built in function can do the required operation.
- UDF is slower than built in functions.

In [13]:
from pyspark.sql.types import BooleanType,StringType

In [14]:
def convertUdf(x):
    if x != "released":
        return True
    else:
        return False

In [15]:
convertUDF = F.udf(lambda x: convertUdf(x), BooleanType())


In [16]:
df = df.withColumn('is_dead',convertUDF('state'))
df.select('is_dead').show()

+-------+
|is_dead|
+-------+
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|   true|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
|  false|
+-------+
only showing top 20 rows



### Change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20

In [17]:
# def remove_S(x):
#     return x.replace('s',"")

In [18]:
# remove_s_UDF = F.udf(lambda x: remove_S(x), StringType())

In [19]:
from pyspark.sql.functions import regexp_replace

In [20]:
df = df.withColumn('age',regexp_replace('age', 's', ''))
df.select('age').show()

+---+
|age|
+---+
| 50|
| 30|
| 50|
| 20|
| 20|
| 50|
| 20|
| 20|
| 30|
| 60|
| 50|
| 20|
| 80|
| 60|
| 70|
| 70|
| 70|
| 20|
| 70|
| 70|
+---+
only showing top 20 rows



### Change age, and no_days  to be typecasted as Double

In [21]:
df = df.withColumn('age', df.age.cast('double'))
df = df.withColumn('no_days',df.no_days.cast('double'))

### Drop the columns
["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [22]:
df = df.drop("patient_id","sex","infected_by","contact_number","released_date","state", "symptom_onset_date","confirmed_date","deceased_date","country","no_days", "city","infection_case") 

In [23]:
df.printSchema()

root
 |-- age: double (nullable = true)
 |-- province: string (nullable = true)
 |-- is_male: string (nullable = false)
 |-- is_dead: boolean (nullable = true)



### Recount the number of nulls now

In [24]:
# df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in df.columns]
#    ).show()

## Now do the same but using SQL select statement

### From the original Patient DataFrame, Create a temporary view (table).

In [25]:
df_sql = spark.read.csv('PatientInfo.csv',
                   header=True,
                       )
df_sql.createOrReplaceTempView("patients")

### Use SELECT statement to select all columns from the dataframe and show the output.

In [26]:
spark.sql("""SELECT *
             FROM patients
          """).show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|         null|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|

### *Using SQL commands*, limit the output to only 5 rows 

In [27]:
spark.sql("""
          SELECT *
          FROM patients
          LIMIT(5)
          """).show()

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|         null|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|   202

### Select the count of males and females in the dataset

In [28]:
spark.sql("""
          SELECT sex,count(sex)
          FROM patients
          GROUP BY sex
          """).show()

+------+----------+
|   sex|count(sex)|
+------+----------+
|  null|         0|
|female|      2218|
|  male|      1825|
+------+----------+



### How many people did survive, and how many didn't?

In [29]:
spark.sql("""
          SELECT state,count(state)
          FROM patients
          GROUP BY state
          """).show()

+--------+------------+
|   state|count(state)|
+--------+------------+
|isolated|        2158|
|released|        2929|
|deceased|          78|
+--------+------------+



### Now, let's perform some preprocessing using SQL:
1. Convert *age* column to double after removing the 's' at the end -- *hint: check SUBSTRING method*
2. Select only the following columns: `['sex', 'age', 'province', 'state']`
3. Store the result of the query in a new dataframe

In [30]:
df_sql = spark.sql("""
          SELECT double(SUBSTRING(age,1,2))
          FROM patients;
          
          """)
# df_sql = spark.sql("""
#                    ALTER TABLE patients modify column age DOUBLE;
#                    """)
df_sql.show()

+--------------------+
|substring(age, 1, 2)|
+--------------------+
|                50.0|
|                30.0|
|                50.0|
|                20.0|
|                20.0|
|                50.0|
|                20.0|
|                20.0|
|                30.0|
|                60.0|
|                50.0|
|                20.0|
|                80.0|
|                60.0|
|                70.0|
|                70.0|
|                70.0|
|                20.0|
|                70.0|
|                70.0|
+--------------------+
only showing top 20 rows



In [31]:
new_sql = spark.sql("""
                    SELECT sex, age, province, state
                    FROM patients
                    """)
new_sql.show()

+------+---+--------+--------+
|   sex|age|province|   state|
+------+---+--------+--------+
|  male|50s|   Seoul|released|
|  male|30s|   Seoul|released|
|  male|50s|   Seoul|released|
|  male|20s|   Seoul|released|
|female|20s|   Seoul|released|
|female|50s|   Seoul|released|
|  male|20s|   Seoul|released|
|  male|20s|   Seoul|released|
|  male|30s|   Seoul|released|
|female|60s|   Seoul|released|
|female|50s|   Seoul|released|
|  male|20s|   Seoul|released|
|  male|80s|   Seoul|deceased|
|female|60s|   Seoul|released|
|  male|70s|   Seoul|released|
|  male|70s|   Seoul|released|
|  male|70s|   Seoul|released|
|  male|20s|   Seoul|released|
|female|70s|   Seoul|released|
|female|70s|   Seoul|released|
+------+---+--------+--------+
only showing top 20 rows



## Machine Learning 
### Create a pipeline model to predict is_dead and evaluate the performance.
- Use <b>StringIndexer</b> to transform <b>string</b> data type to indices.
- Use <b>OneHotEncoder</b> to deal with categorical values.
- Use <b>Imputer</b> to fill missing data with mean.

In [32]:
df = df.withColumn('is_dead',df.is_dead.cast('Integer'))
# df = df.withColumn('age', df.age.cast('double'))

In [33]:
df.select('is_dead','age').show()

+-------+----+
|is_dead| age|
+-------+----+
|      0|50.0|
|      0|30.0|
|      0|50.0|
|      0|20.0|
|      0|20.0|
|      0|50.0|
|      0|20.0|
|      0|20.0|
|      0|30.0|
|      0|60.0|
|      0|50.0|
|      0|20.0|
|      1|80.0|
|      0|60.0|
|      0|70.0|
|      0|70.0|
|      0|70.0|
|      0|20.0|
|      0|70.0|
|      0|70.0|
+-------+----+
only showing top 20 rows



In [34]:
from pyspark.ml.feature import StringIndexer,OneHotEncoder,VectorAssembler, Imputer

In [35]:
categorical_cols = [field for (field,dataType) in df.dtypes if ((dataType == 'string')&(field!='is_dead'))]
categorical_cols

['province', 'is_male']

In [36]:
categorical_output = [s + "_Index" for s in categorical_cols]
categorical_output

['province_Index', 'is_male_Index']

In [37]:
categorical_encoded = [s + "_ohe" for s in categorical_cols]
categorical_encoded

['province_ohe', 'is_male_ohe']

In [38]:
stringIndexer = StringIndexer(inputCols=categorical_cols, outputCols= categorical_output,handleInvalid = 'skip')

ohe = OneHotEncoder(inputCols= categorical_output, outputCols= categorical_encoded)

In [39]:
numerical_cols = [field for (field, dataType) in df.dtypes
                 if dataType in ['int','double'] and field != 'is_dead']
numerical_cols

['age']

In [40]:
imputed_cols =  [x + "imputed" for x in numerical_cols]

In [41]:
imputer = Imputer(strategy='mean', inputCols=numerical_cols, outputCols=imputed_cols)

In [42]:
assemblerInputs = imputed_cols + categorical_encoded
assemblerInputs

['ageimputed', 'province_ohe', 'is_male_ohe']

In [43]:
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol='features')

In [44]:
vecAssembler

VectorAssembler_d46bdc7e5574

### Build models

In [45]:
from pyspark.ml.classification import DecisionTreeClassifier

In [46]:
tree = DecisionTreeClassifier(featuresCol='features',
                              labelCol='is_dead',
                              predictionCol='prediction')

In [47]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol='features',
                        labelCol='is_dead',
                        predictionCol='prediction')

### Split the data

In [48]:
from pyspark.ml import Pipeline
trainDF, testDF = df.randomSplit([0.8,0.2],seed=42)

### Create a Pipeline Logestic regression

In [49]:
myStages = [stringIndexer, ohe, imputer, vecAssembler,lr]
pipeline = Pipeline(stages=myStages)
pipelineModel = pipeline.fit(trainDF)
predDF = pipelineModel.transform(testDF)

In [50]:
predDF.show()

+----+-----------+-------+-------+--------------+-------------+---------------+-------------+------------------+--------------------+--------------------+--------------------+----------+
| age|   province|is_male|is_dead|province_Index|is_male_Index|   province_ohe|  is_male_ohe|        ageimputed|            features|       rawPrediction|         probability|prediction|
+----+-----------+-------+-------+--------------+-------------+---------------+-------------+------------------+--------------------+--------------------+--------------------+----------+
|null| Gangwon-do|  False|      0|          10.0|          0.0|(16,[10],[1.0])|(1,[0],[1.0])|40.085978835978835|(18,[0,11,17],[40...|[0.74231117298496...|[0.67750103925742...|       0.0|
|null|Gyeonggi-do|  False|      1|           2.0|          0.0| (16,[2],[1.0])|(1,[0],[1.0])|40.085978835978835|(18,[0,3,17],[40....|[-3.1704535654030...|[0.04029287268268...|       1.0|
|null|Gyeonggi-do|  False|      1|           2.0|          0.0| (

### Model Evalutation

In [51]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
regeval_acc = MulticlassClassificationEvaluator(predictionCol='prediction',labelCol='is_dead', metricName = 'accuracy')

In [52]:
regeval_acc.evaluate(predDF)

0.8268268268268268

### Pipeline for decision tree

In [53]:
myStages = [stringIndexer, ohe, imputer, vecAssembler,tree]
pipeline = Pipeline(stages=myStages)
pipelineModel = pipeline.fit(trainDF)
predDF_tree = pipelineModel.transform(testDF)

In [54]:
tree_eval_acc = MulticlassClassificationEvaluator(predictionCol='prediction',labelCol='is_dead', metricName = 'accuracy')

### Decision tree evaluation

In [55]:
tree_eval_acc.evaluate(predDF_tree)

0.8378378378378378