# **Lab 1 PySpark:** 


In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv files that we will be using in this lab is **PatientInfo**.



# File 📁

##PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

# Installing PySpark:


Kindly install pyspark ⌨

In [1]:
% pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Check the pyspark version

In [2]:
!pyspark --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.15
Branch HEAD
Compiled by user hgao on 2022-01-20T19:26:14Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.


Importing sparksession then create a spark session

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext



In [4]:
sc = SparkContext
spark = SparkSession.builder.appName("Lab1_G2").getOrCreate()

## Importing libraries needed

Import the libraries that you may need below

In [5]:
import pandas as pd
import numpy as np


In [6]:
from pyspark.sql.functions import expr
from pyspark.sql.functions import col,isnan, when, count,max
from pyspark.sql.functions import *

# Follow the instructions above the cells below:


Kindly load the PatientInfo.csv file and show the first 5 rows

❌ PLEASE do NOT change the cell below 

In [7]:
!gdown https://drive.google.com/uc?id=1PB6wBDVTM_eocxOyi0lWlLBQOlH0rLe_ -O PatientInfo.csv

Downloading...
From: https://drive.google.com/uc?id=1PB6wBDVTM_eocxOyi0lWlLBQOlH0rLe_
To: /content/PatientInfo.csv
  0% 0.00/489k [00:00<?, ?B/s]100% 489k/489k [00:00<00:00, 91.4MB/s]


load the **PatientInfo.csv** file downloaded and put it in a dataframe called **patient**

In [8]:
patient = spark.read.csv("PatientInfo.csv",header=True)

Show the first 5 rows

In [9]:
patient.head(5)

[Row(patient_id='1000000001', sex='male', age='50s', country='Korea', province='Seoul', city='Gangseo-gu', infection_case='overseas inflow', infected_by=None, contact_number='75', symptom_onset_date='2020-01-22', confirmed_date='2020-01-23', released_date='2020-02-05', deceased_date=None, state='released'),
 Row(patient_id='1000000002', sex='male', age='30s', country='Korea', province='Seoul', city='Jungnang-gu', infection_case='overseas inflow', infected_by=None, contact_number='31', symptom_onset_date=None, confirmed_date='2020-01-30', released_date='2020-03-02', deceased_date=None, state='released'),
 Row(patient_id='1000000003', sex='male', age='50s', country='Korea', province='Seoul', city='Jongno-gu', infection_case='contact with patient', infected_by='2002000001', contact_number='17', symptom_onset_date=None, confirmed_date='2020-01-30', released_date='2020-02-19', deceased_date=None, state='released'),
 Row(patient_id='1000000004', sex='male', age='20s', country='Korea', provin

Display the schema for the dataset

In [10]:
patient.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: string (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: string (nullable = true)
 |-- state: string (nullable = true)



Display the statistical summary

In [11]:
patient.describe().show()

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|          5162|         1587|           66|    5165|
|   mean|2.8636345618679576E9|  null|null|      null|    null|          null|                null|2.2845944015643125E9|1.6772572523506988E7|            

How many people survived, and how many didn't survive?

In [12]:
patient.where(patient.state == 'released').select(count("state").alias("Survived")).show()

+--------+
|Survived|
+--------+
|    2929|
+--------+



In [13]:
patient.select(count("released_date").alias("Survived")).show()

+--------+
|Survived|
+--------+
|    1587|
+--------+



In [14]:
patient.select(count("deceased_date").alias("Unsurvived")).show()

+----------+
|Unsurvived|
+----------+
|        66|
+----------+



In [15]:
patient.where(patient.state == 'deceased').select(count("state").alias("Unsurvived")).show()

+----------+
|Unsurvived|
+----------+
|        78|
+----------+



Display the number of null values in each column

In [16]:
from pandas.core.arrays.categorical import contains

# patient.select([count(when((
#                            col(single_column).contains('None') |\
#                            col(single_column).contains('Null') |\
#                            isnan(single_column) | \
#                            col(single_column).isnull() , col)) \
#                           .alias(single_column)\
#                            for single_column in patient.columns]).show()

In [17]:
patient.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in patient.columns]).show()

+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|patient_id| sex| age|country|province|city|infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|         0|1122|1380|      0|       0|  94|           919|       3819|          4374|              4475|             3|         3578|         5099|    0|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+



# Data preprocessing

Kindly fill the nulls in the **deceased_date** with the **released_date**

In [18]:
from pyspark.sql.functions import coalesce
patient = patient.withColumn("deceased_date",coalesce(patient.deceased_date,patient.released_date))

Drop values having a null deceased_date

In [19]:
patient = patient.na.drop(subset=['deceased_date'])
patient.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|   2020-03-02|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|

Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows

Hint: You need to typecase these coumns as date first

In [20]:
from pyspark.sql.types import TimestampType
patient = patient.withColumn("deceased_date",col("deceased_date").cast(TimestampType())).withColumn("confirmed_date",col("confirmed_date").cast(TimestampType()))
patient.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: timestamp (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: timestamp (nullable = true)
 |-- state: string (nullable = true)



In [21]:
patient = patient.withColumn("no_days",datediff(col("deceased_date"),col("confirmed_date")))
patient.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|released_date|      deceased_date|   state|no_days|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|   2020-02-05|2020-02-05 00:00:00|released|     13|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|   2020-03-02|2020-03-02 00:00:00|released|     32|
|1000000003|  male|50s|  Korea|   Seoul|  Jon

Add a **is_male** column if male then it should yield 1, else then 0 and show your dataframe

In [22]:
patient = patient.withColumn("is_male",when((col("sex") == 'male'),lit(1)).otherwise(lit(0)))
patient.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|released_date|      deceased_date|   state|no_days|is_male|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|   2020-02-05|2020-02-05 00:00:00|released|     13|      1|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|   2020-03-02|2020-03-02 00:00:00|released|     32|      1|
|1000

Add a **is_dead** column if patient state is released then make it 1, else 0 and show your dataframe

In [23]:
patient = patient.withColumn("is_dead",when((col("state") == 'deceased'),lit(1)).otherwise(lit(0)))
patient.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+-------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|released_date|      deceased_date|   state|no_days|is_male|is_dead|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+-------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|   2020-02-05|2020-02-05 00:00:00|released|     13|      1|      0|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|   2020-03-02|2020-03-02 00:00:00

In [24]:
patient.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: timestamp (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: timestamp (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: integer (nullable = true)
 |-- is_male: integer (nullable = false)
 |-- is_dead: integer (nullable = false)



Kindly change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20 and show your dataframe

In [25]:
from pyspark.sql.types import DoubleType
patient = patient.withColumn("age",regexp_replace(col("age"),'s',''))
patient = patient.withColumn("age",col("age").cast(DoubleType()))
patient.show(5)

+----------+------+----+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+-------+
|patient_id|   sex| age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|released_date|      deceased_date|   state|no_days|is_male|is_dead|
+----------+------+----+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------------+--------+-------+-------+-------+
|1000000001|  male|50.0|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|   2020-02-05|2020-02-05 00:00:00|released|     13|      1|      0|
|1000000002|  male|30.0|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|   2020-03-02|2020-03-02 00:

Repeat the same, but for the **no_days** return the numbers only and show your dataframe

Print the schema

In [26]:
patient.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: timestamp (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: timestamp (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: integer (nullable = true)
 |-- is_male: integer (nullable = false)
 |-- is_dead: integer (nullable = false)



**age** and **no_days** need to be typecasted as Double

In [27]:
patient = patient.withColumn("no_days",col("no_days").cast(DoubleType()))

reprint the schema to check:

In [28]:
patient.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: timestamp (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: timestamp (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: double (nullable = true)
 |-- is_male: integer (nullable = false)
 |-- is_dead: integer (nullable = false)



Kindly drop the columns below

 ["patient_id","sex","infected_by","contact_number","released_date","state","symptom_onset_date","confirmed_date","deceased_date","country","no_days","city","infection_case"]

In [32]:
patient = patient.drop("patient_id","sex","infected_by","contact_number","released_date","state","symptom_onset_date","confirmed_date","deceased_date","country","no_days","city","infection_case")
patient.show()

+----+--------+-------+-------+
| age|province|is_male|is_dead|
+----+--------+-------+-------+
|50.0|   Seoul|      1|      0|
|30.0|   Seoul|      1|      0|
|50.0|   Seoul|      1|      0|
|20.0|   Seoul|      1|      0|
|20.0|   Seoul|      0|      0|
|50.0|   Seoul|      0|      0|
|20.0|   Seoul|      1|      0|
|20.0|   Seoul|      1|      0|
|30.0|   Seoul|      1|      0|
|60.0|   Seoul|      0|      0|
|50.0|   Seoul|      0|      0|
|20.0|   Seoul|      1|      0|
|60.0|   Seoul|      0|      0|
|70.0|   Seoul|      1|      0|
|70.0|   Seoul|      1|      0|
|70.0|   Seoul|      0|      0|
|80.0|   Seoul|      1|      0|
|40.0|   Seoul|      1|      0|
|30.0|   Seoul|      1|      0|
|50.0|   Seoul|      1|      0|
+----+--------+-------+-------+
only showing top 20 rows



Please recount the number of nulls now

In [35]:
patient.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in patient.columns]).show()


+---+--------+-------+-------+
|age|province|is_male|is_dead|
+---+--------+-------+-------+
| 11|       0|      0|      0|
+---+--------+-------+-------+



We will handle the null values next lab using imputer, look it up if you would like.

Best of luck dears 🤩🤩🤩

If you have any questions you can reach out to me:

### Omar Hammad
#### Software Engineer
##### Email: ommar365@gmail.com
##### Phone: 01144070145
##### Linkedin: https://www.linkedin.com/in/omar-a-hammad