# Data Description and Project Goals

## Abstract

This project uses a Covid-19 dataset that contains over 1 million unique patients and 21 columns. Some of the main features of this
dataset set consist of patient type, age, pneumonia, COPD, diabetes, and more. First, we verified whether the data was clean and
generated charts to view the distribution of each of the features in the dataset to look for imbalances. We did have to generate a
new column to provide a binary classification for whether a patient had died from Covid-19. The main goal of this project was to train a supervised machine learning model and determine whether a patient is at risk given their current symptom, status, and medical history.

# Data Preparation and Feature Engineering

## Creating Spark Session

In [1]:
# Create PySpark session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Covid19").getOrCreate()

## Loading the Dataset

In [5]:
# Load the CSV dataset
df = spark.read.csv("covid_data.csv", header=True, inferSchema=True)

# Show the first 5 rows
df.show(5)

# Print the schema
df.printSchema()

+-----+------------+---+------------+----------+-------+---------+---+--------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+--------------------+---+
|USMER|MEDICAL_UNIT|SEX|PATIENT_TYPE| DATE_DIED|INTUBED|PNEUMONIA|AGE|PREGNANT|DIABETES|COPD|ASTHMA|INMSUPR|HIPERTENSION|OTHER_DISEASE|CARDIOVASCULAR|OBESITY|RENAL_CHRONIC|TOBACCO|CLASIFFICATION_FINAL|ICU|
+-----+------------+---+------------+----------+-------+---------+---+--------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+--------------------+---+
|    2|           1|  1|           1|03/05/2020|     97|        1| 65|       2|       2|   2|     2|      2|           1|            2|             2|      2|            2|      2|                   3| 97|
|    2|           1|  2|           1|03/06/2020|     97|        1| 72|      97|       2|   2|     2|      2|           1|            2|             2|      1|            1|    

# Feature Engineering

In [None]:
# Create a column 'hasDied' that is set to 1 if the patient has died, 0 otherwise
from pyspark.sql.functions import when

default_date = "9999-99-99"

df = df.withColumn("hasDied", when(df["DATE_DIED"] == default_date, 0).otherwise(1))

df.show(5)

+-----+------------+---+------------+----------+-------+---------+---+--------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+--------------------+---+-------+
|USMER|MEDICAL_UNIT|SEX|PATIENT_TYPE| DATE_DIED|INTUBED|PNEUMONIA|AGE|PREGNANT|DIABETES|COPD|ASTHMA|INMSUPR|HIPERTENSION|OTHER_DISEASE|CARDIOVASCULAR|OBESITY|RENAL_CHRONIC|TOBACCO|CLASIFFICATION_FINAL|ICU|hasDied|
+-----+------------+---+------------+----------+-------+---------+---+--------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+--------------------+---+-------+
|    2|           1|  1|           1|03/05/2020|     97|        1| 65|       2|       2|   2|     2|      2|           1|            2|             2|      2|            2|      2|                   3| 97|      1|
|    2|           1|  2|           1|03/06/2020|     97|        1| 72|      97|       2|   2|     2|      2|           1|            2|         

## Data Exploration

# Machine Learning Algorithm Preparation and Tuning

# Model Evaluation and Visualization

# Limitations, Future Work, and Conclusion