To start the EDA, I will import all the libraries I will use, such as Pandas, for functions and methods to manipulate the data in this library and Numpy to solve mathematical problems.

In [1]:
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import seaborn as sns

## Import dataset

I used the 'pd.read_csv' method to access the file that I want to work with.
The warning indicates that Pandas has encountered columns in the DataFrame where the data types are inconsistent throughout, meaning that both strings and numbers are within the same column. I will handle it during the cleaning process.

In [2]:
df = pd.read_csv('2021VAERSDATA.csv', encoding='ISO-8859-1')
orig_df = df.copy()

  df = pd.read_csv('2021VAERSDATA.csv', encoding='ISO-8859-1')


In [3]:
df.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,...,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES
0,916600,01/01/2021,TX,33.0,33.0,,F,,Right side of epiglottis swelled up and hinder...,,...,,,,,2,01/01/2021,,Y,,Pcn and bee venom
1,916601,01/01/2021,CA,73.0,73.0,,F,,Approximately 30 min post vaccination administ...,,...,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,,,2,01/01/2021,,Y,,"""Dairy"""
2,916602,01/01/2021,WA,23.0,23.0,,F,,"About 15 minutes after receiving the vaccine, ...",,...,,,,,2,01/01/2021,,,Y,Shellfish
3,916603,01/01/2021,WA,58.0,58.0,,F,,"extreme fatigue, dizziness,. could not lift my...",,...,kidney infection,"diverticulitis, mitral valve prolapse, osteoar...","got measles from measel shot, mums from mumps ...",,2,01/01/2021,,,,"Diclofenac, novacaine, lidocaine, pickles, tom..."
4,916604,01/01/2021,TX,47.0,47.0,,F,,"Injection site swelling, redness, warm to the ...",,...,Na,,,,2,01/01/2021,,,,Na


I used the method df.shape to have an idea of the dataset size. And I found that this dataset has 34121 rows and 35 columns.

In [4]:
df.shape

(34121, 35)

With the describe() method, I can see that in this dataset, there is only a skewed distribution in two columns ('HOSPDAYS' and 'NUMDAYS') by comparing the values of mean and median (50% value). The other columns are normally distributed. Thus, I will start the cleaning process, but first, I will select the principal columns that can be useful for the project. 

In [5]:
df.describe()

Unnamed: 0,VAERS_ID,AGE_YRS,CAGE_YR,CAGE_MO,HOSPDAYS,NUMDAYS,FORM_VERS
count,34121.0,30933.0,26791.0,83.0,2857.0,31194.0,34121.0
mean,981306.6,51.471923,51.135381,0.084337,3.752888,21.077066,1.998124
std,62045.35,18.521742,18.633316,0.178395,3.878654,644.8344,0.043269
min,916600.0,0.08,0.0,0.0,1.0,0.0,1.0
25%,926464.0,37.0,36.0,0.0,1.0,0.0,2.0
50%,946837.0,50.0,49.0,0.0,3.0,1.0,2.0
75%,1047069.0,65.0,65.0,0.0,5.0,3.0,2.0
max,1115348.0,115.0,106.0,0.7,39.0,36896.0,2.0


## Clean data

To start the cleaning process, I will use the method df.info() because it provides a quick overview of the structure and some basic information about the DataFrame, like data type and if there are missing values. In this case, the dataset presents all the values (no missing values). 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34121 entries, 0 to 34120
Data columns (total 35 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   VAERS_ID      34121 non-null  int64  
 1   RECVDATE      34121 non-null  object 
 2   STATE         28550 non-null  object 
 3   AGE_YRS       30933 non-null  float64
 4   CAGE_YR       26791 non-null  float64
 5   CAGE_MO       83 non-null     float64
 6   SEX           34121 non-null  object 
 7   RPT_DATE      63 non-null     object 
 8   SYMPTOM_TEXT  34119 non-null  object 
 9   DIED          1957 non-null   object 
 10  DATEDIED      1798 non-null   object 
 11  L_THREAT      1259 non-null   object 
 12  ER_VISIT      11 non-null     object 
 13  HOSPITAL      4387 non-null   object 
 14  HOSPDAYS      2857 non-null   float64
 15  X_STAY        52 non-null     object 
 16  DISABLE       870 non-null    object 
 17  RECOVD        31264 non-null  object 
 18  VAX_DATE      32622 non-nu

The method below returns True where there is a NaN (Not a Number) value and False otherwise to indicate the presence of missing values. 

The data use guide (Vaccine Adverse Event Reporting System - VAERS) contains essential information about this dataset, like how it was created and filled, for example, in the 'DIED' column, where they used the letter "Y" to indicate that the patient dies and otherwise the field will be blank. Thus, that is the reason why there is NaN in this dataset. In this case, I will transform the NaN in zeros to represent the absence of occurrence.

Example of data use guide:

DIED: If the vaccine recipient died a "Y" is used; otherwise the field will be blank.

In [7]:
df.isnull().sum()

VAERS_ID            0
RECVDATE            0
STATE            5571
AGE_YRS          3188
CAGE_YR          7330
CAGE_MO         34038
SEX                 0
RPT_DATE        34058
SYMPTOM_TEXT        2
DIED            32164
DATEDIED        32323
L_THREAT        32862
ER_VISIT        34110
HOSPITAL        29734
HOSPDAYS        31264
X_STAY          34069
DISABLE         33251
RECOVD           2857
VAX_DATE         1499
ONSET_DATE       1863
NUMDAYS          2927
LAB_DATA        19041
V_ADMINBY           0
V_FUNDBY        34057
OTHER_MEDS      13882
CUR_ILL         18052
HISTORY         11746
PRIOR_VAX       32687
SPLTTYPE        25898
FORM_VERS           0
TODAYS_DATE       199
BIRTH_DEFECT    34070
OFC_VISIT       28717
ER_ED_VISIT     28592
ALLERGIES       15534
dtype: int64

As this dataset is vast, I will use the .value_counts() function in the 'STATE' column to know how many records are in each category and focus on one State that satisfies the minimum requirement of the project. Then, I will create a subset with this category. 

This code showed me that I can use the CA State alone or enjoy more than one. I prefer to work with the CA because it is a cosmopolitan place that might have people from different parts of the world, and I imagine that the results could be interesting.

In [8]:
category_counts = df['STATE'].value_counts()
print(category_counts)

CA    2577
TX    1807
NY    1783
FL    1654
IN    1142
IL    1135
OH    1072
PA    1012
MI     921
MA     842
NJ     836
NC     752
VA     727
MD     693
AZ     659
WI     649
WA     630
GA     628
CO     617
MN     604
MO     556
TN     526
CT     524
KY     443
OR     363
IA     336
OK     333
LA     317
KS     295
AL     295
SC     286
AR     266
NE     263
MT     262
ME     235
UT     229
NM     229
WV     220
NH     203
NV     197
PR     195
AK     167
ID     157
MS     150
HI     134
SD     116
RI     112
VT     105
ND     101
DE      66
DC      55
WY      53
GU       7
VI       4
AS       3
MP       3
XB       1
Ca       1
MH       1
FM       1
Name: STATE, dtype: int64


Below, I will create the subset with the CA category and work with it from now on. This subset will be named as 'df1'.

In [11]:
by_category = df.groupby('STATE')
df1 = by_category.get_group('CA')
df1.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,...,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES
1,916601,01/01/2021,CA,73.0,73.0,,F,,Approximately 30 min post vaccination administ...,,...,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,,,2,01/01/2021,,Y,,"""Dairy"""
11,916613,01/01/2021,CA,40.0,40.0,,F,,On 12/30/2020 I got a pain in the stomach as i...,,...,,,,,2,01/01/2021,,,,
14,916617,01/01/2021,CA,35.0,35.0,,F,,"Dizziness, chills, fever, muscle aches, pain a...",,...,,Depression,,,2,01/01/2021,,,,
18,916621,01/01/2021,CA,25.0,25.0,,F,,Fatigue - 2 hours prior. Muscle aches/pain - 3...,,...,,None.,,,2,01/01/2021,,,,Ceftiaxone (Rocephin)
84,916690,01/01/2021,CA,37.0,37.0,,M,,Typical sore arm similar to flu shot. Followin...,,...,,,,,2,01/01/2021,,,,Amoxicillin


In [12]:
df1.shape

(2577, 35)

I will drop unnecessary columns in my dataset because it could generate errors and waste time. After dropping, I will nominate the dataset as 'df2'.

In [13]:
df1.columns

Index(['VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO', 'SEX',
       'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT',
       'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE',
       'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY',
       'OTHER_MEDS', 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE',
       'FORM_VERS', 'TODAYS_DATE', 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT',
       'ALLERGIES'],
      dtype='object')

In [21]:
df2 = df1.drop(columns=['VAERS_ID', 'RECVDATE', 'STATE', 'CAGE_YR', 'CAGE_MO', 'RPT_DATE', 'SYMPTOM_TEXT', 'DATEDIED', 'L_THREAT', 'ER_VISIT',
       'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'RECOVD', 'VAX_DATE',
       'ONSET_DATE', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY',
       'OTHER_MEDS', 'PRIOR_VAX', 'SPLTTYPE',
       'FORM_VERS', 'TODAYS_DATE', 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT'])
df2.head()

Unnamed: 0,AGE_YRS,SEX,DIED,DISABLE,NUMDAYS,CUR_ILL,HISTORY,ALLERGIES
1,73.0,F,,,0.0,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,"""Dairy"""
11,40.0,F,,,0.0,,,
14,35.0,F,,,0.0,,Depression,
18,25.0,F,,,0.0,,None.,Ceftiaxone (Rocephin)
84,37.0,M,,,1.0,,,Amoxicillin


In [22]:
df2.shape

(2577, 8)

In [23]:
df2.dtypes

AGE_YRS      float64
SEX           object
DIED          object
DISABLE       object
NUMDAYS      float64
CUR_ILL       object
HISTORY       object
ALLERGIES     object
dtype: object

In [24]:
df2.describe()

Unnamed: 0,AGE_YRS,NUMDAYS
count,2391.0,2424.0
mean,51.61348,5.14769
std,18.131568,62.659682
min,1.0,0.0
25%,37.0,0.0
50%,49.0,1.0
75%,66.0,3.0
max,101.0,2399.0


I am using the code df1.isnull().sum() to calculate the numbers of missing (null and NaN) values in all columns of the dataset. 

In [25]:
df2.isnull().sum()

AGE_YRS       186
SEX             0
DIED         2422
DISABLE      2471
NUMDAYS       153
CUR_ILL      1236
HISTORY       689
ALLERGIES    1078
dtype: int64

Below, I am using the method .fillna() to replace the NaN with zero values.

In [26]:
df2.fillna(0, inplace=True)

In [27]:
df2.head()

Unnamed: 0,AGE_YRS,SEX,DIED,DISABLE,NUMDAYS,CUR_ILL,HISTORY,ALLERGIES
1,73.0,F,0,0,0.0,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,"""Dairy"""
11,40.0,F,0,0,0.0,,,
14,35.0,F,0,0,0.0,0,Depression,
18,25.0,F,0,0,0.0,,None.,Ceftiaxone (Rocephin)
84,37.0,M,0,0,1.0,0,0,Amoxicillin


#### Data Dictionary - dataset (df2):

1. AGE_YRS: The recorded vaccine recipient's age in years.

2. SEX: Sex of the vaccine recipient (M = Male, F = Female, Unknown = Blank).

3. DIED: If the vaccine recipient died a "Y" is used; otherwise the field will be blank.

4. DISABLE: If the vaccine recipient was disabled as a result of the vaccination a "Y" is placed in this field; otherwise the field will be blank.

5. NUMDAYS: The calculated interval (in days) from the vaccination date to the onset date.

6. CUR_ILL: This text field contains narrative about any illnesses at the time of the vaccination as noted on the specified field of the form.

7. HISTORY: This text field contains narrative about any pre-existing physician-diagnosed birth defects or medical condition that existed at the time of vaccination as noted on the specified field of the form.

8. ALLERGIES: This text field contains narrative about any pre-existing physician-diagnosed allergies that existed at the time of vaccination as noted in the specified field of the form.

## Pre-processing

### One-Hot Encode

This tool replaces categorical variables, like 'DIED' ('Y'= dead and 0 = alive), with one or more features with 0 and 1 values (Müller and Guido, 2017 p.214). I am using this tool because Machine Learning models are based on numerical operations and do not recognize stings. Because of this, it is necessary to convert strings into numbers without introducing ordinal relationships, as in the Label Encoding tool, which can give more weight to specific categories than others. 

First, I will use the value_counts function to check the contents of the columns where I want to proceed with the one-hot encoding. This is important because when humans input data, it is always possible to have some errors. There are no typos in the case of these columns, and then I can start the one-hot encoding process. Otherwise, I would have to convert all the typos into a unique word (Müller and Guido, 2017 p.214, 215).

In [28]:
df2["SEX"].value_counts()

F    1837
M     677
U      63
Name: SEX, dtype: int64

In [29]:
df2["DIED"].value_counts()

0    2422
Y     155
Name: DIED, dtype: int64

In [30]:
df2["DISABLE"].value_counts()

0    2471
Y     106
Name: DISABLE, dtype: int64

In [31]:
df_encoded = pd.get_dummies(df2, columns=['SEX', 'DIED', 'DISABLE'])
df_encoded.head()

Unnamed: 0,AGE_YRS,NUMDAYS,CUR_ILL,HISTORY,ALLERGIES,SEX_F,SEX_M,SEX_U,DIED_0,DIED_Y,DISABLE_0,DISABLE_Y
1,73.0,0.0,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,"""Dairy""",1,0,0,1,0,1,0
11,40.0,0.0,,,,1,0,0,1,0,1,0
14,35.0,0.0,0,Depression,,1,0,0,1,0,1,0
18,25.0,0.0,,None.,Ceftiaxone (Rocephin),1,0,0,1,0,1,0
84,37.0,1.0,0,0,Amoxicillin,0,1,0,1,0,1,0


In [32]:
df_encoded.shape

(2577, 12)

#### Sparsity

In [33]:
missing_values = df_encoded.isnull().sum().sum()
zero_values = (df_encoded == 0).sum().sum()

total_data_points = df_encoded.size

sparsity = (missing_values + zero_values) / total_data_points

print(f"Sparsity of the dataset: {sparsity:.2f}")

Sparsity of the dataset: 0.48


In [34]:
df2['CUR_ILL'].unique()

array(['Patient residing at nursing facility. See patients chart.',
       'None', 0, 'COVID dec 2 ,2020', 'none',
       'Suspected autoimmune type reaction following Fluvax October 2020- developed Cold Urticaria (hives in response to cold stimulus) a few days after flu vaccine',
       'COVID 19 Positive on 12/11/2020', 'No e', 'No', 'Na',
       'HAD COVID SYMPTOMS BEGINNING OF DECEMBER, FEVER, CHILLS, LOSS OF TASTE AND SMELL. COVID TEST NEGATIVE.',
       'NKA', 'none known', 'COVID-19', "None- I'm in excellent health",
       'history of hypertension, mitral valve prolapse, asthma', 'na',
       'Cold sore on lip one month ago', 'Diabetes, high blood pressure',
       'NONE', 'COVID-19 asymptomatic case. 12/8/2020 positive test',
       'June 22,2020: COVID19 positive July 2020: Diagnosed with restrictive airway disease, small airway disease, lung nodules November 11, 2020: STEMI, required angioplasty and stent placement',
       'None known', 'None known.', 'unknown', '#NAME?', '

In [35]:
df2['HISTORY'].unique()

array(['Patient residing at nursing facility. See patients chart.',
       'None', 'Depression', ...,
       'Neurosarcoidosis Transverse myelitis with lower paraplegia HIV positive (on HAART) DM2 MDD',
       'Congestive heart failure, triple bypass surgery, diabetes mellitus',
       'severe chronic asthma with home O2'], dtype=object)

In [36]:
df2['ALLERGIES'].unique()

array(['"Dairy"', 'None', 'Ceftiaxone (Rocephin)', 'Amoxicillin',
       'Lactose, neosporin, nickel', 'NKA', 0, 'Kiwi',
       'sulfa, vytorin, ginger (spice)',
       'DayQuil, Hypercare, glycerin (in skin products and food), coconut oil (applied to skin)',
       'none', 'Allergies to bees, shellfish and shrimp.', 'NKDA',
       'Bees, PCN, Pineapple', 'Clindamycin',
       'Penicillins, ciprofloxacin, sulfas', 'Shellfish and milk',
       'Codeine, Thimerisol, sulfates, preservatives',
       'Lots of hay fever', 'Amoxicillin, levaquin', 'Chlorhexidine',
       'amoxicillin, clindamycin',
       'Tested for allergy in the past and was found to have environmental allergies - trees, pollen, dust',
       'Penicillin  Crab allergy',
       'allergies to sulfa, walnut, allergic rhinitis',
       'sulfa drug allergy, walnut food allergy, also has allergic rhinitis to environmental allergens',
       'asthma, food allergies, and environmental allergies',
       'penicillin, ampicillin, p

### Reference

Müller, A. C. and Guido, S. (2017). Introduction to machine learning with Python: a guide for data scientists. 1st ed. United States of America. O’reilly Media.