To start the EDA, I will import all the libraries I will use, such as Pandas, for functions and methods to manipulate the data in this library and Numpy to solve mathematical problems.

In [33]:
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import seaborn as sns

## Import dataset

I used the 'pd.read_csv' method to access the file that I want to work with.
The warning indicates that Pandas has encountered columns in the DataFrame where the data types are inconsistent throughout, meaning that both strings and numbers are within the same column. I will handle it during the cleaning process.

In [38]:
df = pd.read_csv('2021VAERSDATA.csv', encoding='ISO-8859-1')
orig_df = df.copy()

  df = pd.read_csv('2021VAERSDATA.csv', encoding='ISO-8859-1')


In [39]:
df.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,...,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES
0,916600,01/01/2021,TX,33.0,33.0,,F,,Right side of epiglottis swelled up and hinder...,,...,,,,,2,01/01/2021,,Y,,Pcn and bee venom
1,916601,01/01/2021,CA,73.0,73.0,,F,,Approximately 30 min post vaccination administ...,,...,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,,,2,01/01/2021,,Y,,"""Dairy"""
2,916602,01/01/2021,WA,23.0,23.0,,F,,"About 15 minutes after receiving the vaccine, ...",,...,,,,,2,01/01/2021,,,Y,Shellfish
3,916603,01/01/2021,WA,58.0,58.0,,F,,"extreme fatigue, dizziness,. could not lift my...",,...,kidney infection,"diverticulitis, mitral valve prolapse, osteoar...","got measles from measel shot, mums from mumps ...",,2,01/01/2021,,,,"Diclofenac, novacaine, lidocaine, pickles, tom..."
4,916604,01/01/2021,TX,47.0,47.0,,F,,"Injection site swelling, redness, warm to the ...",,...,Na,,,,2,01/01/2021,,,,Na


I used the method df.shape to have an idea of the dataset size. And I found that this dataset has 34121 rows and 35 columns.

In [40]:
df.shape

(34121, 35)

With the describe() method, I can see that in this dataset, there is only a skewed distribution in two columns ('HOSPDAYS' and 'NUMDAYS') by comparing the values of mean and median (50% value). The other columns are normally distributed. Thus, I will start the cleaning process, but first, I will select the principal columns that can be useful for the project. 

In [42]:
df.describe()

Unnamed: 0,VAERS_ID,AGE_YRS,CAGE_YR,CAGE_MO,HOSPDAYS,NUMDAYS,FORM_VERS
count,34121.0,30933.0,26791.0,83.0,2857.0,31194.0,34121.0
mean,981306.6,51.471923,51.135381,0.084337,3.752888,21.077066,1.998124
std,62045.35,18.521742,18.633316,0.178395,3.878654,644.8344,0.043269
min,916600.0,0.08,0.0,0.0,1.0,0.0,1.0
25%,926464.0,37.0,36.0,0.0,1.0,0.0,2.0
50%,946837.0,50.0,49.0,0.0,3.0,1.0,2.0
75%,1047069.0,65.0,65.0,0.0,5.0,3.0,2.0
max,1115348.0,115.0,106.0,0.7,39.0,36896.0,2.0


## Clean data

To start the cleaning process, I will use the method df.info() because it provides a quick overview of the structure and some basic information about the DataFrame, like data type and if there are missing values. In this case, the dataset presents all the values (no missing values). 

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34121 entries, 0 to 34120
Data columns (total 35 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   VAERS_ID      34121 non-null  int64  
 1   RECVDATE      34121 non-null  object 
 2   STATE         28550 non-null  object 
 3   AGE_YRS       30933 non-null  float64
 4   CAGE_YR       26791 non-null  float64
 5   CAGE_MO       83 non-null     float64
 6   SEX           34121 non-null  object 
 7   RPT_DATE      63 non-null     object 
 8   SYMPTOM_TEXT  34119 non-null  object 
 9   DIED          1957 non-null   object 
 10  DATEDIED      1798 non-null   object 
 11  L_THREAT      1259 non-null   object 
 12  ER_VISIT      11 non-null     object 
 13  HOSPITAL      4387 non-null   object 
 14  HOSPDAYS      2857 non-null   float64
 15  X_STAY        52 non-null     object 
 16  DISABLE       870 non-null    object 
 17  RECOVD        31264 non-null  object 
 18  VAX_DATE      32622 non-nu

The method below returns True where there is a NaN (Not a Number) value and False otherwise to indicate the presence of missing values. 

The data use guide **(xxxxxx)** contains essential information about this dataset, like how it was created and filled, for example, in the 'DIED' column, where they used the letter "Y" to indicate that the patient dies and otherwise the field will be blank. Thus, that is the reason why there is NaN in this dataset. In this case, I will transform the NaN in zeros to represent the absence of occurrence.

In [45]:
df.isnull().sum()

VAERS_ID            0
RECVDATE            0
STATE            5571
AGE_YRS          3188
CAGE_YR          7330
CAGE_MO         34038
SEX                 0
RPT_DATE        34058
SYMPTOM_TEXT        2
DIED            32164
DATEDIED        32323
L_THREAT        32862
ER_VISIT        34110
HOSPITAL        29734
HOSPDAYS        31264
X_STAY          34069
DISABLE         33251
RECOVD           2857
VAX_DATE         1499
ONSET_DATE       1863
NUMDAYS          2927
LAB_DATA        19041
V_ADMINBY           0
V_FUNDBY        34057
OTHER_MEDS      13882
CUR_ILL         18052
HISTORY         11746
PRIOR_VAX       32687
SPLTTYPE        25898
FORM_VERS           0
TODAYS_DATE       199
BIRTH_DEFECT    34070
OFC_VISIT       28717
ER_ED_VISIT     28592
ALLERGIES       15534
dtype: int64

I will drop unnecessary columns in my dataset because it could generate errors and waste time. After dropping, I will nominate the dataset as df1.

In [46]:
df.columns

Index(['VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO', 'SEX',
       'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT',
       'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE',
       'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY',
       'OTHER_MEDS', 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE',
       'FORM_VERS', 'TODAYS_DATE', 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT',
       'ALLERGIES'],
      dtype='object')

In [47]:
df1 = df.drop(columns=['VAERS_ID', 'RECVDATE', 'CAGE_YR', 'CAGE_MO', 'RPT_DATE', 'SYMPTOM_TEXT', 'DATEDIED', 'L_THREAT', 'ER_VISIT',
       'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'RECOVD', 'VAX_DATE',
       'ONSET_DATE', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY',
       'OTHER_MEDS', 'PRIOR_VAX', 'SPLTTYPE',
       'FORM_VERS', 'TODAYS_DATE', 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT'])
df1.head()

Unnamed: 0,STATE,AGE_YRS,SEX,DIED,DISABLE,NUMDAYS,CUR_ILL,HISTORY,ALLERGIES
0,TX,33.0,F,,,2.0,,,Pcn and bee venom
1,CA,73.0,F,,,0.0,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,"""Dairy"""
2,WA,23.0,F,,,0.0,,,Shellfish
3,WA,58.0,F,,,0.0,kidney infection,"diverticulitis, mitral valve prolapse, osteoar...","Diclofenac, novacaine, lidocaine, pickles, tom..."
4,TX,47.0,F,,,7.0,Na,,Na


In [48]:
df1.shape

(34121, 9)

In [49]:
df1.dtypes

STATE         object
AGE_YRS      float64
SEX           object
DIED          object
DISABLE       object
NUMDAYS      float64
CUR_ILL       object
HISTORY       object
ALLERGIES     object
dtype: object

In [50]:
df1.describe()

Unnamed: 0,AGE_YRS,NUMDAYS
count,30933.0,31194.0
mean,51.471923,21.077066
std,18.521742,644.8344
min,0.08,0.0
25%,37.0,0.0
50%,50.0,1.0
75%,65.0,3.0
max,115.0,36896.0


I am using the code df1.isnull().sum() to calculate the numbers of missing (null and NaN) values in all columns of the dataset. 

In [17]:
df1.isnull().sum()

STATE         5571
AGE_YRS       3188
SEX              0
DIED         32164
DISABLE      33251
NUMDAYS       2927
CUR_ILL      18052
HISTORY      11746
ALLERGIES    15534
dtype: int64

Below, I am using the method .fillna() to replace the NaN with zero values.

In [56]:
df1.fillna(0, inplace=True)

In [57]:
df1.head()

Unnamed: 0,STATE,AGE_YRS,SEX,DIED,DISABLE,NUMDAYS,CUR_ILL,HISTORY,ALLERGIES
0,TX,33.0,F,0,0,2.0,,,Pcn and bee venom
1,CA,73.0,F,0,0,0.0,Patient residing at nursing facility. See pati...,Patient residing at nursing facility. See pati...,"""Dairy"""
2,WA,23.0,F,0,0,0.0,,,Shellfish
3,WA,58.0,F,0,0,0.0,kidney infection,"diverticulitis, mitral valve prolapse, osteoar...","Diclofenac, novacaine, lidocaine, pickles, tom..."
4,TX,47.0,F,0,0,7.0,Na,0,Na


### LabelEncoder

I will use the labelEncoder feature to replace text data with numerical data. Otherwise, the machine learning models will not recognise the data.

In [58]:
df1["SEX"].value_counts()

F    24468
M     8755
U      898
Name: SEX, dtype: int64

In [59]:
from sklearn.preprocessing import LabelEncoder

In [15]:
df1["DIED"].value_counts()

Y    1957
Name: DIED, dtype: int64

In [16]:
df1["DISABLE"].value_counts()

Y    870
Name: DISABLE, dtype: int64

In [28]:
df1['SEX'].unique()

array(['F', 'M', 'U'], dtype=object)

In [29]:
df1['DIED'].unique()

array([nan, 'Y'], dtype=object)

In [30]:
df1['DISABLE'].unique()

array([nan, 'Y'], dtype=object)

In [32]:
df1['CUR_ILL'].unique()

array(['None',
       'Patient residing at nursing facility. See patients chart.',
       'kidney infection', ..., 'mild diabetes',
       'hbp thyroid copd severe sleep apnea renal failure renal transplant 2016',
       'Asthma with home O2 and regular exacerbations and nebulizer use.'],
      dtype=object)

In [33]:
df1['HISTORY'].unique()

array(['None',
       'Patient residing at nursing facility. See patients chart.',
       'diverticulitis, mitral valve prolapse, osteoarthritis', ...,
       'Medical History/Concurrent Conditions: Aortic valve replacement',
       'Medical History/Concurrent Conditions: Type II diabetes mellitus',
       'Comments: List of non-encoded Patient Relevant History: Patient Other Relevant History 1: no adverse event, Continue: [UNK], Comment: No medical history reported'],
      dtype=object)

In [34]:
df1['ALLERGIES'].unique()

array(['Pcn and bee venom', '"Dairy"', 'Shellfish', ...,
       'Amoxicillan Lactose', 'shellfish allergy',
       'no known specific allergies to us, just history of severe asthma'],
      dtype=object)