# INTRODUCTION

Hello guys!! Welcome to the notebook where we will find insights on COVID-19 data. The data in this notebook was provided by the mexican government. We will start by performing EDA and remove missing data/outliers. 

#   About Dataset

The dataset consists of data of anonymous covid patients from mexico. It has their medical history and medical status at that particular time.In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data. The respective columns mean:
* sex: 1 for female and 2 for male.
* age: of the patient.
* classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.
* patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.
* pneumonia: whether the patient already have air sacs inflammation or not.
* pregnancy: whether the patient is pregnant or not.
* diabetes: whether the patient has diabetes or not.
* copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
* asthma: whether the patient has asthma or not.
* inmsupr: whether the patient is immunosuppressed or not.
* hypertension: whether the patient has hypertension or not.
* cardiovascular: whether the patient has heart or blood vessels related disease.
* renal chronic: whether the patient has chronic renal disease or not.
* other disease: whether the patient has other disease or not.
* obesity: whether the patient is obese or not.
* tobacco: whether the patient is a tobacco user.
* usmr: Indicates whether the patient treated medical units of the first, second or third level.
* medical unit: type of institution of the National Health System that provided the care.
* intubed: whether the patient was connected to the ventilator.
* icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
* date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#loading data 
data = pd.read_csv('/kaggle/input/covid19-dataset/Covid Data.csv')
data.head()

# Understanding the Dataset


1. How many instances and attributes does our dataset have??

In [None]:
data.shape #The dataset has 21 columns or attributes.

2. Let us see the data type of columns in our dataset.

In [None]:
data.dtypes #checking data type of each column

3. Let us see the number of unique values in each column.

In [None]:
data.nunique() #counting number of unique values in each column

4. What is the meaning of USMER? How many patients belong to each category??
> The three commonly known categories of medical treatment are primary(1), secondary(2), Tertiary(3). 
  Primary Care - This is the first stop for most of the symptoms and medical concerns.
  Secondary Care- Secondary care is when your primary care provider refers you to a specialist. Secondary care means your     doctor   has transferred your care to someone who has more specific expertise in whatever health issue you are               experiencing.
  Tertiary Care - If you are hospitalized and require a higher level of speciality care, your doctor may refer you to           tertiary care. Tertiary care requires highly specialized equipment and expertise.

In [None]:
data['USMER'].unique()
data['USMER'].value_counts()

> The patients that belong to secondary care are far greater than those in primary care. This means that patients are diagnosed with serious health issues.

5. Now let us see the unique values for each column.

In [None]:
for i in data.columns:
    if(i!='AGE' and i!='DATE_DIED'):
        print(i," -> ", dict(data[i].value_counts()))

> As we can see there is a lot of missing data in our dataset corresponding to values 97,98,99. Most of the data is categorical which indicates whether the patient has a particular ailment or not.

6. Let us a create a sub dataset for medical history of patient.

In [None]:
med_hist = data.drop(columns=['USMER', 'MEDICAL_UNIT', 'SEX', 'PATIENT_TYPE', 'DATE_DIED', 'INTUBED', 'AGE', 'CLASIFFICATION_FINAL', 'ICU'], axis=1)
med_hist

7. Now, let us see how many rows are such where all medical history data is missing, i.e. the value is either 97, 98 or 99.

In [None]:
data = data.replace([97,98,99], np.nan)
med_hist = med_hist.replace([97,98,99], np.nan)
mm = med_hist[med_hist.isnull().all(axis=1)]
data.iloc[list(mm.index)]

> As we can see among the patients whose medical history is not available at all, one has been hospitalised and one has even died.

8. Now, let us see the time period that our dataset covers.

In [None]:
data['DATE_DIED'] = data['DATE_DIED'].replace('9999-99-99', np.nan)
data['DATE_DIED'] =  pd.to_datetime(data['DATE_DIED'], infer_datetime_format=True)
x = data[data['DATE_DIED'].notnull()]['DATE_DIED']
print(x.max(), x.min())
print(x.max() - x.min())

> We have data of appoximately two years starting from January 2020 to December 2021. 


9. Correlation between all columns in our dataset.

In [None]:
sns.heatmap(data.corr())

> Darker colors indicate negative correlation where lighter colors indicate positive correlation.

10. Weak immune system was a major reason for infection in young people. Let us check if our dataset reflects the same.

In [None]:
x=len(data[(data['INMSUPR']==1.0) & (data['AGE'].isin(np.arange(15.0,30.0)))])
y=len(data[data['INMSUPR']==1.0])
print(x/y*100)

> 11% having weak immune system are of age between 15-30.

11. Let us check the same for people with age above 30.

In [None]:
x=len(data[(data['INMSUPR']==1.0) & (data['AGE'].isin(np.arange(30.0,120.0)))])
y=len(data[data['INMSUPR']==1.0])
print(x/y*100)

> 79% people having weak immune system are of age above 30!!

**Now, the data is clear as to what different columns mean. We'll proceed by finding insights on the data**

# Exploratory Data Analysis (EDA)

1. How many patients who were treated with first and second medical units have died??

In [None]:
plt.rcParams["figure.figsize"] = (5,3)
first = data[data['USMER'] == 1]
second = data[data['USMER'] == 2]
patnt = [len(first), len(second)]
B = first['DATE_DIED'].notnull().sum()
A = second['DATE_DIED'].notnull().sum()
died = [B,A]
stk = pd.DataFrame({'Total patients':patnt, 'Deaths':died}, index=('First MU','Second MU'))
plt.figure(figsize=(10,5))
stk.plot.bar()

2. How many patients who were not hospitalised have died and also how many who were hospitalised have died?

In [None]:
data.head()
STATUS = []
for i in data.index:
    STATUS.append('0' if pd.isna(data['DATE_DIED'][i]) else '1')
data['STATUS']=STATUS

> Here, If the patient has died, value 1 is assigned if not, 0 is assigned.

In [None]:
xyz = pd.DataFrame(data.groupby('PATIENT_TYPE')['STATUS'].value_counts())
xyz

> We can clearly see, a smaller percentage of patients sent home have died and a larger percentage of hospitalised have died.

3. Now, let us see whether the percentage of death of females is more or that of males is more.

In [None]:
data.groupby('SEX')['STATUS'].value_counts()

>Male deaths are far more than female deaths.

4. What is the trend of deaths over the period of time?? Is the rise very steep??

In [None]:
plt.rcParams["figure.figsize"] = (8,3)
data2 = data[data['DATE_DIED'].dt.year == 2020]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
trans =(data2.groupby(data['DATE_DIED'].dt.strftime('%b'))['STATUS'].value_counts()).reset_index(level=['DATE_DIED'])
trans=trans.reset_index(drop=True)
trans.sort_values('DATE_DIED', key = lambda x : pd.Categorical(x, categories=months, ordered=True), inplace=True)
plt.title("YEAR 2020")
sns.lineplot(x=trans['DATE_DIED'],y=trans['STATUS'])

> Deaths increased during the mid months of 2020 and slowly decreased towards the end of the year.

5. Let us also check deaths in 2021.

In [None]:
plt.rcParams["figure.figsize"] = (8,3)
data2 = data[data['DATE_DIED'].dt.year == 2021]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
trans =(data2.groupby(data['DATE_DIED'].dt.strftime('%b'))['STATUS'].value_counts()).reset_index(level=['DATE_DIED'])
trans=trans.reset_index(drop=True)
trans.sort_values('DATE_DIED', key = lambda x : pd.Categorical(x, categories=months, ordered=True), inplace=True)
plt.title("YEAR 2021")
sns.lineplot(x=trans['DATE_DIED'],y=trans['STATUS'])

> Deaths were high during the start of 2021 with highest in month of April. The deaths continuously decreased towards the end of the year.

6. How many on the ventilator survived??

In [None]:
print(data['INTUBED'].value_counts()) #33656 patients were on ventilator
data.groupby('INTUBED')['STATUS'].value_counts()

> Of the patients on ventilator(33656), only 7275 have survived. Of those not on ventilator, 117449 have survived.

7. Let us see the distribution of cases over age.

In [None]:
data.head()
plt.figure(figsize=(20,5))
plt.xticks(rotation=90, horizontalalignment='right',fontweight='light',fontsize='x-small' )
sns.countplot(x='AGE', data=data)

In [None]:
# plt.figure(figsize=(200,5))
plt.rcParams["figure.figsize"] = (20,5)
(data.groupby('STATUS')['AGE'].value_counts()).unstack(0).plot.bar(ylabel='Number of patients',title='No. of deaths age-wise')

> The cases are more for the age of 20-60 and death rate has significantly increased for 40 years and above.

8. Let us now see the count of patients that have required a ventilator age wise.

In [None]:
data.groupby('INTUBED')['AGE'].value_counts().unstack(0).plot.bar(ylabel='Number of patients',title='Ventilators used age-wise')

> We can clearly see that the use of ventilators is very high for infants and patients of age 50 and above. The blue bars show use of ventilators.

9. Let's try to see which ailments are most common in patients

In [None]:
for col in med_hist.columns:
    total = len(data[(data[col]==1.0) | (data[col]==2.0)])
    xyz = len(data[data[col]==1.0])
    print("Every ", int(np.ceil((xyz/total)*100)), "people in 100 have ", col)

> Most common ailments are hypertension, obesity, pneumonia and diabetes.

10. Now, let us see the total number of patients having each disease with the corresponding number of dead and alive from the set.

In [None]:
plt.rcParams["figure.figsize"] = (20,5)
fig,ax = plt.subplots(4,3, sharex=True)
plt.suptitle('Effect of ailments on the death rate', size=16)
i=0
j=0
PLOTS_PER_ROW=3
for col in med_hist.columns:
    xyz = data[data[col]==1.0]
    l = [len(xyz[xyz['STATUS']=='1']),len(xyz[xyz['STATUS']=='0']),len(xyz)]
    ax[i,j].title.set_text(col)
    ax[i,j].barh(y=['Dead','Live','Total'], width=l)
    j+=1
    if j%PLOTS_PER_ROW==0:
        i+=1
        j=0
plt.subplots_adjust(wspace=0.1, hspace=0.4)


> A lot of patients were suffering from Pneumonia, Diabetes, Hypertension, Obesity and Tobacco where death rates have been significantly high for pneumonia, diabetes and hypertension.

11. Now, let us check patients suffering from which diseases are most likely to be on intubed.

In [None]:
plt.rcParams["figure.figsize"] = (20,5)
fig,ax = plt.subplots(4,3, sharex=True)
plt.suptitle('Likelihood of being intubed', size=16)
i=0
j=0
PLOTS_PER_ROW=3
for col in med_hist.columns:
        l = [len(data[(data[col]==1.0) & (data['INTUBED']==1.0)]), len(data[(data[col]==1.0)])]
        ax[i,j].title.set_text(col)
        ax[i,j].barh(y=['Intubed','Total'], width=l)
        j+=1
        if j%PLOTS_PER_ROW==0:
            i+=1
            j=0
plt.subplots_adjust(wspace=0.2, hspace=0.4)

> Patients suffering from pneumonia are most likely to be intubed.

12. Patients of what age were mostly sent home??

In [None]:
plt.rcParams["figure.figsize"] = (10,3)
df = data[data['PATIENT_TYPE']==1.0]
df.groupby('PATIENT_TYPE').AGE.value_counts().sort_index().plot.line()

> Patients of age 25-50 were sent home.

13. How many patients sent to home have died?

In [None]:
plt.rcParams["figure.figsize"] = (5,3)
df1 = data[(data['PATIENT_TYPE']==1.0) & (data['STATUS']=='1')]
df2 = data[(data['PATIENT_TYPE']==2.0) & (data['STATUS']=='1')]
y1 = [len(df1),len(df2)]
y2 = [len(data[data['PATIENT_TYPE']==1.0]), len(data[data['PATIENT_TYPE']==2.0])]
x = np.arange(2)
y1
y2
width = 0.3
plt.bar(x-width/2 , y2, width)
plt.bar(x+width/2, y1,width)
plt.xticks(x + width/2, ['SENT HOME', 'HOSPITALISED'])

> A large fraction of the patients hospitalised have died. Very few of them sent home have died. 

14. In this part, we will see the trend of ailments age wise.

In [None]:
plt.rcParams["figure.figsize"] = (30,20)
i=1
for col in med_hist.columns:
    df = data[data[col]==1.0]
    plt.subplot(12,1,i)
    df.groupby('AGE')[col].value_counts().plot.bar()
    plt.title(col)
    i=i+1

15. Let us see how many patients are in each medical unit.

In [None]:
for i in range(1,14):    
    print(i,'->',len(data[data['MEDICAL_UNIT']==float(i)]))

> The distribution in medical units is quite unusual.

16. Let us see the age wise distribution of medical units.

In [None]:
plt.rcParams["figure.figsize"] = (15,4)
x = np.arange(1,14)
y=[]
y2=[]
for i in range(1,14):
    y.append(len(data[(data["MEDICAL_UNIT"]==float(i)) & (data['ICU']==1.0)]))
    y2.append(len(data[data['MEDICAL_UNIT']==float(i)]))
plt.title("No.of patients in ICU MEDICAL UNIT wise")
plt.pie(x=y, labels=x, autopct='%.0f%%', textprops={'fontsize': 7})
for txt in fig.texts:
    txt.set_visible(False)

> Patients in Medical Units 12,4,6 and 9 are most likely to use an ICU.

17. Now, let's check the classification state of the patients.

In [None]:
data['CLASIFFICATION_FINAL'].value_counts().sort_index()

> Most patients belong to state 3 and 7. 

18. Let us check sent to home/hospitalised in classification 1-3(patients who were diagnosed with covid).

In [None]:
df1 = data[data['CLASIFFICATION_FINAL'].isin([1.0,2,0,3.0])]
df1.groupby('CLASIFFICATION_FINAL')['PATIENT_TYPE'].value_counts()

> Here, 1 indicates sent home and 2 indicates hospitalised. 

19. Let us check sent to home/hospitalised in classification 4-7(patients who were not diagnosed with covid).

In [None]:
df2 = data[data['CLASIFFICATION_FINAL'].isin([4.0,5,0,6.0,7.0])]
df2.groupby('CLASIFFICATION_FINAL')['PATIENT_TYPE'].value_counts()

> This is quite surprising as to why patients not diagnosed with covid are hospitalised. These maybe the patients with inconclusive tests but with severe symptoms.

20. Now, let us check the death rate wrt classification type

In [None]:
df1[df1['PATIENT_TYPE']==1.0].groupby('CLASIFFICATION_FINAL').STATUS.value_counts()

In [None]:
df2[df2['PATIENT_TYPE']==1.0].groupby('CLASIFFICATION_FINAL').STATUS.value_counts()

> 0 states that people that were sent home are alive whereas 1 states that they are dead.

21. Let us see age wise distribution of deaths in females.

In [None]:
df = data[(data['SEX']==1.0) & (data['STATUS']=='1')]
df.groupby('AGE')['STATUS'].count().plot.bar()

22. Let us see the age wise distribution of deaths in males.

In [None]:
df = data[(data['SEX']==2.0) & (data['STATUS']=='1')]
df.groupby('AGE')['STATUS'].count().plot.bar()

> Age wise distribution is approximately the same as females. Just the death numbers are high for males.

23. Which diseases are most prominent in males.

In [None]:
df = data[data['SEX']==2.0]
ctr=[]
almt=[]
for i in med_hist.columns:
    ctr.append(len(df[df[i]==1.0]))
    almt.append(i)
plt.title('Ailment count in Males')
plt.xlabel('Count of patients')
plt.ylabel('Ailment')
plt.barh(y=almt, width=ctr)

24. Which diseases are most prominent in females.

In [None]:
df = data[data['SEX']==1.0]
ctr=[]
almt=[]
for i in med_hist.columns:
    ctr.append(len(df[df[i]==1.0]))
    almt.append(i)
plt.title('Ailment count in Females')
plt.xlabel('Count of patients')
plt.ylabel('Ailment')
plt.barh(y=almt, width=ctr)

25. What is the CUMULATIVE death count according to our dataset.

In [None]:
import plotly.express as px
df = data[data['DATE_DIED'].notnull()].sort_values(by='DATE_DIED')
abc = df.groupby('DATE_DIED').count()
px.line(abc, y=abc['STATUS'].cumsum(), x=abc.index, title='Trend of deaths over the period of time', labels={"DATE_DIED":"Time Period", "y":"Number of Deaths"})

26. Let us check the cumulative test rate over the time period.

In [None]:
# px.line(data, y=np.arange(len(data)).cumsum(), x=data['DATE_DIED'], title='Trend of tests over the period of time', labels={"DATE_DIED":"Time Period", "y":"Number of Deaths"})

27. How many patients on ventilator are shifted to ICU.

In [None]:
df=data[data['INTUBED']==1.0]
df['ICU'].value_counts()

> 9306 patients on ventilator are shifted to ICU.

In [None]:
df=data[data['INTUBED']==2.0]
df['ICU'].value_counts()

> The number of patients who have not been on ventilator but are in ICU is 7552.

28. Since ICU beds are limited in hospitals, what combination of symptoms in a patient are most likely to use an ICU.

In [None]:
med_hist.corr().unstack().sort_values(ascending=False).drop_duplicates().head()

> We can use a combination of (HIPERTENSION  DIABETES) and (DIABETES  PNEUMONIA) to check ICU usage.

In [None]:
df = data[(data['HIPERTENSION']==1.0) & (data['DIABETES']==1.0)]
x=len(df)
y=len(df[df['ICU']==1.0])
print('Chances of a patient suffering from HYPERTENSION and DIABETES to be taken to ICU is : ', ((y/x)*100))

In [None]:
di={}
for i in med_hist.columns:
    for j in med_hist.columns:
        df = data[(data[i]==1.0) & (data[j]==1.0)]
        x=len(df)
        y=len(df[df['ICU']==1.0])
        di[i+' '+j] = ((y/x)*100)
#         print('Chances of a patient suffering from', i, 'and',j, 'to be taken to ICU is : ', ((y/x)*100))
dic = (sorted(di.items() , key = lambda x:x[1], reverse=True))
dic[1:10]

> These combinations are mostly taken to ICU.

29. Let us have a look at patients above the age of 100 and check their heath issues with how many have survived.

In [None]:
df = data[(data['AGE']>=100) & (data['STATUS']=='1')]
df.isin([1.0]).sum(axis=0)

> Out of 208 patients above the age of 100, 34 are dead and rest have survived. The patients who are dead mostly had PNEUMONIA and hypertension ot of which 14 were females rest were males.

30. Since the chances of getting intubed and death rate both are quite low in pregnant ladies, let us check other factors for pregnant women.

In [None]:
df = data[data['PREGNANT']==1.0] #There are 8131 pregnant ladies 
df = df[df['STATUS']=='1'] #89 out of 8131 pregnant ladies have died. Let us check their major symptoms.
df.isin([1.0]).sum(axis=0)

> Pneumonia and Obesity are the major causes of death in pregnant ladies suffering from COVID. A lot of them required ventilator and ICU.

31. Since, patients below the age of 30 are most likely to survive owing strong immune system. Let us check what ailments are prominent among those who have died.

In [None]:
df = data[(data['AGE']<=30.0) & (data['AGE']>=15.0)] 
df = df[df['STATUS']=='1'] #1738 patients between age 15-30 have died 
df.isin([1.0]).sum(axis=0)

> Most of them had Pneumonia. 

32. Suffering from a particular ailment, which age group is more likely to use a ventilator.

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
i=1
for col in med_hist.columns:
    df=data[(data[col]==1.0) & (data["INTUBED"]==1.0)]
    plt.subplot(12,1,i)
    df.groupby('INTUBED')['AGE'].value_counts().sort_index().plot.bar()
    plt.title(col)
    i=i+1

> Looking at this graph we can see the probability of a patient using a ventilator according to age.

33. Suffering from a particular ailment, which age group is more likely to use a ICU.

In [None]:
plt.figure(figsize=(20,10))
i=1
for col in med_hist.columns:
    df=data[(data[col]==1.0) & (data["ICU"]==1.0)]
    plt.subplot(6,2,i)
    df.groupby('ICU')['AGE'].value_counts().sort_index().plot.bar() 
    plt.title(col)
    i+=1

> Looking at this graph we can see the probability of a patient using an ICU according to age.