# PACKAGES TO BE USED

In [None]:
import pandas as pd #package object
import numpy as np #package object
import matplotlib.pyplot as plt #package object
#import seaborn as sns #package object

# DATASET INFORMATION

**Dataset contains 21 unique attributes and 1,048,576 records. 1 means "Yes" and 2 means "No".**

## *Attributes of the Dataset*


sex: female or male 

    Value 1 -> Female

    Value 2 -> Male

age: refers to patient/subject.

classification: covid test

    Values 1-3 -> the patient was diagnosed with covid in different degrees. 

    Values>=4 -> patient is not a carrier of covid or that the test is inconclusive.

patient type: hospitalized or not hospitalized.

pneumonia: presence of air sacs inflammation.

pregnancy: whether the patient is pregnant or not.

diabetes: presence of diabetes.

copd: presence of Chronic obstructive pulmonary disease.

asthma: presence of asthma.

inmsupr: whether the patient is immunosuppressed or not.

hypertension: presence of hypertension.

cardiovascular: presence of heart or blood vessels related disease.

renal chronic: presence of chronic renal disease.

other disease: presence of other disease.

obesity: whether the patient is obese or not.

tobacco: whether the patient is a tobacco user.

usmr: Indicates whether the patient treated medical units of the first, second or third level.

medical unit: type of institution of the National Health System that provided the care.

intubed: whether the patient was connected to the ventilator.

icu: Indicates whether the patient had been admitted to an Intensive Care Unit.

death: indicates whether the patient died or recovered.

## *Overview of Dataset*

In [None]:
#Dataframe made using pandas' object
df= pd.read_csv(r"C:\Users\ideapad\Desktop\Python Programs\Meta_data_BootCamp_Project\Dataset\Covid_Original_Dataset.csv") #Path of dataset stored in the System
print("Dimensions of Dataset:",df.shape)
df.head(5)

In [None]:
#checking null values and datatype of each attribute
df.info()

In [None]:
#checking NA or Empty values in the Dataset
df.isna().sum().sum() 
'''
isna()-checks each record for NA values. Gives true if found, else false.
1st sum()- Sum of NA values for each Attribute.
2nd sum()- Total NA values.
'''

Since the output of above code comes out to be 0. No NA values are present. 

But there are missing values which have been inputed as 97 or 99.

As for missing DATE_DIED, its been inputed as 9999-99-99.

In [None]:
#Checking for Unique values. Most attributes must have only 2 unique values i.e., 1 and 2.
for i in df.columns:
    print(i,"=> ",len(df[i].unique()))

In [None]:
df.ASTHMA.value_counts()

In [None]:
#Checking for the count of different values in DATE_DIED
df.DATE_DIED.value_counts()

## *Conclusion*
1) Attributes which should have only 2 values (1,2) have more unique values. This can be seen in Attribute ASTHMA. As already mentioned- 97 and 98 represent NA values. Thus, such records need to be discarded.
3) In DATE_DIED, "9999-99-99" represent NA values. It can be interpreted that these values are missing because these patients/subjects are ALIVE. Thus, this attribute can be updated used to check for DEATH. If patient is alive, value will be 2, otherwise 1.

# DATA PREPROCESSING
There are missing values in PREGANCY. But after manually looking at some records, the values are missing for only Males. To verify this:

In [None]:
df.SEX.value_counts()[2] #Number of males

In [None]:
df.PREGNANT.value_counts() 

It can be deduced that 97 in attribute PREGNANT refers to the value of males. To verify this further:

In [None]:
df[(df.PREGNANT == 97) & (df.SEX == 2)].shape[0] #Number of records where PREGNANT is missing and SEX is MALE

Thus, the value 97 is inserted for only SEX=MALE in PREGNANT.

Hence, placing correct values for this.

In [None]:
df.PREGNANT=df.PREGNANT.replace(97,2)

## *CLEANING THE DATASET*

### *Value Analysis of INTUBED Attribute*

In [None]:
names=df.INTUBED.unique() #array of values in Attribute
length=len(names)
value=[]
for i in range(length):
    temp=df[df.INTUBED==names[i]].shape[0]
    value.append(temp) #Storing occurance of each value belonging to INTUBED
plt.figure(figsize=(12, 3)) #Dimensions of the graph
plt.bar(names,value)
plt.xlabel("Values used")
plt.ylabel("Count")
plt.title("Count of Intubed")
plt.show()