## Goals
1. What are the most important features for predicting X as a target variable?
2. Which classification approach do you prefer for the prediction of X as a target variable, and why?
3. Why is dimensionality reduction important in machine learning?

## Understanding data variables
### content
The dataset was provided by the Mexican government (link). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.

1. sex: 1 for female and 2 for male.
2. age: of the patient.
3. classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different
   degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.
4. patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.
5. pneumonia: whether the patient already have air sacs inflammation or not.
6. pregnancy: whether the patient is pregnant or not.
7. diabetes: whether the patient has diabetes or not.
8. copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
9. asthma: whether the patient has asthma or not.
10. inmsupr: whether the patient is immunosuppressed or not.
11. hypertension: whether the patient has hypertension or not.
12. cardiovascular: whether the patient has heart or blood vessels related disease.
13. renal chronic: whether the patient has chronic renal disease or not.
14. other disease: whether the patient has other disease or not.
15. obesity: whether the patient is obese or not.
16. tobacco: whether the patient is a tobacco user.
17. usmr: Indicates whether the patient treated medical units of the first, second or third level.
18. medical unit: type of institution of the National Health System that provided the care.
19. intubed: whether the patient was connected to the ventilator.
20. icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
21. date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.


## Reading covid dataset to pandas dataframe

In [1]:
import pandas as pd # Helps to work with pandas dataframe
import seaborn as sns # Help with visualisation 
import matplotlib.pyplot as plt # Helps with visualization
import numpy as np # Helps to work with numbers and arrithmetic operation
import warnings
warnings.filterwarnings('ignore')

In [2]:
df= pd.read_csv('Covid Data.csv')
df.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,2,1,2,2,2,2,2,3,97
1,2,1,2,1,03/06/2020,97,1,72,97,2,...,2,2,1,2,2,1,1,2,5,97
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,2,3,2
3,2,1,1,1,12/06/2020,97,2,53,2,2,...,2,2,2,2,2,2,2,2,7,97
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,2,1,2,2,2,2,2,3,97


In [3]:
pd.set_option('display.max_columns', None) #This helps to display the entire column in df
pd.set_option('display.max_rows', None)
df.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU
0,2,1,1,1,03/05/2020,97,1,65,2,2,2,2,2,1,2,2,2,2,2,3,97
1,2,1,2,1,03/06/2020,97,1,72,97,2,2,2,2,1,2,2,1,1,2,5,97
2,2,1,2,2,09/06/2020,1,2,55,97,1,2,2,2,2,2,2,2,2,2,3,2
3,2,1,1,1,12/06/2020,97,2,53,2,2,2,2,2,2,2,2,2,2,2,7,97
4,2,1,2,1,21/06/2020,97,2,68,97,1,2,2,2,1,2,2,2,2,2,3,97


## Early data analysis
Exploring the nature or dataset

In [4]:
df.shape
# There are 21 features and 1,048,575 rows

(1048575, 21)

In [5]:
df.duplicated().sum()

812049

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   USMER                 1048575 non-null  int64 
 1   MEDICAL_UNIT          1048575 non-null  int64 
 2   SEX                   1048575 non-null  int64 
 3   PATIENT_TYPE          1048575 non-null  int64 
 4   DATE_DIED             1048575 non-null  object
 5   INTUBED               1048575 non-null  int64 
 6   PNEUMONIA             1048575 non-null  int64 
 7   AGE                   1048575 non-null  int64 
 8   PREGNANT              1048575 non-null  int64 
 9   DIABETES              1048575 non-null  int64 
 10  COPD                  1048575 non-null  int64 
 11  ASTHMA                1048575 non-null  int64 
 12  INMSUPR               1048575 non-null  int64 
 13  HIPERTENSION          1048575 non-null  int64 
 14  OTHER_DISEASE         1048575 non-null  int64 
 15

In [7]:
df.isnull().sum()

USMER                   0
MEDICAL_UNIT            0
SEX                     0
PATIENT_TYPE            0
DATE_DIED               0
INTUBED                 0
PNEUMONIA               0
AGE                     0
PREGNANT                0
DIABETES                0
COPD                    0
ASTHMA                  0
INMSUPR                 0
HIPERTENSION            0
OTHER_DISEASE           0
CARDIOVASCULAR          0
OBESITY                 0
RENAL_CHRONIC           0
TOBACCO                 0
CLASIFFICATION_FINAL    0
ICU                     0
dtype: int64

### Null values
Though the 'isnull()'method had returned no null values, The dataset litrature however stated that values such as 97, 98, 99.. are used to fill null values and should be treated as null values. 

To progress, we will count the values in each variabke to look for this null values.

In [8]:
## storing all variables in a list

In [9]:
df['PREGNANT'].unique()
# The pregnant hase this values "97" and "98"

array([ 2, 97, 98,  1], dtype=int64)

In [10]:
## now let's count how many of this null values we have in 'pregnant'
df['PREGNANT'].value_counts()
# As shown there are 527265 null values in the pregnant variables

97    523511
2     513179
1       8131
98      3754
Name: PREGNANT, dtype: int64

In [11]:
no=[523511,513179,8131,3754]
w= sum(no)
print(w)

1048575


According to logic, only biological females can get pregnant, and the data literature did not tell us otherwise, so to be able to handle the propotion of non pregant patient, we would value count the gender column and see the number of females and males in them

In [12]:
# 1: Female and 2: Male
df['SEX'].unique()

array([1, 2], dtype=int64)

In [13]:
df['SEX'].value_counts()

1    525064
2    523511
Name: SEX, dtype: int64

In [14]:
# So according to results there are 523511 male, which is thesame null value represented with 97, we would replace 97 with 2
df['PREGNANT'].replace(98,1,inplace=True)
df['PREGNANT'].replace(97,2,inplace=True) # inplace means replace in the main df

In [15]:
df['PREGNANT'].value_counts()
#2: Non pregnant. This value contain all male patient, and women that were not pregnant too. 
#1: Pregant. This are the assumed pregnant women. 

2    1036690
1      11885
Name: PREGNANT, dtype: int64

In [16]:
df['USMER'].value_counts()
# No null values in Usmer

2    662903
1    385672
Name: USMER, dtype: int64

In [17]:
df['MEDICAL_UNIT'].value_counts()
#No null values ion the 'medical unit'

12    602995
4     314405
6      40584
9      38116
3      19175
8      10399
10      7873
5       7244
11      5577
13       996
7        891
2        169
1        151
Name: MEDICAL_UNIT, dtype: int64

In [18]:
df['PATIENT_TYPE'].value_counts()
#No null values in the patient type

1    848544
2    200031
Name: PATIENT_TYPE, dtype: int64

In [19]:
df['DATE_DIED'].value_counts()
# 9999-99-99: Represent patients that did not die
# All other dates represents patient that died
# Sine this can be categoried into 2 classes, died:1 did not die:2

9999-99-99    971633
06/07/2020      1000
07/07/2020       996
13/07/2020       990
16/06/2020       979
16/07/2020       938
14/07/2020       935
17/06/2020       926
29/06/2020       925
08/06/2020       923
09/07/2020       922
15/07/2020       916
12/06/2020       903
05/07/2020       897
14/06/2020       892
12/07/2020       889
30/06/2020       882
15/06/2020       877
08/07/2020       875
10/06/2020       873
11/07/2020       872
04/07/2020       870
10/07/2020       855
13/06/2020       855
09/06/2020       851
17/07/2020       848
28/06/2020       847
03/07/2020       843
01/07/2020       842
27/06/2020       835
18/06/2020       832
23/06/2020       831
03/06/2020       831
07/06/2020       830
02/07/2020       829
01/06/2020       829
21/06/2020       824
25/05/2020       821
22/06/2020       815
06/06/2020       814
11/06/2020       811
24/06/2020       804
18/05/2020       802
26/06/2020       800
04/06/2020       791
19/06/2020       790
02/06/2020       789
25/06/2020   

In [20]:
Died= df['DATE_DIED'].value_counts()
def get_die(value):
    if (value=='9999-99-99'):
        return 2
    else:
        return 1

In [21]:
df['Died']= df['DATE_DIED'].apply(get_die)
df.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU,Died
0,2,1,1,1,03/05/2020,97,1,65,2,2,2,2,2,1,2,2,2,2,2,3,97,1
1,2,1,2,1,03/06/2020,97,1,72,2,2,2,2,2,1,2,2,1,1,2,5,97,1
2,2,1,2,2,09/06/2020,1,2,55,2,1,2,2,2,2,2,2,2,2,2,3,2,1
3,2,1,1,1,12/06/2020,97,2,53,2,2,2,2,2,2,2,2,2,2,2,7,97,1
4,2,1,2,1,21/06/2020,97,2,68,2,1,2,2,2,1,2,2,2,2,2,3,97,1


In [22]:
# Lets count the values in the died column to make sure we got all unique values accounted for
df['Died'].value_counts()
#Now a new column Died as been added to df where 1:Died, 2: Did not die 

2    971633
1     76942
Name: Died, dtype: int64

In [23]:
 # Droprint he date_died column
df.drop(['DATE_DIED'], axis=1, inplace=True)

In [24]:
#INTUBED 
df['INTUBED'].value_counts()

97    848544
2     159050
1      33656
99      7325
Name: INTUBED, dtype: int64

In [29]:
# Funtion to cal total null value
def get_null(variable):
    for value in variable:
        total_Null= sum(variable)
    return total_Null
null_in=[848544,7325]
Total_Null= get_null(null_in)
Total_Null

855869

In [30]:
n=len(df['INTUBED'])
percent_null_Intubate = (Total_Null/n)*100
percent_null_Intubate
# This shows that over 81 % of the data i the Intubated variable are null values so i will drop the column to
# Avoid introducing bias

81.62210619173639

In [32]:
df.drop(['INTUBED'], axis=1, inplace=True)
df.head()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU,Died
0,2,1,1,1,1,65,2,2,2,2,2,1,2,2,2,2,2,3,97,1
1,2,1,2,1,1,72,2,2,2,2,2,1,2,2,1,1,2,5,97,1
2,2,1,2,2,2,55,2,1,2,2,2,2,2,2,2,2,2,3,2,1
3,2,1,1,1,2,53,2,2,2,2,2,2,2,2,2,2,2,7,97,1
4,2,1,2,1,2,68,2,1,2,2,2,1,2,2,2,2,2,3,97,1


In [34]:
#PNEUMONIA
df['PNEUMONIA'].value_counts()

2     892534
1     140038
99     16003
Name: PNEUMONIA, dtype: int64

To be able to veiw all the null values at thesame time, i will explicitly ask pandas to treat the 97, 99, 98 values as null values

In [36]:
# This code replaces explicitly tell pandas to treat the 97, 98, 99 values as NAN/missing values
feauture=["PNEUMONIA","DIABETES","COPD","ASTHMA","INMSUPR","HIPERTENSION","OTHER_DISEASE","CARDIOVASCULAR","OBESITY","RENAL_CHRONIC","TOBACCO","ICU"]
for variable in feauture: 
    df[variable]=df[variable].replace([98,97,99],np.nan)

In [37]:
#Now we are bale to get the missing values
df.isnull().sum()

USMER                        0
MEDICAL_UNIT                 0
SEX                          0
PATIENT_TYPE                 0
PNEUMONIA                16003
AGE                          0
PREGNANT                     0
DIABETES                  3338
COPD                      3003
ASTHMA                    2979
INMSUPR                   3404
HIPERTENSION              3104
OTHER_DISEASE             5045
CARDIOVASCULAR            3076
OBESITY                   3032
RENAL_CHRONIC             3006
TOBACCO                   3220
CLASIFFICATION_FINAL         0
ICU                     856032
Died                         0
dtype: int64

In [41]:
df.describe()

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU,Died
count,1048575.0,1048575.0,1048575.0,1048575.0,1032572.0,1048575.0,1048575.0,1045237.0,1045572.0,1045596.0,1045171.0,1045471.0,1043530.0,1045499.0,1045543.0,1045569.0,1045355.0,1048575.0,192543.0,1048575.0
mean,1.632194,8.980565,1.499259,1.190765,1.864379,41.7941,1.988666,1.88042,1.985594,1.969805,1.986442,1.844349,1.97313,1.980135,1.847145,1.98192,1.919285,5.305653,1.912446,1.926622
std,0.4822084,3.723278,0.4999997,0.3929041,0.3423854,16.90739,0.1058583,0.3244694,0.1191554,0.1711242,0.1156451,0.3625247,0.1617045,0.1395369,0.3598474,0.1332413,0.2723973,1.881165,0.282647,0.2607556
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,1.0,4.0,1.0,1.0,2.0,30.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0
50%,2.0,12.0,1.0,1.0,2.0,40.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,6.0,2.0,2.0
75%,2.0,12.0,2.0,1.0,2.0,53.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,7.0,2.0,2.0
max,2.0,13.0,2.0,2.0,2.0,121.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,7.0,2.0,2.0


## Percentage of missing values/ Column


In [45]:
n=len(df['ICU'])
percent_null_icu= (856032/n)*100
percent_null_icu
#Null values in Icu is over 81% of the entire observation so i will drop the column 

81.6376510979186

In [46]:
df.drop(["ICU"],axis=1, inplace=True)

In [47]:
null_list=[3338,3003,2979,3404,3104,5045,3076,3032,3006,3220]
