### Case Study – 2
#### Objective:
    • Learn to handle missing values
    • Learn to fit a decision tree and compare its accuracy with a random forest classifier.

### Questions:
#### 1. Let’s attempt to predict the survival of a horse based on various observed medical conditions. Load the data from ‘horses.csv’ and observe whether it contains missing values.
[Hint: Pandas dataframe has a method isnull]

In [1]:
import pandas as pd


In [2]:
df=pd.read_csv("data/horse.csv")
df.head()

Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,no,adult,530101,38.5,66.0,28.0,cool,reduced,,more_3_sec,...,45.0,8.4,,,died,no,11300,0,0,no
1,yes,adult,534817,39.2,88.0,20.0,,,pale_cyanotic,less_3_sec,...,50.0,85.0,cloudy,2.0,euthanized,no,2208,0,0,no
2,no,adult,530334,38.3,40.0,24.0,normal,normal,pale_pink,less_3_sec,...,33.0,6.7,,,lived,no,0,0,0,yes
3,yes,young,5290409,39.1,164.0,84.0,cold,normal,dark_cyanotic,more_3_sec,...,48.0,7.2,serosanguious,5.3,died,yes,2208,0,0,yes
4,no,adult,530255,37.3,104.0,35.0,,,dark_cyanotic,more_3_sec,...,74.0,7.4,,,died,no,4300,0,0,no


In [3]:
df.isnull().sum()

surgery                    0
age                        0
hospital_number            0
rectal_temp               60
pulse                     24
respiratory_rate          58
temp_of_extremities       56
peripheral_pulse          69
mucous_membrane           47
capillary_refill_time     32
pain                      55
peristalsis               44
abdominal_distention      56
nasogastric_tube         104
nasogastric_reflux       106
nasogastric_reflux_ph    246
rectal_exam_feces        102
abdomen                  118
packed_cell_volume        29
total_protein             33
abdomo_appearance        165
abdomo_protein           198
outcome                    0
surgical_lesion            0
lesion_1                   0
lesion_2                   0
lesion_3                   0
cp_data                    0
dtype: int64

In [4]:
# observation : there are many columns having missing values like: rectal_temp(60), pulse(24), respiratory_rate(58)...

In [5]:
# split target variable
X=df.drop(["outcome"], axis=1)
y=df["outcome"]


#### 2. This dataset contains many categorical features, replace them with label encoding.
[Hint: Refer to get_dummies methods in pandas dataframe or Label encoder in scikit-learn]

In [6]:
# identify all categorical columns
cat_cols=[]
for col in X.columns:
    if X[col].dtype=="object":
        cat_cols.append(col)
cat_cols

['surgery',
 'age',
 'temp_of_extremities',
 'peripheral_pulse',
 'mucous_membrane',
 'capillary_refill_time',
 'pain',
 'peristalsis',
 'abdominal_distention',
 'nasogastric_tube',
 'nasogastric_reflux',
 'rectal_exam_feces',
 'abdomen',
 'abdomo_appearance',
 'surgical_lesion',
 'cp_data']

In [7]:
# handel categorical columns with LabelEncoder

from sklearn.preprocessing import LabelEncoder

encoder=LabelEncoder()
for col in cat_cols:    
    X[col]=encoder.fit_transform(X[col])
    
X.head()


Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,abdomen,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,0,0,530101,38.5,66.0,28.0,1,3,6,2,...,0,45.0,8.4,3,,0,11300,0,0,0
1,1,0,534817,39.2,88.0,20.0,4,4,4,1,...,4,50.0,85.0,1,2.0,0,2208,0,0,0
2,0,0,530334,38.3,40.0,24.0,2,2,5,1,...,3,33.0,6.7,3,,0,0,0,0,1
3,1,1,5290409,39.1,164.0,84.0,0,2,2,2,...,5,48.0,7.2,2,5.3,1,2208,0,0,1
4,0,0,530255,37.3,104.0,35.0,4,4,2,2,...,5,74.0,7.4,3,,0,4300,0,0,0


#### 3. Replace the missing values with the most frequent value in each column.
[Hint: Refer to Imputer class in Scikit learn preprocessing module]

In [8]:
X=X.fillna(X.mode().iloc[0])

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   surgery                299 non-null    int32  
 1   age                    299 non-null    int32  
 2   hospital_number        299 non-null    int64  
 3   rectal_temp            299 non-null    float64
 4   pulse                  299 non-null    float64
 5   respiratory_rate       299 non-null    float64
 6   temp_of_extremities    299 non-null    int32  
 7   peripheral_pulse       299 non-null    int32  
 8   mucous_membrane        299 non-null    int32  
 9   capillary_refill_time  299 non-null    int32  
 10  pain                   299 non-null    int32  
 11  peristalsis            299 non-null    int32  
 12  abdominal_distention   299 non-null    int32  
 13  nasogastric_tube       299 non-null    int32  
 14  nasogastric_reflux     299 non-null    int32  
 15  nasoga

#### 4. Fit a decision tree classifier and observe the accuracy.


In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model=DecisionTreeClassifier()
model.fit(X,y)

y_predict=model.predict(X)
print(accuracy_score(y_predict, y))

1.0


#### 5. Fit a random forest classifier and observe the accuracy

In [11]:
from sklearn.ensemble import RandomForestClassifier

r_model=RandomForestClassifier()
r_model.fit(X,y)

y_predict=r_model.predict(X)
print(accuracy_score(y_predict, y))


1.0
