### Module 7: Supervised Learning-1

#### Case Study–2

Objective:

• Learn to handle missing values

• Learn to fit a decision tree and compare its accuracy with a random forest classifier.

#### 1. Let’s attempt to predict the survival of a horse based on various observed medical conditions. Load the data from ‘horses.csv’ and observe whether it contains missing values.
[Hint: Pandas dataframe has a method isnull]

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv("horse.csv")

# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Overview of dataset
print("\nDataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

Missing values per column:
surgery                    0
age                        0
hospital_number            0
rectal_temp               60
pulse                     24
respiratory_rate          58
temp_of_extremities       56
peripheral_pulse          69
mucous_membrane           47
capillary_refill_time     32
pain                      55
peristalsis               44
abdominal_distention      56
nasogastric_tube         104
nasogastric_reflux       106
nasogastric_reflux_ph    246
rectal_exam_feces        102
abdomen                  118
packed_cell_volume        29
total_protein             33
abdomo_appearance        165
abdomo_protein           198
outcome                    0
surgical_lesion            0
lesion_1                   0
lesion_2                   0
lesion_3                   0
cp_data                    0
dtype: int64

Dataset shape: (299, 28)

First 5 rows:


Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,no,adult,530101,38.5,66.0,28.0,cool,reduced,,more_3_sec,...,45.0,8.4,,,died,no,11300,0,0,no
1,yes,adult,534817,39.2,88.0,20.0,,,pale_cyanotic,less_3_sec,...,50.0,85.0,cloudy,2.0,euthanized,no,2208,0,0,no
2,no,adult,530334,38.3,40.0,24.0,normal,normal,pale_pink,less_3_sec,...,33.0,6.7,,,lived,no,0,0,0,yes
3,yes,young,5290409,39.1,164.0,84.0,cold,normal,dark_cyanotic,more_3_sec,...,48.0,7.2,serosanguious,5.3,died,yes,2208,0,0,yes
4,no,adult,530255,37.3,104.0,35.0,,,dark_cyanotic,more_3_sec,...,74.0,7.4,,,died,no,4300,0,0,no


#### 2. This dataset contains many categorical features, replace them with label encoding.
[Hint: Refer to get_dummies methods in pandas dataframe or Label encoder in
scikit-learn]


In [3]:
from sklearn.preprocessing import LabelEncoder

# Identify columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_cols)

# Apply Label Encoding to each categorical column
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col].astype(str))

print("\nData after label encoding:")
df.head()

Categorical columns: Index(['surgery', 'age', 'temp_of_extremities', 'peripheral_pulse',
       'mucous_membrane', 'capillary_refill_time', 'pain', 'peristalsis',
       'abdominal_distention', 'nasogastric_tube', 'nasogastric_reflux',
       'rectal_exam_feces', 'abdomen', 'abdomo_appearance', 'outcome',
       'surgical_lesion', 'cp_data'],
      dtype='object')

Data after label encoding:


Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,0,0,530101,38.5,66.0,28.0,1,4,3,2,...,45.0,8.4,2,,0,0,11300,0,0,0
1,1,0,534817,39.2,88.0,20.0,2,2,5,1,...,50.0,85.0,1,2.0,1,0,2208,0,0,0
2,0,0,530334,38.3,40.0,24.0,3,3,6,1,...,33.0,6.7,2,,2,0,0,0,0,1
3,1,1,5290409,39.1,164.0,84.0,0,3,2,2,...,48.0,7.2,3,5.3,0,1,2208,0,0,1
4,0,0,530255,37.3,104.0,35.0,2,2,2,2,...,74.0,7.4,2,,0,0,4300,0,0,0


#### 3. Replace the missing values with the most frequent value in each column.
[Hint: Refer to Imputer class in Scikit learn preprocessing module]


In [5]:
from sklearn.impute import SimpleImputer

# Encode categorical features (so imputer can handle them numerically)
categorical_cols = df.select_dtypes(include=['object']).columns
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col].astype(str))

# Replace missing values with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("Missing values after imputation:")
print(df_imputed.isnull().sum())

print("\nFirst 5 rows after imputation:")
df_imputed.head()

Missing values after imputation:
surgery                  0
age                      0
hospital_number          0
rectal_temp              0
pulse                    0
respiratory_rate         0
temp_of_extremities      0
peripheral_pulse         0
mucous_membrane          0
capillary_refill_time    0
pain                     0
peristalsis              0
abdominal_distention     0
nasogastric_tube         0
nasogastric_reflux       0
nasogastric_reflux_ph    0
rectal_exam_feces        0
abdomen                  0
packed_cell_volume       0
total_protein            0
abdomo_appearance        0
abdomo_protein           0
outcome                  0
surgical_lesion          0
lesion_1                 0
lesion_2                 0
lesion_3                 0
cp_data                  0
dtype: int64

First 5 rows after imputation:


Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,0.0,0.0,530101.0,38.5,66.0,28.0,1.0,4.0,3.0,2.0,...,45.0,8.4,2.0,2.0,0.0,0.0,11300.0,0.0,0.0,0.0
1,1.0,0.0,534817.0,39.2,88.0,20.0,2.0,2.0,5.0,1.0,...,50.0,85.0,1.0,2.0,1.0,0.0,2208.0,0.0,0.0,0.0
2,0.0,0.0,530334.0,38.3,40.0,24.0,3.0,3.0,6.0,1.0,...,33.0,6.7,2.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0
3,1.0,1.0,5290409.0,39.1,164.0,84.0,0.0,3.0,2.0,2.0,...,48.0,7.2,3.0,5.3,0.0,1.0,2208.0,0.0,0.0,1.0
4,0.0,0.0,530255.0,37.3,104.0,35.0,2.0,2.0,2.0,2.0,...,74.0,7.4,2.0,2.0,0.0,0.0,4300.0,0.0,0.0,0.0


#### 4. Fit a decision tree classifier and observe the accuracy.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Separate predictors and target
X = df_imputed.drop(columns=['outcome'])  
y = df_imputed['outcome']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Fit Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predict and measure accuracy
y_pred = dt_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Decision Tree Test Accuracy:", accuracy)


Decision Tree Test Accuracy: 0.6166666666666667


#### 5. Fit a random forest classifier and observe the accuracy.

In [20]:
from sklearn.ensemble import RandomForestClassifier

# Fit Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

#  Predict and measure accuracy
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Random Forest Test Accuracy:", accuracy)


Random Forest Test Accuracy: 0.7
