<div style="background-color:#5D73F2; color:#19180F; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> [Beginner friendly] Soft Voting Ensemble </div>

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Importing modules
    </div>


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.impute import SimpleImputer
import lightgbm as lgb
import xgboost as xgb




<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 1 : Loading the data    </div>


In [2]:
train = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')
greeks = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/greeks.csv')


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 2 : Preprocessing the data. Converting the categorical variable 'EJ' to binary    </div>


In [3]:
first_category = train.EJ.unique()[0]
train.EJ = train.EJ.eq(first_category).astype('int')
test.EJ = test.EJ.eq(first_category).astype('int')


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Seperating features and target variable for training set and dropping ID column for test set    </div>


In [4]:
test_ids = test['Id']

test = test.drop(['Id'], axis=1)

x = train.drop(['Id', 'Class'], axis=1)
y = train['Class']


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 3: Handling missing values in the data. We'll use the mean strategy to fill in the missing values    </div>


In [5]:
imputer = SimpleImputer(strategy='mean')
x = imputer.fit_transform(x)
test = imputer.transform(test)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 4: Initializing the models    </div>


In [6]:
rf_model = RandomForestClassifier(random_state=42)
lgb_model = lgb.LGBMClassifier(random_state=42)
xgb_model = xgb.XGBClassifier(random_state=42)



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 5 : Creating the soft voting classifier    </div>


In [7]:
voting_clf = VotingClassifier(
    estimators=[('rf', rf_model), ('lgb', lgb_model), ('xgb', xgb_model)],
    voting='soft'
)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Splitting the training data into training and val sets    </div>


In [8]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 6:  Training the soft voting classifier   </div>


In [9]:
voting_clf.fit(x_train, y_train)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 7: Making predictions on the test set    </div>


In [10]:
y_pred = voting_clf.predict_proba(test)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Step 8: Generating submission file    </div>


In [11]:
submission_df = pd.DataFrame({
    'Id': test_ids,
    'class_0': y_pred[:, 0],  # Probability of Class 0 (No age-related condition)
    'class_1': y_pred[:, 1],  # Probability of Class 1 (Age-related condition)
})


In [12]:
submission_df.to_csv('submission.csv', index=False)

In [13]:
submission_df

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.661583,0.338417
1,010ebe33f668,0.661583,0.338417
2,02fa521e1838,0.661583,0.338417
3,040e15f562a2,0.661583,0.338417
4,046e85c7cc7f,0.661583,0.338417
