**Introduction**

This project tackles the challenge of predicting voting behavior across African nations using demographic data. Through advanced machine learning techniques, particularly CatBoost with optimized feature engineering, we achieved a competitive accuracy of 0.88397 on the public leaderboard. The solution leverages a diverse set of demographic indicators including education level, job type, and location characteristics to create a robust prediction model.

The key components of our solution include:

Custom feature engineering with weighted encodings
Optimized CatBoost classifier
Strategic handling of categorical variables
Cross-validated model evaluation
Ensemble techniques for improved accuracy
Our approach demonstrates strong predictive power while maintaining interpretability, making it valuable for both academic research and practical applications in demographic analysis.

***Table of Contents***

**Project Overview**


* Problem Statement
* Evaluation Metric
* Best Score Achievement

**Data Analysis**


* Dataset Overview
* Feature Distribution
* Missing Values Analysis
*  Feature Correlations


**Feature Engineering**


* Binary Feature Encoding
* One-Hot Encoding
* Ordinal Encoding for Education
* Target Encoding for Job Types
* Feature Scaling

**Model Development**


* Base Models Comparison
* Grid Search Optimization
* Cross-Validation Results
* Feature Importance Analysis

**Results & Performance**


* Model Accuracy: 0.88397
* Cross-Validation Scores
* Feature Impact Analysis
* Performance Visualization

**Implementation Details**


* CatBoost Configuration
* Data Preprocessing Pipeline
* Model Training Process
* Prediction Generation




## 1. Project Overview
- **Challenge**: Predict voting behavior using demographic data
- **Best Score**: 0.88397
- **Model**: CatBoost with optimized feature engineering
- **Key Features**: Education, job type, location characteristics



**Problem Statement**

Predict individual voting participation using demographic and socio-economic data
Binary classification task to determine whether a respondent voted or did not vote
Dataset of 18,000+ entries with multiple predictive features


**Evaluation Metric**

Accuracy: Proportion of correct predictions across all instances
Measures the model's ability to correctly classify voting participation
Straightforward metric for binary classification problems


**Best Score Achievement**

Goal: Develop a machine learning model with highest possible accuracy
Key strategies:

Robust feature engineering
Advanced model selection and tuning
Handling potential class imbalance
Comprehensive data preprocessing


Potential top-performing models: Gradient Boosting, Random Forest, Logistic Regression

In [None]:
# General libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")  # Suppress warnings for a cleaner output

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.impute import KNNImputer

# Feature selection
from sklearn.feature_selection import mutual_info_regression

# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, VotingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier

# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report

# Anomaly detection
from sklearn.ensemble import IsolationForest

# Encoding
import category_encoders as ce


## 2. Preprocessing



In [None]:
# Encode categories
cat_cols = ['location_type', 'cellphone_access', 'gender_of_respondent','relationship_with_head', 'marital_status', 'education_level', 'job_type']
num_cols = ['household_size', 'age_of_respondent']
Load the dataset
df = pd.read_csv('/kaggle/input/democracy-in-data-predict-voting-behavior/train1.csv')
#label encoding for sum features
binary_features = ["vote", "location_type", "cellphone_access", "gender_of_respondent"]
le = LabelEncoder()
for col in binary_features:
    df[col] = le.fit_transform(df[col])
# One-hot encode categorical features
one_hot_features = ["relationship_with_head", "marital_status", "job_type"]
df = pd.get_dummies(df, columns=one_hot_features, drop_first=True)

# Define the full category order for education level
education_order = [
    "No formal education",
    "Primary education",
    "Secondary education",
    "Vocational/Specialised training",
    "Tertiary education",
    "Other/Dont know/RTA"
]
# Apply Ordinal Encoding
ord_enc = OrdinalEncoder(categories=[education_order])
df["education_level"] = ord_enc.fit_transform(df[["education_level"]])
#vote is the target 
X = df.drop(columns=['vote'])
y = df['vote']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Feature Scaling
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

# **3.Cross Validation**

In [None]:
# Define Models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Gradient Boosting": GradientBoostingClassifier(),
    "XGBoost": XGBClassifier(eval_metric='logloss', use_label_encoder=False),
    "CatBoost": CatBoostClassifier(
        iterations=500,
        depth=6,  # Reduce depth to prevent overfitting
        learning_rate=0.05,
        l2_leaf_reg=10,  # Stronger regularization
        subsample=0.8
    ),
    "MLP Classifier": MLPClassifier(
        hidden_layer_sizes=(100, 50),
        activation='relu',
        solver='adam',
        max_iter=500,
        random_state=1,
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=10
    )
}

# Train and cross-validate each model
cv_results = {}

print("\nModel Training and Cross-Validation:")
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=5)
    cv_results[name] = {
        'mean_accuracy': scores.mean(),
        'std_accuracy': scores.std()
    }
    model.fit(X_train, y_train)
    print(f"{name} trained. Mean CV Accuracy: {scores.mean():.2f}, Std Dev: {scores.std():.2f}")

# Display sorted cross-validation results
print("\nCross-Validation Results (Sorted by Mean Accuracy):")
sorted_cv_results = sorted(cv_results.items(), key=lambda x: x[1]['mean_accuracy'], reverse=True)
for name, result in sorted_cv_results:
    print(f"{name} - Mean Accuracy: {result['mean_accuracy']:.2f}, Std Dev: {result['std_accuracy']:.2f}")

# 4.Fine Tuning (Hyper Parameter Optimization)


In [None]:
# Define parameter grids
mlp_param_grid = {
    'hidden_layer_sizes': [(100, 50)],
    'activation': ['relu'],
    'solver': ['adam'],
    'learning_rate': ['constant', 'adaptive'],
    'max_iter': [500]
}

gb_param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

catboost_param_grid = {
    'iterations': [100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'depth': [4, 6, 8]
}                    
# Perform grid search for each model
print("Performing Grid Search for MLP Classifier...")
mlp_grid = GridSearchCV(
    MLPClassifier(early_stopping=True, validation_fraction=0.1, n_iter_no_change=10, random_state=1), 
    param_grid=mlp_param_grid, 
    scoring='accuracy', 
    cv=3, 
    n_jobs=-1
)
mlp_grid.fit(X_train, y_train)
print(f"Best parameters for MLP Classifier: {mlp_grid.best_params_}")

print("\nPerforming Grid Search for Gradient Boosting...")
gb_grid = GridSearchCV(
    GradientBoostingClassifier(random_state=1), 
    param_grid=gb_param_grid, 
    scoring='accuracy', 
    cv=3, 
    n_jobs=-1
)
gb_grid.fit(X_train, y_train)
print(f"Best parameters for Gradient Boosting: {gb_grid.best_params_}")

print("\nPerforming Grid Search for CatBoost...")
catboost_grid = GridSearchCV(
    CatBoostClassifier(verbose=0, random_state=1), 
    param_grid=catboost_param_grid, 
    scoring='accuracy', 
    cv=3, 
    n_jobs=-1
)
catboost_grid.fit(X_train, y_train)
print(f"Best parameters for CatBoost: {catboost_grid.best_params_}")

# Evaluate the best models on the test set
print("\nTest Set Evaluation:")
best_models = {
    'MLP Classifier': mlp_grid.best_estimator_,
    'Gradient Boosting': gb_grid.best_estimator_,
    'CatBoost': catboost_grid.best_estimator_
}

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} - Accuracy: {accuracy:.2f}")

# 5.Defining and Training the Ensemble


In [None]:
# Define the ensemble model using the best four models
ensemble_model = VotingClassifier(
    estimators=[
        ('mlp', mlp_grid.best_estimator_),
        ('catboost', catboost_grid.best_estimator_),
        ('gradient_boosting', gb_grid.best_estimator_),
        ('logistic_regression', models['Logistic Regression'])
    ],
    voting='soft'  # Use 'soft' voting to consider the predicted probabilities
)

#Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Evaluate the ensemble model using cross-validation
ensemble_scores = cross_val_score(ensemble_model, X_train, y_train, scoring='accuracy', cv=5)
print(f"Ensemble Model - Mean CV Accuracy: {ensemble_scores.mean():.2f}, Std Dev: {ensemble_scores.std():.2f}")

# Evaluate the ensemble model on the test set
y_pred_ensemble = ensemble_model.predict(X_test)
ensemble_accuracy = accuracy_score(y_test, y_pred_ensemble)
print(f"Ensemble Model - Test Set Accuracy: {ensemble_accuracy:.2f}")

# 6.Preparing Testing Data

In [None]:
# Prepare test data
def prepare_test_data(df, train_columns):
    df = df.copy()

    # Encode binary features
    binary_features = ["location_type", "cellphone_access", "gender_of_respondent"]
    le = LabelEncoder()
    for col in binary_features:
        df[col] = le.fit_transform(df[col])

    # One-hot encode features
    one_hot_features = ["relationship_with_head", "marital_status", "job_type"]
    df = pd.get_dummies(df, columns=one_hot_features, drop_first=True)

    # Ordinal encode education_level
    education_order = [
        "No formal education",
        "Primary education",
        "Secondary education",
        "Vocational/Specialised training",
        "Tertiary education",
        "Other/Dont know/RTA"
    ]
    ord_enc = OrdinalEncoder(categories=[education_order])
    df["education_level"] = ord_enc.fit_transform(df[["education_level"]])

    # Align columns with train set
    for col in train_columns:
        if col not in df.columns:
            df[col] = 0
    df = df[train_columns]

    return df

# 7.Submitting


In [None]:
test_data = pd.read_csv('/kaggle/input/democracy-in-data-predict-voting-behavior/test.csv')
train_columns = X_train.columns
test_data_preprocessed = prepare_test_data(test_data, train_columns)
test_data_scaled = scaler.transform(test_data_preprocessed)
test_data_scaled = pd.DataFrame(test_data_scaled, columns=train_columns)

#Predict using the ENSEEEMBLE model ;)
predicted = ensemble_model.predict(test_data_scaled)

#Create Submission File
submission = pd.DataFrame({
    'ID': test_data['ID'],
    'vote': ['Yes' if p == 1 else 'No' for p in predicted]
})
submission.to_csv('submission.csv', index=False)
print("Submission file created: submission.csv")