# Classification with k-fold cross-validation

For determining the best classification model i used the k fold cross-validation method with 10 splits.

Models i tried:
- Logistic Regression
- Support Vector Classifier
- Random Forest Classifier

For Logistic Regression and SVC i used the StandardScaler.



In [2]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 42

df = pd.read_excel("Spam.xlsx")
le = LabelEncoder()
y = le.fit_transform(df["type"])
X = df.drop(columns=["type"])

models = {
    "LogReg": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
    ]),

    "SVC": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", SVC(kernel="rbf", probability=True, random_state=RANDOM_STATE))
    ]),

    "RandomForest": RandomForestClassifier(
        n_estimators=400,
        max_depth=None,
        random_state=RANDOM_STATE,
        n_jobs=-1
    ),
}

scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"]

cv = StratifiedKFold(
    n_splits=10,
    shuffle=True,
    random_state=RANDOM_STATE
)

# Perform k-fold cross-validation for each model
results = []
for name, model in models.items():
    cv_out = cross_validate(
        model, X, y,
        cv=cv,
        scoring=scoring,
        n_jobs=-1,
        return_train_score=False
    )
    summary = {f"{metric}_mean":  cv_out[f"test_{metric}"].mean()
               for metric in scoring}
    summary.update({f"{metric}_std":  cv_out[f"test_{metric}"].std(ddof=0)
                    for metric in scoring})
    summary["model"] = name
    results.append(summary)

df_result = (pd.DataFrame(results)
             .set_index("model")
             .sort_values("accuracy_mean", ascending=False))

print("\nk-fold cross-validation results (with 10 splits):")
print(df_result.round(3))

# best model according to accuracy
best_name = df_result.index[0]
best_model = models[best_name].fit(X, y)
print(f"\nbest model: {best_name}")


k-fold cross-validation results (with 10 splits):
              accuracy_mean  precision_mean  recall_mean  f1_mean  \
model                                                               
RandomForest          0.953           0.962        0.943    0.953   
SVC                   0.924           0.941        0.905    0.923   
LogReg                0.922           0.926        0.917    0.921   

              roc_auc_mean  accuracy_std  precision_std  recall_std  f1_std  \
model                                                                         
RandomForest         0.987         0.015          0.020       0.021   0.015   
SVC                  0.971         0.018          0.017       0.026   0.019   
LogReg               0.969         0.016          0.017       0.026   0.017   

              roc_auc_std  
model                      
RandomForest        0.006  
SVC                 0.009  
LogReg              0.010  

best model: RandomForest


With an accuracy of 0.9328 the neural network from part b would rank behind the RandomForest classifier.