# Testing models 
by YM

In [1]:
import pandas as pd

In [17]:
data = pd.read_csv("../data/training_data.csv")

In [18]:
data.columns

Index(['Inscrits', 'Votants', 'Nom', 'victimes_par_hab', 'infractions_par_hab',
       'mosquee_par_hab', 'NB_Pers_par_Foyer_Alloc_par_hab',
       'statut_commune_uu2020', 'revenu_imposable_par_habitant', 'rural',
       'montagne', 'touristique', 'ptot_n', 'dep_inv_hor_remb_par_hab',
       'dot_glo_fonc_par_hab', 'conso_ind_par_hab', 'cons_agr_par_hab',
       'conso_ter_par_hab', 'conso_res_par_hab'],
      dtype='object')

In [19]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify the categorical and numerical columns
categorical_cols = ['statut_commune_uu2020']
numerical_cols = data.columns.drop(['Nom', 'statut_commune_uu2020'])

# Create transformers for preprocessing steps
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numerical_transformer = StandardScaler()

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a pipeline with the preprocessor
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Apply transformations to the data
X = data.drop('Nom', axis=1)
y = data['Nom']
X_transformed = pipeline.fit_transform(X)

# Show the transformed feature matrix shape to confirm processing
X_transformed.shape

(41367, 19)

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Define the models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Evaluate each model using cross-validation
model_scores = {}
for model_name, model in models.items():
    scores = cross_val_score(model, X_transformed, y, cv=5, scoring='accuracy')
    model_scores[model_name] = scores.mean()

model_scores



{'Logistic Regression': 0.5852016518374759,
 'Decision Tree': 0.4171701742807922,
 'Random Forest': 0.47745981705225304,
 'SVM': 0.534390130301176,
 'Gradient Boosting': 0.5397080723758357}


#### Conclusion on Machine Learning Model Performance

The evaluation of various machine learning models on the given dataset using 5-fold cross-validation provided the following accuracy scores:

- **Logistic Regression:** 58.52%
- **Decision Tree:** 41.72%
- **Random Forest:** 47.75%
- **SVM (Support Vector Machine):** 53.44%
- **Gradient Boosting:** 53.97%

##### Key Observations:
- **Logistic Regression** was the top performer, suggesting the dataset may be well-suited to linear models.
- **Decision Tree** had the lowest performance, potentially indicating overfitting or an inability to capture essential data patterns.
- **Random Forest** and **Gradient Boosting**, both ensemble methods, performed below expectations, pointing to a potential need for improved hyperparameter tuning.
- **SVM** showed moderate success, hinting that adjustments in kernel selection and regularization might yield better results.

##### Recommendations:
1. **Hyperparameter Tuning:** Implementing GridSearchCV to optimize model parameters could notably enhance performance, especially for ensemble methods and SVM.
2. **Feature Engineering:** Additional work on refining and selecting input features could lead to improved accuracies, particularly for models that are currently underperforming.
3. **Adjusting Model Complexity:** Modifying parameters such as tree depth in decision tree-based models could help balance bias and variance, potentially leading to better outcomes.
4. **Exploring Additional Algorithms:** Testing other machine learning algorithms or more advanced techniques might provide further improvements if the current models are not ideally suited to the dataset's nuances.

The initial results are a valuable baseline, indicating substantial room for improvement in model performance through methodical tuning and further exploration of machine learning options.


### Hyperparameter Tuning for Logistic Regression with GridSearchCV

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score

# Example hyperparameters tuning for Logistic Regression
param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

# Setting up the GridSearchCV for Logistic Regression
grid_search_lr = GridSearchCV(estimator=LogisticRegression(max_iter=1000), 
                              param_grid=param_grid_lr, 
                              cv=5, 
                              scoring=make_scorer(accuracy_score))

# Fitting GridSearchCV
grid_search_lr.fit(X_transformed, y)

# Best parameters and best score
best_params_lr = grid_search_lr.best_params_
best_score_lr = grid_search_lr.best_score_

print("Best Parameters for Logistic Regression:", best_params_lr)
print("Best Scoring by GridSearchCV:", best_score_lr)



Best Parameters for Logistic Regression: {'C': 100, 'solver': 'liblinear'}
Best Scoring by GridSearchCV: 0.6059181717111218


In [22]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


In [23]:
from joblib import dump, load  # Import dump and load functions
dump(grid_search_lr.best_estimator_, '../data/best_logistic_regression_model.joblib')


['../data/best_logistic_regression_model.joblib']