### Explanation of the imports:
- **pandas**: For handling data in a tabular format (DataFrames), making it easy to manipulate and analyze datasets.
- **numpy**: For efficient numerical computations on large arrays and matrices.
- **matplotlib.pyplot**: For generating plots and visualizing data (such as ROC curves, feature distributions, etc.).
- **sklearn.model_selection.train_test_split**: Used to split the data into training and testing sets.
- **sklearn.linear_model.LogisticRegression**: This class implements logistic regression for binary classification problems.
- **sklearn.metrics.classification_report**: Generates a detailed report on classification performance, including precision, recall, F1-score, etc.
- **tqdm.notebook.tqdm**: Provides a progress bar for loops, useful in Jupyter notebooks when running long processes.
- **sklearn.metrics**: Various functions like accuracy, precision, recall, F1-score, ROC-AUC, and balanced accuracy are used to evaluate the performance of the model.
- **sklearn.utils.all_estimators**: A utility function to retrieve all available estimators in scikit-learn.
- **modin.pandas**: A drop-in replacement for pandas, enabling parallel and faster DataFrame operations, utilizing multiple CPU cores.
- **sklearnex**: Intel’s extension of scikit-learn that optimizes various algorithms for better performance, particularly on Intel architectures.

In [1]:
# Import necessary libraries             # For data manipulation
import numpy as np                # For numerical operations
import matplotlib.pyplot as plt   # For data visualization

# Machine learning model and evaluation imports
from sklearn.model_selection import train_test_split  # For splitting the dataset into training and testing sets
from sklearn.linear_model import LogisticRegression   # For logistic regression model
from sklearn.metrics import classification_report     # For detailed classification metrics

# Progress bar for loops
from tqdm.notebook import tqdm   # For progress bar visualization in Jupyter notebooks

# Additional metrics for evaluating model performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score

# Utility function to get all estimators from scikit-learn
from sklearn.utils import all_estimators

import modin.pandas as pd  # Import Modin for parallel DataFrame operations

from sklearnex import patch_sklearn  # Import sklearnex for scikit-learn optimization

patch_sklearn()  # Patch scikit-learn with sklearnex

import warnings
warnings.filterwarnings("ignore")


### Load Dataset from Project Root Directory

In this section, we set the root directory for the project and load the dataset into a pandas DataFrame. The dataset contains processed city data, including population details.

#### Variables:
- `path_root_dir`: The root directory path where the datasets are stored.
- `data`: The DataFrame that holds the loaded CSV data.

The `pd.read_csv()` function is used to read the CSV file into the DataFrame.

In [2]:
#set this to the root directory of the project
path_root_dir="../datasets/"
data = pd.read_csv(path_root_dir + "processed/all_city_data_with_pop.csv")

In [3]:
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,geometry,parking,edges,EV_stations,parking_space,civic,restaurant,park,...,cinema,library,commercial,retail,townhall,government,residential,city,population,Berlin_data_onlycenter_
0,0,0,"POLYGON ((8.4727605 50.099822499999995, 8.4730...",0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0,Frankfurt,9.014051,
1,1,1,"POLYGON ((8.4775730092433 50.10302720327834, 8...",0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0,Frankfurt,0.0,
2,2,2,"POLYGON ((8.479750879173663 50.09863320231676,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0,Frankfurt,9.014051,
3,3,3,"POLYGON ((8.479688060978736 50.10443297769501,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0,Frankfurt,9.014051,
4,4,4,"POLYGON ((8.47965547981383 50.107440331063444,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0,Frankfurt,0.0,


### Filtering Relevant Columns for Modeling

Here, we filter out the necessary columns from the dataset for building our machine learning model. These columns contain important features such as the number of EV stations, parking, schools, population, and various other civic, commercial, and residential attributes.

- **Columns Selected for Modeling**:
  - `geometry`, `city`, `EV_stations`, `parking`, `edges`, `parking_space`, `civic`, `restaurant`, `park`, `school`, `node`, `Community_centre`, `place_of_worship`, `university`, `cinema`, `library`, `commercial`, `retail`, `townhall`, `government`, `residential`, `population`.
  
- **Drop Missing Values**:
  - After filtering out the columns, rows with missing values (NaN) are removed using `dropna()` to ensure clean data for modeling.

#### Code to Filter and Clean Data:

In [4]:
# Filtering the relevant columns for modeling
data = data[['geometry','city','EV_stations', 'parking', 'edges',
        'parking_space', 'civic', 'restaurant', 'park', 'school',
       'node', 'Community_centre', 'place_of_worship', 'university', 'cinema',
       'library', 'commercial', 'retail', 'townhall', 'government',
       'residential', 'population']]

# Print initial data size
print("data size:", data.shape)

# Drop rows with missing values
data = data.dropna()

# Print data size after dropping missing values
print("data size after dropping na:", data.shape)

# Display the first few rows of the filtered and cleaned data
data.head()

data size: (10824, 22)
data size after dropping na: (10129, 22)


Unnamed: 0,geometry,city,EV_stations,parking,edges,parking_space,civic,restaurant,park,school,...,place_of_worship,university,cinema,library,commercial,retail,townhall,government,residential,population
0,"POLYGON ((8.4727605 50.099822499999995, 8.4730...",Frankfurt,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0,9.014051
1,"POLYGON ((8.4775730092433 50.10302720327834, 8...",Frankfurt,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0,0.0
2,"POLYGON ((8.479750879173663 50.09863320231676,...",Frankfurt,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0,9.014051
3,"POLYGON ((8.479688060978736 50.10443297769501,...",Frankfurt,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0,9.014051
4,"POLYGON ((8.47965547981383 50.107440331063444,...",Frankfurt,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0,0.0


### Function: `data_splitter`

This function is designed to split the dataset into training and testing sets, either based on predefined cities (if provided) or by using random sampling when no city-specific splitting is required. The goal is to train a model that predicts the presence of EV stations in the cities. 

- **Two Modes of Splitting**:
  - **City-based splitting**: If `train_cities` and `test_cities` are provided, the function will filter the dataset based on these cities.
  - **Random-based splitting**: If no city-specific splitting is provided, the dataset will be split randomly using `train_test_split`.

- **Binary Classification Target**:
  - The target variable `EV_stations` is transformed into a binary variable (1 or 0), where `1` indicates the presence of EV stations in the city and `0` indicates none.

#### Parameters:
- **`data`**: The input DataFrame containing city and feature information.
- **`train_cities`**: (Optional) A list of city names to be used for the training set.
- **`test_cities`**: (Optional) A list of city names to be used for the test set.
- **`test_size`**: The proportion of data to use for testing (default is 0.2, or 20%).
- **`random_state`**: A seed value for reproducibility in random splits.

#### Steps in the Code:
1. **City-based Splitting**:
    - If `train_cities` and `test_cities` are provided:
      - Filter rows belonging to `train_cities` for training and `test_cities` for testing.
      - Drop the `city`, `geometry`, and `EV_stations` columns from features.
      - Apply binary transformation on `EV_stations` as the target variable (1 for presence, 0 for absence).
      
2. **Random-based Splitting**:
    - If city lists are not provided:
      - Use `train_test_split` to randomly split the data into training and testing sets, stratifying based on the target variable (`EV_stations`).


In [5]:
def data_splitter(data, train_cities=None, test_cities=None, test_size=0.2, random_state=42):

    if train_cities is not None:
        train = data[data['city'].isin(train_cities)]
        test = data[data['city'].isin(test_cities)]


        X_train = train.drop(['city','geometry', 'EV_stations'], axis=1)
        y_train = train['EV_stations'].astype(int)
        y_train = y_train.apply(lambda x: 1 if x > 0 else 0)

        X_test = test.drop(['city','geometry', 'EV_stations'], axis=1)
        y_test = test['EV_stations'].astype(int)
        y_test = y_test.apply(lambda x: 1 if x > 0 else 0)
    else:
        X = data.drop(['city','geometry', "EV_stations"], axis=1)
        y = data['EV_stations']
        y = y.apply(lambda x: 1 if x > 0 else 0)
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size, random_state=random_state)

    return X_train, X_test, y_train, y_test

In [6]:
X_train, X_test, y_train, y_test = data_splitter(data)

### Logistic Regression Model

This code block demonstrates the training and evaluation of a Logistic Regression model for predicting the presence of EV stations in the dataset. It uses the data split earlier into training and testing sets.

#### Steps:
1. **Model Initialization**: 
   - A `LogisticRegression()` model is created using the default parameters.

2. **Model Training**: 
   - The model is trained on the training dataset (`X_train`, `y_train`) using the `.fit()` method.
   
3. **Model Evaluation**:
   - The model's performance is evaluated on the test dataset (`X_test`, `y_test`) using the `.score()` method, which returns the accuracy of the model.
   
4. **Classification Report**:
   - After making predictions on the test set using the `.predict()` method, a detailed classification report is generated using `classification_report()`. This report includes:
     - Precision: The ratio of correctly predicted positive observations to total predicted positives.
     - Recall: The ratio of correctly predicted positive observations to all actual positives.
     - F1-Score: The weighted average of Precision and Recall.
     - Support: The number of true instances for each label.


In [7]:
# Logistic Regression Model
logreg = LogisticRegression()

# Train the model
logreg.fit(X_train, y_train)

# Print Test Accuracy
print("Logistic Regression Test Accuracy: ", logreg.score(X_test, y_test))

# Predict on the test set
y_pred = logreg.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

Logistic Regression Test Accuracy:  0.8958538993089832
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      1786
           1       0.62      0.30      0.41       240

    accuracy                           0.90      2026
   macro avg       0.77      0.64      0.68      2026
weighted avg       0.88      0.90      0.88      2026



### Running Multiple Classification Models

This code block demonstrates how to apply and evaluate all available classification models from `scikit-learn` using the `all_estimators()` function. Each model is trained on the training dataset and evaluated on the test dataset. Key metrics like accuracy, precision, recall, F1-score, AUC, and balanced accuracy are calculated and stored in a DataFrame for comparison.

#### Steps:
1. **Fetching Classifiers**:
   - `all_estimators(type_filter='classifier')` returns all available classifier models in `scikit-learn`.
   
2. **Model Training and Evaluation**:
   - Each classifier is instantiated and trained using the training set (`X_train`, `y_train`).
   - Predictions (`y_pred`) are made on the test set (`X_test`).
   
3. **Metrics Calculation**:
   - The following evaluation metrics are computed for each model:
     - **Accuracy**: Proportion of correct predictions out of all predictions.
     - **Precision**: Ratio of correctly predicted positives to total predicted positives, computed using `precision_score()`.
     - **Recall**: Ratio of correctly predicted positives to all actual positives, computed using `recall_score()`.
     - **F1-Score**: Harmonic mean of precision and recall, computed using `f1_score()`.
     - **AUC (Area Under the ROC Curve)**: Measures the ability of the model to distinguish between classes, computed using `roc_auc_score()`.
     - **Balanced Accuracy**: Average of recall obtained in each class, computed using `balanced_accuracy_score()`.
   
4. **Results Storage**:
   - Results for each classifier, along with their corresponding metrics, are stored in the `results` list.
   
5. **DataFrame Creation**:
   - A Pandas DataFrame is created from the results for easier visualization and sorting.
   - The DataFrame is sorted by `F1-score` and `AUC` in descending order to rank the best models.


In [8]:
# Get all classification model classes
classifiers = all_estimators(type_filter='classifier')

# Initialize result table
results = []
models = {}
# Run models and collect results
for name, ClassifierClass in tqdm(classifiers):
    try:
        # Initialize model
        model = ClassifierClass()
        model.fit(X_train, y_train)
        models[name] = model
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='macro')
        recall = recall_score(y_test, y_pred, average='macro')
        f1 = f1_score(y_test, y_pred, average='macro')
        auc = roc_auc_score(y_test, y_pred)
        balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
        
        # Append results
        results.append([name, accuracy, precision, recall, f1, auc, balanced_accuracy])
    except Exception as e:
        print("", end="")

# Create a DataFrame from results
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1-score", "AUC", "Balanced Accuracy"])
results_df = results_df.sort_values(by=['F1-score', 'AUC'], ascending=False)
print(results_df)


  0%|          | 0/41 [00:00<?, ?it/s]

                             Model  Accuracy  Precision    Recall  F1-score   
12  HistGradientBoostingClassifier  0.900296   0.767657  0.705403  0.730580  \
0               AdaBoostClassifier  0.900296   0.768031  0.703600  0.729434   
25          RandomForestClassifier  0.904245   0.788820  0.689609  0.724794   
11      GradientBoostingClassifier  0.899309   0.766435  0.694023  0.722009   
15      LinearDiscriminantAnalysis  0.894373   0.750447  0.680403  0.707106   
8             ExtraTreesClassifier  0.897828   0.765678  0.673346  0.705678   
1                BaggingClassifier  0.892892   0.745251  0.679563  0.704936   
9                       GaussianNB  0.866239   0.687214  0.707727  0.696568   
20                   MultinomialNB  0.872162   0.693752  0.693052  0.693401   
19                   MLPClassifier  0.900790   0.790903  0.651582  0.691288   
4                     ComplementNB  0.855874   0.674136  0.712668  0.689998   
24   QuadraticDiscriminantAnalysis  0.861797   0.677

### Function: `run_experiment`

This function automates the process of training and evaluating multiple classification models on the provided dataset. It compares performance metrics across different classifiers, facilitating easy model selection.

#### Inputs:
- **`X_train`**: Feature set for training the models.
- **`X_test`**: Feature set for testing the models.
- **`y_train`**: Target labels for training.
- **`y_test`**: Target labels for testing.

#### Process:
1. **Fetch Classifiers**:
   - Uses `all_estimators` from `scikit-learn` to fetch all available classifier models for experimentation.

2. **Initialize Result Storage**:
   - A list `results` is created to store the performance metrics.
   - A dictionary `models` is initialized to store trained models for later use.

3. **Train and Evaluate Models**:
   - The function iterates through each classifier:
     - It initializes the model using the `ClassifierClass`.
     - The model is trained on `X_train` and `y_train`.
     - Predictions are made on the `X_test` set.
     - Performance metrics, including accuracy, precision, recall, F1-score, AUC, and balanced accuracy, are computed.
   - Results are stored in the `results` list, and the trained model is saved in the `models` dictionary.

4. **Error Handling**:
   - Exceptions are caught and handled silently to avoid stopping the entire process if a particular model fails.

5. **Results Compilation**:
   - The results are converted into a Pandas DataFrame (`results_df`), and the table is sorted by F1-score and AUC in descending order to highlight the best-performing models.

#### Outputs:
- **`results_df`**: A Pandas DataFrame containing the performance metrics for each model.
- **`models`**: A dictionary with the trained models accessible by model name.


In [9]:
def run_experiment(X_train, X_test, y_train, y_test):
    # Get all classification model classes
    classifiers = all_estimators(type_filter='classifier')

    # Initialize result table
    results = []
    models = {}
    # Run models and collect results
    for name, ClassifierClass in tqdm(classifiers):
        try:
            # Initialize model
            model = ClassifierClass()
            model.fit(X_train, y_train)
            models[name] = model
            y_pred = model.predict(X_test)
            
            # Calculate metrics
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='macro')
            recall = recall_score(y_test, y_pred, average='macro')
            f1 = f1_score(y_test, y_pred, average='macro')
            auc = roc_auc_score(y_test, y_pred)
            balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
            
            # Append results
            results.append([name, accuracy, precision, recall, f1, auc, balanced_accuracy])
        except Exception as e:
            print("", end="")

    # Create a DataFrame from results
    results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1-score", "AUC", "Balanced Accuracy"])
    results_df = results_df.sort_values(by=['F1-score', 'AUC'], ascending=False)
    return results_df, models


In [10]:
result_df, models = run_experiment(X_train, X_test, y_train, y_test)

  0%|          | 0/41 [00:00<?, ?it/s]

In [11]:
results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score,AUC,Balanced Accuracy
12,HistGradientBoostingClassifier,0.900296,0.767657,0.705403,0.73058,0.705403,0.705403
0,AdaBoostClassifier,0.900296,0.768031,0.7036,0.729434,0.7036,0.7036
25,RandomForestClassifier,0.904245,0.78882,0.689609,0.724794,0.689609,0.689609
11,GradientBoostingClassifier,0.899309,0.766435,0.694023,0.722009,0.694023,0.694023
15,LinearDiscriminantAnalysis,0.894373,0.750447,0.680403,0.707106,0.680403,0.680403
8,ExtraTreesClassifier,0.897828,0.765678,0.673346,0.705678,0.673346,0.673346
1,BaggingClassifier,0.892892,0.745251,0.679563,0.704936,0.679563,0.679563
9,GaussianNB,0.866239,0.687214,0.707727,0.696568,0.707727,0.707727
20,MultinomialNB,0.872162,0.693752,0.693052,0.693401,0.693052,0.693052
19,MLPClassifier,0.90079,0.790903,0.651582,0.691288,0.651582,0.651582


In [12]:
results_df.to_csv("../results/all_cities_random_shuffle.csv", index=False)

### Experiment: Comparing Model Performance on Big and Small Cities

This experiment involves splitting the dataset based on two sets of cities:
1. **Big Cities**: Berlin, Munich, Stuttgart, Frankfurt.
2. **Small Cities**: Karlsruhe, Trier, Saarbrücken, Mainz.

For each group (big and small cities), models are trained by excluding one city as the test set and using the remaining cities as the training set. This process is repeated for all possible combinations of cities in each group, and the performance of the models is recorded.

#### Steps:
1. **City Splitting**:
   - For each city in the `big_cities` list, the dataset is split such that one city is used as the test set, and the rest are used for training. The same is done for `small_cities`.

2. **Data Splitting**:
   - The `data_splitter` function is used to create the train and test splits for each combination of cities.

3. **Model Training and Evaluation**:
   - The `run_experiment` function is used to train multiple classifiers on the training data and evaluate them on the test data.
   - The metrics collected include accuracy, precision, recall, F1-score, AUC, and balanced accuracy.

4. **Saving Results**:
   - The results for each city test case are saved in a CSV file for further analysis.


In [13]:
"""
Berlin, Munich, Stuttgart, Frankfurt: Big CITY EXP-1
Kalsruhe, trier, saarbrucken, mainz: EXP-2
"""

# EXP-1
big_cities = ['Berlin', 'Munich', 'Stuttgart', 'Frankfurt']
small_cities = ['Karlsruhe', 'Trier', 'Saarbrücken', 'Mainz']


# make a table in the end to summarise the results of all experiments

# big cities splited in trian and test where only one big city is test and all possible combinations for this
for city in tqdm(big_cities):
    test_cities = [city]
    train_cities = [x for x in big_cities if x != city]
    X_train, X_test, y_train, y_test = data_splitter(data, train_cities=train_cities, test_cities=test_cities)
    results_df, models = run_experiment(X_train, X_test, y_train, y_test)
    results_df.to_csv(f"../results/big_cities_test_city_{city}_.csv", index=False)
    



# small cities splited in trian and test where only one small city is test and all possible combinations for this
for city in tqdm(small_cities):
    test_cities = [city]
    train_cities = [x for x in small_cities if x != city]
    X_train, X_test, y_train, y_test = data_splitter(data, train_cities=train_cities, test_cities=test_cities)
    results_df, models = run_experiment(X_train, X_test, y_train, y_test)
    results_df.to_csv(f"../results/small_cities_test_city_{city}_.csv", index=False)

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/41 [00:00<?, ?it/s]

In [14]:
results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score,AUC,Balanced Accuracy
2,BernoulliNB,0.847561,0.633567,0.81241,0.665443,0.81241,0.81241
28,SGDClassifier,0.941057,0.871992,0.612098,0.662336,0.612098,0.612098
1,BaggingClassifier,0.930894,0.737337,0.619819,0.654837,0.619819,0.619819
8,ExtraTreesClassifier,0.941057,0.908574,0.598906,0.647382,0.598906,0.598906
20,MultinomialNB,0.894309,0.626664,0.652892,0.637969,0.652892,0.652892
17,LogisticRegression,0.912602,0.651632,0.623163,0.635386,0.623163,0.623163
7,ExtraTreeClassifier,0.904472,0.635564,0.631979,0.633737,0.631979,0.631979
22,PassiveAggressiveClassifier,0.941057,0.970165,0.585714,0.630965,0.585714,0.585714
4,ComplementNB,0.880081,0.605113,0.645233,0.620127,0.645233,0.645233
0,AdaBoostClassifier,0.922764,0.676409,0.589059,0.614229,0.589059,0.589059


### Aggregating AUC Scores and Identifying the Best Models

This script aggregates the AUC (Area Under the Curve) scores for all models across multiple experiments, calculates the average AUC for each model, and selects the top-performing models based on their average AUC. The goal is to identify the models that consistently perform well across different test cities.

#### Steps:
1. **Load Experiment Results**:
   - The script reads all CSV files from the `../results/` directory, where each file contains the results of a specific experiment.

2. **Sum AUC Scores**:
   - For each model in the results, the script sums up the AUC scores across different experiments and also counts the number of times each model appears.

3. **Calculate Average AUC**:
   - The average AUC for each model is computed by dividing the total AUC by the number of times the model was evaluated.

4. **Sort and Select Top Models**:
   - Models are then sorted based on their average AUC in descending order, and the top 5 models are selected for further analysis.


In [19]:
import pandas as pd
import glob

# Get a list of all result files from different experiments
result_files = glob.glob("../results/*.csv")

# Create a dictionary to store the total AUC and count for each model
auc_sum_per_model = {}
count_per_model = {}

# Iterate over each result file
for file in result_files:
    print(file)
    # Load the results for each experiment
    results = pd.read_csv(file)
    
    # Iterate over each row in the results
    for _, row in results.iterrows():
        model = row['Model']
        auc = row['AUC']
        
        # Update the total AUC and count for the model
        if model in auc_sum_per_model:
            auc_sum_per_model[model] += auc
            count_per_model[model] += 1
        else:
            auc_sum_per_model[model] = auc
            count_per_model[model] = 1

# Calculate the average AUC for each model
average_auc_per_model = {model: auc_sum_per_model[model] / count_per_model[model] for model in auc_sum_per_model}

# Create a DataFrame from the average AUC dictionary
average_auc_df = pd.DataFrame(list(average_auc_per_model.items()), columns=['Model', 'Average AUC'])

# Sort the DataFrame by Average AUC in descending order
sorted_models = average_auc_df.sort_values(by='Average AUC', ascending=False)

# Select the top 5 models
top_5_models = sorted_models.head(5)

# Display the best models
print(top_5_models)


../results\all_cities_random_shuffle.csv
../results\big_cities_test_city_Berlin_.csv
../results\big_cities_test_city_Frankfurt_.csv
../results\big_cities_test_city_Munich_.csv
../results\big_cities_test_city_Stuttgart_.csv
../results\small_cities_test_city_Karlsruhe_.csv
../results\small_cities_test_city_Mainz_.csv
../results\small_cities_test_city_Saarbrücken_.csv
../results\small_cities_test_city_Trier_.csv
              Model  Average AUC
12      BernoulliNB     0.779203
16  NearestCentroid     0.766557
10     ComplementNB     0.671147
9     MultinomialNB     0.662987
25    SGDClassifier     0.658037


In [20]:
top_5_models

Unnamed: 0,Model,Average AUC
12,BernoulliNB,0.779203
16,NearestCentroid,0.766557
10,ComplementNB,0.671147
9,MultinomialNB,0.662987
25,SGDClassifier,0.658037


### Combining and Summarizing Experiment Results by City Type

This script processes the results from various classification experiments, categorizes them by city type (big, small, and all), calculates the average AUC (Area Under the Curve) for the models used in these experiments, and generates a summary DataFrame displaying these results.

#### Steps:
1. **Load Experiment Results**:
   - The script reads all CSV files containing the results of classification experiments from the `../results/` directory.

2. **Group Results by City Type**:
   - It categorizes the results based on city types: `big`, `small`, and `all`.
   - For each city type, it combines the results from the relevant files into a single DataFrame.

3. **Calculate Average AUC**:
   - The script computes the average AUC for each model within each city type.

4. **Select Top Models**:
   - It identifies the top 5 models based on their average AUC and filters the combined results accordingly.

5. **Generate Summary DataFrame**:
   - For each city type, a summary row is created containing the average AUC values for the top models.
   - These summary rows are concatenated into a final DataFrame, which is printed at the end.


In [24]:
import pandas as pd
import glob

# Get a list of all result files from different experiments
result_files = glob.glob("../results/*.csv")

# Create an empty DataFrame to store the combined results
combined_results = pd.DataFrame()

# Create an empty DataFrame to store the summary
summary_results = pd.DataFrame(columns=['type_city'])

# Iterate over each result file
for type_city in ['big', 'small', 'all']:
    # Reset combined_results for each type_city iteration
    combined_results = pd.DataFrame()

    # List to store dataframes to be concatenated later
    data_frames = []

    # Iterate over each result file
    for file in result_files:
        # Load the results for each experiment
        if type_city in file:
            results = pd.read_csv(file)
            
            # Append the results to the list of dataframes
            data_frames.append(results)

    # Concatenate all dataframes in the list
    if data_frames:
        combined_results = pd.concat(data_frames)

    # Calculate the average AUC for each model
    average_auc_per_model = combined_results.groupby('Model')['AUC'].mean()
    
    # Sort the models by average AUC in descending order
    sorted_models = average_auc_per_model.sort_values(ascending=False)
    
    # Get the top 5 models
    top_5_models = sorted_models.head(5).index.tolist()

    # Filter the results to include only the rows corresponding to the top 5 models
    filtered_results = combined_results[combined_results['Model'].isin(top_5_models)]

    # Calculate the average AUC for each model
    average_auc_by_model = filtered_results.groupby('Model')['AUC'].mean()
    
    # Create a row with type_city and average AUC values for each model
    row = {'type_city': type_city}
    row.update(average_auc_by_model.to_dict())
    
    # Append the row to the summary_results DataFrame
    summary_results = pd.concat([summary_results, pd.DataFrame([row])], ignore_index=True)

# Display the summary_results DataFrame
print(summary_results)


  type_city  BernoulliNB  ComplementNB  GaussianNB  NearestCentroid   
0       big     0.764115      0.669090    0.660482         0.715620  \
1     small     0.810223      0.654498         NaN         0.853362   
2       all     0.799321      0.673888         NaN         0.834474   

   QuadraticDiscriminantAnalysis  LinearSVC  MultinomialNB  SGDClassifier  
0                       0.657869        NaN            NaN            NaN  
1                            NaN   0.685329       0.666221            NaN  
2                            NaN        NaN       0.675165       0.676748  


In [25]:
summary_results

Unnamed: 0,type_city,BernoulliNB,ComplementNB,GaussianNB,NearestCentroid,QuadraticDiscriminantAnalysis,LinearSVC,MultinomialNB,SGDClassifier
0,big,0.764115,0.66909,0.660482,0.71562,0.657869,,,
1,small,0.810223,0.654498,,0.853362,,0.685329,0.666221,
2,all,0.799321,0.673888,,0.834474,,,0.675165,0.676748


### Summarizing Experiment Metrics by City Type

This script analyzes the results from various classification experiments, specifically focusing on the metrics for the top-performing models based on city types (big, small, and all). It calculates the average values for AUC (Area Under the Curve), Accuracy, Precision, and Recall, generating a concise summary of model performance.

#### Steps:
1. **Load Experiment Results**:
   - The script retrieves all CSV files containing experiment results from the `../results/` directory.

2. **Group Results by City Type**:
   - It categorizes results based on city types: `big`, `small`, and `all`.
   - For each city type, it concatenates the results from relevant files into a single DataFrame.

3. **Filter for Top Models**:
   - It filters the combined results to retain only those corresponding to the top 5 models (assumed to be pre-defined in the variable `top_5_models`).

4. **Calculate Average Metrics**:
   - The average values for AUC, Accuracy, Precision, and Recall are computed for the top 5 models within each city type.

5. **Generate Summary DataFrame**:
   - For each city type, a summary row is created containing the average metrics, which are appended to a final summary DataFrame.


In [26]:
import pandas as pd
import glob

# Get a list of all result files from different experiments
result_files = glob.glob("../results/*.csv")

# Create an empty DataFrame to store the summary
summary_results = pd.DataFrame(columns=['type_city', 'AUC', 'Accuracy', 'Precision', 'Recall'])

# Iterate over each type of city
for type_city in ['big', 'small', 'all']:
    # List to store the DataFrames for each file of the current type_city
    data_frames = []

    # Iterate over each result file
    for file in result_files:
        # Load the results for each experiment
        if type_city in file:
            results = pd.read_csv(file)
            # Add to the list of dataframes to be concatenated
            data_frames.append(results)

    # Concatenate all dataframes into one DataFrame if data exists
    if data_frames:
        combined_results = pd.concat(data_frames)

        # Filter the results to include only the rows corresponding to the top 5 models
        filtered_results = combined_results[combined_results['Model'].isin(top_5_models)]

        # Calculate the average values for each metric (AUC, Accuracy, Precision, Recall)
        average_metrics_per_model = filtered_results.groupby('Model')[['AUC', 'Accuracy', 'Precision', 'Recall']].mean()

        # Calculate the mean of the average metrics across the top 5 models
        average_values = average_metrics_per_model.mean()

        # Create a row with type_city and the average metric values
        row = {'type_city': type_city}
        for metric in ['AUC', 'Accuracy', 'Precision', 'Recall']:
            row[metric] = average_values[metric]

        # Append the row to the summary_results DataFrame
        summary_results = pd.concat([summary_results, pd.DataFrame([row])], ignore_index=True)

# Display the summary_results DataFrame
print(summary_results)


  type_city       AUC  Accuracy  Precision    Recall
0       big  0.689336  0.819802   0.662562  0.689336
1     small  0.722180  0.887707   0.654480  0.722180
2       all  0.731919  0.860643   0.656496  0.731919


In [27]:
top_5_models

['NearestCentroid',
 'BernoulliNB',
 'SGDClassifier',
 'MultinomialNB',
 'ComplementNB']

In [28]:
summary_results.to_csv("../results/summary_results.csv", index=False)