<a href="https://colab.research.google.com/github/Etienne982/AI-in-aviation/blob/main/Introduction%20Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**General Approach**

This code aims to train a machine learning model to predict whether a flight will be delayed by more than 15 minutes, using flight delay data. The process involves:

Loading and preprocessing the data.
Extracting features and defining the target variable.
Normalizing the data.
Training a neural network using Keras.
Saving the trained model, columns used, and scaler for future use.

**Code Explanation**

**1. Imports**

*   pandas: Handles data in tabular form (DataFrames).
*   Sequential: A linear stack of layers in a Keras model.
*   Dense: Fully connected layers for the neural network.
*   Dropout: Adds regularization by randomly dropping some neurons during training to prevent overfitting.
*   EarlyStopping: Stops training if the model’s performance on validation data stops improving.
*   StandardScaler: Normalizes data to have a mean of 0 and standard deviation of 1.
*   train_test_split: Splits data into training and testing sets.
*   numpy: A library for numerical computations.

**2. Loading the data**

  Loads the dataset from a CSV file that contains flight information and delay statuses.

**3. Preprocessing the Data**

  a. Removing Prefixes and Converting Columns
  
  Removing prefixes: The Month, DayofMonth, and DayOfWeek columns have a "c-" prefix, which is removed. These columns are then converted to integers.

  b. Converting Target Column to Binary

  Converts the target column dep_delayed_15min to binary format:
1 represents "Yes" (delayed), and 0 represents "No" (not delayed).

  c. Creating Dummy Variables

  Dummy encoding: Converts categorical columns (UniqueCarrier, Origin, Dest) into multiple binary columns (e.g., UniqueCarrier_AA, Origin_JFK), dropping the first category to avoid redundancy.

  d. Extracting Departure Hour

  Extracts the departure hour from the DepTime column (e.g., 1530 becomes 15) and removes the original DepTime column.

  e. Handling Errors and Missing Values

  Converts all columns to numeric format, replacing invalid values with NaN.
Fills missing values (NaN) with the mean of each column.


**4. Splitting Features and Target**

  X: Defines the input features, including numeric columns and the dummy variables created earlier.

  y: Defines the target variable (dep_delayed_15min), which indicates whether a flight is delayed or not.

**5. Saving the List of Training Columns**

  Saves the list of columns used for training into a text file, ensuring consistency for future predictions.

**6. Normalizing the Data**


*   StandardScaler: Scales the features to have a mean of 0 and a standard deviation of 1, improving the performance of the neural network.

*   Saves the scaler object to a .pkl file, allowing it to be reused for scaling future data.


**7. Splitting the Data into Training and Testing Sets**

  Divides the dataset into:

  *   Training set (80%): Used to train the model.
  *   Testing set (20%): Used to evaluate the model's performance.


**8. Building the Neural Network**


*   Dense(64): First hidden layer with 64 neurons and ReLU activation function.
*   Dropout(0.2): Randomly drops 20% of neurons during training to reduce overfitting.
*   Dense(32): Second hidden layer with 32 neurons and ReLU activation.
*   Dense(1): Output layer with a single neuron and a linear activation function, suitable for regression tasks.


**9. Compiling the Model**

*   optimizer='adam': Uses the Adam optimizer to adjust weights during training.
*   loss='mean_squared_error': Uses mean squared error as the loss function, suitable for regression problems

**10. Defining Early Stopping**

  Stops training if the validation loss (val_loss) doesn't improve for 10 consecutive epochs.
  Restores the best model weights from before the early stopping occurred.

**11. Training the Model**


*   epochs=100: Maximum number of training cycles.
*   batch_size=2: Number of samples processed at a time.
*   validation_data=(X_test, y_test): Validates the model during training to monitor its performance.
*   callbacks=[early_stopping]: Stops training early if no improvement is observed.
*   verbose=1: Displays detailed training logs.

**12. Saving the Model**

  Saves the trained model in an .h5 file for reuse.

  Prints a confirmation message once the model, scaler, and training columns have been saved.

In [None]:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np


df = pd.read_csv('flight_delays_train.csv')


def preprocess_data(df):

    df['Month'] = df['Month'].str.replace('c-', '', regex=False).astype(int)
    df['DayofMonth'] = df['DayofMonth'].str.replace('c-', '', regex=False).astype(int)
    df['DayOfWeek'] = df['DayOfWeek'].str.replace('c-', '', regex=False).astype(int)


    df['dep_delayed_15min'] = df['dep_delayed_15min'].map({'Y': 1, 'N': 0})


    df = pd.get_dummies(df, columns=['UniqueCarrier', 'Origin', 'Dest'], drop_first=True)


    df['Dep Hour'] = df['DepTime'] // 100
    df = df.drop(columns=['DepTime'], errors='ignore')


    df = df.apply(pd.to_numeric, errors='coerce')
    df = df.fillna(df.mean())
    return df


df = preprocess_data(df)


X = df[['Month', 'DayofMonth', 'DayOfWeek', 'Distance', 'Dep Hour'] +
       [col for col in df.columns if col.startswith('UniqueCarrier_') or
                                     col.startswith('Origin_') or
                                     col.startswith('Dest_')]]
y = df['dep_delayed_15min']


training_columns = list(X.columns)
with open('training_columns.txt', 'w') as f:
    f.write('\n'.join(training_columns))


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


import joblib
joblib.dump(scaler, 'scaler.pkl')


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))


model.compile(optimizer='adam', loss='mean_squared_error')


early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)


model.fit(X_train, y_train, epochs=100, batch_size=2, validation_data=(X_test, y_test), callbacks=[early_stopping], verbose=1)


model.save('flight_delay_model.h5')

print("Modèle, scaler et colonnes sauvegardés avec succès.")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m110s[0m 3ms/step - loss: 0.2858 - val_loss: 0.1462
Epoch 2/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m110s[0m 3ms/step - loss: 0.1482 - val_loss: 0.1481
Epoch 3/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m128s[0m 2ms/step - loss: 0.1480 - val_loss: 0.1433
Epoch 4/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 3ms/step - loss: 0.1479 - val_loss: 0.1440
Epoch 5/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m150s[0m 3ms/step - loss: 0.1473 - val_loss: 0.1426
Epoch 6/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m114s[0m 3ms/step - loss: 0.1456 - val_loss: 0.1433
Epoch 7/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 3ms/step - loss: 0.1442 - val_loss: 0.1437
Epoch 8/100
[1m40000/40000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 3ms/step - loss: 0.1461 - v



Modèle, scaler et colonnes sauvegardés avec succès.


We've tried to develop our code to improve execution speed and achieve more accurate results.

**1. Main Objective**

**First Code (Model Training)**

  Purpose: Train a machine learning model using training data.
  It includes:


*   Preprocessing data to make it suitable for model training (feature extraction, converting to numerical formats, creating dummy variables).
*   Loading training data from a CSV file.
*   Splitting data into training and testing sets.
*   Building and training a neural network model using Keras.
*   Saving the trained model, the scaler for normalization, and the list of training columns.


**Second Code (Model Prediction)**

  Purpose: Load a pre-trained model and make predictions based on user-provided data.
  It includes:
*   Loading the pre-trained model (flight_delay_model.h5), the scaler (scaler.pkl), and the list of columns used during training (training_columns.txt).
*   Preprocessing user input data to match the format expected by the model.
*   Making a prediction using the loaded model.
*   Displaying the prediction result.

**2. Data Used**

**First Code**

  Works with training data loaded from a CSV file (flight_delays_train.csv).
  The data is prepared for training, involving:

*   Creating dummy variables for categorical features.
*   Normalizing numerical data.
*   Selecting relevant columns for training the model.


**Second Code**

Works with user-provided data entered dynamically via input prompts.

Ensures that the user data is compatible with the pre-trained model:

*   Adds missing columns if the user does not provide certain features.
*   Reorders columns to match the training data format.
*   Normalizes the user data using the saved scaler.



**3. Data Preprocessing**

**First Code**

Preprocessing focuses on cleaning raw training data from the CSV file:

*   Removes prefixes (e.g., c-) from columns like Month.
*   Converts non-numeric columns to numeric values.
*   Creates dummy variables for categorical columns (e.g., UniqueCarrier).
*   Fills missing values (NaN) with the mean of each numerical column.



**Second Code**

Preprocessing focuses on ensuring that user-provided data matches the model’s requirements:


*   Checks and fixes formatting issues (e.g., converting Month if it's a string).
*   Creates dummy variables for categorical user inputs.
*   Adds missing columns and sets their values to 0 if needed.


**4. Saving and Loading**

**First Code**

Saves artifacts needed for later use:
*   The trained model (flight_delay_model.h5).
*   The scaler used for normalization (scaler.pkl).
*   The list of columns used during training (training_columns.txt).

**Second Code**

Loads saved artifacts for making predictions:
*   Loads the model using load_model.
*   Loads the scaler with joblib.load.
*   Loads the training columns from a text file.

**5. User Interaction**

**First Code**

Does not involve any direct interaction with the user.
The process of loading data, training the model, and saving artifacts is fully automated.

**Second Code**

Interactive: Uses the input() function to request flight details from the user (e.g., month, departure time, airline code, etc.).
Provides predictions based on the user-provided input.

**6. Core Functionality Comparison**


| **Aspect**         | **First Code**                      | **Second Code**                          |
|--------------------|-------------------------------------|------------------------------------------|
| **Focus**          | Training a model                    | Using a model for predictions           |
| **Data Input**     | Training dataset (`flight_delays_train.csv`) | User inputs (`input()`)                 |
| **Output**         | Trained model, scaler, columns list | Delay prediction for a given flight     |
| **Preprocessing**  | Cleans and prepares training data   | Prepares user-provided data             |
| **Normalization**  | Applies normalization to training data | Normalizes user data                  |
| **Saving**         | Saves the model, scaler, and columns | Does not save; loads pre-trained files  |
| **Interaction**    | Non-interactive                     | Interactive (user provides data)        |


**7. Purpose in Workflow**

**First Code**

Used for training and saving the model. It takes raw data, cleans and preprocesses it, trains a machine learning model, and saves the necessary artifacts for future use.

**Second Code**

Used for making predictions. It loads the saved model and artifacts, processes user-provided data, and predicts flight delays.

**Conclusion**

The first code is for model training, focusing on creating a reliable model from raw training data. The second code is for prediction, allowing the user to interactively input flight details and receive predictions using the pre-trained model.

The two codes are complementary: the first prepares the model, and the second utilizes it.

In [None]:
import numpy as np
import pandas as pd
from keras.models import load_model
from sklearn.preprocessing import StandardScaler
import joblib

# Charger le modèle Keras, le scaler et les colonnes sauvegardées
model = load_model('flight_delay_model.h5')  # Remplacez par le nom de votre modèle Keras
scaler = joblib.load('scaler.pkl')
with open('training_columns.txt', 'r') as f:
    training_columns = f.read().splitlines()

def preprocess_data(df):
    """
    Prétraite les données en fonction des besoins du modèle.
    """
    # Vérifier et convertir les colonnes 'Month', 'DayofMonth', 'DayOfWeek' si elles sont de type chaîne
    if df['Month'].dtype == 'object':
        df['Month'] = df['Month'].str.replace('c-', '', regex=False).astype(int)
    if df['DayofMonth'].dtype == 'object':
        df['DayofMonth'] = df['DayofMonth'].str.replace('c-', '', regex=False).astype(int)
    if df['DayOfWeek'].dtype == 'object':
        df['DayOfWeek'] = df['DayOfWeek'].str.replace('c-', '', regex=False).astype(int)

    # Encoder les variables catégoriques
    categorical_columns = ['UniqueCarrier', 'Origin', 'Dest']
    df = pd.get_dummies(df, columns=categorical_columns)

    # Assurez-vous que toutes les colonnes nécessaires sont présentes
    for col in training_columns:
        if col not in df.columns:
            df[col] = 0  # Si une colonne manque, la remplir avec des zéros

    # Réorganiser les colonnes selon l'ordre du modèle
    df = df[training_columns]
    return df

def preprocess_user_data(flight_data):
    """
    Convertit les données de l'utilisateur en DataFrame et les prétraite.
    """
    # Convertir le dictionnaire en DataFrame
    df = pd.DataFrame([flight_data])

    # Prétraiter les données
    df_processed = preprocess_data(df)

    # Normaliser les données
    df_scaled = scaler.transform(df_processed)

    return df_scaled

# Entrée de l'utilisateur pour prédire un vol spécifique
def get_user_input():
    """
    Fonction pour obtenir les données de vol de l'utilisateur.
    """
    flight_data = {
        'Month': int(input("Entrez le mois (1-12) : ")),
        'DayofMonth': int(input("Entrez le jour du mois (1-31) : ")),
        'DayOfWeek': int(input("Entrez le jour de la semaine (1=Lundi, 7=Dimanche) : ")),
        'DepTime': int(input("Entrez l'heure de départ au format HHMM (ex : 1230) : ")),
        'UniqueCarrier': input("Entrez le code de la compagnie aérienne (ex : AA, DL) : "),
        'Origin': input("Entrez le code de l'aéroport d'origine (ex : JFK, ATL) : "),
        'Dest': input("Entrez le code de l'aéroport de destination (ex : LAX, SFO) : "),
        'Distance': float(input("Entrez la distance du vol (en km) : "))
    }
    return flight_data

def predict_delay(flight_data):
    """
    Prédit le retard d'un vol en utilisant les données de l'utilisateur.
    """
    # Prétraiter et normaliser les données utilisateur
    flight_df_scaled = preprocess_user_data(flight_data)

    # Prédiction du retard
    predicted_delay = model.predict(flight_df_scaled)

    # Affichage du résultat
    print(f"Le retard prévu pour ce vol est de {predicted_delay[0][0]:.2f} minutes.")

if __name__ == "__main__":
    # Obtenir les données utilisateur
    flight_data = get_user_input()

    # Prédire le retard
    predict_delay(flight_data)






Entrez le mois (1-12) : 7
Entrez le jour du mois (1-31) : 11
Entrez le jour de la semaine (1=Lundi, 7=Dimanche) : 3
Entrez l'heure de départ au format HHMM (ex : 1230) : 1139
Entrez le code de la compagnie aérienne (ex : AA, DL) : US
Entrez le code de l'aéroport d'origine (ex : JFK, ATL) : PIT
Entrez le code de l'aéroport de destination (ex : LAX, SFO) : FLL
Entrez la distance du vol (en km) : 994


  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] = 0  # Si une colonne manque, la remplir avec des zéros
  df[col] 

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
Le retard prévu pour ce vol est de 0.12 minutes.
