# Exam on Artificial Neural Networks (ANN)

Welcome the Artificial Neural Networks (ANN) practical exam. In this exam, you will work on a classification task to predict the outcome of incidents involving buses. You are provided with a dataset that records breakdowns and delays in bus operations. Your task is to build, train, and evaluate an ANN model.

---

## Dataset Overview

### **Dataset:**
* Just run the command under the `Load Data` section to get the data downloaded and unzipped or you can access it [here](www.kaggle.com/datasets/khaledzsa/bus-breakdown-and-delays)

### **Dataset Name:** Bus Breakdown and Delays

### **Description:**  
The dataset contains records of incidents involving buses that were either running late or experienced a breakdown. Your task is to predict whether the bus was delayed or had a breakdown based on the features provided.

### **Features:**
The dataset contains the following columns:

- `School_Year`
- `Busbreakdown_ID`
- `Run_Type`
- `Bus_No`
- `Route_Number`
- `Reason`
- `Schools_Serviced`
- `Occurred_On`
- `Created_On`
- `Boro`
- `Bus_Company_Name`
- `How_Long_Delayed`
- `Number_Of_Students_On_The_Bus`
- `Has_Contractor_Notified_Schools`
- `Has_Contractor_Notified_Parents`
- `Have_You_Alerted_OPT`
- `Informed_On`
- `Incident_Number`
- `Last_Updated_On`
- `Breakdown_or_Running_Late` (Target Column)
- `School_Age_or_PreK`

## Load Data

In [None]:
!kaggle datasets download -d khaledzsa/bus-breakdown-and-delays
!unzip bus-breakdown-and-delays.zip

Dataset URL: https://www.kaggle.com/datasets/khaledzsa/bus-breakdown-and-delays
License(s): unknown
Downloading bus-breakdown-and-delays.zip to /content
100% 4.75M/4.75M [00:00<00:00, 49.3MB/s]
100% 4.75M/4.75M [00:00<00:00, 48.4MB/s]
Archive:  bus-breakdown-and-delays.zip
  inflating: Bus_Breakdown_and_Delays.csv  


## Importing Libraries

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix

## Exploratory Data Analysis (EDA)
This could include:
* **Inspect the dataset**

* **Dataset structure**

* **Summary statistics**

* **Check for missing values**

* **Distribution of features**

* **Categorical feature analysis**

* **Correlation matrix**

* **Outlier detection**

And add more as needed!

In [2]:
df = pd.read_csv('Bus_Breakdown_and_Delays.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147972 entries, 0 to 147971
Data columns (total 21 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   School_Year                      147972 non-null  object
 1   Busbreakdown_ID                  147972 non-null  int64 
 2   Run_Type                         147883 non-null  object
 3   Bus_No                           147972 non-null  object
 4   Route_Number                     147884 non-null  object
 5   Reason                           147870 non-null  object
 6   Schools_Serviced                 147972 non-null  object
 7   Occurred_On                      147972 non-null  object
 8   Created_On                       147972 non-null  object
 9   Boro                             141654 non-null  object
 10  Bus_Company_Name                 147972 non-null  object
 11  How_Long_Delayed                 126342 non-null  object
 12  Number_Of_Studen

In [4]:
df.head()

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,...,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
0,2015-2016,1224901,Pre-K/EI,811,1,Other,C353,10/26/2015 08:30:00 AM,10/26/2015 08:40:00 AM,Bronx,...,10MINUTES,5,Yes,Yes,No,10/26/2015 08:40:00 AM,,10/26/2015 08:40:39 AM,Running Late,Pre-K
1,2015-2016,1225098,Pre-K/EI,9302,1,Heavy Traffic,C814,10/27/2015 07:10:00 AM,10/27/2015 07:11:00 AM,Bronx,...,25 MINUTES,3,Yes,Yes,No,10/27/2015 07:11:00 AM,,10/27/2015 07:11:22 AM,Running Late,Pre-K
2,2015-2016,1215800,Pre-K/EI,358,2,Heavy Traffic,C195,09/18/2015 07:36:00 AM,09/18/2015 07:38:00 AM,Bronx,...,15 MINUTES,12,Yes,Yes,Yes,09/18/2015 07:38:00 AM,,09/18/2015 07:38:44 AM,Running Late,Pre-K
3,2015-2016,1215511,Pre-K/EI,331,2,Other,C178,09/17/2015 08:08:00 AM,09/17/2015 08:12:00 AM,Bronx,...,10 minutes,11,Yes,Yes,Yes,09/17/2015 08:12:00 AM,,09/17/2015 08:12:08 AM,Running Late,Pre-K
4,2015-2016,1215828,Pre-K/EI,332,2,Other,S176,09/18/2015 07:39:00 AM,09/18/2015 07:45:00 AM,Bronx,...,10MINUTES,12,Yes,Yes,No,09/18/2015 07:45:00 AM,,09/18/2015 07:56:40 AM,Running Late,Pre-K


In [5]:
df.shape

(147972, 21)

In [6]:
df.describe()

Unnamed: 0,Busbreakdown_ID,Number_Of_Students_On_The_Bus
count,147972.0,147972.0
mean,1287779.0,3.590071
std,43243.38,55.365859
min,1212681.0,0.0
25%,1250438.0,0.0
50%,1287844.0,0.0
75%,1325191.0,4.0
max,1362605.0,9007.0


In [7]:
df.isnull().sum()

School_Year                             0
Busbreakdown_ID                         0
Run_Type                               89
Bus_No                                  0
Route_Number                           88
Reason                                102
Schools_Serviced                        0
Occurred_On                             0
Created_On                              0
Boro                                 6318
Bus_Company_Name                        0
How_Long_Delayed                    21630
Number_Of_Students_On_The_Bus           0
Has_Contractor_Notified_Schools         0
Has_Contractor_Notified_Parents         0
Have_You_Alerted_OPT                    0
Informed_On                             0
Incident_Number                    142340
Last_Updated_On                         0
Breakdown_or_Running_Late               0
School_Age_or_PreK                      0
dtype: int64

In [8]:
df.duplicated().sum()

0

In [48]:
df['Occurred_On']=pd.to_datetime(df['Occurred_On'])
df['Created_On']=pd.to_datetime(df['Created_On'])
df['Created_On']=pd.to_datetime(df['Informed_On'])
df['Last_Updated_On']=pd.to_datetime(df['Last_Updated_On'])


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 147693 entries, 0 to 147971
Data columns (total 19 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   School_Year                      147693 non-null  int64         
 1   Busbreakdown_ID                  147693 non-null  int64         
 2   Run_Type                         147693 non-null  int64         
 3   Bus_No                           147693 non-null  int64         
 4   Route_Number                     147693 non-null  int64         
 5   Reason                           147693 non-null  int64         
 6   Schools_Serviced                 147693 non-null  int64         
 7   Occurred_On                      147693 non-null  datetime64[ns]
 8   Created_On                       147693 non-null  datetime64[ns]
 9   Boro                             147693 non-null  int64         
 10  Bus_Company_Name                 147693 non-null 

## Data Preprocessing
This could include:

* **Handle Missing Values**
    * Impute missing values or drop them.

* **Encode Categorical Variables**
    * One-hot encoding
    * Label encoding

* **Scale and Normalize Data**
    * Standardization (Z-score)
    * Min-Max scaling

* **Feature Engineering**
    * Create new features
    * Feature selection

* **Handle Imbalanced Data**
    * Oversampling
    * Undersampling

* **Handle Outliers**
    * Remove outliers
    * Transform outliers

* **Remove Duplicates**
    * Remove redundant or duplicate data


And add more as needed!

Please treat these as suggestions. Feel free to use your judgment for the rest.

**Handle Missing Values**

In [12]:
df.dropna(subset='Run_Type',inplace=True)

In [13]:
df.dropna(subset='Route_Number',inplace=True)
df.dropna(subset='Reason',inplace=True)

In [14]:
df.drop(columns='Incident_Number',inplace=True)

In [15]:
df.drop(columns='How_Long_Delayed',inplace=True)

In [16]:
mode_B=df['Boro'].mode()[0]
df['Boro']=df['Boro'].fillna(mode_B)

**Encode Categorical Variables**

In [50]:
categ=['School_Year', 'Run_Type', 'Bus_No', 'Route_Number',
       'Reason', 'Schools_Serviced', 'Boro',
       'Bus_Company_Name','Has_Contractor_Notified_Schools', 'Has_Contractor_Notified_Parents',
       'Have_You_Alerted_OPT','Breakdown_or_Running_Late', 'School_Age_or_PreK']

label_encoders = {}
for feature in categ:
    le = LabelEncoder()
    df[feature] = le.fit_transform(df[feature].astype(str))
    label_encoders[feature] = le

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 147693 entries, 0 to 147971
Data columns (total 19 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   School_Year                      147693 non-null  int64         
 1   Busbreakdown_ID                  147693 non-null  int64         
 2   Run_Type                         147693 non-null  int64         
 3   Bus_No                           147693 non-null  int64         
 4   Route_Number                     147693 non-null  int64         
 5   Reason                           147693 non-null  int64         
 6   Schools_Serviced                 147693 non-null  int64         
 7   Occurred_On                      147693 non-null  datetime64[ns]
 8   Created_On                       147693 non-null  datetime64[ns]
 9   Boro                             147693 non-null  int64         
 10  Bus_Company_Name                 147693 non-null 

**Feature Engineering**

In [52]:
df.head()

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Created_On,Boro,Bus_Company_Name,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
0,0,1224901,3,9106,11121,6,3145,1970-01-01 00:00:00.000005222,2015-10-26 08:40:00,1,5,5,1,1,0,10/26/2015 08:40:00 AM,1970-01-01 00:00:00.000011797,1,0
1,0,1225098,3,9975,11121,3,3241,1970-01-01 00:00:00.000005329,2015-10-27 07:11:00,1,5,3,1,1,0,10/27/2015 07:11:00 AM,1970-01-01 00:00:00.000011990,1,0
2,0,1215800,3,5241,3429,3,3099,1970-01-01 00:00:00.000001288,2015-09-18 07:38:00,1,5,12,1,1,1,09/18/2015 07:38:00 AM,1970-01-01 00:00:00.000003009,1,0
3,0,1215511,3,5082,3429,6,3098,1970-01-01 00:00:00.000001143,2015-09-17 08:12:00,1,5,11,1,1,1,09/17/2015 08:12:00 AM,1970-01-01 00:00:00.000002720,1,0
4,0,1215828,3,5092,3429,6,3349,1970-01-01 00:00:00.000001291,2015-09-18 07:45:00,1,5,12,1,1,0,09/18/2015 07:45:00 AM,1970-01-01 00:00:00.000003071,1,0


In [53]:
df3=df[['Run_Type','Reason','Number_Of_Students_On_The_Bus',	'Has_Contractor_Notified_Schools','Have_You_Alerted_OPT']]

In [54]:
df3.head()

Unnamed: 0,Run_Type,Reason,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Have_You_Alerted_OPT
0,3,6,5,1,0
1,3,3,3,1,0
2,3,3,12,1,1
3,3,6,11,1,1
4,3,6,12,1,0


## Split the Dataset
Next, split the dataset into training, validation, and testing sets.

In [55]:
X = df3
y = df['Breakdown_or_Running_Late']

In [56]:
X.columns

Index(['Run_Type', 'Reason', 'Number_Of_Students_On_The_Bus',
       'Has_Contractor_Notified_Schools', 'Have_You_Alerted_OPT'],
      dtype='object')

In [57]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

## Building the ANN Model
In this section, define the architecture of the ANN by specifying the number of layers, neurons, and activation functions.

In [59]:
model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

## Compile the Model
Compile the ANN model by defining the optimizer, loss function, and evaluation metrics.

In [61]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [62]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 64)                384       
                                                                 
 dense_4 (Dense)             (None, 32)                2080      
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2497 (9.75 KB)
Trainable params: 2497 (9.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Training the Model
Train the ANN model using the training data.

In [63]:
history = model.fit(X_train, y_train, epochs=20, batch_size=32)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the Model
Evaluate the performance of the model on the test set.

In [64]:
loss, accuracy = model.evaluate(X_test, y_test)



## Make Predictions
Use the trained model to make predictions on new or unseen data.

## Model Performance Visualization
Visualize the performance metrics such as accuracy and loss over the epochs.

## Save the Model
Save the trained model for submission.

In [65]:
model.save('model.h5')

  saving_api.save_model(


## Project Questions:

1. **Data Preprocessing**: Explain why you chose your specific data preprocessing techniques (e.g., normalization, encoding). How did these techniques help prepare the data for training the model?


- I dropped rows with missing values for small numbers of null values in column.

- I applied label encoding for categorical variables to convert them into numeric format suitable for training a neural network.

- I standardized the numerical features using StandardScaler to bring all features to a similar scale.


2. **Model Architecture**: Describe the reasoning behind your model’s architecture (e.g., the number of layers, type of layers, number of neurons, and activation functions). Why did you believe this architecture was appropriate for the problem at hand?


- The input layer was designed to accept 5 features .
- I used two hidden layers with 64 and 32 neurons, respectively, each using the ReLU activation function.
- The output layer consists of a single neuron with a sigmoid activation function to output the probability for binary classification.



3. **Training Process**: Discuss why you chose your batch size, number of epochs, and optimizer. How did these choices affect the training process? Did you experiment with different values, and what were the outcomes?
4. **Loss Function and Metrics**: Why did you choose the specific loss function and evaluation metrics? How do they align with the objective of the task (e.g., regression vs classification)?

- I chose binary cross-entropy as the loss function since i dealing with a binary classification task.

- I used accuracy as the primary metric to evaluate the model’s performance.


5. **Regularization Techniques**: If you used regularization techniques such as dropout or weight decay, explain why you implemented them and how they influenced the model's performance.
6. **Model Evaluation**: Justify your approach to evaluating the model. Why did you choose the specific performance metrics, and how do they reflect the model's success in solving the task?
7. **Model Tuning (If Done)**: Describe any tuning you performed (e.g., hyperparameter tuning) and why you felt it was necessary. How did these adjustments improve model performance?
8. **Overfitting and Underfitting**: Analyze whether the model encountered any overfitting or underfitting during training. What strategies could you implement to mitigate these issues?

### Answer Here: