There is a real time fire dataset (Indoor_Fire_Dataset.csv) with the following description:


The dataset contains 4 incipient fire scenarios (wood, candles, cable, lunts) along with 8 different nuisance scenarios (smoke from a fog machine, deodorant, ethanol, CO release, exhaust gases, green waste, welding, and abrasive cutting) carried out in a (10 x 22 x 8)m^3 industrial hall without ventilation. Each scenario was repeated 3 times in random with background sequences in between to reduce the influence of prehistory. The dataset consists of 248,502 rows and 18 columns and is structures as a continuous multivariate time series. Each row represents the sensor measurements (CO2, CO, H2, humidity, particulate matter of different sizes, air temperature, VOC and UV) from a unique sensor node position ("Sensor_ID") in the industrial hall at a specific timestamp. The columns correspond to the sensor measurements and include additional labels: a scenario-specific label ("scenario_label"), a binary label ("anomaly_label") distinguishing between "Normal" (background) and "Anomaly" (fire or nuisance scenario), a "fire_label" intersecting the anomalies into fire-relevant or non-fire-relevant anomalies and a progress label ("progress_label") that allows for dividing the events into sub-sequences based on ongoing physical sub-processes. The "Sensor_ID" column can be utilized to access data from different sensor node positions.



Our database is different, for we need to transform this data into a new csv file in order to store that into our database. Instruction for each column:

1. Temperature (float8) - same as the data column Temperature_Room
2. Humidity (float8) - same as the data column Humidity_Room
3. Pressure (float8) - keep every entry 1030.0.
4. CO (float8) - same as the data column CO_Room
5. CO2 (float8) - same as the data column CO2_Room
6. Fire (bool) - if the "anomaly_label" is also not 0 AND the "fire" label is not 0, the entry is True. Else, the entry is false.

In [3]:
import pandas as pd

# Read the Indoor_Fire_Dataset.csv file
df = pd.read_csv('Indoor_Fire_Dataset.csv')

# Apply column transformations and additions
df_transformed = pd.DataFrame({
    'Temperature': df['Temperature_Room'],
    'Humidity': df['Humidity_Room'],
    'Pressure': 1030.0,
    'CO': df['CO_Room'],
    'CO2': df['CO2_Room'],
    'Fire': (df['fire'] != 0.0) & (df['anomaly_label'] != 0.0)
})

# Save the transformed data to our_fire_data.csv
df_transformed.to_csv('our_fire_data.csv', index=False)

print("Data transformed and saved to 'our_fire_data.csv' successfully!")

Data transformed and saved to 'our_fire_data.csv' successfully!


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Here are some other supervised learning models commonly used for classification tasks, each with its own strengths and weaknesses:

1.  **Decision Trees:**
    *   **Concept:** Builds a tree-like model of decisions based on features to predict a target value. It splits data into subsets based on the value of input features.
    *   **Pros:** Easy to understand and interpret (white-box model), can handle both numerical and categorical data, requires little data preparation.
    *   **Cons:** Prone to overfitting, can be unstable (small variations in data might result in a completely different tree).

2.  **Random Forest:**
    *   **Concept:** An ensemble learning method that builds multiple decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
    *   **Pros:** Reduces overfitting compared to individual decision trees, generally more accurate and robust, can handle a large number of features.
    *   **Cons:** Less interpretable than single decision trees, can be computationally expensive for very large datasets.

3.  **Support Vector Machines (SVM):**
    *   **Concept:** Finds the optimal hyperplane that best separates classes in a high-dimensional space. It aims to maximize the margin between the classes.
    *   **Pros:** Effective in high-dimensional spaces, memory efficient (uses a subset of training points in the decision function), versatile due to different kernel functions.
    *   **Cons:** Can be slow on large datasets, choosing the right kernel function and regularization parameters can be challenging.

4.  **K-Nearest Neighbors (KNN):**
    *   **Concept:** A non-parametric, lazy learning algorithm that classifies a data point based on the majority class of its 'k' nearest neighbors in the feature space.
    *   **Pros:** Simple to understand and implement, no training phase (lazy learner), works well for multi-class problems.
    *   **Cons:** Computationally expensive during prediction (has to calculate distance to all training examples), sensitive to irrelevant features and the scale of data, needs careful selection of 'k'.

5.  **Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost):**
    *   **Concept:** Builds an ensemble of weak prediction models (typically decision trees) sequentially. Each new tree corrects the errors of the previous ones.
    *   **Pros:** Often provides state-of-the-art performance on many tabular datasets, handles various data types, robust to overfitting with proper tuning.
    *   **Cons:** Can be more complex to tune than Random Forest, prone to overfitting if not tuned carefully, training can be time-consuming.

6.  **Neural Networks (Deep Learning):**
    *   **Concept:** Composed of layers of interconnected nodes (neurons), inspired by the human brain. They learn hierarchical representations of data.
    *   **Pros:** Highly powerful for complex patterns, especially in image, text, and sequence data; can achieve very high accuracy with enough data and computational resources.
    *   **Cons:** Requires very large datasets for optimal performance, computationally intensive, difficult to interpret (black-box model), prone to overfitting without proper regularization.

The choice of model depends heavily on the specific problem, the nature of the data, the size of the dataset, computational resources, and the need for interpretability.

The next step is to build a predictive model using Sci-Kit learn library in Python. Linear classifier is probably a good model that helps us to process the data. First we need to classify the data "our_fire_data.csv" into training data and testing data, with each dataset having True and False outputs under column "fire". The training data should be saved as training.csv and the testing data should be saved as testing.csv. Just try one line to training and the next line to testing, then again the next line to training, and so on.

In [5]:
import pandas as pd

# Load the transformed data
df_fire = pd.read_csv('our_fire_data.csv')

# Split into training and testing sets based on alternating rows
training_df = df_fire.iloc[::2]
testing_df = df_fire.iloc[1::2]

# Save the training and testing sets to new CSV files
training_df.to_csv('training.csv', index=False)
testing_df.to_csv('testing.csv', index=False)

print("Data successfully split into 'training.csv' and 'testing.csv'!")

Data successfully split into 'training.csv' and 'testing.csv'!


Then import the linear classification tools from Scikit-learn. These tools will be used on the five parameters: Temperature, Humidity, Pressure, CO, and CO2. The boolean value fire will be the result to be predicted. Write the import lines.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Training Part:
Using the training.csv data, train the model to so that it would use linear classification to predict the fire output from the parameters.

In [7]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load the training data
training_df = pd.read_csv('training.csv')

# Define features (X) and target (y)
X_train = training_df[['Temperature', 'Humidity', 'Pressure', 'CO', 'CO2']]
y_train = training_df['Fire']

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

print("Model trained successfully!")

Model trained successfully!


# Testing part
Find out the accuracy in percentage by applying testing.csv to the model.

In [8]:
import pandas as pd
from sklearn.metrics import accuracy_score

# Load the testing data
testing_df = pd.read_csv('testing.csv')

# Define features (X) and true target (y) for testing
X_test = testing_df[['Temperature', 'Humidity', 'Pressure', 'CO', 'CO2']]
y_test = testing_df['Fire']

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy_percentage = accuracy * 100

print(f"Model accuracy on the testing data: {accuracy_percentage:.2f}%")

Model accuracy on the testing data: 88.92%


# Cross validation
Separate the data into training and testing data in a different way than above, then perform training and testing again.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the original transformed data
df_fire = pd.read_csv('our_fire_data.csv')

# Define features (X) and target (y)
X = df_fire[['Temperature', 'Humidity', 'Pressure', 'CO', 'CO2']]
y = df_fire['Fire']

# Split data into training and testing sets using a different method (random split)
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Data split into new training and testing sets successfully!")

Data split into new training and testing sets successfully!


### Training the model with the new split

In [10]:
# Initialize a new Logistic Regression model
model_new = LogisticRegression()

# Train the new model
model_new.fit(X_train_new, y_train_new)

print("New model trained successfully!")

New model trained successfully!


### Testing the model with the new split

In [11]:
# Make predictions on the new test set
y_pred_new = model_new.predict(X_test_new)

# Calculate the accuracy
accuracy_new = accuracy_score(y_test_new, y_pred_new)
accuracy_percentage_new = accuracy_new * 100

print(f"New model accuracy on the testing data: {accuracy_percentage_new:.2f}%")

New model accuracy on the testing data: 88.93%


# Utilizing the model
Now we have an input:

Temperature = 30.60
Humidity = 30.20
Pressure = 1030.30
CO = 3.10
CO2 = 512.00

Use the model to make a prediction.

In [12]:
import pandas as pd

# Define the input values
input_data = {
    'Temperature': [30.60],
    'Humidity': [30.20],
    'Pressure': [1030.30],
    'CO': [3.10],
    'CO2': [512.00]
}

# Create a DataFrame from the input data
input_df = pd.DataFrame(input_data)

# Make a prediction using the trained model (model_new from cross-validation)
prediction = model_new.predict(input_df)

# Output the prediction
if prediction[0]:
    print("Prediction: Fire detected!")
else:
    print("Prediction: No fire detected.")

Prediction: No fire detected.


# KNN Model

In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the training and testing data (if not already loaded in the environment)
# Re-loading to ensure availability in case the kernel state resets or for independent execution
df_fire = pd.read_csv('our_fire_data.csv')
X = df_fire[['Temperature', 'Humidity', 'Pressure', 'CO', 'CO2']]
y = df_fire['Fire']

# Split data into training and testing sets using a random split (as done previously for 'new' datasets)
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Initialize the KNN classifier. A common choice for n_neighbors is 5.
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the KNN model
knn_model.fit(X_train_new, y_train_new)

print("KNN model trained successfully!")

# Make predictions on the test set
y_pred_knn = knn_model.predict(X_test_new)

# Calculate the accuracy
accuracy_knn = accuracy_score(y_test_new, y_pred_knn)
accuracy_percentage_knn = accuracy_knn * 100

print(f"KNN model accuracy on the testing data: {accuracy_percentage_knn:.2f}%")

KNN model trained successfully!
KNN model accuracy on the testing data: 92.48%
