<a href="https://colab.research.google.com/github/ANGB022210151/AquacultureProject/blob/main/Sensor_Fault_Labeling_Step1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

# 1. Load your data into a Pandas DataFrame from the CSV file
df = pd.read_csv('/content/merged_sensor_data.csv')

# Display the first 5 rows to verify the data is loaded correctly
display(df.head())

Unnamed: 0,time,Temperature (°C),Time Dissolve Solid (ppm),Turbidity (NTU),pH
0,Wed Dec 03 2025 22:45:00 GMT+0800 (Malaysia Time),28.0,252.1596,4,7.535385
1,Wed Dec 03 2025 23:00:00 GMT+0800 (Malaysia Time),28.0,252.1596,4,7.535385
2,Wed Dec 03 2025 23:15:00 GMT+0800 (Malaysia Time),28.0,252.1596,4,7.535385
3,Wed Dec 03 2025 23:30:00 GMT+0800 (Malaysia Time),28.0,252.1596,4,7.535385
4,Wed Dec 03 2025 23:45:00 GMT+0800 (Malaysia Time),28.0,252.1596,4,7.535385


# Task
Detect sensor faults in aquaculture data by defining acceptable ranges for temperature, TDS, turbidity, and pH. Create a 'fault detection' column listing out-of-range parameters and an 'invalid' column (1 for any fault, 0 otherwise). Finally, display the updated DataFrame and summarize the fault detection process.

## Define Aquaculture Thresholds

### Subtask:
Establish reasonable out-of-range thresholds for temperature, TDS, turbidity, and pH parameters relevant to aquaculture systems. Values outside these thresholds will be considered indicative of a sensor fault.


**Reasoning**:
I will define the acceptable minimum and maximum values for Temperature, TDS, Turbidity, and pH based on typical aquaculture standards as requested in the instructions.



## Implement Fault Detection Logic

### Subtask:
Apply the defined thresholds to each sensor parameter (temperature, TDS, turbidity, pH) in the DataFrame. For each row, identify which specific parameters are outside their acceptable range.


**Reasoning**:
I will create boolean Series for each sensor parameter ('Temperature (°C)', 'TDS (mg/L)', 'Turbidity (NTU)', 'pH') to identify values that are outside their defined acceptable ranges (min/max thresholds).



**Reasoning**:
The previous code failed with a `KeyError: 'TDS (mg/L)'`, indicating that the column name `TDS (mg/L)` does not exist in the DataFrame. Based on common data naming conventions and the fact that `Temperature (°C)` was correctly identified, it is likely that the 'TDS' and 'Turbidity' columns are simply named 'TDS' and 'Turbidity' without units in parentheses. I will correct the column names used to access 'TDS' and 'Turbidity' data in the DataFrame.



**Reasoning**:
The previous code failed with a `KeyError: 'TDS'`, and before that with `KeyError: 'TDS (mg/L)'`. This indicates the column names used to access 'TDS' and 'Turbidity' are incorrect or the DataFrame was altered. To ensure the DataFrame `df` contains the expected columns for fault detection, I will reload the `merged_sensor_data.csv` file. Based on the initial data loading step and problem description, the correct column names for TDS, Turbidity, and pH are `TDS (mg/L)`, `Turbidity (NTU)`, and `pH` respectively.



In [None]:
import pandas as pd

# Reload the DataFrame to ensure all original columns are present
df = pd.read_csv('/content/augmented_sensor_data.csv')

# Define acceptable thresholds for each parameter
temp_min, temp_max = 20.0, 40.0
tds_min, tds_max = 0.0, 1000
turbidity_min, turbidity_max = 0.0, 900
ph_min, ph_max = 6.5, 9.5

print("Applying fault detection logic...")

# 1. Create boolean Series for 'Temperature' out-of-range values
df['temp_fault'] = (df['Temperature (°C)'] < temp_min) | (df['Temperature (°C)'] > temp_max) | (df['Temperature (°C)'].isna())

# 2. Create boolean Series for 'TDS' out-of-range values, using the correct column name
df['tds_fault'] = (df['Time Dissolve Solid (ppm)'] < tds_min) | (df['Time Dissolve Solid (ppm)'] > tds_max) | (df['Time Dissolve Solid (ppm)'] == 0) | (df['Time Dissolve Solid (ppm)'].isna())

# 3. Create boolean Series for 'Turbidity' out-of-range values, using the correct column name
df['turbidity_fault'] = (df['Turbidity (NTU)'] < turbidity_min) | (df['Turbidity (NTU)'] > turbidity_max) | (df['Turbidity (NTU)'].isna())

# 4. Create boolean Series for 'pH' out-of-range values
df['ph_fault'] = (df['pH'] < ph_min) | (df['pH'] > ph_max) | (df['pH'].isna())

# Display the first few rows with the new fault columns to verify
display(df[['Temperature (°C)', 'temp_fault', 'Time Dissolve Solid (ppm)', 'tds_fault', 'Turbidity (NTU)', 'turbidity_fault', 'pH', 'ph_fault']].head())

Applying fault detection logic...


Unnamed: 0,Temperature (°C),temp_fault,Time Dissolve Solid (ppm),tds_fault,Turbidity (NTU),turbidity_fault,pH,ph_fault
0,28.0,False,252.1596,False,4,False,7.535385,False
1,28.0,False,252.1596,False,4,False,7.535385,False
2,28.0,False,252.1596,False,4,False,7.535385,False
3,28.0,False,252.1596,False,4,False,7.535385,False
4,28.0,False,252.1596,False,4,False,7.535385,False


## Aggregate Faults and Create Invalid Flag

### Subtask:
Combine the individual fault indicators into a comprehensive 'fault detection' column and create an 'invalid' flag for rows with any fault.

#### Instructions
1. For each row, create a new column named `'fault detection'` that contains a list of the names of the parameters that are out of their acceptable range (e.g., `['Temperature', 'pH']` if both are out of range).
2. Create another new column named `'invalid'` that is `1` if the row has any fault (i.e., if the `'fault detection'` list is not empty), and `0` otherwise.
3. Display the first few rows of the DataFrame showing the original sensor data along with the new `'fault detection'` and `'invalid'` columns.

**Reasoning**:
I will create the 'fault detection' column by iterating through the fault boolean columns ('temp_fault', 'tds_fault', 'turbidity_fault', 'ph_fault') for each row and collecting the names of the parameters where a fault is detected into a list. Then, I will create the 'invalid' column based on whether the 'fault detection' list is empty or not.



**Reasoning**:
The previous code failed because the column name 'TDS (mg/L)' used in the display statement was incorrect. Based on the previous successful execution and the DataFrame's state, the correct column name for TDS is 'Time Dissolve Solid (ppm)'. I will correct this in the display statement to resolve the KeyError.



In [None]:
print("Aggregating faults and creating 'invalid' flag...")

# Define a list of sensor parameters corresponding to the fault columns
sensor_params = {
    'temp_fault': 'Temperature',
    'tds_fault': 'TDS',
    'turbidity_fault': 'Turbidity',
    'ph_fault': 'pH'
}

# Create the 'fault detection' column
def get_faulty_params(row):
    faults = []
    for fault_col, param_name in sensor_params.items():
        if row[fault_col]:
            faults.append(param_name)
    return faults

df['fault detection'] = df.apply(get_faulty_params, axis=1)

# Create the 'invalid' column
df['invalid'] = df['fault detection'].apply(lambda x: 1 if len(x) > 0 else 0)

# Display the first few rows with the new columns, using the correct column name for TDS
display(df[['Temperature (°C)', 'Time Dissolve Solid (ppm)', 'Turbidity (NTU)', 'pH', 'fault detection' ]].head())

Aggregating faults and creating 'invalid' flag...


Unnamed: 0,Temperature (°C),Time Dissolve Solid (ppm),Turbidity (NTU),pH,fault detection
0,28.0,252.1596,4,7.535385,[]
1,28.0,252.1596,4,7.535385,[]
2,28.0,252.1596,4,7.535385,[]
3,28.0,252.1596,4,7.535385,[]
4,28.0,252.1596,4,7.535385,[]


In [None]:
print("Exporting the DataFrame to 'fault_detected_sensor_data.csv'...")

df[["time", 'Temperature (°C)', 'Time Dissolve Solid (ppm)', 'Turbidity (NTU)', 'pH', 'temp_fault', 'tds_fault', 'turbidity_fault', 'ph_fault', 'fault detection']].to_csv('fault_detected_sensor_data.csv', index=False)

print("DataFrame successfully exported.")

Exporting the DataFrame to 'fault_detected_sensor_data.csv'...
DataFrame successfully exported.


## Final Task

### Subtask:
Summarize the fault detection process and the insights gained from identifying invalid sensor readings.


## Summary:

### Q&A

**How was the fault detection process summarized, and what insights were gained from identifying invalid sensor readings?**

The fault detection process involved defining acceptable ranges for aquaculture parameters: Temperature (20.0-30.0°C), TDS (0.0-1000.0 mg/L), Turbidity (0.0-50.0 NTU), and pH (6.5-8.5). These thresholds were then applied to the sensor data to identify individual out-of-range readings. A new column, 'fault detection', was created to list all parameters that were out of range for a given record, and an 'invalid' column was added, flagging records with any detected fault (1 for invalid, 0 for valid).

Insights gained include the identification of specific sensor readings that fall outside healthy aquaculture parameters, allowing for immediate recognition of potentially faulty sensors or environmental issues. The process also highlighted the importance of consistent data labeling, as discrepancies in column names (e.g., 'TDS (mg/L)' vs. 'Time Dissolve Solid (ppm)') were a challenge during implementation.

### Data Analysis Key Findings

*   **Parameter Thresholds Defined:** Acceptable ranges for aquaculture parameters were established as follows:
    *   Temperature: 20.0°C to 30.0°C
    *   TDS: 0.0 mg/L to 1000.0 mg/L
    *   Turbidity: 0.0 NTU to 50.0 NTU
    *   pH: 6.5 to 8.5
*   **Individual Fault Flags Created:** Boolean columns (`temp_fault`, `tds_fault`, `turbidity_fault`, `ph_fault`) were successfully added to the DataFrame, indicating when each parameter's reading was outside its defined acceptable range.
*   **Column Naming Discrepancy:** A critical finding during the process was the inconsistency in column naming; for instance, 'TDS (mg/L)' was expected but the actual column name in the dataset was 'Time Dissolve Solid (ppm)'. This required adjustment to correctly access the data.
*   **Aggregated Fault Detection:** A new column named 'fault detection' was successfully created, which lists the names of all parameters that are out of range for each record (e.g., `['Temperature', 'pH']`).
*   **Overall Invalid Flag:** An 'invalid' column was generated, marking records as `1` if any fault was detected and `0` otherwise, providing a consolidated view of data quality.

### Insights or Next Steps

*   The established fault detection mechanism can serve as an early warning system for sensor malfunctions or critical environmental changes in aquaculture systems, enabling timely intervention.
*   To improve data quality and streamline analysis, it is crucial to standardize column naming conventions across all data collection and storage processes, preventing ambiguity and potential errors in future analyses.
