# Brief

- Dataset link - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XNFVTS
- Paper link - https://www.researchgate.net/publication/329150699_Electronic_nose_dataset_for_beef_quality_monitoring_in_uncontrolled_ambient_conditions

### Description of the dataset:-
This dataset consists of five series correspond to five beef cuts where one series contains 2160 minutes of measurement points. Every series is distributed in comma-separated value (csv). The first row contains the column header as follows:
- Minute: time of measurement point (minute);
- Class: discrete label of beef quality [“excellent”, “good”, “acceptable”, “spoiled”];
- TVC: continuous label of microbial population (log10 cfu/g);
- MQ_: sensor resistance value of a particular gas sensor (Ω);
- Humidity: relative humidity (%) in the sample chamber;
- Temperature: temperature (C) in the sample chamber.

### What each of the sensors mean:-
- MQ135 – Air quality (CO₂, NH₃, NOx, benzene, smoke)
- MQ136 – Hydrogen sulfide (H₂S)
- MQ2 – Methane (CH₄), propane, butane, smoke, LPG
- MQ3 – Alcohol, ethanol, benzine
- MQ4 – Methane (CH₄), natural gas
- MQ5 – LPG, natural gas, town gas
- MQ6 – LPG, butane
- MQ8 – Hydrogen (H₂)
- MQ9 – Carbon monoxide (CO), methane, LPG
- Humidity – Water vapor percentage in air
- Temperature – Heat level (°C/°F)


In [None]:
#@title Mount drive

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Beef Dataset/TS1.csv")

df.head()

Unnamed: 0,minute,class,TVC,MQ135,MQ136,MQ2,MQ3,MQ4,MQ5,MQ6,MQ8,MQ9,Humidity,Temperature
0,1,excellent,2.567112,22.86,41.24,29.56,14.83,49.38,11.1,2.21,32.09,14.29,55.1,34.6
1,2,excellent,2.567112,22.75,41.5,29.72,14.83,49.38,11.23,2.23,31.92,14.58,55.1,34.6
2,3,excellent,2.567112,22.75,41.5,29.72,14.83,49.38,11.28,2.22,34.31,14.4,55.1,34.6
3,4,excellent,2.567112,22.86,42.56,29.56,14.77,49.38,11.46,2.22,34.12,14.64,55.4,34.5
4,5,excellent,2.567112,23.07,42.03,29.56,14.83,49.38,11.46,2.22,33.93,14.58,55.4,34.4


In [None]:
# --- BEEF SENSOR DATA EDA (TS1.csv) ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.ndimage import gaussian_filter1d

# --- 1. Load data ---
df = pd.read_csv("/content/drive/MyDrive/Beef Dataset/TS1.csv")

# --- 2. Basic info ---
print(df.head())
print(df.info())
print(df['class'].value_counts())

# --- 3. Convert time column properly ---
df['minute'] = pd.to_numeric(df['minute'], errors='coerce')

# --- 4. Sensor list ---
sensors = ['MQ135','MQ136','MQ2','MQ3','MQ4','MQ5','MQ6','MQ8','MQ9','Humidity','Temperature','TVC']

# --- 5. Plot raw vs smoothed sensor drift ---
fig, axes = plt.subplots(6, 2, figsize=(14, 16))
axes = axes.flatten()

for i, sensor in enumerate(sensors):
    raw = df[sensor].values
    smooth = gaussian_filter1d(raw, sigma=10)   # spike stability smoothing

    axes[i].plot(df['minute'], raw, label='Raw', alpha=0.5)
    axes[i].plot(df['minute'], smooth, label='Smoothed', linewidth=2)
    axes[i].set_title(f"{sensor} vs Minute (Sensor Drift)")
    axes[i].set_xlabel("Minute")
    axes[i].set_ylabel("Sensor Reading (Ω or %)")
    axes[i].legend()
    axes[i].grid(True)

# remove empty subplots if odd number
for j in range(len(sensors), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.savefig("TS1_Sensor_Shift_and_Smoothing.png", dpi=300)
plt.close()

# --- 6. Correlation matrix ---
df_encoded = df.copy()
df_encoded['class'] = df_encoded['class'].astype('category').cat.codes  # encode class numerically

corr = df_encoded[['TVC'] + sensors + ['class']].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title("Correlation Matrix: Sensors, TVC & Class (TS1)")
plt.tight_layout()
plt.savefig("TS1_Correlation_Matrix.png", dpi=300)
plt.close()

# --- 7. Combine both sheets into one PDF ---
from matplotlib.backends.backend_pdf import PdfPages
pdf = PdfPages("TS1_EDA_Report.pdf")

# Page 1 - Sensor drift
img1 = plt.imread("TS1_Sensor_Shift_and_Smoothing.png")
plt.figure(figsize=(8.3, 11.7)) # A4
plt.imshow(img1)
plt.axis('off')
plt.title("Sensor Drift & Spike Stability (TS1)")
pdf.savefig()
plt.close()

# Page 2 - Correlation
img2 = plt.imread("TS1_Correlation_Matrix.png")
plt.figure(figsize=(8.3, 11.7))
plt.imshow(img2)
plt.axis('off')
plt.title("Correlation Analysis (TS1)")
pdf.savefig()
plt.close()

pdf.close()

print("✅ EDA complete. Outputs saved as:")
print("   - TS1_Sensor_Shift_and_Smoothing.png")
print("   - TS1_Correlation_Matrix.png")
print("   - TS1_EDA_Report.pdf (2-page summary)")


   minute      class       TVC  MQ135  MQ136    MQ2    MQ3    MQ4    MQ5  \
0       1  excellent  2.567112  22.86  41.24  29.56  14.83  49.38  11.10   
1       2  excellent  2.567112  22.75  41.50  29.72  14.83  49.38  11.23   
2       3  excellent  2.567112  22.75  41.50  29.72  14.83  49.38  11.28   
3       4  excellent  2.567112  22.86  42.56  29.56  14.77  49.38  11.46   
4       5  excellent  2.567112  23.07  42.03  29.56  14.83  49.38  11.46   

    MQ6    MQ8    MQ9  Humidity  Temperature  
0  2.21  32.09  14.29      55.1         34.6  
1  2.23  31.92  14.58      55.1         34.6  
2  2.22  34.31  14.40      55.1         34.6  
3  2.22  34.12  14.64      55.4         34.5  
4  2.22  33.93  14.58      55.4         34.4  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   minute       2160 non-null   int64  
 1   class        2160 non-n

In [7]:
#@title Basic LDA
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd


# Prepare Data
# Encode the 'class' column numerically if not already done in df_encoded
if 'class' in df.columns and df['class'].dtype == 'object':
    df['class'] = df['class'].astype('category').cat.codes

X = df[['MQ135','MQ136','MQ2','MQ3','MQ4','MQ5','MQ6','MQ8','MQ9','Humidity','Temperature']] # Sensor readings as features
y = df['class'] # Encoded class as target

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LDA Model
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Evaluate Model
y_pred = lda.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9629629629629629


# Conclusion

### Ideal Case:
- TVC  must have a sharp increase in reading
- MQ9, MQ6 - must show a slight decrease (as the gas concentration increase resistance decreases)
- MQ135, MQ136 sharp decrease

### Results:

The increase/decrease happens but after a certain time the values tend to saturate, temporal info can therefore not provide much to the SNN, a classifier can be trained but explainability is at risk since the sensors also pick up noise and are 'contaminated', as per the paper.

### Verdict
Good for Classifier, risky for explainability.

LDA scores achieved were >96%
