# **SECOM Data Analysis and Model Training**

## **0. Data Download and Extraction**

The following cell downloads and extracts the data in case it is not already present in the local directory.

In [1]:
from get_data import download_and_extract_data

url = "https://archive.ics.uci.edu/static/public/179/secom.zip"
download_and_extract_data(url)

2026-01-16 18:42:50,649 - get_data - INFO - SECOM data found.
2026-01-16 18:42:50,650 - get_data - INFO - Data ready to be used.


## **1. Exploratory Data Analysis**

### **1.1 Analysis of Sensor Data**

#### **Load data**

In [2]:
import pandas as pd

In [3]:
data_path = r"data\secom.data"
df = pd.read_csv(data_path, sep=" ", header=None)

#### **Overview of data**

In [4]:
df.shape

(1567, 590)

In [5]:
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,580,581,582,583,584,585,586,587,588,589
0,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,0.0162,...,,,0.5005,0.0118,0.0035,2.363,,,,
1,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,-0.0005,...,0.006,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045
2,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,0.0041,...,0.0148,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602
3,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,-0.0124,...,0.0044,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432
4,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,-0.0031,...,,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 590 entries, 0 to 589
dtypes: float64(590)
memory usage: 7.1 MB


All columns contain data of type float64, which means there are no categorical variables in the dataset.

#### **Check for constant features and drop them**

In [7]:
constant_features_mask = df.nunique(dropna=True) <= 1
print(f"{sum(constant_features_mask)} columns contain constant values. They are removed.")
constant_features = df.columns[constant_features_mask]
constant_features.to_numpy()

116 columns contain constant values. They are removed.


array([  5,  13,  42,  49,  52,  69,  97, 141, 149, 178, 179, 186, 189,
       190, 191, 192, 193, 194, 226, 229, 230, 231, 232, 233, 234, 235,
       236, 237, 240, 241, 242, 243, 256, 257, 258, 259, 260, 261, 262,
       263, 264, 265, 266, 276, 284, 313, 314, 315, 322, 325, 326, 327,
       328, 329, 330, 364, 369, 370, 371, 372, 373, 374, 375, 378, 379,
       380, 381, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
       414, 422, 449, 450, 451, 458, 461, 462, 463, 464, 465, 466, 481,
       498, 501, 502, 503, 504, 505, 506, 507, 508, 509, 512, 513, 514,
       515, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538])

In [8]:
df = df.loc[:, ~constant_features_mask]

In [9]:
df.shape

(1567, 474)

#### **Analysis and handling of missing values by column**

In [10]:
print("Statistics on numbers of missing values per column:")
df.isna().sum(axis=0).describe()

Statistics on numbers of missing values per column:


count     474.000000
mean       86.784810
std       267.540335
min         0.000000
25%         1.000000
50%         4.000000
75%        10.000000
max      1429.000000
dtype: float64

On average, each column has about 87 missing values, while the highest number of missing values is 1429 (out of 1567 records).

In [11]:
top_na_cols = 50
print(f"{top_na_cols} columns with most missing values:")
top_na_cols_df = df.isna().sum().sort_values(ascending=False)[:top_na_cols].to_frame().reset_index()
top_na_cols_df.columns = ["Col number", "Num NAs"]
top_na_cols_df

50 columns with most missing values:


Unnamed: 0,Col number,Num NAs
0,157,1429
1,292,1429
2,293,1429
3,158,1429
4,85,1341
5,358,1341
6,220,1341
7,492,1341
8,516,1018
9,110,1018


In [12]:
na_filter_threshold = 0.2
na_filter_mask = df.isna().sum()/len(df) > na_filter_threshold
print(f"{sum(na_filter_mask)} columns have more than {na_filter_threshold:.0%} missing values. These columns are dropped.")
df = df.loc[:, ~na_filter_mask]

32 columns have more than 20% missing values. These columns are dropped.


In [13]:
df.shape

(1567, 442)

#### **Analysis and handling of missing values by row**

In [14]:
print("Statistics on numbers of missing values per row:")
df.isna().sum(axis=1).describe()

Statistics on numbers of missing values per row:


count    1567.000000
mean        5.110402
std         9.455546
min         0.000000
25%         0.000000
50%         0.000000
75%         8.000000
max        99.000000
dtype: float64

In [15]:
top_na_rows = 20
print(f"{top_na_rows} rows with most missing values:")
top_na_rows_df = df.isna().sum(axis=1).sort_values(ascending=False)[:top_na_rows].to_frame().reset_index()
top_na_rows_df.columns = ["Row number", "Num NAs"]
top_na_rows_df

20 rows with most missing values:


Unnamed: 0,Row number,Num NAs
0,1566,99
1,1564,99
2,1561,87
3,995,74
4,846,66
5,814,66
6,512,62
7,511,60
8,1206,54
9,1234,54


In [16]:
max_na_count = (df.isna().sum(axis=1)).max()
max_na_fraction = max_na_count/df.shape[1]
print(f"The rows with the most missing values have at most {max_na_fraction:.0%} missing values.\nNo rows will be removed due to missing values.")

The rows with the most missing values have at most 22% missing values.
No rows will be removed due to missing values.


### **1.2 Analysis of Labels**

In [17]:
labels_path = r"data\secom_labels.data"
labels_df = pd.read_csv(labels_path, sep=" ", header=None)

In [18]:
labels_df.shape

(1567, 2)

In [19]:
labels_df.head(5)

Unnamed: 0,0,1
0,-1,19/07/2008 11:55:00
1,-1,19/07/2008 12:32:00
2,1,19/07/2008 13:17:00
3,-1,19/07/2008 14:43:00
4,-1,19/07/2008 15:22:00


In [20]:
labels_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1567 non-null   int64 
 1   1       1567 non-null   object
dtypes: int64(1), object(1)
memory usage: 24.6+ KB


The labels contain no missing values.

In [21]:
labels_df[0].value_counts()

0
-1    1463
 1     104
Name: count, dtype: int64

In [22]:
labels_df[0].value_counts(normalize=True)

0
-1    0.933631
 1    0.066369
Name: proportion, dtype: float64

As mentioned in the dataset description, there are 104 test fails (which amount to 6.6% of tests), while all the other tests passed.

## **2. Feature Selection**

## **2.1 Correlation Analysis**

The correlation between the features will be computed, with the aim of dropping features that are highly correlated with others. This will help reduce the number of features and thereby reduce the time needed to test models in the next step.

In [27]:
import numpy as np

In [28]:
corr = df.corr().abs()
corr.shape

(442, 442)

In [29]:
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

correlated_pairs = (
    upper.stack()
    .reset_index()
    .rename(columns={
        "level_0": "feature_1",
        "level_1": "feature_2",
        0: "feature_corr"
    })
)

In [30]:
# Threshold for deciding which features are considered highly correlated:
corr_threshold = 0.9

In [33]:
correlated_pairs = correlated_pairs[
    correlated_pairs["feature_corr"] >= corr_threshold
]
n_correlated_feat = len(correlated_pairs)
print(f"{n_correlated_feat} pairs of features are highly correlated (>= {corr_threshold:.2f}).")
correlated_pairs

349 pairs of features are highly correlated (>= 0.90).


Unnamed: 0,feature_1,feature_2,feature_corr,target_corr_f1,target_corr_f2,drop
1759,4,7,0.916410,0.013760,0.012993,7
1879,4,140,0.999975,0.013760,0.013618,140
1971,4,275,0.999976,0.013760,0.013621,275
2065,4,413,0.938416,0.013760,0.006381,413
2750,7,140,0.916715,0.012993,0.013618,7
...,...,...,...,...,...,...
97407,575,577,0.928311,0.052731,0.049633,577
97440,583,584,0.994771,0.005981,0.005419,584
97441,583,585,0.999890,0.005981,0.005034,585
97446,584,585,0.995342,0.005419,0.005034,585


For each pair of highly correlated features, identify the one that has the lowest correlation with the target variable. It will be dropped.

In [34]:
target_corr = df.apply(lambda col: col.corr(labels_df[0])).abs()

correlated_pairs["target_corr_f1"] = correlated_pairs["feature_1"].map(target_corr)
correlated_pairs["target_corr_f2"] = correlated_pairs["feature_2"].map(target_corr)

correlated_pairs["feature to drop"] = np.where(
    correlated_pairs["target_corr_f1"] < correlated_pairs["target_corr_f2"],
    correlated_pairs["feature_1"],
    correlated_pairs["feature_2"]
)
correlated_pairs

Unnamed: 0,feature_1,feature_2,feature_corr,target_corr_f1,target_corr_f2,drop,feature to drop
1759,4,7,0.916410,0.013760,0.012993,7,7
1879,4,140,0.999975,0.013760,0.013618,140,140
1971,4,275,0.999976,0.013760,0.013621,275,275
2065,4,413,0.938416,0.013760,0.006381,413,413
2750,7,140,0.916715,0.012993,0.013618,7,7
...,...,...,...,...,...,...,...
97407,575,577,0.928311,0.052731,0.049633,577,577
97440,583,584,0.994771,0.005981,0.005419,584,584
97441,583,585,0.999890,0.005981,0.005034,585,585
97446,584,585,0.995342,0.005419,0.005034,585,585


In [35]:
features_to_drop = correlated_pairs["feature to drop"].unique()
print(f"Dropping {len(features_to_drop)} redundant features.")
df = df.drop(columns=features_to_drop)

Dropping 198 redundant features.


In [36]:
df.shape

(1567, 244)

### **2.2 Impute Missing Values**

In [44]:
from sklearn.impute import SimpleImputer

X = df.to_numpy()
imputer = SimpleImputer(strategy="median")
X = imputer.fit_transform(X)

### **2.2 Mutual Information**

In [37]:
# from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

# X = df.to_numpy()
# y = labels_df[0].to_numpy()

# selector = SelectKBest(score_func=f_classif, k=10)
# X_final = selector.fit_transform(X, y)

# final_features = X.columns[selector.get_support()]
# final_features

## **3. Model Training and Parameter Tuning**

### **3.1 Split Dataset**

In [38]:
from sklearn.model_selection import train_test_split

In [47]:
test_size = 0.2

y = labels_df[0].to_numpy()
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=test_size/(1-test_size), random_state=42)

In [48]:
len(X_train) + len(X_val) + len(X_test) == len(X)

True

In [49]:
len(y_train) + len(y_val) + len(y_test) == len(y)

True

### **3.2 Train LogisticRegression Model**

In [50]:
from sklearn.linear_model import LogisticRegression

In [54]:
model = LogisticRegression(solver="liblinear", max_iter=1000)
model.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'liblinear'


In [55]:
y_pred = model.predict(X_val)
(y_pred == y_val).mean()

np.float64(0.8789808917197452)

In [None]:
# TODO: optional: view module 4 videos again
# TODO: Look into F1 score
# TODO: consider whether to rewatch module 6 videos before selecting models to test
# TODO: consider applying scaling to features (check if necessary)