# Data Loading and Preprocessing

This notebook handles the loading and preprocessing of data for the ANN-PSO implementation. We'll:
1. Load a dataset from the UCI ML Repository
2. Perform data preprocessing (handling missing values, scaling)
3. Split the data into training and testing sets
4. Save the processed data for the ANN and PSO implementation

In [1]:
# All data loading code here
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo

In [2]:
# Load the Concrete Compressive Strength dataset from UCI ML Repository
concrete = fetch_ucirepo(id=165) 

# Get features and target
X = concrete.data.features  # Features
y = concrete.data.targets   # Target variable

print("Dataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print("\nFeature names:", list(X.columns))
print("\nMetadata:")
print(concrete.metadata)

Dataset Information:
Number of samples: 1030
Number of features: 8

Feature names: ['Cement', 'Blast Furnace Slag', 'Fly Ash', 'Water', 'Superplasticizer', 'Coarse Aggregate', 'Fine Aggregate', 'Age']

Metadata:
{'uci_id': 165, 'name': 'Concrete Compressive Strength', 'repository_url': 'https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength', 'data_url': 'https://archive.ics.uci.edu/static/public/165/data.csv', 'abstract': 'Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. ', 'area': 'Physics and Chemistry', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 1030, 'num_features': 8, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Concrete compressive strength'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1998, 'last_updated': 'Sun Feb 11 2024', 'dataset_doi': '10.2443

In [6]:
# Display first few rows and basic statistics
print("First few rows of the dataset:")
print(X.head())
print("\nBasic statistics of features:")
print(X.describe())
print("\nTarget variable statistics:")
print(y.describe())

First few rows of the dataset:
   Cement  Blast Furnace Slag  Fly Ash  Water  Superplasticizer  \
0   540.0                 0.0      0.0  162.0               2.5   
1   540.0                 0.0      0.0  162.0               2.5   
2   332.5               142.5      0.0  228.0               0.0   
3   332.5               142.5      0.0  228.0               0.0   
4   198.6               132.4      0.0  192.0               0.0   

   Coarse Aggregate  Fine Aggregate  Age  
0            1040.0           676.0   28  
1            1055.0           676.0   28  
2             932.0           594.0  270  
3             932.0           594.0  365  
4             978.4           825.5  360  

Basic statistics of features:
            Cement  Blast Furnace Slag      Fly Ash        Water  \
count  1030.000000         1030.000000  1030.000000  1030.000000   
mean    281.167864           73.895825    54.188350   181.567282   
std     104.506364           86.279342    63.997004    21.354219   
min  

In [7]:
# Import preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib  # for saving the preprocessed data

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (824, 8)
Testing set shape: (206, 8)


In [8]:
# Create data directory if it doesn't exist
import os
os.makedirs('../data', exist_ok=True)

# Save the preprocessed data and scaler
data_path = '../data/'
joblib.dump(scaler, f'{data_path}scaler.joblib')
np.savez(f'{data_path}processed_data.npz', 
         X_train=X_train, X_test=X_test,
         y_train=y_train, y_test=y_test)

print("Data and scaler have been saved successfully!")

Data and scaler have been saved successfully!


# Saved Data Files Explanation

The preprocessed data is saved in two files:

1. `scaler.joblib`:
   - Contains the fitted StandardScaler object
   - Stores the mean and standard deviation of each feature
   - Required for consistent scaling of new data
   - Used joblib format for efficient serialization of scikit-learn objects

2. `processed_data.npz`:
   - NumPy compressed archive containing:
     - `X_train`: Scaled features for training (80% of data)
     - `X_test`: Scaled features for testing (20% of data)
     - `y_train`: Target values for training
     - `y_test`: Target values for testing
   - Uses efficient binary format for fast loading
   - Maintains numerical precision
   - Compressed storage to save space

In [None]:
# Example: Loading the saved data
# This shows how to load the data in your ANN-PSO implementation

# Load the scaler
loaded_scaler = joblib.load('../data/scaler.joblib')

# Load the processed data
data = np.load('../data/processed_data.npz')
X_train_loaded = data['X_train']
X_test_loaded = data['X_test']
y_train_loaded = data['y_train']
y_test_loaded = data['y_test']

print("Loaded data shapes:")
print(f"X_train shape: {X_train_loaded.shape}")
print(f"X_test shape: {X_test_loaded.shape}")
print(f"y_train shape: {y_train_loaded.shape}")
print(f"y_test shape: {y_test_loaded.shape}")

# Example: How to scale new data using the saved scaler
# new_data = loaded_scaler.transform(new_data)