In [198]:
#Import necessary libraries.
import os
import numpy as np
import pandas as pd


In [199]:
# Data loading
dir_path = os.path.join(os.path.dirname(os.getcwd()), 'data', 'raw')
file_path = os.path.join(dir_path, 'Concrete_Data.xlsx')
df = pd.read_excel(file_path)
df.head(5)

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [200]:
df[df.duplicated()]

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
77,425.0,106.3,0.0,153.5,16.5,852.1,887.1,3,33.398217
80,425.0,106.3,0.0,153.5,16.5,852.1,887.1,3,33.398217
86,362.6,189.0,0.0,164.9,11.6,944.7,755.8,3,35.301171
88,362.6,189.0,0.0,164.9,11.6,944.7,755.8,3,35.301171
91,362.6,189.0,0.0,164.9,11.6,944.7,755.8,3,35.301171
100,425.0,106.3,0.0,153.5,16.5,852.1,887.1,7,49.201007
103,425.0,106.3,0.0,153.5,16.5,852.1,887.1,7,49.201007
109,362.6,189.0,0.0,164.9,11.6,944.7,755.8,7,55.895819
111,362.6,189.0,0.0,164.9,11.6,944.7,755.8,7,55.895819
123,425.0,106.3,0.0,153.5,16.5,852.1,887.1,28,60.294676


In [201]:
df = df.drop_duplicates()
df.shape

(1005, 9)

In [202]:
# Ratio features
df.columns.str.strip()
# 1. Water to Cement Ratio
df["water_cement_ratio"] =df["Water  (component 4)(kg in a m^3 mixture)"]/ df["Cement (component 1)(kg in a m^3 mixture)"]

# 2. Water to Binder ratio
total_binder_content = df["Cement (component 1)(kg in a m^3 mixture)"] + df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"] + df["Fly Ash (component 3)(kg in a m^3 mixture)"]
df["water_binder_ratio"] = df["Water  (component 4)(kg in a m^3 mixture)"] / total_binder_content

# 3. Aggregate to binder ratio
total_aggregate_quantity = df["Coarse Aggregate  (component 6)(kg in a m^3 mixture)"] + df["Fine Aggregate (component 7)(kg in a m^3 mixture)"]
df["agg_binder_ratio"] = total_aggregate_quantity / total_binder_content

df[["water_cement_ratio", "water_binder_ratio", "agg_binder_ratio"]].describe()

Unnamed: 0,water_cement_ratio,water_binder_ratio,agg_binder_ratio
count,1005.0,1005.0,1005.0
mean,0.756223,0.473139,4.564681
std,0.313524,0.1255,1.233582
min,0.266893,0.235073,2.376562
25%,0.547465,0.389522,3.476
50%,0.689531,0.48,4.507895
75%,0.93753,0.561264,5.409803
max,1.882353,0.9,9.85


# Group 1 — Ratio Features

This section explains the engineered ratio features and why they are critical for predicting concrete compressive strength.

## Core Philosophy
- Raw ingredient amounts alone don’t capture concrete behavior.
- Concrete strength depends on **relationships between ingredients**, not absolute values.
- Group 1 features focus on these **ratios**.

## Water-to-Cement (w/c) Ratio
- Combines **Water** and **Cement** into a single predictor.
- More water relative to cement → more voids → weaker concrete.
- **Abrams’ Law (1919)** confirms this principle.
- **EDA correlation with strength:** -0.489 → stronger predictor than either column alone.

## Water-to-Binder (w/b) Ratio
- Extends w/c ratio by including all cementitious materials: **Cement + Slag + Fly Ash**.
- Captures a more accurate view of **paste quality** in modern mixes.
- Recognized in design codes like **Eurocode 2** and **ACI 318**.
- For mixes with high Slag or Fly Ash, w/b will be lower than w/c, better reflecting strength potential.

## Aggregate-to-Binder Ratio
- Combines **Total Aggregate (Coarse + Fine)** divided by **Total Binder**.
- Absolute aggregate amounts have weak correlations with strength (EDA showed -0.14 and -0.19).
- The ratio reveals a **strong negative relationship** with strength.
- Physically meaningful: captures the **paste volume concept** — more binder per aggregate → stronger concrete.
- Example: 900 kg/m³ aggregate with 500 kg/m³ binder is stronger than same aggregate with 250 kg/m³ binder.


In [203]:
#  Summation Features
df.columns.str.strip()
# 1. Total binder content
df["total_binder_content"] = df["Cement (component 1)(kg in a m^3 mixture)"] + df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"] + df["Fly Ash (component 3)(kg in a m^3 mixture)"]
# 2. Total aggregate content
df["total_aggregate_content"] = df["Coarse Aggregate  (component 6)(kg in a m^3 mixture)"] + df["Fine Aggregate (component 7)(kg in a m^3 mixture)"]
# 3. Total mix content
df["total_mix"] = df["Cement (component 1)(kg in a m^3 mixture)"] + df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"] + df["Fly Ash (component 3)(kg in a m^3 mixture)"] + df["Coarse Aggregate  (component 6)(kg in a m^3 mixture)"] + df["Fine Aggregate (component 7)(kg in a m^3 mixture)"] + df["Water  (component 4)(kg in a m^3 mixture)"]
df[["total_binder_content", "total_aggregate_content", "total_mix"]].describe()

Unnamed: 0,total_binder_content,total_aggregate_content,total_mix
count,1005.0,1005.0,1005.0
mean,406.207264,1747.063085,2335.344726
std,91.423606,102.230279,62.622809
min,200.0,1457.0,2183.6
25%,336.28,1679.0,2285.0
50%,388.48,1758.3,2343.63
75%,480.0,1829.0,2378.9
max,640.0,1970.0,2551.0


# Group 2 — Summation Features

This section explains the summation-based engineered features and the reasoning behind them.

## Core Philosophy
- Concrete materials behave as a **system**, not as isolated ingredients.
- Some materials work together toward a shared structural role.
- Summing related components captures their **collective behavior**.

## Total Binder (Cement + Slag + Fly Ash)
- Combines all cementitious materials into one feature.
- Encodes their **combined binding power**.
- Stronger than any individual ingredient during EDA.
- Reflects the total hydration potential available in the mix.
- Represents the full paste-forming capacity of the concrete.

## Total Aggregate (Coarse + Fine)
- Combines coarse and fine aggregates into a single structural mass.
- Aggregates work together to form the **rigid skeletal framework**.
- Captures the total solid framework volume rather than treating sand and gravel separately.
- Aligns with structural engineering principles of aggregate packing.

## Total Mix (All Ingredients Combined)
- Sum of all components in the mix.
- Acts as a **density sanity check**.
- Normal-weight concrete typically falls between **2000–2500 kg/m³**.
- Any mix outside this range may indicate data quality issues.
- Ensures dataset realism before modeling.

## Key Insight
Concrete performance emerges from **group-level material interactions**.
Summation features reflect this system-level behavior more effectively than analyzing each component independently.

In [204]:
# SCM(Supplementary Cementitious Materials) Ratio Features
df.columns.str.strip()
# 1. Cement Dominance Ratio
df["cement_ratio"] = df["Cement (component 1)(kg in a m^3 mixture)"] / df["total_binder_content"]
# 2.  Slag Replacement Ratio
df["slag_ratio"] = df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"] / df["total_binder_content"]
# 3. Fly Ash Replacement Ratio
df["flyash_ratio"] = df["Fly Ash (component 3)(kg in a m^3 mixture)"] / df["total_binder_content"]

# Description of the new data
df[["cement_ratio","slag_ratio","flyash_ratio" ]].describe()

Unnamed: 0,cement_ratio,slag_ratio,flyash_ratio
count,1005.0,1005.0,1005.0
mean,0.689336,0.167852,0.142812
std,0.211444,0.201849,0.169989
min,0.264,0.0,0.0
25%,0.53469,0.0,0.0
50%,0.679224,0.03992,0.0
75%,0.812386,0.328691,0.295547
max,1.0,0.612987,0.552469


# Group 3 — Binder Composition Ratios

This group focuses on the **internal composition of the binder**, answering the question: *what is the binder made of?*

## Core Philosophy
- Not all binder behaves the same: cement, slag, and fly ash contribute differently to strength development.
- Ratios express **relative contribution** rather than absolute quantity.
- This tells the model the **character of the binder**, not just its size.

## Key Ratios

### Cement Ratio
- Fraction of total binder that is cement.
- High cement_ratio → fast early strength gain, higher cost.
- Dominates early-age performance.

### Slag Ratio
- Fraction of total binder that is slag.
- Moderate early strength gain, lower cost, long-term strength improves.
- Slower reacting than cement, contributes to maturity strength.

### Fly Ash Ratio
- Fraction of total binder that is fly ash.
- Slowest reacting; contributes primarily in long-term strength.
- Needs time and proper curing conditions to show impact.

## Practical Insights
- Two mixes can have identical total binder but behave differently at early ages:
  - 100% cement → fast strength gain
  - 50% cement + 50% slag → slower early gain, higher long-term strength
- The three ratios always sum to **1.0**, serving as a **built-in integrity check** for data consistency.
- Captures binder composition behavior for more accurate modeling of both early and long-term strength.

In [205]:
# Age transformations
df.columns.str.strip()
# 1. Log age
df["log_age"] = np.log(df["Age (day)"] + 1)

# 2. Age group categorical
def age_groups(age):
    df.columns.str.strip()
    if age <= 7:
        return "Early"
    elif age <= 28:
        return "Standard"
    elif age <= 90:
        return "Mature"
    else:
        return "Long-term"

df["age_group"] = df["Age (day)"].apply(age_groups)
print(df["age_group"].value_counts())
df[["log_age", "age_group"]].describe()

age_group
Standard     481
Early        253
Mature       140
Long-term    131
Name: count, dtype: int64


Unnamed: 0,log_age
count,1005.0
mean,3.244973
std,1.108951
min,0.693147
25%,2.079442
50%,3.367296
75%,4.043051
max,5.902633


# Group 4 — Age Features

This group captures **how concrete strength evolves over time**, reflecting the non-linear hydration process.

## Core Philosophy
- Raw age as a linear number is misleading:
  - Day 1 → Day 28 difference is huge.
  - Day 337 → Day 365 difference is minimal.
- Concrete strength gains **fast early**, then plateaus.
- Features in this group compress, transform, or categorize age to match physical reality.

## Features

### log_age
- Logarithmic transformation of Age.
- Compresses large ages, emphasizes early-age differences.
- Helps the model understand rapid early hydration vs slow late growth.

### age_squared
- Non-linear term for linear models.
- Allows models to fit curvature of age-strength relationship.
- Useful when tree-based models are not used.

### age_group
- Categorical classification of curing stage:
  - Early (1–7 days)
  - Standard (8–28 days)
  - Mature (29–90 days)
  - Long-term (90+ days)
- Mimics how civil engineers think about concrete curing in practice.

## Practical Insights
- Together, these features give the model **multiple representations of age**.
- Ensures early-age strength gain is weighted more heavily.
- Supports both linear and non-linear modeling approaches.

In [206]:
# Interaction features
df.columns.str.strip()
# 1. Slag age interaction
df["slag_age_interaction"] = df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"] * df["log_age"]

# 2. Fly ash age interaction
df["flyash_age_interaction"] = df["Fly Ash (component 3)(kg in a m^3 mixture)"] * df["log_age"]

# 3. Cement age interaction
df["cement_age_interaction"] = df["Cement (component 1)(kg in a m^3 mixture)"] * df["log_age"]
df[["slag_age_interaction", "flyash_age_interaction","cement_age_interaction"]].describe()

Unnamed: 0,slag_age_interaction,flyash_age_interaction,cement_age_interaction
count,1005.0,1005.0,1005.0
mean,231.942723,178.619678,904.887341
std,295.292979,221.106018,486.605003
min,0.0,0.0,141.402025
25%,0.0,0.0,524.019269
50%,50.509437,0.0,802.154097
75%,437.074999,376.800403,1122.898433
max,1401.875417,806.446159,3025.144163


# Group 5 — SCM × Age Interaction Features

This group captures the **time-dependent effect of supplementary cementitious materials (SCMs)** like Slag and Fly Ash on concrete strength.

## Core Philosophy
- Raw Slag and Fly Ash quantities alone tell almost no story.
  - Example: 200 kg/m³ of Slag at 3 days → weak concrete.
  - Same mix at 180 days → strong concrete.
- The ingredient hasn’t changed; the time of reaction is what matters.
- Multiplying each SCM by `log_age` encodes both **quantity and time available to react**.

## Features

### slag_x_age
- Interaction of Slag × log_age.
- Captures how Slag contributes increasingly over time.

### flyash_x_age
- Interaction of Fly Ash × log_age.
- Turns a misleading raw correlation (negative or near-zero) into a physically meaningful positive trend.

### cement_x_age
- Optional: applies same logic to Cement.
- Even though Cement reacts fast, its contribution compounds over time.

## Practical Insights
- These features teach the model the **time-dependent chemistry of concrete**.
- Removes the need for the model to learn these relationships from scratch.
- Results in **stronger correlations with strength** compared to raw SCM columns.

In [207]:
# Binary flag features
# 1. Superplasticizer flag
def uses_superplasticizer(superplasticizer):
    df.columns.str.strip()
    return 1 if superplasticizer > 0 else 0
df["sp_flag"] = df["Superplasticizer (component 5)(kg in a m^3 mixture)"].apply(uses_superplasticizer)

# 2. Slag flag
def uses_slag(slag):
    return 1 if slag > 0 else 0
df["slag_flag"] = df["Blast Furnace Slag (component 2)(kg in a m^3 mixture)"].apply(uses_slag)

#3. Uses fly ash
def uses_flyash(flyash):
    return 1 if flyash > 0 else 0
df["flyash_flag"] = df["Fly Ash (component 3)(kg in a m^3 mixture)"].apply(uses_flyash)

df[["sp_flag","slag_flag","flyash_flag"]].describe()

Unnamed: 0,sp_flag,slag_flag,flyash_flag
count,1005.0,1005.0,1005.0
mean,0.623881,0.537313,0.461692
std,0.484652,0.498854,0.498779
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,1.0,1.0,0.0
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


# Group 6 — Binary Flags for Optional Ingredients

This group handles ingredients that are **not always used**, such as Superplasticizer or rare SCMs.

## Core Philosophy
- A zero in the raw column means the engineer **chose not to use** the ingredient.
- Mixing zeros and non-zeros in one column forces the model to treat “not used” as just a numeric value.
- Separating the design decision into a **binary flag** provides clarity.

## Features

### superplasticizer_flag
- 1 if Superplasticizer is used, 0 if not.
- Works alongside the original Superplasticizer quantity column.
- Captures **both the decision to use it** and **how much was used**.

## Practical Insights
- Gives the model **two pieces of information per ingredient**:
  1. The design philosophy (used or not used).
  2. The quantity within that philosophy.
- Improves learning for optional ingredients that are zero-heavy in the dataset.

## QUALITY CHECKS

In [208]:
# Quality checks
df.dtypes

Cement (component 1)(kg in a m^3 mixture)                float64
Blast Furnace Slag (component 2)(kg in a m^3 mixture)    float64
Fly Ash (component 3)(kg in a m^3 mixture)               float64
Water  (component 4)(kg in a m^3 mixture)                float64
Superplasticizer (component 5)(kg in a m^3 mixture)      float64
Coarse Aggregate  (component 6)(kg in a m^3 mixture)     float64
Fine Aggregate (component 7)(kg in a m^3 mixture)        float64
Age (day)                                                  int64
Concrete compressive strength(MPa, megapascals)          float64
water_cement_ratio                                       float64
water_binder_ratio                                       float64
agg_binder_ratio                                         float64
total_binder_content                                     float64
total_aggregate_content                                  float64
total_mix                                                float64
cement_ratio             

In [209]:
# Correlation against the target
def correlation_validation(df):
    correlations = {}

    # Strip column names to avoid invisible spaces
    df.columns = df.columns.str.strip()

    # Define target by name
    target_col = "Concrete compressive strength(MPa, megapascals)"
    target = df[target_col]

    # Select numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns

    # Compute correlation of each numeric column with target
    for col in numeric_cols:
        if col != target_col:  # compare column name, not series
            correlations[col] = df[col].corr(target)

    return correlations

corrs = correlation_validation(df)

# Sorting by absolute correlaation
corr_series = pd.Series(corrs)
sorted_corr = corr_series.reindex(
    corr_series.abs().sort_values(ascending=False).index
)
print("Sorted correlations")
print("=================================================================")
print(sorted_corr)

Sorted correlations
cement_age_interaction                                   0.700973
water_binder_ratio                                      -0.610834
total_binder_content                                     0.598086
log_age                                                  0.559856
agg_binder_ratio                                        -0.554540
water_cement_ratio                                      -0.489408
Cement (component 1)(kg in a m^3 mixture)                0.488283
total_mix                                                0.362793
Superplasticizer (component 5)(kg in a m^3 mixture)      0.344225
Age (day)                                                0.337371
sp_flag                                                  0.272225
Water  (component 4)(kg in a m^3 mixture)               -0.269606
total_aggregate_content                                 -0.256348
slag_age_interaction                                     0.251793
slag_flag                                               

In [210]:
# Dropping weak and redudant features
df_model = df.copy()
df_model = df_model.drop(columns= ["slag_ratio", "flyash_ratio","flyash_flag", "flyash_age_interaction", "total_aggregate_content", "Coarse Aggregate  (component 6)(kg in a m^3 mixture)", "Fine Aggregate (component 7)(kg in a m^3 mixture)"])

# Dir path
dir_path = os.path.join(os.path.dirname(os.getcwd()), "data", "processed")
# Make the directory if it does not exist
os.makedirs(dir_path, exist_ok=True)
# Save the new dataset
file_path = os.path.join(dir_path, "Concrete_processed_data.xlsx")
df_model.to_excel(file_path, index=False)
print(f"Processed data saved at: {file_path}")

Processed data saved at: /home/local-host/PycharmProjects/concrete_strength_prediction/data/processed/Concrete_processed_data.xlsx


In [211]:
df_model.shape

(1005, 19)

In [212]:
df.shape

(1005, 26)

# Correlation Analysis Full Interpretation

This section evaluates the correlation between engineered and raw features against the target variable Concrete Compressive Strength. The goal is to validate whether domain driven feature engineering improved predictive signal compared to raw measurements.

---

# Top 5 Strongest Predictors

| Rank | Feature                | Correlation | Type       |
| ---- | ---------------------- | ----------- | ---------- |
| 1    | cement_age_interaction | +0.701      | Engineered |
| 2    | water_binder_ratio     | -0.611      | Engineered |
| 3    | total_binder_content   | +0.598      | Engineered |
| 4    | agg_binder_ratio       | -0.555      | Engineered |
| 5    | log_age                | +0.560      | Engineered |

Key Insight

Every single top 5 feature is engineered. Not one raw feature made it into the top 5.

This is clear validation that domain driven feature engineering worked as intended.

---

# Bottom 5 Weakest Predictors

| Rank | Feature                | Correlation | Notes             |
| ---- | ---------------------- | ----------- | ----------------- |
| 1    | slag_ratio             | +0.003      | Essentially zero  |
| 2    | flyash_flag            | -0.034      | Near zero         |
| 3    | flyash_age_interaction | +0.043      | Surprisingly weak |
| 4    | cement_ratio           | +0.111      | Weak              |
| 5    | Blast Furnace Slag raw | +0.103      | Weak raw feature  |

---

# Parent vs Engineered Feature Comparison

These comparisons confirm that engineered features consistently outperform their raw parents.

cement_age_interaction (+0.701) vs raw Cement (+0.488) with +0.213 improvement
water_binder_ratio (-0.611) vs raw Water (-0.270) with +0.341 improvement
agg_binder_ratio (-0.555) vs raw Coarse (-0.145) and Fine (-0.186) with large improvement
log_age (+0.560) vs raw Age (+0.337) with +0.223 improvement confirming non linearity
water_cement_ratio (-0.489) vs raw Water (-0.270) with strong improvement

These results support the idea that ratio features and interaction terms capture physical relationships better than raw quantities.

---

# Three Findings to Flag

Flag 1. flyash_age_interaction is weak at +0.043

We expected Fly Ash multiplied by log_age to increase strongly. It did flip positive from raw Fly Ash which was negative, but the improvement is small.

The likely reason is distribution. More than half of mixes contain zero Fly Ash. The interaction term is therefore zero for many rows which weakens overall correlation.

This feature may still help tree based models in the subset of mixes that use Fly Ash but it is not globally strong.

Flag 2. slag_ratio is essentially useless at +0.003

The proportion of binder that is slag does not explain strength on its own. Absolute quantity appears to matter more than proportional composition.

This feature is a candidate for removal.

Flag 3. flyash_ratio is weaker than raw Fly Ash

This is the only case where an engineered feature performed worse than its raw parent. Expressing Fly Ash as a fraction of binder reduces useful signal compared to the raw quantity.

This feature should be removed.

---

# Feature Selection Decisions

Keep Strong Engineered Features

cement_age_interaction
water_binder_ratio
total_binder_content
agg_binder_ratio
log_age
water_cement_ratio

Keep Useful Raw Features

Cement
Superplasticizer
Age
Water
Slag
sp_flag
slag_flag
slag_age_interaction

Drop Weak or Redundant Features

slag_ratio
flyash_ratio
flyash_flag
flyash_age_interaction
raw Coarse Aggregate
raw Fine Aggregate
total_aggregate_content

---

# Final Conclusion

The correlation analysis shows that engineered features provide the strongest predictive signal. Domain knowledge improved feature quality and allowed ratios and interaction terms to replace several raw measurements. Removing weak or redundant variables creates a cleaner feature set for modeling.
