Uploading Data

In [None]:
from google.colab import files
uploaded = files.upload()

Loading Data

In [10]:
import pandas as pd
raw_data=pd.read_csv("Material Strength Predictor data.csv")
raw_data.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


The dataset contains material composition parameters and curing age information
used to predict concrete compressive strength.

# Data Understanding – Material Strength Prediction

## Objective
The purpose of this notebook is to understand the structure, quality, and basic characteristics of the dataset before applying any preprocessing or machine learning models.

This step ensures:
- The dataset is suitable for modeling
- Features are correctly interpreted
- Potential data quality issues are identified early



 ## Shape and Columns

In [11]:
# Dataset shape
print(f"Number of rows: {raw_data.shape[0]}")
print(f"Number of columns: {raw_data.shape[1]}")

# Column names
raw_data.columns

Number of rows: 1030
Number of columns: 9


Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

Understanding the number of observations and features helps determine
the scale of the problem and modeling approach.

## Data Types and Memory Usage

Checking data types ensures all variables are correctly loaded and
helps identify categorical or numerical features.


In [12]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   Cement (component 1)(kg in a m^3 mixture)              1030 non-null   float64
 1   Blast Furnace Slag (component 2)(kg in a m^3 mixture)  1030 non-null   float64
 2   Fly Ash (component 3)(kg in a m^3 mixture)             1030 non-null   float64
 3   Water  (component 4)(kg in a m^3 mixture)              1030 non-null   float64
 4   Superplasticizer (component 5)(kg in a m^3 mixture)    1030 non-null   float64
 5   Coarse Aggregate  (component 6)(kg in a m^3 mixture)   1030 non-null   float64
 6   Fine Aggregate (component 7)(kg in a m^3 mixture)      1030 non-null   float64
 7   Age (day)                                              1030 non-null   int64  
 8   Concrete compressive strength(MPa, megapascals)  

## Missing Values Analysis

Missing values can negatively impact model training and must be identified early.


In [13]:
raw_data.isnull().sum()

Unnamed: 0,0
Cement (component 1)(kg in a m^3 mixture),0
Blast Furnace Slag (component 2)(kg in a m^3 mixture),0
Fly Ash (component 3)(kg in a m^3 mixture),0
Water (component 4)(kg in a m^3 mixture),0
Superplasticizer (component 5)(kg in a m^3 mixture),0
Coarse Aggregate (component 6)(kg in a m^3 mixture),0
Fine Aggregate (component 7)(kg in a m^3 mixture),0
Age (day),0
"Concrete compressive strength(MPa, megapascals)",0


## Duplicate Records Check

Duplicate observations may bias the model and distort evaluation metrics.


In [14]:
duplicates = raw_data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 25


## Basic Statistical Summary

Provide insight into:
- Feature ranges
- Mean and variance
- Potential outliers

In [15]:
raw_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Cement (component 1)(kg in a m^3 mixture),1030.0,281.167864,104.506364,102.0,192.375,272.9,350.0,540.0
Blast Furnace Slag (component 2)(kg in a m^3 mixture),1030.0,73.895825,86.279342,0.0,0.0,22.0,142.95,359.4
Fly Ash (component 3)(kg in a m^3 mixture),1030.0,54.18835,63.997004,0.0,0.0,0.0,118.3,200.1
Water (component 4)(kg in a m^3 mixture),1030.0,181.567282,21.354219,121.8,164.9,185.0,192.0,247.0
Superplasticizer (component 5)(kg in a m^3 mixture),1030.0,6.20466,5.973841,0.0,0.0,6.4,10.2,32.2
Coarse Aggregate (component 6)(kg in a m^3 mixture),1030.0,972.918932,77.753954,801.0,932.0,968.0,1029.4,1145.0
Fine Aggregate (component 7)(kg in a m^3 mixture),1030.0,773.580485,80.17598,594.0,730.95,779.5,824.0,992.6
Age (day),1030.0,45.662136,63.169912,1.0,7.0,28.0,56.0,365.0
"Concrete compressive strength(MPa, megapascals)",1030.0,35.817961,16.705742,2.33,23.71,34.445,46.135,82.6


## Summary – Data Understanding

This dataset includes **1030 real-world concrete samples**, describing material composition, curing time, and the resulting compressive strength. Since all variables are numerical, the data is well-suited for building predictive machine learning models.

The dataset is **high-quality and complete**, with no missing values. A small number of duplicate records (25 samples) were identified and will be removed during preprocessing to ensure accurate and unbiased model training.

Material quantities vary widely across samples, which reflects **practical construction and mix-design decisions**. Some components, such as blast furnace slag, fly ash, and superplasticizer, are used only in certain mixes, indicating that their influence on strength depends on how and when they are applied.

Curing age ranges from **1 to 365 days**, reinforcing the well-known fact that concrete strength develops over time. The compressive strength values span a broad range, allowing the model to learn patterns across both low- and high-strength mixes.

Overall, the dataset is **clean, realistic, and suitable for predictive modeling**. Its structure and variability support further exploratory analysis, feature engineering, and physics-informed machine learning in the next stages of the project.

