# Exploratory Data Analysis (EDA) on Fuel Blend Properties Dataset

**Goal**: Develop accurate predictive models that can estimate 10 key properties of blended fuels, enabling faster development and optimization of sustainable fuel formulations.

**Dataset Description**:
- `train.csv`: Training data containing blend composition, component properties, and target blend properties
- `test.csv`: Test data with only input features; model must predict the 10 blend properties
- `sample_submission.csv`: Template for submission format

**Column Groups in Train.csv**:
- **Blend Composition (first 5 columns)**: Volume percentages of each of the 5 base components (e.g., *Component1_fraction*).
- **Component Properties (next 50 columns)**: 10 Properties of each component batch (e.g., *Component1_Property1*).
- **Final Blend Properties - Targets (last 10 columns)**: Target properties to predict (e.g., *BlendProperty1*).

**Column Groups in Test.csv**:
- **Id (first column)**: Repesents the unique number for each row (e.g., *1*).
- **Blend Composition (next 5 columns)**: Volume percentages of each of the 5 base components (e.g., *Component1_fraction*).
- **Component Properties (last 50 columns)**: 10 Properties of each component batch (e.g., *Component1_Property1*).

**Evaluation Metric**:
The evaluation metric used is the **Mean Absolute Percentage Error (MAPE)**, defined as:
$$\text{MAPE} = \frac{1}{n} \sum_{t=1}^{n} \left| \frac{y_t - \hat{y}_t}{y_t} \right| \times 100$$

Where:
- $y_t$: Actual value
- $\hat{y}_t$: Predicted value

For reporting purposes, scores are normalized using the formula:
$$\text{Score} = \max(0, 100 - \left(\frac{\text{cost}}{\text{reference cost}}\right) \times 100)$$

## Step 1: Load & Explore the Dataset

In [1]:
import pandas as pd
from pathlib import Path

# Directory of the datasets
data_path = Path('../data')

# Load the raw dataset
train_data = pd.read_csv(data_path / 'train.csv')
test_data = pd.read_csv(data_path / 'test.csv')

# Shape of the datasets
print("Shape of train data: ", train_data.shape)
print("Shape of test data: ", test_data.shape)

Shape of train data:  (2000, 65)
Shape of test data:  (500, 56)


In [2]:
# First 5 rows of the train data
train_data.head(5)

Unnamed: 0,Component1_fraction,Component2_fraction,Component3_fraction,Component4_fraction,Component5_fraction,Component1_Property1,Component2_Property1,Component3_Property1,Component4_Property1,Component5_Property1,...,BlendProperty1,BlendProperty2,BlendProperty3,BlendProperty4,BlendProperty5,BlendProperty6,BlendProperty7,BlendProperty8,BlendProperty9,BlendProperty10
0,0.21,0.0,0.42,0.25,0.12,-0.021782,1.981251,0.020036,0.140315,1.032029,...,0.489143,0.607589,0.32167,-1.236055,1.601132,1.384662,0.30585,0.19346,0.580374,-0.762738
1,0.02,0.33,0.19,0.46,0.0,-0.224339,1.148036,-1.10784,0.149533,-0.354,...,-1.257481,-1.475283,-0.437385,-1.402911,0.147941,-1.143244,-0.439171,-1.379041,-1.280989,-0.503625
2,0.08,0.08,0.18,0.5,0.16,0.457763,0.242591,-0.922492,0.908213,0.972003,...,1.784349,0.450467,0.622687,1.375614,-0.42879,1.161616,0.601289,0.87295,0.66,2.024576
3,0.25,0.42,0.0,0.07,0.26,-0.577734,-0.930826,0.815284,0.447514,0.455717,...,-0.066422,0.48373,-1.865442,-0.046295,-0.16382,-0.209693,-1.840566,0.300293,-0.351336,-1.551914
4,0.26,0.16,0.08,0.5,0.0,0.120415,0.666268,-0.626934,2.725357,0.392259,...,-0.118913,-1.172398,0.301785,-1.787407,-0.493361,-0.528049,0.286344,-0.265192,0.430513,0.735073


In [3]:
# Information about the dataset
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Component1_fraction    2000 non-null   float64
 1   Component2_fraction    2000 non-null   float64
 2   Component3_fraction    2000 non-null   float64
 3   Component4_fraction    2000 non-null   float64
 4   Component5_fraction    2000 non-null   float64
 5   Component1_Property1   2000 non-null   float64
 6   Component2_Property1   2000 non-null   float64
 7   Component3_Property1   2000 non-null   float64
 8   Component4_Property1   2000 non-null   float64
 9   Component5_Property1   2000 non-null   float64
 10  Component1_Property2   2000 non-null   float64
 11  Component2_Property2   2000 non-null   float64
 12  Component3_Property2   2000 non-null   float64
 13  Component4_Property2   2000 non-null   float64
 14  Component5_Property2   2000 non-null   float64
 15  Comp

In [4]:
# Statistical info
train_data.describe(include='all')

Unnamed: 0,Component1_fraction,Component2_fraction,Component3_fraction,Component4_fraction,Component5_fraction,Component1_Property1,Component2_Property1,Component3_Property1,Component4_Property1,Component5_Property1,...,BlendProperty1,BlendProperty2,BlendProperty3,BlendProperty4,BlendProperty5,BlendProperty6,BlendProperty7,BlendProperty8,BlendProperty9,BlendProperty10
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.18069,0.18291,0.17982,0.34209,0.11449,0.000245,-0.017319,0.001703,-0.004653,-0.018256,...,-0.016879,-0.002076,-0.014351,-0.006068,-0.015249,-0.003497,-0.013568,-0.017236,-0.001507,-0.001795
std,0.1632,0.163704,0.166283,0.141119,0.080219,0.999423,1.006409,0.998859,1.006902,1.009294,...,0.993787,1.004512,0.99936,1.009176,0.98648,1.009126,1.000613,0.998759,1.001096,0.990433
min,0.0,0.0,0.0,0.01,0.0,-2.943737,-1.718895,-3.008683,-3.029468,-3.57244,...,-2.550897,-3.079759,-3.041624,-2.835701,-1.730111,-2.80821,-2.994571,-3.62108,-3.292727,-2.476429
25%,0.03,0.04,0.02,0.22,0.05,-0.694658,-0.765154,-0.701948,-0.693361,-0.713149,...,-0.766128,-0.735109,-0.624235,-0.783547,-0.683165,-0.697379,-0.622453,-0.725564,-0.702384,-0.733653
50%,0.14,0.15,0.14,0.35,0.12,0.011977,-0.030235,0.021335,0.016774,0.194936,...,-0.021089,0.001684,0.146135,-0.028158,-0.25065,-0.011649,0.13347,-0.001548,-0.002604,-0.010459
75%,0.29,0.3,0.29,0.5,0.18,0.685717,0.65396,0.673125,0.659227,1.032029,...,0.714763,0.723807,0.727597,0.664659,0.358701,0.695182,0.70413,0.684894,0.706084,0.693839
max,0.5,0.5,0.5,0.5,0.29,2.981146,3.05109,2.868901,2.982258,1.032029,...,2.856588,2.769156,1.638646,3.769643,3.600439,3.433292,3.293228,3.340657,3.276199,2.708703


In [5]:
# Get column groups
fraction_cols = [col for col in train_data.columns if 'fraction' in col]
property_cols = [col for col in train_data.columns if 'Property' in col and 'Blend' not in col]
blend_cols = [col for col in train_data.columns if 'BlendProperty' in col]
print(f"Fraction Columns: {len(fraction_cols)}\nProperty Columns: {len(property_cols)}\nBlend Columns: {len(blend_cols)}")

Fraction Columns: 5
Property Columns: 50
Blend Columns: 10


Insights on the dataset:
- No null or duplicate values are present in the dataset
- Only numerical values are present in the dataset
- Volume percentages of each of the 5 base components are in plausible range (0.00-1.00)

## Step 2: Analyze & Visualize the Dataset

#### 2.1 Blend Composition Distribution

In [6]:
## Histogram
import numpy as np
import plotly.graph_objects as go
from scipy.stats import gaussian_kde
from plotly.subplots import make_subplots

# Define a bold color palette (extend as needed)
bold_colors = ['#1f77b4', '#ff7f0e', '#2ca02c',
               '#d62728', '#9467bd', '#8c564b']

# Create subplot grid for histograms + KDEs
fig_hist_fractions = make_subplots(
    rows=2, cols=3,
    subplot_titles=fraction_cols,
    vertical_spacing=0.1
)

# Loop through each component fraction
for i, col in enumerate(fraction_cols):
    row = i // 3 + 1  # Determine subplot row
    col_pos = i % 3 + 1  # Determine subplot column

    data = train_data[col]

    # Add histogram trace with bold color and white borders
    fig_hist_fractions.add_trace(
        go.Histogram(
            x=data,
            name=col,
            nbinsx=30,
            marker=dict(
                color=bold_colors[i % len(bold_colors)],  # Cycle colors
                line=dict(color='white', width=1)  # White bar dividers
            ),
            opacity=0.6
        ),
        row=row, col=col_pos
    )

    # Compute KDE (kernel density estimate)
    kde = gaussian_kde(data)
    x_vals = np.linspace(data.min(), data.max(), 200)
    y_vals = kde(x_vals) * len(data) * (data.max() - data.min()) / 30  # Scale to match hist height

    # Add KDE line over histogram
    fig_hist_fractions.add_trace(
        go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines',
            line=dict(color='black', width=2),
            name=f"{col} KDE"
        ),
        row=row, col=col_pos
    )

# Final layout tweaks
fig_hist_fractions.update_layout(
    title_text="Distribution of Component Fractions with KDE Overlay",
    showlegend=False,
    height=700
)

fig_hist_fractions.show()

Plot Insights:
- Component1–3:
    - **High frequency at 0 and 0.5**: Indicates these components are often either absent or at maximum allowed proportion.
    - **Relatively uniform in between**: Mild variation in mid-ranges.

- Component4:
    - **Skewed toward higher values, especially 0.5**: Often present in large amounts—likely a dominant or essential component in most blends.

- Component5:
    - **Concentrated around 0.1–0.2, with a drop-off after 0.3**: Typically added in small to moderate quantities, rarely exceeds 0.3.

In [7]:
import plotly.express as px

# Sum each component's contribution across all rows
total_fractions = train_data[fraction_cols].sum(axis=0)

# Prepare data for pie chart
pie_data = total_fractions.reset_index()
pie_data.columns = ['Component', 'TotalFraction']

# Create and show the pie chart using Plotly
fig = px.pie(
    pie_data,
    values='TotalFraction',
    names='Component',
    title='Overall Contribution of Each Component',
    hover_name='Component'
)

colors = ['#0B4689','#08306B','#155FA0','#3E92CC', '#1E77B5',]
fig.update_traces(textposition='inside',
                  textinfo='percent',
                  marker=dict(colors=colors, line=dict(color='#FFFFFF', width=3)))
fig.update_layout(showlegend=True)
fig.show()

Plot Insights:
- `Component4_fraction` contributes the most (≈34.2%) to the overall blend composition.
- `Component5_fraction` contributes the least (≈11.4%) to the overall blend composition.
- The other components — Component1, Component2, Component3,  — each contribute around 18%.
- The distribution is not uniform, indicating that Component4 plays a dominant role in the blends.

#### 2.2 Blend Property Distribution

In [8]:
# Bold color palette for blend properties (extend as needed)
blend_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
                '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']

# Create 2x5 subplot grid for 10 blend properties
fig_hist_blend = make_subplots(
    rows=2, cols=5,
    subplot_titles=blend_cols,
    vertical_spacing=0.1
)

# Loop through each blend property
for i, col in enumerate(blend_cols):
    row = i // 5 + 1
    col_pos = i % 5 + 1

    data = train_data[col].dropna()

    # Add histogram with bold color and white separator lines
    fig_hist_blend.add_trace(
        go.Histogram(
            x=data,
            name=col,
            nbinsx=30,
            marker=dict(
                color=blend_colors[i % len(blend_colors)],
                line=dict(color='white', width=1)
            ),
            opacity=0.6
        ),
        row=row, col=col_pos
    )

    # KDE overlay
    kde = gaussian_kde(data)
    x_vals = np.linspace(data.min(), data.max(), 200)
    y_vals = kde(x_vals) * len(data) * (data.max() - data.min()) / 30  # Scale to match histogram

    fig_hist_blend.add_trace(
        go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines',
            line=dict(color='black', width=2),
            name=f"{col} KDE"
        ),
        row=row, col=col_pos
    )

# Update layout
fig_hist_blend.update_layout(
    title_text="Distribution of Blend Properties with KDE Overlay",
    showlegend=False,
    height=700
)

fig_hist_blend.show()


Plot Insights:

- BlendProperty1, 2, 6, 8, 9, 10:
    - These exhibit near-Gaussian (bell-shaped) distributions.
    - Likely standardized or already normalized.
    
- BlendProperty3 & 4:
    - Show slight skew (e.g., right-tail in Property3).
    - Mostly symmetric but may benefit from light transformation or checking for outliers.

- BlendProperty5:
    - Clearly right-skewed with a long tail.
    - Consider log or Box-Cox transformation before modeling.

- BlendProperty7:
    - Looks sharper (peaky) than normal — possibly leptokurtic.
    - Could be indicating low variance or tightly centered values.

In [9]:
# Selecting last 10 blend property columns
data = train_data.iloc[:, 55:]

# Melt the DataFrame to long format for Plotly
melted_data = data.melt(var_name='Property', value_name='Value')

# Create the violin plot
fig = px.violin(melted_data, x='Property', y='Value', color='Property', box=True, points='all', color_discrete_sequence=px.colors.qualitative.Bold)

# Update layout for aesthetics
fig.update_layout(
    showlegend=False,
    title='Distribution of Blend Properties',
    yaxis_title='Value',
    xaxis_title=None,
    xaxis_tickangle=45
)

fig.show()

Plot Insights:
- Most properties show *normal/symmetric distributions* (BlendProperty1, 2, 3, 6, 8, 9, 10)
- `BlendProperty4` appears *bimodal* (two peaks) - suggests two distinct blend regimes
- `BlendProperty5` shows *right skew* with long tail
- `BlendProperty7` shows *left skew*

In [10]:
## Boxplot
fig_box_blend = go.Figure()

for col in blend_cols:
    fig_box_blend.add_trace(
        go.Box(y=train_data[col], name=col, boxpoints='outliers')
    )

fig_box_blend.update_layout(
    showlegend=False,
    title="Boxplots of Blend Properties",
    yaxis_title="Value",
    height=500
)

Plot Insights:
- Most blend properties have medians near 0, indicating centering or normalization.
- Boxes are mostly symmetric (except BlendProperty5).
- BlendProperty4–9 show several mild to moderate outliers, especially in the upper tail.
- BlendProperty5 and BlendProperty7 are more skewed and have more upper outliers → may benefit from transformation or capping.
- BlendProperty5 has a wider IQR and whiskers, suggesting more variability.
- BlendProperty3 shows a more compact range, indicating low variance compared to others.
- Appear to have a few notable lower-end outliers, but maintain symmetry.

#### 2.3 Correlation Heatmaps and Pairplot

In [11]:
# Component Fraction vs Blend Properties
subset_cols = fraction_cols + blend_cols
corr_matrix = train_data[subset_cols].corr()

fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='blues',
    zmid=0,
    text=np.round(corr_matrix.values, 2),
    texttemplate="%{text}",
))

fig.update_layout(
    title="Correlation Heatmap - Fractions vs Blend Properties",
    width=1200,
    height=800,
    xaxis_tickangle=-45
)

Key Patterns Identified
1. `Component1_fraction` (Bottom Row)
    - Weak correlations across all blend properties (-0.09 to 0.22)
    - Minimal impact on final blend characteristics
    - Most correlations are near zero, suggesting limited predictive power

2. `Component2_fraction` (Second Row)
    - Strong negative correlations with most blend properties (-0.83 to -0.15)
    - Particularly strong negative impact on `BlendProperty7` (-0.83)
    - Consistent negative influence across `BlendProperty6-10`
    - Acts as a "diluting" component

3. `Component3_fraction` (Third Row)
    - Moderate positive correlations (0.17 to 0.50)
    - Strongest positive impact on `BlendProperty7` (0.50)
    - Balanced influence across different properties
    - Shows complementary behavior to Component2

4. `Component4_fraction` (Fourth Row)
    - Strong negative correlations with key properties (-0.74 to 0.24)
    - Extremely strong negative impact on `Component5_fraction` (-0.74)
    - Inverse relationship with `BlendProperty1` (-0.20) and `BlendProperty2` (-0.28)
    - Antagonistic component behavior

5. `Component5_fraction` (Fifth Row)
    - Strong positive correlations with multiple properties (0.41 to 0.67)
    - Highest positive impact on `BlendProperty9` (0.67)
    - Consistent positive influence across `BlendProperty8-10`
    - Acts as an "enhancing" component

Critical Insights
1. Component Roles:
    - `Component2` & `Component4`: Negative influencers - reduce blend properties
    - `Component3` & `Component5`: Positive enhancers - improve blend properties
    - `Component1`: Neutral/minimal impact - possibly a base or filler

2. Blend Property Groups:
- `BlendProperty1-5`: Show mixed correlations with different components
- `BlendProperty6-10`: Show stronger, more consistent patterns
- `BlendProperty7`: Shows extreme correlations (both +0.50 and -0.83)

In [12]:
# Full features
corr_matrix = train_data.corr()

fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='blues',
    zmid=0,
    text=np.round(corr_matrix.values, 2),
    texttemplate="%{text}",
))

fig.update_layout(
    title="Correlation Heatmap - All Features",
    width=1700,
    height=1300,
    xaxis_tickangle=-45
)

Plot Insights:

- Fraction-to-Property Pathways
    - **Component fractions** (bottom-left) show **varied correlation patterns** with individual component properties
    - **Fraction influences** propagate through component properties to final blend properties
    - **Non-uniform influence patterns** - some fractions affect certain properties more than others

- Multicollinearity Issues:
    - **High correlations between blend properties** may cause multicollinearity in modeling
    - **Component properties within same component** show expected correlations
    - **Redundant features** likely present - feature selection needed

In [13]:
# blend properties vs components fraction pairplots
subset_cols = fraction_cols + blend_cols

fig = px.scatter_matrix(
    train_data[subset_cols],
    dimensions=subset_cols,
    title="Pairplot - Fractions VS Blend Properties"
)

fig.update_layout(
    width=1700,
    height=1400
)

Plot Insights:
1. Linear Relationships:  
   - Strong negative correlation between `Component2_fraction` and `BlendProperty2`.  
   - Positive correlation between `Component5_fraction` and `BlendProperty5`.  

2. Nonlinear/Scattered Patterns:  
   - Most relationships between fractions/properties are diffuse or nonlinear, suggesting complex interactions.  

3. Outliers:  
   - Notable in `BlendProperty3`, `BlendProperty7`, and `BlendProperty9` (extreme values).  

4. Component Distributions:  
   - `Component2_fraction` and `Component4_fraction` are concentrated near 0.5, while others (e.g., `Component3_fraction`) have broader ranges.  

5. Target Properties:  
   - Blend properties like `BlendProperty5` and `BlendProperty8` show bimodal/skewed distributions, requiring transformations.  

## Step 3: Feature Engineering

### **🎯 Strategic Insights**


#### **For Feature Engineering:**
1. **Dimensionality Reduction**: PCA could reduce 65 features to ~10-20 components
2. **Feature Grouping**: Create composite features for each component
3. **Selective Features**: Focus on features with strongest cross-group correlations

#### **For Modeling:**
1. **Regularization Essential**: High correlation requires Ridge/Lasso regression
2. **Multi-output Benefits**: Blend properties are correlated - predict together
3. **Feature Selection**: Remove redundant features to improve generalization

#### **Data Quality:**
- **Well-structured dataset** with logical correlation patterns
- **No obvious data quality issues** (correlations make physical sense)
- **Sufficient complexity** for advanced modeling techniques

### **🚀 Recommended Next Steps**

1. **Correlation Analysis**: Identify features with >0.9 correlation for removal
2. **PCA Analysis**: Determine optimal number of components
3. **Feature Selection**: Use correlation thresholds and variance inflation factors
4. **Regularized Modeling**: Start with Ridge/Lasso before complex models