# Fuel Blend Properties Prediction 

**Dataset Overview:**
- **Problem:** Predict 10 fuel blend properties from component fractions and their individual properties
- **Features:** 55 input features (5 blend fractions + 50 component properties)
- **Targets:** 10 blend properties to predict
- **Evaluation Metric:** MAPE (Mean Absolute Percentage Error)


# Data Exploration - Quick Overview

**Purpose:** Get a quick first look at the dataset structure and validate data quality.

This is a fast entry point before diving into detailed EDA analysis.

## 1. Setup & Imports

In [18]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print('✓ Libraries imported successfully')

✓ Libraries imported successfully


## 2. Load Data

In [19]:
# Load training and test datasets
train_path = r'C:\Users\tm0792.STUDENTS.010\OneDrive - UNT System\Competitions\Shell ai Hackathon\shell_ai_hack\data\train.csv'  
test_path = r'C:\Users\tm0792.STUDENTS.010\OneDrive - UNT System\Competitions\Shell ai Hackathon\shell_ai_hack\data\test.csv'  

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print('✓ Data loaded successfully')
print(f'  Training set: {train_df.shape}')
print(f'  Test set: {test_df.shape}')

✓ Data loaded successfully
  Training set: (2000, 65)
  Test set: (500, 56)


## 3. Display First Few Rows

In [20]:
print('First 5 rows of training data:')
train_df.head()

First 5 rows of training data:


Unnamed: 0,Component1_fraction,Component2_fraction,Component3_fraction,Component4_fraction,Component5_fraction,Component1_Property1,Component2_Property1,Component3_Property1,Component4_Property1,Component5_Property1,...,BlendProperty1,BlendProperty2,BlendProperty3,BlendProperty4,BlendProperty5,BlendProperty6,BlendProperty7,BlendProperty8,BlendProperty9,BlendProperty10
0,0.21,0.0,0.42,0.25,0.12,-0.021782,1.981251,0.020036,0.140315,1.032029,...,0.489143,0.607589,0.32167,-1.236055,1.601132,1.384662,0.30585,0.19346,0.580374,-0.762738
1,0.02,0.33,0.19,0.46,0.0,-0.224339,1.148036,-1.10784,0.149533,-0.354,...,-1.257481,-1.475283,-0.437385,-1.402911,0.147941,-1.143244,-0.439171,-1.379041,-1.280989,-0.503625
2,0.08,0.08,0.18,0.5,0.16,0.457763,0.242591,-0.922492,0.908213,0.972003,...,1.784349,0.450467,0.622687,1.375614,-0.42879,1.161616,0.601289,0.87295,0.66,2.024576
3,0.25,0.42,0.0,0.07,0.26,-0.577734,-0.930826,0.815284,0.447514,0.455717,...,-0.066422,0.48373,-1.865442,-0.046295,-0.16382,-0.209693,-1.840566,0.300293,-0.351336,-1.551914
4,0.26,0.16,0.08,0.5,0.0,0.120415,0.666268,-0.626934,2.725357,0.392259,...,-0.118913,-1.172398,0.301785,-1.787407,-0.493361,-0.528049,0.286344,-0.265192,0.430513,0.735073


## 4. Data Types & Structure

In [21]:
print('\n' + '='*70)
print('DATA STRUCTURE & TYPES')
print('='*70)

print(f'\nTraining set shape: {train_df.shape} (rows, columns)')
print(f'Test set shape: {test_df.shape} (rows, columns)')

print(f'\nData types:')
print(train_df.dtypes.value_counts())

print(f'\nAll numeric? {train_df.select_dtypes(include=[np.number]).shape[1] == train_df.shape[1]}')


DATA STRUCTURE & TYPES

Training set shape: (2000, 65) (rows, columns)
Test set shape: (500, 56) (rows, columns)

Data types:
float64    65
Name: count, dtype: int64

All numeric? True


## 5. Column Breakdown

In [22]:
# Identify feature groups
blend_cols = [col for col in train_df.columns if 'fraction' in col]
component_prop_cols = [col for col in train_df.columns if 'Component' in col and 'Property' in col]
target_cols = [col for col in train_df.columns if 'BlendProperty' in col]

print('\n' + '='*70)
print('COLUMN BREAKDOWN')
print('='*70)

print(f'\nBlend Fractions ({len(blend_cols)} columns):')
print(f'  {blend_cols}')

print(f'\nComponent Properties ({len(component_prop_cols)} columns):')
print(f'  5 components × 10 properties')

print(f'\nTarget Variables ({len(target_cols)} columns):')
print(f'  {target_cols}')

print(f'\n' + '-'*70)
print(f'TOTAL: {len(blend_cols) + len(component_prop_cols) + len(target_cols)} columns')
print(f'  Features (Input): {len(blend_cols) + len(component_prop_cols)} columns')
print(f'  Targets (Output): {len(target_cols)} columns')


COLUMN BREAKDOWN

Blend Fractions (5 columns):
  ['Component1_fraction', 'Component2_fraction', 'Component3_fraction', 'Component4_fraction', 'Component5_fraction']

Component Properties (50 columns):
  5 components × 10 properties

Target Variables (10 columns):
  ['BlendProperty1', 'BlendProperty2', 'BlendProperty3', 'BlendProperty4', 'BlendProperty5', 'BlendProperty6', 'BlendProperty7', 'BlendProperty8', 'BlendProperty9', 'BlendProperty10']

----------------------------------------------------------------------
TOTAL: 65 columns
  Features (Input): 55 columns
  Targets (Output): 10 columns


## 6. Missing Values Check

In [23]:
print('\n' + '='*70)
print('MISSING VALUES CHECK')
print('='*70)

missing_train = train_df.isnull().sum().sum()
missing_test = test_df.isnull().sum().sum()

print(f'\nTraining set missing values: {missing_train}')
print(f'Test set missing values: {missing_test}')

if missing_train == 0 and missing_test == 0:
    print('\n✓ NO MISSING VALUES DETECTED - Data quality is good!')
else:
    print('\n⚠️ Missing values detected')
    if missing_train > 0:
        print(f'  Training: {train_df.isnull().sum()[train_df.isnull().sum() > 0]}')
    if missing_test > 0:
        print(f'  Test: {test_df.isnull().sum()[test_df.isnull().sum() > 0]}')


MISSING VALUES CHECK

Training set missing values: 0
Test set missing values: 0

✓ NO MISSING VALUES DETECTED - Data quality is good!


## 7. Basic Statistics

In [29]:
print('\n' + '='*70)
print('BASIC STATISTICS')
print('='*70)

print(f'\nTotal component property columns: {len(component_prop_cols)}')
print(f'Components: 5')
print(f'Properties per component: {len(component_prop_cols) // 5}')

print('\nBlend Fractions Summary:')
print(train_df[blend_cols].describe().round(4))


fraction_sums = train_df[blend_cols].sum(axis=1)
print(f'\nFraction sum statistics:')
print(f'  Min: {fraction_sums.min():.4f}')
print(f'  Max: {fraction_sums.max():.4f}')
print(f'  Mean: {fraction_sums.mean():.4f}')
print(f'  Std: {fraction_sums.std():.4f}')


BASIC STATISTICS

Total component property columns: 50
Components: 5
Properties per component: 10

Blend Fractions Summary:
       Component1_fraction  Component2_fraction  Component3_fraction  \
count            2000.0000            2000.0000            2000.0000   
mean                0.1807               0.1829               0.1798   
std                 0.1632               0.1637               0.1663   
min                 0.0000               0.0000               0.0000   
25%                 0.0300               0.0400               0.0200   
50%                 0.1400               0.1500               0.1400   
75%                 0.2900               0.3000               0.2900   
max                 0.5000               0.5000               0.5000   

       Component4_fraction  Component5_fraction  
count            2000.0000            2000.0000  
mean                0.3421               0.1145  
std                 0.1411               0.0802  
min                 0.0100

In [30]:
print('\nStatistical summary of component properties:')
print(train_df[component_prop_cols].describe().round(4))


Statistical summary of component properties:
       Component1_Property1  Component2_Property1  Component3_Property1  \
count             2000.0000             2000.0000             2000.0000   
mean                 0.0002               -0.0173                0.0017   
std                  0.9994                1.0064                0.9989   
min                 -2.9437               -1.7189               -3.0087   
25%                 -0.6947               -0.7652               -0.7019   
50%                  0.0120               -0.0302                0.0213   
75%                  0.6857                0.6540                0.6731   
max                  2.9811                3.0511                2.8689   

       Component4_Property1  Component5_Property1  Component1_Property2  \
count             2000.0000             2000.0000             2000.0000   
mean                -0.0047               -0.0183               -0.0059   
std                  1.0069                1.0093    

In [25]:
print('\nTarget Variables Summary:')
print(train_df[target_cols].describe().round(4))


Target Variables Summary:
       BlendProperty1  BlendProperty2  BlendProperty3  BlendProperty4  \
count       2000.0000       2000.0000       2000.0000       2000.0000   
mean          -0.0169         -0.0021         -0.0144         -0.0061   
std            0.9938          1.0045          0.9994          1.0092   
min           -2.5509         -3.0798         -3.0416         -2.8357   
25%           -0.7661         -0.7351         -0.6242         -0.7835   
50%           -0.0211          0.0017          0.1461         -0.0282   
75%            0.7148          0.7238          0.7276          0.6647   
max            2.8566          2.7692          1.6386          3.7696   

       BlendProperty5  BlendProperty6  BlendProperty7  BlendProperty8  \
count       2000.0000       2000.0000       2000.0000       2000.0000   
mean          -0.0152         -0.0035         -0.0136         -0.0172   
std            0.9865          1.0091          1.0006          0.9988   
min           -1.7301  

## 8. Sanity Checks

In [26]:
print('\n' + '='*70)
print('SANITY CHECKS')
print('='*70)

# Check fraction sums
fraction_sums = train_df[blend_cols].sum(axis=1)
print(f'\n1. Blend Fractions Sum:')
print(f'   Mean: {fraction_sums.mean():.4f}')
print(f'   Min: {fraction_sums.min():.4f}')
print(f'   Max: {fraction_sums.max():.4f}')
print(f'   ✓ Valid blend composition' if abs(fraction_sums.mean() - 1.0) < 0.01 or abs(fraction_sums.mean() - 100) < 1 else '   ⚠️ Check fractions')

# Check duplicates
print(f'\n2. Duplicate Rows:')
print(f'   Training: {train_df.duplicated().sum()}')
print(f'   Test: {test_df.duplicated().sum()}')

# Check infinite values
inf_train = np.isinf(train_df.select_dtypes(include=[np.number])).sum().sum()
inf_test = np.isinf(test_df.select_dtypes(include=[np.number])).sum().sum()
print(f'\n3. Infinite Values:')
print(f'   Training: {inf_train}')
print(f'   Test: {inf_test}')

# Check negative values
print(f'\n4. Negative Values in Features:')
neg_count_train = (train_df[blend_cols + component_prop_cols] < 0).sum().sum()
neg_count_test = (test_df[blend_cols + component_prop_cols] < 0).sum().sum()
print(f'   Training: {neg_count_train}')
print(f'   Test: {neg_count_test}')

print(f'\n✓ All sanity checks passed!')


SANITY CHECKS

1. Blend Fractions Sum:
   Mean: 1.0000
   Min: 1.0000
   Max: 1.0000
   ✓ Valid blend composition

2. Duplicate Rows:
   Training: 0
   Test: 0

3. Infinite Values:
   Training: 0
   Test: 0

4. Negative Values in Features:
   Training: 49907
   Test: 12430

✓ All sanity checks passed!


## 9. Next Steps

In [27]:
print('\n' + '='*70)
print('SUMMARY')
print('='*70)

print(f'\n✓ Data is ready for analysis')
print(f'\nNext steps:')
print(f'  1. Run 02_eda_analysis.ipynb for detailed exploration')
print(f'  2. Review insights and patterns')
print(f'  3. Move to 03_model_experiments.ipynb for feature engineering')

print(f'\n' + '='*70)


SUMMARY

✓ Data is ready for analysis

Next steps:
  1. Run 02_eda_analysis.ipynb for detailed exploration
  2. Review insights and patterns
  3. Move to 03_model_experiments.ipynb for feature engineering

