# Construction Material Recommendation Model

This notebook creates a recommendation model for a Construction Material Recommendation System. The model takes project inputs (e.g., category, durability, cost) and recommends materials by predicting a suitability score for each material. We'll use a `RandomForestRegressor` from scikit-learn, train it on synthetic data, and save the model as a pickle file (`recommendation_model.pkl`).

## Steps:
1. Import required libraries.
2. Load and preprocess the material data.
3. Generate synthetic project data for training.
4. Compute synthetic scores for training.
5. Train the model using a pipeline.
6. Save the model to a pickle file.

## Step 1: Import Libraries



In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pickle
import random

## Step 2: Load and Preprocess Material Data

load the material data from a CSV file (`construction_material_recommendation_system.csv`), validate its columns, and preprocess it by converting numerical columns and mapping categorical values (e.g., fire resistance ratings).

In [2]:
# Load the material data
csv_path = r'E:\material-recommendation-system\construction_material_recommendation_system.csv'
material_data = pd.read_csv(csv_path)

# Validate required columns
required_columns = [
    'Material ID', 'Material Name', 'Category', 'Durability Rating',
    'Cost per Unit ($)', 'Environmental Suitability', 'Supplier Availability',
    'Lead Time (days)', 'Sustainability Score', 'Thermal Conductivity (W/m·K)',
    'Compressive Strength (MPa)', 'Fire Resistance Rating'
]
missing_columns = [col for col in required_columns if col not in material_data.columns]
if missing_columns:
    raise ValueError(f"Missing required columns in CSV: {missing_columns}")

# Clean and preprocess material data
numeric_columns = [
    'Durability Rating', 'Cost per Unit ($)', 'Lead Time (days)',
    'Sustainability Score', 'Thermal Conductivity (W/m·K)', 'Compressive Strength (MPa)'
]
for col in numeric_columns:
    material_data[col] = pd.to_numeric(material_data[col], errors='coerce').fillna(0)

# Map fire resistance ratings to numerical values
fire_rating_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1}
material_data['Fire Resistance'] = material_data['Fire Resistance Rating'].map(fire_rating_map).fillna(0)

# Display the first few rows of the preprocessed data
material_data.head()

Unnamed: 0,Material ID,Material Name,Category,Durability Rating,Cost per Unit ($),Environmental Suitability,Supplier Availability,Lead Time (days),Sustainability Score,Thermal Conductivity (W/m·K),Compressive Strength (MPa),Fire Resistance Rating,Fire Resistance
0,1,Plastic,Insulation,8,768,Coastal,High,12,7,1.47,269,C,2
1,2,Brick,Plumbing,10,417,Dry,Low,9,4,1.98,326,A,4
2,3,Aluminum,Finishing,10,132,Coastal,High,22,8,0.54,61,D,1
3,4,Wood,Structural,7,518,Humid,High,29,8,0.92,227,A,4
4,5,Concrete,Insulation,9,834,Humid,Medium,18,3,0.12,324,B,3


## Step 3: Define Material Features

Extract the relevant features from the material data that will be used to compute similarity scores with project inputs.

In [3]:
# Define features for the model
material_features = material_data[[
    'Category', 'Environmental Suitability', 'Supplier Availability',
    'Fire Resistance', 'Durability Rating', 'Cost per Unit ($)', 'Lead Time (days)',
    'Sustainability Score', 'Thermal Conductivity (W/m·K)', 'Compressive Strength (MPa)'
]]

# Display the first few rows of material features
material_features.head()

Unnamed: 0,Category,Environmental Suitability,Supplier Availability,Fire Resistance,Durability Rating,Cost per Unit ($),Lead Time (days),Sustainability Score,Thermal Conductivity (W/m·K),Compressive Strength (MPa)
0,Insulation,Coastal,High,2,8,768,12,7,1.47,269
1,Plumbing,Dry,Low,4,10,417,9,4,1.98,326
2,Finishing,Coastal,High,1,10,132,22,8,0.54,61
3,Structural,Humid,High,4,7,518,29,8,0.92,227
4,Insulation,Humid,Medium,3,9,834,18,3,0.12,324


## Step 4: Generate Synthetic Project Data

Since we don't have real project data, we simulate project inputs by sampling from the material features' ranges. This synthetic data will be used to train the model.

In [4]:
# Simulate project inputs for training data
np.random.seed(42)
num_samples = 100  # Number of synthetic projects (reduce for speed)
materials_per_project = 10  # Number of materials to sample per project
synthetic_projects = pd.DataFrame({
    'category': np.random.choice(material_data['Category'].unique(), num_samples),
    'environmental_suitability': np.random.choice(material_data['Environmental Suitability'].unique(), num_samples),
    'supplier_availability': np.random.choice(material_data['Supplier Availability'].unique(), num_samples),
    'fire_resistance': np.random.choice([4, 3, 2, 1], num_samples),
    'durability': np.random.uniform(1, 10, num_samples),
    'cost': np.random.uniform(10, 500, num_samples),
    'lead_time': np.random.uniform(1, 30, num_samples),
    'sustainability': np.random.uniform(1, 10, num_samples),
    'thermal_conductivity': np.random.uniform(0.1, 2.0, num_samples),
    'compressive_strength': np.random.uniform(10, 100, num_samples)
})

# Display the first few rows of synthetic projects
synthetic_projects.head()

Unnamed: 0,category,environmental_suitability,supplier_availability,fire_resistance,durability,cost,lead_time,sustainability,thermal_conductivity,compressive_strength
0,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267
1,Electrical,Humid,High,1,9.425393,22.990542,1.329256,2.206137,0.639174,65.806548
2,Finishing,Coastal,Low,3,7.264268,297.030035,14.591159,1.259044,0.789856,58.011498
3,Electrical,All,High,4,6.130551,470.712818,2.632795,7.796235,1.327243,90.450332
4,Electrical,All,High,2,1.874588,291.982347,4.44572,6.582786,1.184479,80.973749


## Step 5: Compute Synthetic Scores

Define a function to compute a synthetic score for each project-material pair based on feature similarity. The score rewards matches on categorical features and minimizes differences in numerical features.

In [5]:
# Compute synthetic scores based on feature similarity
def compute_synthetic_score(project, material):
    score = 0
    # Categorical features (exact match gets higher score)
    categorical_features = ['category', 'environmental_suitability', 'supplier_availability']
    for feat in categorical_features:
        score += 20 if project[feat] == material[feat] else 0
    
    # Numerical features (closer values get higher score)
    numerical_features = [
        ('fire_resistance', 'fire_resistance', 10, 4),
        ('durability', 'durability', 10, 10),
        ('cost', 'cost', 5, 490),
        ('lead_time', 'lead_time', 5, 30),
        ('sustainability', 'sustainability', 10, 10),
        ('thermal_conductivity', 'thermal_conductivity', 5, 2),
        ('compressive_strength', 'compressive_strength', 10, 100)
    ]
    for proj_feat, mat_feat, weight, max_diff in numerical_features:
        diff = abs(project[proj_feat] - material[mat_feat])
        normalized_diff = diff / max_diff
        score += weight * (1 - normalized_diff)
    
    return max(0, score)  # Ensure score is non-negative

## Step 6: Generate Training Data

Pair each synthetic project with a random subset of materials to create training data. For each pair, we compute a synthetic score as the target variable.

In [6]:
# Generate training data (sampled for speed)
X_train = []
y_train = []
for i in range(num_samples):
    project = synthetic_projects.iloc[i].to_dict()
    sampled_indices = random.sample(range(len(material_features)), min(materials_per_project, len(material_features)))
    for j in sampled_indices:
        material = material_features.iloc[j].to_dict()
        # Combine project and material features into a single input row
        input_row = {
            'category': project['category'],
            'environmental_suitability': project['environmental_suitability'],
            'supplier_availability': project['supplier_availability'],
            'fire_resistance': project['fire_resistance'],
            'durability': project['durability'],
            'cost': project['cost'],
            'lead_time': project['lead_time'],
            'sustainability': project['sustainability'],
            'thermal_conductivity': project['thermal_conductivity'],
            'compressive_strength': project['compressive_strength'],
            'material_category': material['Category'],
            'material_env_suitability': material['Environmental Suitability'],
            'material_supplier_availability': material['Supplier Availability'],
            'material_fire_resistance': material['Fire Resistance'],
            'material_durability': material['Durability Rating'],
            'material_cost': material['Cost per Unit ($)'],
            'material_lead_time': material['Lead Time (days)'],
            'material_sustainability': material['Sustainability Score'],
            'material_thermal_conductivity': material['Thermal Conductivity (W/m·K)'],
            'material_compressive_strength': material['Compressive Strength (MPa)']
        }
        X_train.append(input_row)
        # Prepare lowercase-keyed material dict for scoring
        material_dict = {
            'category': material['Category'],
            'environmental_suitability': material['Environmental Suitability'],
            'supplier_availability': material['Supplier Availability'],
            'fire_resistance': material['Fire Resistance'],
            'durability': material['Durability Rating'],
            'cost': material['Cost per Unit ($)'],
            'lead_time': material['Lead Time (days)'],
            'sustainability': material['Sustainability Score'],
            'thermal_conductivity': material['Thermal Conductivity (W/m·K)'],
            'compressive_strength': material['Compressive Strength (MPa)']
        }
        score = compute_synthetic_score(project, material_dict)
        y_train.append(score)

# Convert training data to DataFrame
X_train_df = pd.DataFrame(X_train)

# Display the shape of the training data
print(f"Training data shape: {X_train_df.shape}")
X_train_df.head()

Training data shape: (1000, 20)


Unnamed: 0,category,environmental_suitability,supplier_availability,fire_resistance,durability,cost,lead_time,sustainability,thermal_conductivity,compressive_strength,material_category,material_env_suitability,material_supplier_availability,material_fire_resistance,material_durability,material_cost,material_lead_time,material_sustainability,material_thermal_conductivity,material_compressive_strength
0,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267,Plumbing,Humid,High,4,7,351,17,6,1.15,156
1,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267,Electrical,All,Low,3,7,336,14,4,1.65,293
2,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267,Plumbing,Coastal,High,3,8,315,27,3,1.73,160
3,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267,Insulation,All,High,2,7,403,14,6,0.43,35
4,Structural,All,Medium,2,6.011211,324.785446,15.286128,4.631346,0.342403,81.462267,Plumbing,Coastal,High,2,9,963,7,10,1.71,281


## Step 7: Define and Train the Model

Create a pipeline that preprocesses the data (one-hot encoding for categorical features) and trains a `RandomForestRegressor` to predict the suitability scores.

In [7]:
# Define categorical and numerical columns for preprocessing
categorical_cols = [
    'category', 'environmental_suitability', 'supplier_availability',
    'material_category', 'material_env_suitability', 'material_supplier_availability'
]
numerical_cols = [
    'fire_resistance', 'durability', 'cost', 'lead_time', 'sustainability',
    'thermal_conductivity', 'compressive_strength',
    'material_fire_resistance', 'material_durability', 'material_cost',
    'material_lead_time', 'material_sustainability',
    'material_thermal_conductivity', 'material_compressive_strength'
]

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numerical_cols)
    ])

# Create model pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train the model
model.fit(X_train_df, y_train)

print("Model training completed.")

Model training completed.


## Step 8: Save the Model

Finally, we save the trained model to a pickle file (`recommendation_model.pkl`) for use in the Flask application.

In [8]:
# Save the model to a pickle file
with open('recommendation_model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model trained and saved as recommendation_model.pkl")

Model trained and saved as recommendation_model.pkl
