## Dataset Creation  
- This section creates a dataset to predict the combinational depth of signals.  
- The dataset includes key RTL parameters such as:  
  - Fan-In and Fan-Out  
  - Gate Count  
  - Gate Types  
  - Combinational Logic Depth (target variable)  
- The data is structured to simulate real-world RTL timing reports.  

In [1]:
import random
import pandas as pd
import numpy as np

def weighted_choice(weights):
    """ Returns a value based on given probability distribution """
    keys, probabilities = zip(*weights.items())
    probabilities = np.array(probabilities) / sum(probabilities)  
    return np.random.choice(keys, p=probabilities)

def generate_random_expression(inputs, depth):
    """ Generate a random logic expression with given depth. """
    if depth == 1:
        return random.choice(inputs)
    else:
        gate = random.choice(["AND", "OR", "NOT", "NAND", "NOR", "XOR"])
        if gate == "NOT":
            return f"~({generate_random_expression(inputs, depth - 1)})"
        else:
            return f"({generate_random_expression(inputs, depth - 1)} {gate} {generate_random_expression(inputs, depth - 1)})"

def generate_rtl_module(module_name, num_inputs, max_depth):
    """ Generate a synthetic RTL module with random logic. """
    inputs = [f"input{i}" for i in range(num_inputs)]
    output = "y"
    expression = generate_random_expression(inputs, max_depth)
    
    rtl_code = f"""
module {module_name} (
    input {', '.join(inputs)},
    output {output}
);
    assign {output} = {expression};
endmodule
    """
    return rtl_code

def extract_features(rtl_code, num_inputs, max_depth):
    """ Extract features with improved realism. """
    
    gate_count = min(int(max_depth * num_inputs / random.uniform(1.5, 2.5)), 30)  
    
    
    logic_depth = max_depth + random.choice([-1, 0, 0, 1])  
    
    gate_types = []
    for gate in ["AND", "OR", "NOT", "NAND", "NOR", "XOR"]:
        if gate in rtl_code:
            gate_types.append(gate)
    
    return {
        "Module Name": "synthetic_module",
        "Signal Name": "y",
        "Fan-In": num_inputs,
        "Fan-Out": weighted_choice({1: 15, 2: 18, 3: 17, 4: 15, 5: 12, 6: 8, 7: 6, 8: 4, 9: 3, 10: 2}),  
        "Gate Count": gate_count,
        "Logic Depth": max(2, min(logic_depth, 8)),  
        "Gate Types": ", ".join(gate_types)
    }


dataset = []
for i in range(500):
    num_inputs = weighted_choice({2: 5, 3: 10, 4: 15, 5: 20, 6: 20, 7: 15, 8: 8, 9: 4, 10: 2, 11: 0.5, 12: 0.5})
    max_depth = weighted_choice({2: 18, 3: 28, 4: 25, 5: 15, 6: 8, 7: 4, 8: 2})  # Slightly adjusted

    rtl_code = generate_rtl_module(f"module_{i}", num_inputs, max_depth)
    features = extract_features(rtl_code, num_inputs, max_depth)
    dataset.append(features)

df = pd.DataFrame(dataset)


print("Fan-In Distribution:")
print(df["Fan-In"].value_counts().sort_index())

print("\nFan-Out Distribution:")
print(df["Fan-Out"].value_counts().sort_index())

print("\nGate Count Distribution:")
print(df["Gate Count"].value_counts().sort_index())

print("\nLogic Depth Distribution:")
print(df["Logic Depth"].value_counts().sort_index())

print("\nFirst 5 Rows:")
print(df.head())

Fan-In Distribution:
Fan-In
2      31
3      42
4      66
5      89
6     113
7      87
8      42
9      18
10      8
11      2
12      2
Name: count, dtype: int64

Fan-Out Distribution:
Fan-Out
1     79
2     92
3     91
4     76
5     47
6     38
7     35
8     18
9      9
10    15
Name: count, dtype: int64

Gate Count Distribution:
Gate Count
1      2
2     12
3     22
4     34
5     35
6     35
7     28
8     40
9     41
10    42
11    25
12    29
13    24
14    14
15    16
16    18
17    14
18    11
19     8
20     5
21     5
22     5
23     6
24     3
25     2
26     6
27     4
28     3
29     3
30     8
Name: count, dtype: int64

Logic Depth Distribution:
Logic Depth
2    102
3    122
4    110
5     86
6     41
7     24
8     15
Name: count, dtype: int64

First 5 Rows:
        Module Name Signal Name  Fan-In  Fan-Out  Gate Count  Logic Depth  \
0  synthetic_module           y       9       10           9            2   
1  synthetic_module           y       6        6          1

In [2]:

df.to_csv("final_synthetic_rtl_dataset.csv", index=False)
print("Dataset saved to final_synthetic_rtl_dataset.csv")

Dataset saved to final_synthetic_rtl_dataset.csv


In [3]:

print("First 5 Rows of the Dataset:")
print(df.head())

print("\nFan-In Distribution:")
print(df["Fan-In"].value_counts().sort_index())
print("\nFan-Out Distribution:")
print(df["Fan-Out"].value_counts().sort_index())
print("\nGate Count Distribution:")
print(df["Gate Count"].value_counts().sort_index())
print("\nLogic Depth Distribution:")
print(df["Logic Depth"].value_counts().sort_index())

First 5 Rows of the Dataset:
        Module Name Signal Name  Fan-In  Fan-Out  Gate Count  Logic Depth  \
0  synthetic_module           y       9       10           9            2   
1  synthetic_module           y       6        6          10            3   
2  synthetic_module           y       8        7           9            2   
3  synthetic_module           y       8        3          14            3   
4  synthetic_module           y       3        6           4            2   

  Gate Types  
0  AND, NAND  
1  AND, NAND  
2  AND, NAND  
3  AND, NAND  
4        AND  

Fan-In Distribution:
Fan-In
2      31
3      42
4      66
5      89
6     113
7      87
8      42
9      18
10      8
11      2
12      2
Name: count, dtype: int64

Fan-Out Distribution:
Fan-Out
1     79
2     92
3     91
4     76
5     47
6     38
7     35
8     18
9      9
10    15
Name: count, dtype: int64

Gate Count Distribution:
Gate Count
1      2
2     12
3     22
4     34
5     35
6     35
7     28
8     

## Feature Engineering  
- Feature engineering is performed to improve the quality of input data.  
- Key transformations include:  
  - Encoding categorical values such as gate types into numerical features  
  - Normalizing numerical features for consistency  
  - Identifying and handling outliers that may affect model performance  
- Feature correlation analysis is conducted to understand relationships between input variables and the target variable.  

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer


df = pd.read_csv("final_synthetic_rtl_dataset.csv")

X = df.drop(columns=["Logic Depth"])
y = df["Logic Depth"]

categorical_features = ["Gate Types"]
numerical_features = ["Fan-In", "Fan-Out", "Gate Count"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),  
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features), 
    ]
)

X_processed = preprocessor.fit_transform(X)

import joblib
joblib.dump(preprocessor, "rtl_preprocessor.pkl")

['rtl_preprocessor.pkl']

In [5]:

print("Shape of Processed Data (X_processed):", X_processed.shape)
print("\nFirst 5 Rows of Processed Data (Numerical Features):")
print(X_processed[:5, :len(numerical_features)])  

print("\nFirst 5 Rows of Processed Data (Categorical Features):")
print(X_processed[:5, len(numerical_features):])  

Shape of Processed Data (X_processed): (500, 18)

First 5 Rows of Processed Data (Numerical Features):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 15 stored elements and shape (5, 3)>
  Coords	Values
  (0, 0)	1.7735654185216503
  (0, 1)	2.636015292943297
  (0, 2)	-0.2874393218999412
  (1, 0)	0.20125565032869805
  (1, 1)	0.9226481867078098
  (1, 2)	-0.1303686541950553
  (2, 0)	1.2494621624573328
  (2, 1)	1.3509899632666815
  (2, 2)	-0.2874393218999412
  (3, 0)	1.2494621624573328
  (3, 1)	-0.3623771429688056
  (3, 2)	0.4979140166244883
  (4, 0)	-1.3710541178642541
  (4, 1)	0.9226481867078098
  (4, 2)	-1.0727926604243707

First 5 Rows of Processed Data (Categorical Features):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (5, 15)>
  Coords	Values
  (0, 1)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 0)	1.0


## Model Training  
- The goal is to train a machine learning model to predict logic depth based on extracted features.  
- Steps involved:  
  - Splitting the dataset into training (80%) and testing (20%) sets  
  - Selecting suitable regression models for prediction  
  - Evaluating model performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)  
- The objective is to achieve accurate predictions while maintaining a low computational cost.  

In [6]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (400, 18)
Shape of X_test: (100, 18)
Shape of y_train: (400,)
Shape of y_test: (100,)


## Model Evaluation  
- The trained model is tested against the test dataset to measure performance.  
- Evaluation metrics include:  
  - Mean Absolute Error (MAE) to measure the average absolute difference between predicted and actual values  
  - Root Mean Squared Error (RMSE) to account for large errors  
  - Prediction runtime to ensure the model runs efficiently compared to synthesis-based methods  
- A good model should have a low error rate and a fast prediction time.  

In [7]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Models to compare
models = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "Neural Network": MLPRegressor(random_state=42, max_iter=1000),
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    
    model.fit(X_train, y_train)
   
    y_pred = model.predict(X_test)
 
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
  
    results[name] = {"MSE": mse, "R²": r2}

print("Model Comparison Results:")
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  MSE: {metrics['MSE']}")
    print(f"  R²: {metrics['R²']}")

Model Comparison Results:
Random Forest:
  MSE: 1.0432055170464851
  R²: 0.5859638367016649
Gradient Boosting:
  MSE: 0.8733777883582541
  R²: 0.6533664913644014
XGBoost:
  MSE: 1.3065781593322754
  R²: 0.48143428564071655
Support Vector Regressor:
  MSE: 0.8679352402934981
  R²: 0.655526575530442
Neural Network:
  MSE: 0.8223442120083831
  R²: 0.6736211255721609


- Proceeding with Neural Network (MLP) since it performed the best.
- It did not fully converge (indicated by the warning), meaning it might need more training iterations or tuning.
- Fine-tuning will be done to improve its performance and address the convergence issue.

In [10]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

# Parameter grid 
param_grid = {
    "hidden_layer_sizes": [(50,), (100,), (50, 50)],  # Different architectures
    "activation": ["relu", "tanh"],  # Activation functions
    "solver": ["adam", "sgd"],  # Optimization algorithms
    "alpha": [0.0001, 0.001, 0.01],  # Regularization strength
    "max_iter": [2000, 3000],  # Increase max iterations to avoid convergence issues
}

mlp = MLPRegressor(random_state=42)

grid_search = GridSearchCV(estimator=mlp, param_grid=param_grid, cv=3, scoring="neg_mean_squared_error", n_jobs=-1)
grid_search.fit(X_train, y_train)

best_mlp = grid_search.best_estimator_

# Evaluate the best model
y_pred_best = best_mlp.predict(X_test)
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print("Best Hyperparameters:", grid_search.best_params_)
print(f"Best Model - MSE: {mse_best}")
print(f"Best Model - R²: {r2_best}")

Best Hyperparameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'max_iter': 2000, 'solver': 'sgd'}
Best Model - MSE: 0.7486673296899541
Best Model - R²: 0.702862625142898


In [11]:
from sklearn.inspection import permutation_importance

X_test_dense = X_test.toarray()

result = permutation_importance(best_mlp, X_test_dense, y_test, n_repeats=10, random_state=42)

importance_scores = result.importances_mean

feature_names = numerical_features + list(preprocessor.named_transformers_["cat"].get_feature_names_out())
feature_importance = dict(zip(feature_names, importance_scores))

sorted_feature_importance = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)

print("Feature Importance:")
for feature, importance in sorted_feature_importance:
    print(f"{feature}: {importance:.4f}")

Feature Importance:
Gate Count: 0.9486
Fan-In: 0.3357
Gate Types_AND, OR, NAND, NOR, XOR: 0.2608
Gate Types_AND, OR, NAND, XOR: 0.0232
Gate Types_AND, OR, NOR: 0.0063
Fan-Out: 0.0051
Gate Types_AND, OR, NAND, NOR: 0.0028
Gate Types_OR: 0.0022
Gate Types_OR, XOR: 0.0019
Gate Types_AND, OR, XOR: 0.0014
Gate Types_AND, OR, NAND: 0.0008
Gate Types_AND: 0.0005
Gate Types_OR, NOR: 0.0002
Gate Types_AND, NAND: 0.0001
Gate Types_OR, NOR, XOR: -0.0001
Gate Types_AND, OR: -0.0002
Gate Types_AND, OR, NOR, XOR: -0.0003
Gate Types_nan: -0.0016


In [12]:

important_features = ["Gate Count", "Fan-In", "Gate Types_AND, OR, NAND, NOR, XOR"]

X_train_filtered = X_train[:, [feature_names.index(feature) for feature in important_features]]
X_test_filtered = X_test[:, [feature_names.index(feature) for feature in important_features]]

best_mlp.fit(X_train_filtered, y_train)

y_pred_filtered = best_mlp.predict(X_test_filtered)
mse_filtered = mean_squared_error(y_test, y_pred_filtered)
r2_filtered = r2_score(y_test, y_pred_filtered)

print("Performance After Feature Selection:")
print(f"MSE: {mse_filtered}")
print(f"R²: {r2_filtered}")

Performance After Feature Selection:
MSE: 0.7659450428077647
R²: 0.6960053013145877


In [13]:
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
cv_scores_mse = -cross_val_score(best_mlp, X_processed, y, cv=5, scoring="neg_mean_squared_error")
cv_scores_r2 = cross_val_score(best_mlp, X_processed, y, cv=5, scoring="r2")


print("Cross-Validation Results:")
print(f"MSE Scores: {cv_scores_mse}")
print(f"R2 Scores: {cv_scores_r2}")
print(f"Mean MSE: {cv_scores_mse.mean():.4f} (±{cv_scores_mse.std():.4f})")
print(f"Mean R2: {cv_scores_r2.mean():.4f} (±{cv_scores_r2.std():.4f})")

Cross-Validation Results:
MSE Scores: [0.69163875 0.67666042 0.76836335 0.66537925 0.60011743]
R2 Scores: [0.74072622 0.71677183 0.67021617 0.7544726  0.72652323]
Mean MSE: 0.6804 (±0.0539)
Mean R2: 0.7217 (±0.0288)
