### Power BI Integration: Generating Bore Size Predictions  

The purpose of this code is to prepare the dataset for use within Power BI by importing data, processing it, and generating predictions for future bore sizes along with wear classifications using a Random Forest Regressor model. The final output is a dataset containing both actual and predicted bore sizes for future cycles, which can be visualized and analyzed within Power BI. Below is an overview of the steps performed in the script:

---

#### Data Import and Preprocessing:
1. **Load Dataset**: The dataset is loaded directly from Power BI and stored in the `df` variable. This ensures that the data is accessible for subsequent analysis.
2. **Drop Unnecessary Columns**: The script removes columns that are not relevant for the analysis. This simplifies the dataset and eliminates clutter, ensuring that only the necessary features are used in the modeling process.
3. **Data Type Conversion**: Measurement values (`"C.[DataValue]"`) are converted to numeric types, and timestamps (`"C.[EntryTimestamp]"`) are converted to `datetime` objects. This allows for more precise data handling and analysis.
4. **Outlier Removal Using IQR**: Outliers in the data are removed using the Interquartile Range (IQR) method, ensuring that extreme values do not distort the model's performance.

---

#### Feature Engineering:
1. **Lag Features**: The script creates lag features for different time windows (3, 5, 10, 15, 20 cycles) to capture the temporal dependencies of bore measurements.
2. **Rolling Statistics**: Rolling mean and standard deviation are computed for multiple time windows (3, 5, 10 cycles) to capture short-term trends in the bore measurements.
3. **Cycle Count**: A `Cycle_Count` column is created to track the cycle number for each measurement.
4. **Target Variable**: The target variable, `"Target_NextCycle"`, is defined by shifting the bore measurements by one cycle, allowing for the prediction of future bore sizes.

---

#### Model Training:
1. **Random Forest Regressor**: A Random Forest Regressor is trained on the features (`rfr_features`) to predict the target variable, `"Target_NextCycle"`. Hyperparameters such as `n_estimators=200`, `max_depth=10`, and others are specified to optimize the model's performance.
2. **Model Fitting**: The model is trained on the training data (`X_train`, `y_train`) to learn patterns and relationships in the dataset.

---

#### Future Bore Size Predictions:
1. **Future Cycles**: The number of future cycles to predict (`future_cycles`) is specified. In this case, predictions are made for 10 future cycles, but this can be adjusted as needed.
2. **Prediction Loop**: The script uses the last known values from the dataset to predict future bore sizes by iteratively applying the model and updating the rolling statistics for each cycle.
3. **Wear Classification**: Based on the predicted changes in bore size, the wear stage for each future cycle is classified as either **Normal Wear**, **Moderate Wear**, or **Critical Wear**.

---

#### Merging Actual and Predicted Data:
1. **Combine Data**: The script merges the actual and predicted data into a final dataset (`final_df`), which includes both the original bore measurements and the predicted values for future cycles along with the associated wear classifications.
2. **Power BI Integration**: The final dataset is displayed as the output, which can be directly imported into Power BI for further analysis and visualization.

---

#### Key Considerations:
1. **Adjustable Parameters**: 
   - **Number of Future Cycles**: The number of future cycles for which predictions are made can be adjusted by changing the `future_cycles` variable.
   - **Wear Classification Cutoffs**: The thresholds used to classify wear stages (e.g., `Normal Wear`, `Moderate Wear`, and `Critical Wear`) can be customized to suit different operational definitions or requirements.
2. **Feature Engineering**: The lag features and rolling statistics windows can be adjusted to capture more or less historical information, depending on the characteristics of the data and the needs of the analysis.


In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Load dataset from Power BI
df = dataset.copy()

# Drop unnecessary columns
columns_to_drop = [
    "D.[NamePostfix]", "F.[Name]", "G1.[ParameterID]", "G1.[Name]",
    "G1.[OperatorMessage]", "G2.[ParameterID]", "G2.[Name]", 
    "G2.[Description]", "G3.[ParameterID]", "G3.[Name]", 
    "G3.[OperatorMessage]", "A.[ParameterID]", "A.[EntryTimestamp]", 
    "A.[DataValue]", "A.[Description]", "B.[ParameterID]", 
    "B.[EntryTimestamp]", "B.[DataValue]", "B.[Description]", 
    "C.[ParameterID]"
]

df = df.drop(columns=[col for col in columns_to_drop if col in df.columns], 
             errors="ignore")

# Convert measurement values to numeric
df["C.[DataValue]"] = pd.to_numeric(df["C.[DataValue]"], errors="coerce")

# Convert timestamps to datetime and sort
df["C.[EntryTimestamp]"] = pd.to_datetime(df["C.[EntryTimestamp]"], 
                                          errors="coerce")
df = df.dropna(subset=["C.[EntryTimestamp]"]).sort_values(
    by="C.[EntryTimestamp]"
).reset_index(drop=True)

# Outlier Removal Using IQR
Q1 = df["C.[DataValue]"].quantile(0.25)
Q3 = df["C.[DataValue]"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df = df[
    (df["C.[DataValue]"] >= lower_bound) & 
    (df["C.[DataValue]"] <= upper_bound)
].reset_index(drop=True)

# Create Lag Features
for lag in [3, 5, 10, 15, 20]:
    df[f"DataValue_Lag{lag}"] = df["C.[DataValue]"].shift(lag)

# Create Rolling Statistics
for window in [3, 5, 10]:
    df[f"Rolling_Mean_{window}"] = df["C.[DataValue]"].rolling(
        window=window
    ).mean()
    df[f"Rolling_Std_{window}"] = df["C.[DataValue]"].rolling(
        window=window
    ).std()

# Create Cycle Count
df["Cycle_Count"] = np.arange(1, len(df) + 1)

# Define target variable
df["Target_NextCycle"] = df["C.[DataValue]"].shift(-1)

# Drop NaN values due to shifting
df = df.dropna().reset_index(drop=True)

# Define feature columns
feature_columns = [
    "Cycle_Count", "C.[DataValue]", "DataValue_Lag3", "DataValue_Lag5", 
    "DataValue_Lag10", "DataValue_Lag15", "DataValue_Lag20", 
    "Rolling_Mean_3", "Rolling_Std_3", "Rolling_Mean_5", "Rolling_Std_5", 
    "Rolling_Mean_10", "Rolling_Std_10"
]

# Train Random Forest Model
X = df[feature_columns]
y = df["Target_NextCycle"]

rfr_model = RandomForestRegressor(
    n_estimators=200,        
    max_depth=10,            
    min_samples_split=10,    
    min_samples_leaf=5,       
    max_features="sqrt",     
    bootstrap=True,          
    random_state=42
)

rfr_model.fit(X, y)

# Define how many future cycles to predict
future_cycles = 10  

# Create a DataFrame for future predictions
future_df = pd.DataFrame()
future_df["Cycle_Count"] = range(
    df["Cycle_Count"].max() + 1, df["Cycle_Count"].max() + 1 + future_cycles
)

# Use last known values for predictions
last_known_values = df.iloc[-1][feature_columns].to_dict()
predicted_bores = []

# Predict future bore sizes using the trained model
for cycle in future_df["Cycle_Count"]:
    new_row = last_known_values.copy()

    # Shift lag values forward
    for lag in [3, 5, 10, 15, 20]:
        new_row[f"DataValue_Lag{lag}"] = (
            predicted_bores[-lag] if len(predicted_bores) >= lag
            else last_known_values["C.[DataValue]"]
        )

    # Update rolling statistics dynamically
    for window in [3, 5, 10]:
        new_row[f"Rolling_Mean_{window}"] = (
            np.mean(predicted_bores[-window:])
            if len(predicted_bores) >= window
            else last_known_values[f"Rolling_Mean_{window}"]
        )
        new_row[f"Rolling_Std_{window}"] = (
            np.std(predicted_bores[-window:])
            if len(predicted_bores) >= window
            else last_known_values[f"Rolling_Std_{window}"]
        )

    # Convert to DataFrame and predict bore size
    new_X = pd.DataFrame([new_row])[feature_columns]
    predicted_bore = rfr_model.predict(new_X)[0]
    predicted_bores.append(predicted_bore)

    # Store predictions in the DataFrame
    future_df.loc[future_df["Cycle_Count"] == cycle, "Predicted_Bore_Size"] = (
        predicted_bore
    )

# Compute bore size changes over time
future_df["Bore_Size_Change"] = future_df["Predicted_Bore_Size"].diff().fillna(0)

# Define wear classification function
def classify_wear(change):
    """Classify wear stages based on bore size change."""
    if change < 0.001:
        return "Normal Wear"
    elif 0.001 <= change < 0.005:
        return "Moderate Wear"
    return "Critical Wear"

# Assign wear labels to future cycles
future_df["Predicted_Wear_Stage"] = future_df["Bore_Size_Change"].apply(
    classify_wear
)

# Merge actual and predicted data
df["Predicted_Bore_Size"] = np.nan
df["Predicted_Wear_Stage"] = np.nan

# Final dataset
final_df = pd.concat([df, future_df], ignore_index=True)

# Display the final dataset for Power BI
final_df



























NameError: name 'dataset' is not defined