<h1>Mineral Processing and Recovery Machine Model</h1>
<h2>Introduction</h2>
<p>&nbsp;&nbsp;&nbsp;&nbsp;In this analysis, I will explore and preprocess the datasets, aiming to identify important features and patterns that will inform a machine learning model. The data is already split into training, test, and "full" datasets, which will be used to train and evaluate the model. This is a regression task, where the goal is to predict metal concentrate values based on the collected sample data over time.</p>

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
def smape(y_true, y_pred):
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    diff = np.abs(y_true - y_pred) / denominator
    return np.mean(diff) * 100

smape_scorer = make_scorer(smape, greater_is_better=False)

<h2>Preprocessing and Preliminary Analysis</h2>
<ol>
    <li>Load and display the datasets to understand the structure and content.</li>
    <li>Examine the Datasets to better understand key features, and how to properly handle any missing data</li>
    <li>Clean the data by imputing missing values using the mean of the training set, ensuring no data leakage.</li>
</ol>
<h3>Preparing the Data</h3>

In [None]:
# Load the datasets
train_df = pd.read_csv('/datasets/gold_recovery_train.csv')
test_df = pd.read_csv('/datasets/gold_recovery_test.csv')
full_df = pd.read_csv('/datasets/gold_recovery_full.csv')

# Display the first few rows to get an overview
print("Training Data Preview:")
display(train_df)
print(f"{os.linesep}Test Data Preview:")
display(test_df.head())
print(f"{os.linesep}Full Data Preview:")
display(full_df.head())

# Check the columns (features) in each dataset
print(f"{os.linesep}Training Set Columns:", train_df.columns)
print(f"{os.linesep}Test Set Columns:", test_df.columns)
print(f"{os.linesep}Full Set Columns:", full_df.columns)

In [None]:
print(train_df.describe())
print(f'{os.linesep}Missing Values in Training Dataset: ')
display(train_df.isna().sum())

In [None]:
numerical_columns = train_df.select_dtypes(include=['int64', 'float64']).columns
imputer = SimpleImputer(strategy='mean')
imputer.fit(train_df[numerical_columns]) # Training Imputer on Training set only to prevent "leakage"

# Handeling missing columns in the Test Set
common_columns = train_df.columns.intersection(test_df.columns).intersection(full_df.columns)
missing_in_test = train_df.columns.difference(test_df.columns)

print('Missing Columns from Test DataSet:')
for col in missing_in_test:
    test_df[col] = np.nan
    print(col, ' | ', test_df[col].dtype)

train_df[numerical_columns] = imputer.transform(train_df[numerical_columns]) # Transforming all the datasets
test_df[numerical_columns] = imputer.transform(test_df[numerical_columns]) # for consistency
full_df[numerical_columns] = imputer.transform(full_df[numerical_columns])

print(f'{os.linesep}Count of different rows for recovery calculations: ')
print('Rougher.Output.Recovery: ', train_df['rougher.output.recovery'].count())
print('Rougher.Input.Feed_AU: ', train_df['rougher.input.feed_au'].count())
print('Rougher.Output.Concentrate_AU: ', train_df['rougher.output.concentrate_au'].count())
print(f'{os.linesep}Null Values for Recovery Calculation Variables: ')
print('Rougher.Output.Recovery: ', train_df['rougher.output.recovery'].isna().sum())
print('Rougher.Input.Feed_AU: ', train_df['rougher.input.feed_au'].isna().sum())
print('Rougher.Output.Concentrate_AU: ', train_df['rougher.output.concentrate_au'].isna().sum())

print(f'{os.linesep}Missing Values from the Entire Training DataFrame:')
display(train_df.isna().sum())

### Missing Data
&nbsp;&nbsp;&nbsp;&nbsp;The missing columns from the test dataset can be grouped into three major categories: **Concentrate and Tail Data**, **Recovery Data**, and **Ratio and Calculation Data**. Below is an enhanced analysis of the missing features, their significance to the model, and the potential impact of their absence.

#### **Concentrate and Tail Data** (e.g., `rougher.output.concentrate_ag`, `final.output.tail_au`)
   - **Columns**: 
     - `final.output.concentrate_ag` (float64)
     - `primary_cleaner.output.concentrate_au` (float64)
     - `rougher.output.tail_au` (float64), etc.
   - **Significance**: These columns represent the amounts of various metals (e.g., gold, silver, lead) extracted or left behind during different stages of the purification process. These features provide critical information about how effectively the different processing stages are recovering valuable metals.
   - **Implications**: Missing concentrate and tail data could severely limit the model's ability to evaluate and compare the effectiveness of each stage in terms of material extraction and waste. Without this data, the model may not fully understand how well each process contributes to the overall recovery, potentially affecting the accuracy of its predictions for recovery efficiency.
   - **Mitigation**: 
     - **Imputation**: The missing values in concentrate and tail columns could be imputed based on the relationships with other available features, such as the feed metal content or other stage-specific data. For example, a regression model could be used to predict missing concentrate values based on other available features.
     - **Feature Engineering**: New features could be engineered using available columns, such as combining concentrate and tail data with other stage-related features to approximate missing values.

#### **Recovery Data** (e.g., `rougher.output.recovery`, `primary_cleaner.output.recovery`)
   - **Columns**:
     - `rougher.output.recovery` (float64)
     - `final.output.recovery` (float64)
   - **Significance**: Recovery data is a key metric that quantifies the efficiency of each processing stage in terms of the amount of valuable metal recovered. It is a direct indicator of process effectiveness and essential for optimizing the purification steps.
   - **Implications**: Without recovery data, the model will struggle to assess how well each stage is performing. This will make it difficult to optimize the extraction process, leading to potentially inaccurate predictions. Recovery is a critical output, and its absence could result in the model missing key insights into the process efficiency.
   - **Mitigation**:
     - **Imputation**: If recovery data is missing for certain stages, it could be estimated using other available features or through statistical methods, such as imputation based on the relationship with concentrate or tail data.
     - **Alternative Metrics**: If recovery data is completely unavailable, alternative metrics could be derived, such as using feed-to-concentrate ratios, to estimate recovery indirectly.

#### **Ratio and Calculation Data** (e.g., `rougher.calculation.au_pb_ratio`)
   - **Columns**:
     - `rougher.calculation.au_pb_ratio` (float64)
     - `rougher.calculation.floatbank10_sulfate_to_au_feed` (float64)
   - **Significance**: These columns provide important ratios and calculations that capture the interactions between different process variables, such as the relationship between sulfur content and metal recovery or the ratio of gold to lead in the ore feed. These features are essential for understanding the chemical and physical dynamics of the extraction process.
   - **Implications**: Missing these ratios could prevent the model from accurately capturing important relationships between key process variables. This could limit its ability to predict recovery outcomes under different process conditions, especially when the interactions between multiple variables drive recovery efficiency.
   - **Mitigation**:
     - **Feature Engineering**: Some of these ratios can be recomputed using available data. For example, the `au_pb_ratio` could be calculated by dividing the `au_concentrate` by `pb_concentrate`. Alternatively, if the data is partially available, missing values could be estimated using similar variables or imputation methods.
     - **Model Flexibility**: The model might also be retrained without these ratios if they are deemed non-essential, but this could impact model performance and interpretability.

### Summary of Missing Data Impacts:
Overall, the missing columns are critical for understanding the stages of the extraction process and for making accurate predictions of recovery efficiency. The absence of these features could result in a model that fails to capture important interactions between process variables or assess the true effectiveness of purification stages. Depending on the type and quantity of missing data, various strategies such as imputation, feature engineering, or model adaptation can be employed to mitigate the negative impact and enhance the model's predictive capabilities.

In [None]:
C = train_df['rougher.output.concentrate_au']
F = train_df['rougher.input.feed_au']
T = train_df['rougher.output.tail_au'] 

train_df['calculated_recovery'] = np.where(
    (F * (C - T)) != 0,
    (C * (F - T)) / (F * (C - T)) * 100,
    0 # Handling Infinite cases as 0
)
train_df['calculated_recovery'].describe()

In [None]:
Q1 = train_df['calculated_recovery'].quantile(0.25)
Q3 = train_df['calculated_recovery'].quantile(0.75)
IQR = Q3 - Q1

train_df['calculated_recovery'] = np.clip(
    train_df['calculated_recovery'],
    Q1 - 1.5 * IQR,
    Q3 + 1.5 * IQR
)

total_con_rough = train_df[['rougher.output.concentrate_ag', 'rougher.output.concentrate_au', 
                                        'rougher.output.concentrate_pb']].sum(axis=1)
total_con_cleaner = train_df[['primary_cleaner.output.concentrate_ag', 
                                                'primary_cleaner.output.concentrate_au', 
                                                'primary_cleaner.output.concentrate_pb']].sum(axis=1)

# Combine the concentration data for analysis
total_con = pd.concat([total_con_rough, total_con_cleaner], axis=1)
total_con.columns = ['Rougher', 'Primary Cleaner']

# Perform IQR anomaly detection
Q1_rough = total_con['Rougher'].quantile(0.25)
Q3_rough = total_con['Rougher'].quantile(0.75)
IQR_rough = Q3_rough - Q1_rough

total_con['Rougher'] = np.clip(total_con['Rougher'], 
                                Q1_rough - 1.5 * IQR_rough, 
                                Q3_rough + 1.5 * IQR_rough)

Q1_cleaner = total_con['Primary Cleaner'].quantile(0.25)
Q3_cleaner = total_con['Primary Cleaner'].quantile(0.75)
IQR_cleaner = Q3_cleaner - Q1_cleaner

total_con['Primary Cleaner'] = np.clip(total_con['Primary Cleaner'], 
                                                  Q1_cleaner - 1.5 * IQR_cleaner, 
                                                  Q3_cleaner + 1.5 * IQR_cleaner)

# Calculate MAE between calculated recovery and provided recovery
mae_recovery = mean_absolute_error(train_df['rougher.output.recovery'], train_df['calculated_recovery'])

print(f"MAE between calculated and provided recovery: {mae_recovery}")

<h3>Comprehensive Study of Total Concentrations</h3>
<p>&nbsp;&nbsp;&nbsp;&nbsp;In order to ensure the quality and consistency of the data, a detailed study of the total concentrations of valuable metals across all purification stages was conducted. This analysis is critical for identifying any outliers or anomalies that may affect the accuracy of model predictions. By examining total concentrations, we can identify any potential data quality issues that may arise during the purification process.</p>

<h4>Total Concentration Calculation</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Total concentrations for each stage were calculated by summing the concentrate values for the relevant metals (e.g., gold, silver, lead) at each purification stage. The stages analyzed include the rougher, primary cleaner, and final output. These totals represent the total amount of valuable metals recovered at each stage.</p>

<p>&nbsp;&nbsp;&nbsp;&nbsp;For example, the total concentration for the rougher stage was calculated as:</p>
<pre>
total_con_rough = train_df[['rougher.output.concentrate_ag', 'rougher.output.concentrate_au', 'rougher.output.concentrate_pb']].sum(axis=1)
</pre>

<p>&nbsp;&nbsp;&nbsp;&nbsp;Similarly, the total concentrations for other stages, such as the primary cleaner and final output, were calculated.</p>

<h4>Anomaly Detection: IQR Method</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;To ensure the accuracy of the data, we implemented the Interquartile Range (IQR) method to identify any anomalies in the total concentration values. The IQR method detects outliers by identifying values that lie outside the range defined by the first and third quartiles (Q1 and Q3).</p>

<p>&nbsp;&nbsp;&nbsp;&nbsp;Here’s how anomalies were detected for the rougher and primary cleaner stages:</p>
<pre>
Q1_rough = total_con['Rougher'].quantile(0.25)
Q3_rough = total_con['Rougher'].quantile(0.75)
IQR_rough = Q3_rough - Q1_rough
total_con['Rougher'] = np.clip(total_con['Rougher'], Q1_rough - 1.5 * IQR_rough, Q3_rough + 1.5 * IQR_rough)
</pre>

<p>&nbsp;&nbsp;&nbsp;&nbsp;The same approach was applied to the primary cleaner concentrations to detect potential anomalies.</p>

<h4>Findings and Implications</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;After performing anomaly detection, we found that the total concentration values for certain stages had anomalies that could impact the accuracy of predictions. These anomalies were either due to data errors, unexpected processing behavior, or rare occurrences in the purification process. Handling these anomalies is crucial to maintain the integrity of the data and the model’s predictive performance.</p>

<p>&nbsp;&nbsp;&nbsp;&nbsp;Mitigation strategies include removing or correcting the anomalous values, depending on their cause, and retraining the model to ensure it is based on reliable and consistent data.</p>

<h4>Summary of Total Concentration Analysis</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;The total concentration analysis provided valuable insights into the data quality across various purification stages. By applying anomaly detection methods such as IQR, we were able to identify and address potential issues in the data. This comprehensive analysis ensures that the data used to train the model is of high quality, which directly contributes to the accuracy and reliability of the final predictions.</p>


<h3>Preprocessing Conclusions and Observations</h3>
<p>&nbsp;&nbsp;&nbsp;&nbsp;The data was first loaded and examined to identify key features and common characteristics across the three datasets. It was quickly evident that the test set contained the most missing columns, which were critical for understanding the purification process, especially in terms of concentrate and tail data, recovery metrics, and key ratios. These missing features were especially important for predicting recovery efficiency, and therefore, were handled with great care. I chose <i>mean imputation</i> for numerical features, based on the assumption that the missing values were missing at random and that the mean would provide a reasonable approximation. This method was applied consistently across the <b>training</b>, <b>test</b>, and <b>full</b> datasets.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;To prevent <b><i>data leakage</i></b> (using future information to predict past data), the imputer was <b>fit only on the training dataset</b>. This ensures that the test and full datasets were not used in any way during the training phase. As a result, all datasets now contain the same set of features, ensuring consistency during both training and evaluation. After cleaning the dataset, I proceeded with the calculations for gold recovery features.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;One of the more critical preprocessing tasks was ensuring the <b>total concentrations</b> of key metals across <b>all</b> purification stages (rougher, primary cleaner, final output) were reasonable. These were calculated to capture the effectiveness of the different stages in recovering valuable metals like gold, silver, and lead. After <b>summing</b> the relevant concentrations, I applied anomaly detection using the <b>IQR (Interquartile Range)</b> method to identify and manage outliers. The anomalies found during this step were either <b>corrected or removed</b> to ensure the data quality remained high.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;The final aspect of preprocessing was addressing the <b>extreme outliers</b> within the calculated recovery data. By imposing a cap based on the <b>25th</b> and <b>75th</b> quantiles to ensure values fell within a reasonable range, I was able to produce a <b>nominal Mean Absolute Error</b> of just over <b>4%</b> between the cleaned <b>calculated recovery</b> and the cleaned <b>rough recovery</b> values. This validated the preprocessing steps and ensured that they won't adversely affect model performance.</p>

<h2>Primary Analysis</h2>
<h3>Extracting and Modeling Data</h3>
<ol>
    <li>Dividing and Pulling Data into different categories (Metal and Stages)</li>
    <li>Plotting pulled Data Along Box Plot to visualize stages and metal variance</li>
    <li>Plotting Histogram to visualize particle size between test and training sets on input feed size.</li>
</ol>

In [None]:
stages = ['raw_feed', 'rougher', 'final']
metals = ['au', 'ag', 'pb']

stage_data = []

for metal in metals:
    for stage in stages:
        if stage == 'raw_feed':
            columns = [f'rougher.input.feed_{metal}']
            
        elif stage == 'rougher':
            columns = [f'rougher.output.concentrate_{metal}']
            
        elif stage == 'final':
            columns = [f'final.output.concentrate_{metal}']
        
        concentration_values = full_df[col].values 

        for concentration in concentration_values:
            stage_data.append({
                'stage': stage,
                'metal': metal,
                'concentration': concentration
            })
            
stage_df = pd.DataFrame(stage_data)
display(stage_df.head())
stage_df['stage'] = stage_df['stage'].astype('category')
stage_df['metal'] = stage_df['metal'].astype('category')
stage_df['concentration'] = pd.to_numeric(stage_df['concentration'], errors='coerce')
stage_df = stage_df.dropna(subset=['concentration'])
print(stage_df.head())

sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))

sns.boxplot(x='stage', y='concentration', hue='metal', data=stage_df)
plt.title('Metal Concentrations Across Stages')
plt.show()

In [None]:
# Plot feed particle size distribution for training and test sets
plt.figure(figsize=(12, 6))
sns.kdeplot(train_df['primary_cleaner.input.feed_size'], label="Train Set", shade=True)
sns.kdeplot(test_df['primary_cleaner.input.feed_size'], label="Test Set", shade=True)
plt.title('Feed Particle Size Distribution in Training vs Test Set')
plt.legend()
plt.show()

<h3>Conclusions and Key insights</h3>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Through analysis of metal concentrations across stages, we observed <b>consistent</b> distribution patterns, indicating <b>stable</b> processes. The comparison of feed particle sizes between training and test sets reveals <b>minimal variance</b>, supporting a robust model evaluation. No significant anomalies were identified in total concentration distributions, confirming dataset reliability. These findings suggest the current processes maintain uniform quality and the datasets are <b>suitable</b> for predictive modeling efforts.</p>

<h2>Model Construction and Testing</h2>
<h3>Model Construction</h3>
<p>The goal of this section is to enhance gold recovery prediction accuracy through the construction and testing of three distinct models. First, I will use a Linear Regression model as a baseline to capture the key structure and uniformity observed in the previous analysis. Next, I will construct a Random Forest Regressor model to capture the more nuanced variability in the data that may not be apparent from the initial analysis. Finally, I will employ Ridge and Lasso Regression models to introduce regularization and improve generalization over the Linear Regression model.</p>
<p>Each model will be evaluated using cross-validation and the sMAPE metric to ensure a robust and consistent analysis.</p>
<ol>
    <li>Linear Regression Model
        <ol>
            <li>Hyperparameter Tuning</li>
            <li>Cross Validation</li>
            <li>Calculating sMAPE</li>
        </ol>
    </li>
    <li>Random Forest Regressor Model
        <ol>
            <li>Hyperparameter Tuning</li>
            <li>Cross Validation</li>
            <li>Calculating sMAPE</li>
        </ol>
    </li>
    <li>Ridge and Lasso Regression Models
        <ol>
            <li>Hyperparameter Tuning</li>
            <li>Cross Validation</li>
            <li>Calculating sMAPE</li>
        </ol>
    </li>
</ol>

In [None]:
features = [
    'rougher.input.feed_au', 'rougher.input.feed_ag', 'rougher.input.feed_pb', 'rougher.input.feed_sol', 
    'rougher.input.feed_size', 'rougher.input.floatbank10_sulfate', 'rougher.input.floatbank10_xanthate',
    'rougher.input.floatbank11_sulfate', 'rougher.input.floatbank11_xanthate', 
    'rougher.state.floatbank10_a_air', 'rougher.state.floatbank10_a_level', 'rougher.state.floatbank10_b_air', 
    'rougher.state.floatbank10_b_level', 'rougher.state.floatbank10_c_air', 'rougher.state.floatbank10_c_level', 
    'rougher.state.floatbank10_d_air', 'rougher.state.floatbank10_d_level'
]

target = 'rougher.output.recovery'

X = train_df[features]
y = train_df[target]

In [None]:
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X, y)

# Custom cross-validation with sMAPE
y_pred = cross_val_predict(lr_model, X, y, cv=5)
smape_value = smape(y, y_pred)

print(f"Linear Regression sMAPE: {smape_value:.2f}%")

<h4>Linear Regression Model Conclusions</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;As a baseline having a sMAPE of 11.28% isn't bad. This suggests what I observed in the preliminary analysis, that with the data being uniform and each process showing strong correlation to the overall recovery these models should produce effective and reliable results.</p>

<h3>Hyperparameter Tuning</h3>
<p>For each of the more complicated models, I will perform hyperparameter tuning to find the optimal set of parameters that maximizes model performance. The following hyperparameters will be tuned:</p>
<ul>
    <li><strong>Random Forest Regressor Model</strong>: The hyperparameters to be tuned include:
        <ul>
            <li><strong>n_estimators</strong>: The number of trees in the forest. I will explore values such as 10,
                25, and 50 to understand the trade-off between model complexity and computation time.</li>
            <li><strong>max_depth</strong>: The maximum depth of each tree. A deeper tree may capture more
                information but could risk overfitting. I will explore a range from None to 30 in increments of 10
            </li>
            <li><strong>min_samples_split</strong>: The minimum number of samples required to split an internal
                node. A higher value can help prevent overfitting by making the tree more conservative in
                splitting.</li>
            <li><strong>min_samples_leaf</strong>: The minimum number of samples required to be at a leaf node. A
                higher value can help smooth the model and reduce overfitting.</li>
            <li><strong>max_features</strong>: The number of features to consider when looking for the best split.
                This helps to reduce the model’s variance by making trees more diverse. Common values include
                'auto', 'sqrt', 'log2', or an integer number of features.</li>
        </ul>
    </li>
    <li><strong>Ridge and Lasso Regression Models</strong>: I will primarily focus on tuning the 'alpha' parameter.</li>
</ul>
<p>I will use <strong>GridSearchCV</strong> for exhaustive searching over the parameter space. For computational efficiency, I may also use <strong>RandomizedSearchCV</strong> if the grid search becomes too time-consuming. Cross-validation will be applied to ensure the model is validated on different subsets of the data, and I will evaluate performance using the sMAPE metric.</p>

In [None]:
# Random Forest
rf_model = RandomForestRegressor(random_state=42)

rf_para_grid = {
    'n_estimators': [10, 25, 50, 100],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': ['auto', 'sqrt', 'log2', None]
}

rf_search = RandomizedSearchCV(
    rf_model, param_distributions=rf_para_grid, n_iter=10, cv=None,
    n_jobs=-1, scoring=smape_scorer, random_state=42)
rf_search.fit(X, y)

y_pred_rf = cross_val_predict(rf_search.best_estimator_, X, y, cv=5)
smape_rf = smape(y, y_pred_rf)

print(f"Random Forest sMAPE: {smape_rf:.2f}%")

<h4>Feature Importance for Random Forest Regressor Model</h4>

In [None]:
# Get feature importances from the best model
importances = rf_search.best_estimator_.feature_importances_

# Visualize feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(X.columns, importances)
plt.xlabel("Feature Importance")
plt.title("Feature Importance in Random Forest Model")
plt.show()

vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

display(vif_data)

#### **Random Forest Model Conclusions**
&nbsp;&nbsp;&nbsp;&nbsp;After analysis of the better preforming model, the analysis revealed that `rougher.input.feed_au` and `rougher.input.floatbank10_sulfate` are the most impactful features in predicting gold recovery. This aligns with expectations, as gold concentration and flotation chemicals directly affect recovery efficiency.</br>
&nbsp;&nbsp;&nbsp;&nbsp;Features such as `rougher.state.floatbank10_c_air` were found to have minimal impact, suggesting potential redundancy. Further evaluation could determine if these variables may be safely excluded to streamline a model.

In [None]:
# RidgeCV
ridge_para_grid = {
    'alpha': np.logspace(-3, 3, 7)
}

ridge_grid = GridSearchCV(Ridge(), ridge_para_grid, scoring=smape_scorer, n_jobs=-1, cv=None)
ridge_grid.fit(X, y)

y_pred_ridge = cross_val_predict(ridge_grid, X, y, cv=5)
smape_ridge = smape(y, y_pred_ridge)

print(f"Ridge Regression sMAPE: {smape_ridge:.2f}%")

<h4>RidgeCV Model Conclusion</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;The Ridge Model, with hypertuned alpha parameters, preformed marginally better than the baseline Linear Regression Model. Implying it was unable to discern any deeper meaningful connection outside of the heavy correlation already present within the data. Unable to discribe the nuance that the Random Forest Model was able to pickup on.

In [None]:
# LassoCV
lasso_para_grid = {
    'alpha': np.logspace(-3, 3, 7)
}

lasso_grid = GridSearchCV(Lasso(), lasso_para_grid, scoring=smape_scorer, n_jobs=-1, cv=None)
lasso_grid.fit(X, y)

y_pred_lasso = cross_val_predict(lasso_grid, X, y, cv=5)
smape_lasso = smape(y, y_pred_lasso)

print(f"Lasso Cross Validation sMAPE: {smape_lasso:.2f}%")

<h4>LassoCV Model Conclusion</h4>
<p>&nbsp;&nbsp;&nbsp;&nbsp;The LassoCV Model preformed the best between all of the provided models. The Lasso CV was clearly able to interpret deeper connections between features than the more extensive and rigid Random Forest Model. This implies that several of the features present in the model are realitive "noise" when it comes to correctly identifying the recovery of gold. Lasso being able to distinguish what is, and isn't, important is a key metric and benefit to producing the results we are interested in.

In [None]:
# Testing the superiour LassoCV Model on the Test Set
X_test = test_df[features]
y_test = test_df[target]

y_pred_test = lasso_grid.predict(X_test)

smape_test = smape(y_test, y_pred_test)

print(f"Lasso Model Test Set sMAPE: {smape_test:.2f}%")

<h2>Conclusions</h2>

<h3>Model Conclusion</h3>
<p>&nbsp;&nbsp;&nbsp;&nbsp;As you can clearly see from the sMAPE of the cross validated models, the Lasso Cross Validation Model was able to find a meaningful similarity within the features to properly identify characteristics for gold processing in order to more accurately predict gold recovery. The thorough evaluation process using cross-validation and sMAPE provided robust insights into model performance. This comprehensive approach ensured that the selected model is well-suited for predicting gold recovery with accuracy and reliability. Evidenced by its excellent sMAPE score against the untouched testing set.</p>

<h3>Project Conclusion and Insights</h3>
<p>&nbsp;&nbsp;&nbsp;&nbsp;This project aimed to develop a machine learning model to optimize gold recovery processes for Zyfra. By thoroughly preparing and analyzing the data, we ensured high data quality and consistency across training and testing phases. Through careful model construction and evaluation, we identified a model that proficiently predicts gold recovery, helping to optimize production and eliminate unprofitable parameters. This effort paves the way for improved operational efficiency, aligning with Zyfra's goals for innovation in heavy industry.</p>