<a href="https://colab.research.google.com/github/ArifAygun/Iron-Ore-Froth-Flotation-Quality-Prediction/blob/main/AA_Graduate_Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Quality Prediction of Iron Ore Mining Flotation Process - Part:3**

# **Machine Learning Models**

### **Import Libraries and Modules**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
import random
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

### **Import Dataset**

In [17]:
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/My Drive/Flotation/

flotation = pd.read_csv('flotation_scaled.csv')
flotation.head().T

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/My Drive/Flotation


Unnamed: 0,0,1,2,3,4
iron_feed,0.540799,0.540799,0.540799,0.540799,0.540799
silica_feed,0.488314,0.488314,0.488314,0.488314,0.488314
starch_flow,0.256417,0.244856,0.28508,0.26904,0.276932
amina_flow,0.679801,0.595667,0.706357,0.708914,0.762635
pulp_flow,0.531515,0.55864,0.531768,0.558523,0.552414
pulp_pH,0.661952,0.669876,0.630313,0.567179,0.48336
pulp_density,0.670355,0.471926,0.681237,0.674662,0.786425
airflow,0.458967,0.45431,0.45598,0.455192,0.455045
level,0.276882,0.272294,0.275446,0.378092,0.520677
iron_conc,0.815436,0.840604,0.825503,0.788591,0.768456


###**Split Dataset as X and y**

In [19]:
X = flotation.drop('silica_conc', axis=1)
y_Si = flotation['silica_conc']

print("Shape of X:", X.shape)
print("Shape of y_Si:", y_Si.shape)

Shape of X: (4097, 10)
Shape of y_Si: (4097,)


### **Split into train, validation, and test set**

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_val_test, y_train, y_val_test = train_test_split(X, y_Si, 
                                           test_size=0.4, random_state=1)

X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, 
                                                test_size=0.5, random_state=1) 

print(X_train.shape[0], X_val.shape[0], X_test.shape[0])

2458 819 820


### **Estimate and test a linear regression with all inputs**

In [21]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

lr0 = linear_model.LinearRegression()

lr0.fit(X_train, y_train)

y_train_pred = lr0.predict(X_train)
y_val_pred = lr0.predict(X_val)

print('MSE on training set:')
print(mean_squared_error(y_train, y_train_pred))
print('MSE on validation set:')
print(mean_squared_error(y_val, y_val_pred))
print('')

print('R squared on training set:')
print(r2_score(y_train, y_train_pred))
print('R squared on validation set:')
print(r2_score(y_val, y_val_pred))

MSE on training set:
0.015910449639251378
MSE on validation set:
0.016538188279176805

R squared on training set:
0.6837973814566114
R squared on validation set:
0.6719744191606916


- Here we show that a linear regression model with lots of parameters, overfits on the training set and has a disappointing performance on the validation set.

- We are not yet using the test set because we are going to try other models and then pick the best one.

### **Estimate and validate a linear regression with ten randomly chosen inputs**

In [22]:
import random
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

random.seed(10)

lr = linear_model.LinearRegression()

input_indices = random.sample(range(0, X.shape[1]), 10)

X_train_subset = X_train.iloc[:, input_indices]
X_val_subset = X_val.iloc[:, input_indices]

lr.fit(X_train_subset, y_train)

y_train_pred = lr.predict(X_train_subset)
y_val_pred = lr.predict(X_val_subset)

print('MSE on training set:')
print(mean_squared_error(y_train, y_train_pred))
print('MSE on validation set:')
print(mean_squared_error(y_val, y_val_pred))
print('-'*25)
print('R squared on training set:')
print(r2_score(y_train, y_train_pred))
print('R squared on validation set:')
print(r2_score(y_val, y_val_pred))


MSE on training set:
0.015910449639251378
MSE on validation set:
0.016538188279176805
-------------------------
R squared on training set:
0.6837973814566114
R squared on validation set:
0.6719744191606916


- Here we show that overfitting is much less severe

### **Estimate many linear regressions with ten randomly chosen inputs and pick the best one**

- For 1000 times, randomly choose 10 inputs, estimate regression

In [23]:
import random
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

random.seed(10)

lr = linear_model.LinearRegression()

input_indices = random.sample(range(0, X.shape[1]), 10)

MSE = mean_squared_error(y_val, y_val_pred)

for j in range(0, 1000):
    lr_j = linear_model.LinearRegression()
    input_indices_j = random.sample(range(0, X.shape[1]), 10)
    lr_j.fit(X_train.iloc[:, input_indices_j], y_train)
    y_val_pred_j = lr_j.predict(X_val.iloc[:, input_indices_j])
    MSE_j = mean_squared_error(y_val, y_val_pred_j)
    if MSE_j < MSE:
        input_indices = input_indices_j
        lr = lr_j
        MSE = MSE_j

# Make predictions on the train, validation, and test sets
X_train_subset = X_train.iloc[:, input_indices]
X_val_subset = X_val.iloc[:, input_indices]
X_test_subset = X_test.iloc[:, input_indices]

lr.fit(X_train_subset, y_train)

y_train_pred = lr.predict(X_train_subset)
y_val_pred = lr.predict(X_val_subset)
y_test_pred = lr.predict(X_test_subset)

print('MSE on training set:')
print(mean_squared_error(y_train, y_train_pred))
print('MSE on validation set:')
print(mean_squared_error(y_val, y_val_pred))
print('MSE on test set:')
print(mean_squared_error(y_test, y_test_pred))
print('')

print('R squared on training set:')
print(r2_score(y_train, y_train_pred))
print('R squared on validation set:')
print(r2_score(y_val, y_val_pred))
print('R squared on test set:')
print(r2_score(y_test, y_test_pred))


MSE on training set:
0.015910449639251378
MSE on validation set:
0.016538188279176795
MSE on test set:
0.017172259535480513

R squared on training set:
0.6837973814566114
R squared on validation set:
0.6719744191606918
R squared on test set:
0.6736118126581889


- MSE values on the training, validation, and test sets are all relatively low, indicating that the model is able to fit the data well.

- R-squared values on the training, validation, and test sets are all around 0.67-0.68, suggesting that the model explains about 67-68% of the variance in the target variable.

- The model seems to perform reasonably well on the given dataset. However, it's important to note that the interpretation and evaluation of these metrics may vary depending on the specific context and requirements of the problem at hand.

## **10. Random Forest Regressor**

In [25]:
rf1 = RandomForestRegressor(random_state = 0, n_estimators = 100)  
rf1.fit(X_train,y_train)
y_pred_rf = rf1.predict(X_test)
print('R2 Score of Random Forest Regression',r2_score(y_test,y_pred_rf))

R2 Score of Random Forest Regression 0.7458296086893559


### **Estimate and test a Random Forest Regressor with all inputs**

In [29]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

rf0 = RandomForestRegressor()

rf0.fit(X_train, y_train)

y_train_pred0 = rf0.predict(X_train)
y_val_pred0 = rf0.predict(X_val)
y_test_pred0 = rf0.predict(X_test)

print('MSE on training set:')
print(mean_squared_error(y_train, y_train_pred0))
print('MSE on validation set:')
print(mean_squared_error(y_val, y_val_pred0))
print('MSE on test set:')
print(mean_squared_error(y_test, y_test_pred0))
print('-'*25)
print('R squared on training set:')
print(r2_score(y_train, y_train_pred0))
print('R squared on validation set:')
print(r2_score(y_val, y_val_pred0))
print('R squared on test set:')
print(r2_score(y_test, y_test_pred0))

MSE on training set:
0.0017140385432286424
MSE on validation set:
0.011863732984685604
MSE on test set:
0.013272989803509374
-------------------------
R squared on training set:
0.9659353765643361
R squared on validation set:
0.7646895876663908
R squared on test set:
0.7477241085471082
