# This document shows the training process of the model.

Reference for time to train is a 16'' Apple M1 Pro 10 Core 16 GB RAM Machine.

Comments are available only the first time a specific line of code is used.

## Trial 1

### Algorithm: KNN

### Paramters:
K_Neighbords: 5

### Results:

MSE: 2660.73
R^2: 0.974

Time to train: 40 seconds 

In [None]:
import pandas as pd
import sklearn as sklearn
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics

bat = pd.read_csv('Battery_RUL.csv')

#remove "Cycle_Index" since it doesn't carry info and "RUL" from X since its the objective
X = bat.iloc[:, 1:-1]
y = bat["RUL"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4333)

knn = KNeighborsRegressor(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred=knn.predict(X_test)

print(metrics.mean_squared_error(y_test, y_pred))
print(metrics.r2_score(y_test, y_pred))

#y_pred = pd.DataFrame(y_pred)
#y_pred.to_csv("/test.csv")

#print(y_pred)

## Trial 2

### Algorithm: XGBoost

### Paramters:
N_estimators: 100
Learning Rate: 0.5
Max Depth: 10
Random State: 4333

### Results:

MSE: 844.4
R^2: 0.991

Time to train:

In [None]:
import pandas as pd
import xgboost as xgb
import sklearn as sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics

bat = pd.read_csv('Battery_RUL.csv')

#remove "Cycle_Index" since it doesn't carry info and "RUL" from X since its the objective
X = bat.iloc[:, 1:-1]
y = bat["RUL"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4333)

# Initialize the XGBRegressor
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.5, max_depth=10, random_state=4333)

xgb_regressor.fit(X_train, y_train)

y_pred = xgb_regressor.predict(X_test)

print(metrics.mean_squared_error(y_test, y_pred))
print(metrics.r2_score(y_test, y_pred))

## Trial 3

### Algorithm: XGBoost + GridSearch

### Paramters:
```py
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 200, 'reg_alpha': 0.1, 'reg_lambda': 2, 'subsample': 1.0}
MSE: 766.1
R^2: 0.992

Time to train: 40 seconds

In [None]:
import pandas as pd
import xgboost as xgb
import sklearn as sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

bat = pd.read_csv('Battery_RUL.csv')

X = bat.iloc[:, 1:-1]
y = bat["RUL"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4333)

xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', random_state=4333)

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=xgb_regressor, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

# Get the best estimator
best_xgb_regressor = grid_search.best_estimator_

y_pred = best_xgb_regressor.predict(X_test)

print("Best parameters found: ", grid_search.best_params_)
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, y_pred))
print("R-squared: ", metrics.r2_score(y_test, y_pred))

## Trial 4

### Algorithms & Techniques: XGBoost, GridSearch

### Paramters:
```py
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 100],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}
```
*same code but increase max "max_depth" parameter to 100

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 200, 'reg_alpha': 0.1, 'reg_lambda': 1, 'subsample': 0.8}
MSE: 669.1
R^2: 0.993

Time to train: 5 minutes

## Trial 5

### Algorithms & Techniques: XGBoost, GridSearch and Normalization

### Paramters:
```py
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 100],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 200, 'reg_alpha': 0.1, 'reg_lambda': 2, 'subsample': 1.0}
MSE: 1515.7
R^2: 0.985

Time to train: 5 minutes

In [None]:
import pandas as pd
import xgboost as xgb
import sklearn as sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

bat = pd.read_csv('Battery_RUL.csv')

X = bat.iloc[:, 1:-1]
y = bat["RUL"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4333)

#normalize data
X_train = sklearn.preprocessing.normalize(X_train)

xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', random_state=4333)

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}

grid_search = GridSearchCV(estimator=xgb_regressor, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

best_xgb_regressor = grid_search.best_estimator_

y_pred = best_xgb_regressor.predict(X_test)

print("Best parameters found: ", grid_search.best_params_)
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, y_pred))
print("R-squared: ", metrics.r2_score(y_test, y_pred))

## Trial 6

### Algorithms & Techniques: Random Forest, GridSearch

### Paramters:
```py
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 100],
    'max_leaf_nodes': [None, 10, 20, 30],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3],
    'min_impurity_decrease': [0, 0.1, 0.2],
    'min_weight_fraction_leaf': [0, 0.1, 0.2],
    'random_state': [100, 500, 4000]
}
```

### Results:

Best parameters found:  
MSE: 835.6
R^2: 0.991

Time to train: 45 minutes

In [None]:
import pandas as pd
import sklearn.ensemble
import sklearn as sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics


bat = pd.read_csv('Battery_RUL.csv')

X = bat.iloc[:, 1:-1]
y = bat["RUL"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4333)

# Intitialize the RandomForest Regressor
randomForest_regressor = sklearn.ensemble.RandomForestRegressor(criterion='squared_error')

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 100],
    'max_leaf_nodes': [None, 10, 20, 30],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3],
    'min_impurity_decrease': [0, 0.1, 0.2],
    'min_weight_fraction_leaf': [0, 0.1, 0.2],
    'random_state': [100, 500, 4000]
}

grid_search = GridSearchCV(estimator=randomForest_regressor, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

best_xgb_regressor = grid_search.best_estimator_

y_pred = best_xgb_regressor.predict(X_test)

print("Best parameters found: ", grid_search.best_params_)
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, y_pred))
print("R-squared: ", metrics.r2_score(y_test, y_pred))

## Trial 7

### Algorithms & Techniques: XGBoost, GridSearch, and Paramter Optimization

The best result so far uses XGBoost, looking at that result's parameters; if the best value for a parameter is the greatest number, I set it as the lower limit and add two greater parameters.

**note: This time random state gets its own array instead of just using 4333.**

**note: It took 1 hour and 45 minutes to train a slightly worse model compared to one that took 5 minutes to train.**

**note: From now on the code will be the same from trail 4 only the parameters will change.**

### Paramters:

```py
param_grid = {
     'n_estimators': [200, 300, 1000],
     'learning_rate': [0.1, 0.15, 0.19],
     'max_depth': [100, 200, 500],
     'subsample': [0.1, 0.5, 0.8],
     'colsample_bytree': [0.1, 0.5, 0.8],
     'reg_alpha': [1, 2, 3],
     'reg_lambda': [0.05, 0.5, 0.9],
     'random_state': [100, 500, 4000]
 }
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 300, 'random_state: 500, 'reg_alpha': 3, 'reg_lambda': 0.9, 'subsample': 0.8} 
MSE: 743.6
R^2: 0.992

Time to train: 1 hour and 45 minutes

## Trial 8

### Algorithms & Techniques: XGBoost, GridSearch, and Paramter Optimization

Comparing the two results above and seeing the common best parameters:

**found in both:**

**colsample_bytree: 0.8**
**learning rate: 0.1**
**max depth: 100**
**n_estimators: not common**
**reg_alpha: not common**
**reg_lambda: not common**
**subsample: 0.8**
**random_state: not common**

looking at the **not common:**

`for n_estimators`, 200 and 300 were best out of [50, 100, 200] [200, 300, 1000] so **[200, 250, 300] will be used**

`for reg_alpha`, 1 and 3 were best out of [0, 0.1] [1, 2, 3] we can see that the maximum value was always chosen. Looking at the XGBoost documentation, the range for the paramter is [0, infin] so **[5, 10, 100] will be used**

`for reg_lambda`, 1 and 0.9 were best out of [1, 1.5, 2] [0.05, 0.5, 0.9] so [1] will be used.

`for random state`, 500 was best out of [100, 500, 4000, 4333] so [300, 500, 1000] will be used.

### Paramters:

```py
param_grid = {
     'n_estimators': [200, 250, 300],
     'learning_rate': [0.1],
     'max_depth': [100],
     'subsample': [0.8],
     'colsample_bytree': [0.8],
     'reg_alpha': [5, 10, 100],
     'reg_lambda': [1],
     'random_state': [300, 500, 1000]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 300, 'random_state: 500, 'reg_alpha': 10, 'reg_lambda': 1, 'subsample': 0.8} 
MSE: 753.8
R^2: 0.992

Time to train: 24 seconds

## Trial 9

### Algorithms & Techniques: XGBoost, GridSearch, and Paramter Optimization

Comparing the results we can see that `n_esitmators` and `random_state` are the only paramters that have enough scale to try more values with. The result is close to the best result so far but not better.

### Paramters:

```py
param_grid = {
    'n_estimators': [350, 750, 900],
    'learning_rate': [0.1],
    'max_depth': [100],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'reg_alpha': [1],
    'reg_lambda': [1],
    'random_state': [600, 700, 900]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 350, 'random_state: 700, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.8} 
MSE: 684.9
R^2: 0.993

Time to train: 25 seconds

## Trial 10

### Algorithms & Techniques: XGBoost, GridSearch, and Paramter Optimization

The main difference between the last trial and the best trial is `n_estimators`, best is 200 and previous is 350. Also, the best `random_state` is 700 so far. The result is very close to the best result so far but not better.

### Paramters:

```py
param_grid = {
    'n_estimators': [200],
    'learning_rate': [0.1],
    'max_depth': [100],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'reg_alpha': [1],
    'reg_lambda': [1],
    'random_state': [700]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 200, 'random_state: 700, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.8} 
MSE: 686.7
R^2: 0.993

Time to train: 

## Trial 11

### Algorithms & Techniques: XGBoost, GridSearch, and Paramter Optimization

Trying `n_estimator` as 2000 and `random_state` as 4333 gives us a new best result.

### Paramters:

```py
param_grid = {
    'n_estimators': [2000],
    'learning_rate': [0.1],
    'max_depth': [100],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'reg_alpha': [1],
    'reg_lambda': [1],
    'random_state': [4333]
}
```

### Results:

Best parameters found:  {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 100, 'n_estimators': 200, 'random_state: 4333, 'reg_alpha': 1, 'reg_lambda': 1-, 'subsample': 0.8} 
MSE: 618.3
R^2: 0.994

Time to train: 23 seconds