## Remaining Battery Life🔋🪫Prediction

The Hawaii Natural Energy Institute conducted an analysis on 14 NMC-LCO 18650 batteries, each with a nominal capacity of 2.8 Ah. These batteries underwent over 1000 charge-discharge cycles at a temperature of 25°C, using a constant current-constant voltage (CC-CV) charging method at a C/2 rate and a discharge rate of 1.5C.

<img align=left width=550px src='https://apmonitor.com/pds/uploads/Main/battery_life.png'>

Data

 - Cycle Index: number of cycle
 - F1: Discharge Time (s)
 - F2: Time at 4.15V (s)
 - F3: Time Constant Current (s)
 - F4: Decrement 3.6-3.4V (s)
 - F5: Max. Voltage Discharge (V)
 - F6: Min. Voltage Charge (V)
 - F7: Charging Time (s)
 - Total time (s)
 - RUL: target

See full [problem statement](https://apmonitor.com/pds/index.php/Main/BatteryLife).

### Import Packages and Battery Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = 'http://apmonitor.com/pds/uploads/Main/'
data = pd.read_csv(url+"battery_data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15064 entries, 0 to 15063
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Cycle_Index                15064 non-null  float64
 1   Discharge Time (s)         15064 non-null  float64
 2   Decrement 3.6-3.4V (s)     15064 non-null  float64
 3   Max. Voltage Dischar. (V)  15064 non-null  float64
 4   Min. Voltage Charg. (V)    15064 non-null  float64
 5   Time at 4.15V (s)          15064 non-null  float64
 6   Time constant current (s)  15064 non-null  float64
 7   Charging time (s)          15064 non-null  float64
 8   RUL                        15064 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.0 MB


Shorten column names

In [2]:
data.columns = ['Cycle','Disch_s','Dec_3.6-3.4','MaxVD','MinVC','T4.15V','TCC_s','Charge_s','RUL']

Summarize the data

In [3]:
data.describe()

Unnamed: 0,Cycle,Disch_s,Dec_3.6-3.4,MaxVD,MinVC,T4.15V,TCC_s,Charge_s,RUL
count,15064.0,15064.0,15064.0,15064.0,15064.0,15064.0,15064.0,15064.0,15064.0
mean,556.155005,4581.27396,1239.784672,3.908176,3.577904,3768.336171,5461.26697,10066.496204,554.194172
std,322.37848,33144.012077,15039.589269,0.091003,0.123695,9129.552477,25155.845202,26415.354121,322.434514
min,1.0,8.69,-397645.908,3.043,3.022,-113.584,5.98,5.98,0.0
25%,271.0,1169.31,319.6,3.846,3.488,1828.884179,2564.31,7841.9225,277.0
50%,560.0,1557.25,439.239471,3.906,3.574,2930.2035,3824.26,8320.415,551.0
75%,833.0,1908.0,600.0,3.972,3.663,4088.3265,5012.35,8763.2825,839.0
max,1134.0,958320.37,406703.768,4.363,4.379,245101.117,880728.1,880728.1,1133.0


![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Create new ID Column to Identify 14 Batteries

In [4]:
data['ID']= 0 
# add ID to DataFrame

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Filter Data

There are many bad measurements as shown in the line plot. Data rows with bad values need to be removed.

In [5]:
# Create a line plot of the data 

Remove bad values with upper and lower validity limits. A more automated approach could reject values based on rate of change or knowledge of physical constraints that would lead to elimination of data rows. 

In [6]:
# Remove bad data values

Fewer outliers as shown with line and box plots.

In [7]:
# Create a line plot of the data 

In [8]:
# Create a box plot of the data 

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Pair Plot

A pair plot shows the correlation between variables.

```python
sns.pairplot(data)
```

It has bar distributions on the diagonal and scatter plots on the off-diagonal. A pair plot also shows a different color (`hue`) by category `ID`. Pair plots show correlations between pairs of variables that may be related and gives a good indication of features (explanatory inputs) that are used for classification or regression. Reduce data by 10x to help with plot speed.

In [9]:
# Create a pair plot with reduced data set.

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Joint Plot

A joint plot shows two variables, with the univariate and joint distributions.

```python
sns.jointplot(x='MaxVD',y='RUL',data=data,kind="kde")
```

Try `kind='reg'`, `'kde'`, and `'hex'` to see different joint plot styles.

In [10]:
# Create a joint plot

Create a correlation heat map

```python
plt.figure(figsize=(10,8))
cor = data.corr()
sns.heatmap(cor, annot=True,cmap=plt.cm.Reds)
plt.show()
```

to examine the correlation among the variables. Which have the strongest correlation to `RUL`?

In [11]:
# Calculate the data correlation

In [12]:
# Visualize the correlation 

### Regression

The is objective is to minimize a loss function such as a sum of squared errors between the measured and predicted values:

$Loss = \sum_{i=1}^{n}\left(y_i-z_i\right)^2$

where `n` is the number of observations. Regression requires labelled data (output values) for training.

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Linear Regression

There are many model forms such as linear, polynomial, and nonlinear. A familiar linear model is a line with slope `a` and intercept `b` with `y = a x + b`.   
    
```python
x = data['MaxVD'].values
z = data['RUL'].values
p1 = np.polyfit(x,z,1)
```
    
A simple method for linear regression is with `numpy` to fit `p=np.polyfit(x,y,1)` and evaluate `np.polyval(p,x)` the model. Determine the slope and intercept that minimize the sum of squared errors (least squares) between the predicted `lnMFR` and measured `lnMFR` output using `H2R` as the input.

In [13]:
# Linear regression with one feature

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Multiple Linear Regression

Multiple linear regression uses more than one feature to predict the label.

In [14]:
# Create the features and label for multiple linear regression

`statsmodels` performs standard Ordinary Least Squares (OLS) analysis with an informative report summary.

```python
import statsmodels.api as sm
xc = sm.add_constant(x)
model = sm.OLS(z,xc).fit()
predictions = model.predict(xc)
model.summary()
```

The input `x` is augmented with a ones column so that it also predicts the intercept. This is accomplished with `xc=sm.add_constant(x)`. Perform a multiple linear regression with all of the data columns to predict `lnMFR`.

In [15]:
# Linear regression with multiple features

### Scale Data

Many regression algorithms require scaled data to perform well (e.g. Artificial Neural Networks). Scale data with the Standard Scalar from scikit-learn.

In [16]:
# Scale data

The value `ds` is returned as a `numpy` array so we need to convert it back to a `pandas` `DataFrame`.

```python
ds = pd.DataFrame(ds,columns=data.columns)
```

Re-use the column names from `data`.

In [17]:
# Restore ID value (unscaled)

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Divide Data

Data is divided into train and test sets to separate a fraction of the rows for evaluating classification or regression models. A typical split is 80% for training and 20% for testing, although the range depends on how much data is available and the objective of the study.

The `train_test_split` is a function in `sklearn` for the specific purpose of splitting data into train and test sets.

```python
from sklearn.model_selection import train_test_split
train,test = train_test_split(ds, test_size=0.2, shuffle=True)
```

There are options such as `shuffle=True` to randomize the selection in each set. 

In [18]:
# Split data - method 1

For this data set, it is better to split by battery ID than randomly. Otherwise, data from all batteries are used for training and testing. A split by battery ID is better to observe the test performance on battery data not used for training.

In [19]:
# Split data - method 2

### Select Best Features

Rank the features to determine the best set that predicts `RUL`. There is additional information on [Select K Best Features](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).

In [20]:
# Select features and label

In [21]:
# Determine the features with the highest correlation to the label

In [22]:
# remove lowest scoring features

![exercise](https://apmonitor.com/che263/uploads/Begin_Python/exercise.png)

### Regression

Machine learning is computer algorithms and statistical models that rely on patterns and inference. They perform a specific task without explicit instructions. Machine learned regression models can be as simple as linear regression or as complex as deep learning. This tutorial demonstrates several regression methods with `scikit-learn` and the `lazypredict` package.

In [23]:
# pip install lazypredict

In [24]:
# Evaluate many regressors

In [25]:
# predict with kernel ridge regressor

In [26]:
# predict with linear regressor

### View Remaining Useful Life (Unscaled) on Test Batteries

In [27]:
# view RUL on test batteries

### View Remaining Useful Life on Training Data Batteries

In [28]:
# view RUL on training batteries

### Regression with PyTorch

In [29]:
# regression with PyTorch

### Regression with Keras / TensorFlow

In [30]:
# regression with TensorFlow / Keras