## NASA dataset feature extraction

In this notebook, we will perform step by step operations on extracting features from the NASA datasets.

Our target is to extract useful features for SOH estimation.


In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import os

### 1.1 DataSet Introduction

When you open a matfile of the NASA dataset, eg: B0005.mat, you will see the Data structure like below:

Data Structure:
- cycle:	top level structure array containing the charge, discharge and impedance operations
	- type: 	operation  type, can be charge, discharge or impedance(which correspondent to the fields below)
	- ambient_temperature:	ambient temperature (degree C)
	- time: 	the date and time of the start of the cycle, in MATLAB  date vector format
	- data:	data structure containing the measurements

- for charge the fields are:
	- Voltage_measured: 	Battery terminal voltage (Volts)
	- Current_measured:	Battery output current (Amps)
	- Temperature_measured: 	Battery temperature (degree C)
	- Current_charge:		Current measured at charger (Amps)
	- Voltage_charge:		Voltage measured at charger (Volts)
	- Time:			Time vector for the cycle (secs)

- for discharge the fields are:
	- Voltage_measured: 	Battery terminal voltage (Volts)
	- Current_measured:	Battery output current (Amps)
	- Temperature_measured: 	Battery temperature (degree C)
	- Current_charge:		Current measured at load (Amps)
	- Voltage_charge:		Voltage measured at load (Volts)
	- Time:			Time vector for the cycle (secs)
	- Capacity:		Battery capacity (Ahr) for discharge till 2.7V 

- for impedance the fields are:
	- Sense_current:		Current in sense branch (Amps)
	- Battery_current:	Current in battery branch (Amps)
	- Current_ratio:		Ratio of the above currents 
	- Battery_impedance:	Battery impedance (Ohms) computed from raw data
	- Rectified_impedance:	Calibrated and smoothed battery impedance (Ohms) 
	- Re:			Estimated electrolyte resistance (Ohms)
	- Rct:			Estimated charge transfer resistance (Ohms)

This is meaning that on each battery, there will be several cycles performed on it and the cycle type might be different(charge, discharge, impedance). And all of them are aggregated in this one mat file.

We will inspect and perform different operations on each type of the cycle accordingly in the section below. 


### 2.1 charge feature extraction

#### 2.1.1 CC-CV mode separation
As for the charge fields, since the charging is composite of both CC and CV mode, we need to separate the CC mode and CV mode from this whole cycle data.

According to each dataset's readme file, we know the rules of CC and CV mode charging, so we define the rules like below:

- as long as the voltage is exceeding the 4.2V, then we take the index of that row into account and any rows that is on top of the row is count as CC mode.
- any rows that is below that row is count as CV mode.

However, we may do some sampling from the data to ensure the better correlation between the data and SOH. For example, just get 4.0V - 4.2V of the CC mode/just get 1.0A - 0.2A of the CV mode while we will just do the full dataset in this version.

Below will be the code for doing the separation.

In [2]:
def separate_cc_cv(data):
    data = data.reset_index(drop=True)  # Reset the index within each group
    try:
        # Find the relative index where Voltage exceeds 4.2V
        cc_end_idx = data[data['Voltage_Measured'] >= 4.2].index[0]
    except IndexError:
        # If no index is found, assign all data to CC mode and CV mode as empty
        cc_end_idx = None
    
    if cc_end_idx is not None:
        # Separate CC and CV mode data
        cc_data = data.iloc[:cc_end_idx+1]
        cv_data = data.iloc[cc_end_idx+1:]
    else:
        return None, None
    
    return cc_data, cv_data

#### 2.1.2 feature calculation for cycle
As for each charging cycle, we will use some statistical, time series, and computational physics method for describing the characteristics of the cycle data.

Below are the lists of features that we used:
- Mean
- Standard Deviation
- Kurtosis
- Skewness
- Charging Time
- Accumulated Charge
- Curve Slope
- Curve Entropy

For above features, we are using to describe the CC mode/CV mode current and voltage data. To be specific, for CC mode, we are calculating the voltage mean, standard deviation, kurtosis, skewness, slope and entropy because the current is constant. As for CV mode, we do the same thing for the current because the voltage is constant.

In [3]:
def calculate_CC_features(data):
    # Initialize a dictionary to hold the calculated features
    features = {}
    
    # Voltage features
    features['voltage mean'] = data['Voltage_Measured'].mean()
    features['voltage std'] = data['Voltage_Measured'].std()
    features['voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
    features['voltage skewness'] = stats.skew(data['Voltage_Measured'])
    
    # Temperature features
    features['CC_temperature mean'] = data['Temperature'].mean()
    features['CC_temperature std'] = data['Temperature'].std()
    features['CC_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
    features['CC_temperature skewness'] = stats.skew(data['Temperature'])
    
    # Charging time
    t_start = data['Time'].iloc[0]
    t_end = data['Time'].iloc[-1]
    features['CC charge time'] = t_end - t_start
    
    # Accumulated charge (Q)
    I = data['Current_Measured']
    dt = data['Time'].diff().fillna(0)  # Time differences
    Q = (I * dt).sum()
    features['CC Q'] = Q
    
    # Voltage slope
    V_start = data['Voltage_Measured'].iloc[0]
    V_end = data['Voltage_Measured'].iloc[-1]
    features['voltage slope'] = (V_end - V_start) / features['CC charge time']
    
    # Voltage entropy
    voltage_counts = data['Voltage_Measured'].value_counts(normalize=True)
    features['voltage entropy'] = -np.sum(voltage_counts * np.log(voltage_counts))
    
    return features

def calculate_CV_features(data):
    # Initialize a dictionary to hold the calculated features
    features = {}
    
    # Current features
    features['current mean'] = data['Current_Measured'].mean()
    features['current std'] = data['Current_Measured'].std()
    features['current kurtosis'] = stats.kurtosis(data['Current_Measured'])
    features['current skewness'] = stats.skew(data['Current_Measured'])
    
    # Temperature features
    features['CV_temperature mean'] = data['Temperature'].mean()
    features['CV_temperature std'] = data['Temperature'].std()
    features['CV_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
    features['CV_temperature skewness'] = stats.skew(data['Temperature'])
    
    # Charging time
    t_start = data['Time'].iloc[0]
    t_end = data['Time'].iloc[-1]
    features['CV charge time'] = t_end - t_start
    
    # Accumulated charge (Q)
    I = data['Current_Measured']
    dt = data['Time'].diff().fillna(0)  # Time differences
    Q = (I * dt).sum()
    features['CV Q'] = Q
    
    # Current slope
    I_start = data['Current_Measured'].iloc[0]
    I_end = data['Current_Measured'].iloc[-1]
    features['current slope'] = (I_end - I_start) / features['CV charge time']
    
    # Current entropy
    current_counts = data['Current_Measured'].value_counts(normalize=True)
    features['current entropy'] = -np.sum(current_counts * np.log(current_counts))
    
    return features

Below is the python script that will do the CC/CV mode separation and the feature calculation work.

All the data will be saved to the path NASA/ChargeOutput

Before this step, you should run the NASA/NASAdataextract/NASAmatscript/NASAchargeExtract.m in the MATLAB for generating necessary data from source data.

In [5]:

def process_charge_csv(file_path):
    data = pd.read_csv(file_path)
    results = []

    for cycle_life, cycle_data in data.groupby('Cycle Life'):
        cc_data, cv_data = separate_cc_cv(cycle_data)

        if cc_data is None or cv_data is None:
            print(f"Skipping Cycle Life {cycle_life} in file {file_path} due to missing CC/CV separation")
            continue

        if not cc_data.empty:
            cc_features = calculate_CC_features(cc_data)

        if not cv_data.empty:
            cv_features = calculate_CV_features(cv_data)

        # Merge CC and CV features into a single dictionary
        combined_features = {**cc_features, **cv_features}
        combined_features['charge cycle life'] = cycle_life
        
        results.append(combined_features)
    
    return pd.DataFrame(results)

# 电池Batch信息

Batch1 = ['B0005','B0006','B0007','B0018']
Batch2 = ['B0025','B0026','B0027','B0028']
Batch3 = ['B0029','B0030','B0031','B0032']
Batch4 = ['B0033','B0034','B0036']
Batch5 = ['B0038','B0039','B0040']
Batch6 = ['B0041','B0042','B0043', 'B0044']
Batch7 = ['B0045','B0046','B0047','B0048']
Batch8 = ['B0049','B0050','B0051','B0052']
Batch9 = ['B0053','B0054','B0055','B0056']

# recreate the above information into a dictionary
Batch = {'Batch1': Batch1, 'Batch2': Batch2, 'Batch3': Batch3, 'Batch4': Batch4, 'Batch5': Batch5, 'Batch6': Batch6, 'Batch7': Batch7, 'Batch8': Batch8, 'Batch9': Batch9}

# 设置文件夹路径
root = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract'
output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput'
files = [f for f in os.listdir(root) if f.endswith('.csv')]

all_results = []

for batch in Batch:
    for battery in Batch[batch]:
        file = [f for f in os.listdir(root) if f.endswith('.csv') and f.startswith(battery)]
        if not file:
            continue
        file_path = os.path.join(root, file[0])
        df = process_charge_csv(file_path)
        all_results.append(df)
        final_charge_df = pd.concat(all_results, ignore_index=True)
        if(not os.path.exists(os.path.join(output_folder, batch))):
            os.makedirs(os.path.join(output_folder, batch))
        final_charge_df.to_csv(os.path.join(output_folder, batch, f'{battery}_charge_features.csv'), index=False)

print('Charge Feature extraction completed.')

  features['voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
  features['voltage skewness'] = stats.skew(data['Voltage_Measured'])
  features['CC_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
  features['CC_temperature skewness'] = stats.skew(data['Temperature'])
  features['voltage slope'] = (V_end - V_start) / features['CC charge time']
  features['voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
  features['voltage skewness'] = stats.skew(data['Voltage_Measured'])
  features['CC_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
  features['CC_temperature skewness'] = stats.skew(data['Temperature'])
  features['voltage slope'] = (V_end - V_start) / features['CC charge time']
  features['voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
  features['voltage skewness'] = stats.skew(data['Voltage_Measured'])
  features['CC_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
  features['CC_temperature skewness'] = st

Skipping Cycle Life 80 in file /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract/B0025.mat_chargeExtract.csv due to missing CC/CV separation
Skipping Cycle Life 80 in file /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract/B0026.mat_chargeExtract.csv due to missing CC/CV separation
Skipping Cycle Life 80 in file /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract/B0027.mat_chargeExtract.csv due to missing CC/CV separation
Skipping Cycle Life 80 in file /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract/B0028.mat_chargeExtract.csv due to missing CC/CV separation
Skipping Cycle Life 10 in file /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeExtract/B0050.mat_chargeExtract.csv due to missing CC/CV separation


  features['voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
  features['voltage skewness'] = stats.skew(data['Voltage_Measured'])
  features['CC_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
  features['CC_temperature skewness'] = stats.skew(data['Temperature'])
  features['voltage slope'] = (V_end - V_start) / features['CC charge time']


Charge Feature extraction completed.


### 3.1 Discharge Extraction

#### 3.1.1 Discharge CC/CV & feature calculation

Due to the fact that all the discharge protocols are the same over the NASA dataset, we will not separate the CC/CV mode but uniformly using CC mode. According to the Readme file of the dataset, Discharge was carried out at a constant current (CC) level of 2A until the battery voltage fell to several different constant voltage. Hence, for the Discharge features, we will still use the same metrics as charges do, that is:
- Mean
- Standard Deviation
- Kurtosis
- Skewness
- Slope
- Entropy
- Charging Time
- Accumulated Charge

Before this step, you should run the NASAdischargeExtract.m script in the MATLAB to extract necessary discharge data from the raw data.

In [6]:

def calculate_discharge_features(data):
    # Initialize a dictionary to hold the calculated features
    features = {}
    
    # Voltage features
    features['discharge voltage mean'] = data['Voltage_Measured'].mean()
    features['discharge voltage std'] = data['Voltage_Measured'].std()
    features['discharge voltage kurtosis'] = stats.kurtosis(data['Voltage_Measured'])
    features['discharge voltage skewness'] = stats.skew(data['Voltage_Measured'])
    features['discharge voltage cov'] = features['discharge voltage std'] / features['discharge voltage mean']


    # Temperature features
    features['discharge_temperature mean'] = data['Temperature'].mean()
    features['discharge_temperature std'] = data['Temperature'].std()
    features['discharge_temperature kurtosis'] = stats.kurtosis(data['Temperature'])
    features['discharge_temperature skewness'] = stats.skew(data['Temperature'])
    features['discharge_temperature cov'] = features['discharge_temperature std'] / features['discharge_temperature mean']

    # Charging time
    t_start = data['Time'].iloc[0]
    t_end = data['Time'].iloc[-1]
    features['discharge time'] = t_end - t_start
    
    # Accumulated charge (Q)
    I = data['Current_Measured']
    dt = data['Time'].diff().fillna(0)  # Time differences
    Q = (I * dt).sum()
    features['discharge Q'] = Q
    
    # Voltage slope
    V_start = data['Voltage_Measured'].iloc[0]
    V_end = data['Voltage_Measured'].iloc[-1]
    features['discharge voltage slope'] = (V_end - V_start) / features['discharge time']
    
    # Voltage entropy
    voltage_counts = data['Voltage_Measured'].value_counts(normalize=True)
    features['discharge voltage entropy'] = stats.entropy(voltage_counts)
    features['capacity'] = data['Capacity'].iloc[0]
    return features

def process_discharge_csv(file_path):
    data = pd.read_csv(file_path)
    results = []
    
    for cycle_life, cycle_data in data.groupby('Cycle Life'):
        features = calculate_discharge_features(cycle_data)
        features['discharge cycle life'] = cycle_life
        results.append(features)
    
    return pd.DataFrame(results)

# 设置文件夹路径
root = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/DischargeExtract'
output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/DischargeOutput'
files = [f for f in os.listdir(root) if f.endswith('.csv')]

for batch in Batch:
    for battery in Batch[batch]:
        file = [f for f in os.listdir(root) if f.endswith('.csv') and f.startswith(battery)]
        if not file:
            continue
        file_path = os.path.join(root, file[0])
        df = process_discharge_csv(file_path)
        batch_output_folder = os.path.join(output_folder, batch)
        if not os.path.exists(batch_output_folder):
            os.makedirs(batch_output_folder)
        df.to_csv(os.path.join(batch_output_folder, f'{battery}_discharge_features.csv'), index=False)

print('Discharge Feature extraction completed.')


Discharge Feature extraction completed.


### 4.1 Impedance Extraction

#### 4.1.1 Complex number processing

For the impedance cycles, there are several columns that is containing the complex numbers. As we all know, the complex number has real number part and imag number part. By using the function in MATLAB script, we have extract the real number part and the imag number part. 

Since the current ratio, sense current and battery current makes no impact to the capacity that we are going to measure, we just simply not using them.

We make the separation operation for complex number for battery impedance and rectified impedance. From physics law, we know that the impedance is highly related to the Li-battery's state of health. So we extract mean and max value of those two columns to be the feature.

Since the calculation and the processing of the complex number is not that hard, we are not writing a single function in python file for processing them. All the processing steps are done within the mat script under NASAdataextract/NASAmatscript/NASAimpedanceExtract.m


### 5.1 Merge Features

#### 5.1.1 cycle number matching

Remember we have a column called cycle number for charge, discharge and impedance. Now, it is the time for using it !

The cycle order in the NASA dataset is not that periodic. As for the charge and discharge, it is periodic as charge - discharge - charge - discharge. However, for the impedance, it is kind of random. 

We inspect all the features generated for impedance, charge and discharge and found the discharge data is the bottleneck, all the charge cycle and impedance cycle will be dependent on the discharge cycle

So we set the rules like below:
- for charge cycle, its cycle life should be +2 to the discharge cycle
- for impedance cycle, its cycle life should be +1 to the discharge cycle

And we need to first preprocess the charge/discharge/impedance cycle to drop the nan rows.

In [7]:

def preprocess_data(file_path):
    # Load data
    data = pd.read_csv(file_path)
    
    # Drop rows containing NaN, blank, or zero values
    data = data.replace('', pd.NA).dropna()
    data = data[(data != 0).all(axis=1)]
    
    return data

def preprocess_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for batch in os.listdir(input_folder):
        batch_folder = os.path.join(input_folder, batch)
        if os.path.isdir(batch_folder):
            output_batch_folder = os.path.join(output_folder, batch)
            if not os.path.exists(output_batch_folder):
                os.makedirs(output_batch_folder)
            
            for file in os.listdir(batch_folder):
                if file.endswith('.csv'):
                    file_path = os.path.join(batch_folder, file)
                    cleaned_data = preprocess_data(file_path)
                    output_file_path = os.path.join(output_batch_folder, file)
                    cleaned_data.to_csv(output_file_path, index=False)
                    print(f'Processed and saved: {output_file_path}')

# Define input and output folders
charge_input_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput'
charge_output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned'

discharge_input_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/DischargeOutput'
discharge_output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/DischargeOutput_Cleaned'

impedance_input_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ImpedanceOutput'
impedance_output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ImpedanceOutput_Cleaned'

# Preprocess data
print('Processing charge data...')
preprocess_folder(charge_input_folder, charge_output_folder)

print('Processing discharge data...')
preprocess_folder(discharge_input_folder, discharge_output_folder)

print('Processing impedance data...')
preprocess_folder(impedance_input_folder, impedance_output_folder)

print('Data preprocessing completed.')


Processing charge data...
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch8/B0052_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch8/B0049_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch8/B0050_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch8/B0051_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch6/B0044_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned/Batch6/B0041_charge_features.csv
Processed and saved: /Users/jonathanzha/Desktop/Battery-dataset-preprocess

And then we will perform our merge process.

In [11]:

# Set the folder paths
charge_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ChargeOutput_Cleaned'
discharge_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/DischargeOutput_Cleaned'
impedance_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/ImpedanceOutput_Cleaned'
output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput'

# Create the output folder if it does not exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Battery batch information
Batch = {
    'Batch1': ['B0005', 'B0006', 'B0007', 'B0018'],
    'Batch2': ['B0025', 'B0026', 'B0027', 'B0028'],
    'Batch3': ['B0029', 'B0030', 'B0031', 'B0032'],
    'Batch4': ['B0033', 'B0034', 'B0036'],
    'Batch5': ['B0038', 'B0039', 'B0040'],
    'Batch6': ['B0041', 'B0042', 'B0043', 'B0044'],
    'Batch7': ['B0045', 'B0046', 'B0047', 'B0048'],
    'Batch8': ['B0049', 'B0050', 'B0051'],
    'Batch9': ['B0053', 'B0054', 'B0055', 'B0056']
}

def find_next_cycle(df, cycle_life_col, target_cycle_life):
    df = df[df[cycle_life_col] >= target_cycle_life]
    if not df.empty:
        return df.iloc[0]
    return None

for batch_name, batteries in Batch.items():
    batch_output_folder = os.path.join(output_folder, batch_name)
    if not os.path.exists(batch_output_folder):
        os.makedirs(batch_output_folder)
    
    discharge_batch_folder = os.path.join(discharge_folder, batch_name)
    charge_batch_folder = os.path.join(charge_folder, batch_name)
    impedance_batch_folder = os.path.join(impedance_folder, batch_name)
    
    for battery in batteries:
        discharge_file = f'{battery}_discharge_features.csv'
        charge_file = f'{battery}_charge_features.csv'
        impedance_file = f'{battery}.mat_ImpedanceExtract.csv'
        
        if os.path.exists(os.path.join(discharge_batch_folder, discharge_file)):
            discharge_data = pd.read_csv(os.path.join(discharge_batch_folder, discharge_file))
            charge_data = pd.read_csv(os.path.join(charge_batch_folder, charge_file)) if os.path.exists(os.path.join(charge_batch_folder, charge_file)) else pd.DataFrame()
            impedance_data = pd.read_csv(os.path.join(impedance_batch_folder, impedance_file)) if os.path.exists(os.path.join(impedance_batch_folder, impedance_file)) else pd.DataFrame()
            
            merged_data = []

            for _, impedance_row in impedance_data.iterrows():
                impedance_cycle_life = impedance_row['Impedance Cycle Life']
                # Find corresponding discharge cycle
                target_discharge_cycle_life = impedance_cycle_life - 1
                discharge_row = find_next_cycle(discharge_data, 'discharge cycle life', target_discharge_cycle_life)
                
                # Find corresponding charge cycle
                target_charge_cycle_life = impedance_cycle_life + 1
                charge_row = find_next_cycle(charge_data, 'charge cycle life', target_charge_cycle_life)
                
                if discharge_row is not None and charge_row is not None:
                    combined_row = {**discharge_row, **charge_row, **impedance_row}
                    merged_data.append(combined_row)
            
            if merged_data:
                merged_df = pd.DataFrame(merged_data)
                output_csv = os.path.join(batch_output_folder, f'{battery}_Merged_discharge+charge+impedance.csv')
                merged_df.to_csv(output_csv, index=False)
                print(f'Merged data saved to {output_csv}')
            else:
                print(f'No merged data for battery {battery}')
        else:
            print(f'Discharge file not found for battery {battery}')

print('Data merging completed.')


Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch1/B0005_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch1/B0006_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch1/B0007_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch1/B0018_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch2/B0025_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch2/B0026_Merged_discharge+charge+impedance.csv
Merged data saved to /Users/jonathanzha/

### 6.1 Plot the Correlation Matrix

#### 6.1.1 Problematic rows

During the process of inspecting the rows of the csv that we merged, we found there are few problems:

- some complex number combinations in B0050（Batch8）of Re and Rct unprocessed which can't be input into the neural network
- some rows that are blank or have Nan value cuz of the merging process
- elminate duplicate rows(you can refer to the discharge cycle life and charge cycle life for doing this)
- eliminate some unnecessary columns eg: charge cycle life / discharge cycle life.



In [12]:

# Function to handle complex number combinations
def process_complex_numbers(df, columns):
    for column in columns:
        if column in df.columns:
            df[column] = df[column].apply(lambda x: np.real(complex(x)) if isinstance(x, str) and 'j' in x else x)
    return df

# Set the folder paths
output_folder = '/Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput'

# Columns to remove
columns_to_remove = ['charge cycle life', 'discharge cycle life', 'Impedance Cycle Life']

# Columns with potential complex numbers
complex_columns = ['Re', 'Rct']

for batch_name in os.listdir(output_folder):
    batch_folder = os.path.join(output_folder, batch_name)
    if os.path.isdir(batch_folder):
        for file_name in os.listdir(batch_folder):
            file_path = os.path.join(batch_folder, file_name)
            if file_name.endswith('_Merged_discharge+charge+impedance.csv'):
                # Read the CSV file
                df = pd.read_csv(file_path)
                
                # Process complex numbers
                df = process_complex_numbers(df, complex_columns)
                
                # Remove rows with NaN values
                df = df.dropna()
                
                # Eliminate duplicate rows
                df = df.drop_duplicates(subset=['discharge cycle life', 'charge cycle life'])
                
                # Remove unnecessary columns
                df = df.drop(columns=columns_to_remove, errors='ignore')
                
                # Save the cleaned DataFrame back to CSV
                df.to_csv(file_path, index=False)
                print(f'Processed and cleaned {file_path}')

print('Data cleaning completed.')


Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch8/B0051_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch8/B0050_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch8/B0049_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch6/B0042_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch6/B0043_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonathanzha/Desktop/Battery-dataset-preprocessing-code-library/NASA/MergedOutput/Batch6/B0041_Merged_discharge+charge+impedance.csv
Processed and cleaned /Users/jonat