# **Data Cleaning, Manipulation and Analysis**

---

## **Objectives**

The intention of this notebook was tri-fold: data cleaning, data transformation and data loading. Light analysis has also been carried out to better understand, extract and load data.

### **Inputs**

* Dataset retrived from Kaggle (CSV file containing data reagaring patients with or potentially at risk of Alzheimer's disease saved to inputs folder)

### **Outputs**

* Data cleaning pipeline (within this notebook)
* Machine learning pipeline (within this notebook)
* Cleaned data (csv file extracted to outputs folder)
* Data for machine learning (CSV file extracted to outputs folder)

### **Additional Comments**

* Data was extracted from Kaggle with the source citation included in the README file.
* Data was saved in its raw orginal form and then cleaned (a machine learning dataset with scaling and encoding was also created).

---
---

##### **REMINDER**: 
All notebook cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content).

---
---

## **Setup Information**

---

### **Change Working Directory**

* When storing the notebooks in a subfolder to run in the editor, for projects such as these, it's best practice to change the working directory. 
* We need to change the working directory from its current folder to its parent folder.

In [1]:
# First we access the current directory with os.getcwd()
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\healthcare-and-public-health\\jupyter_notebooks'

In [2]:
# Then we make the parent of the current directory the new current directory
# We use *os.path.dirname()* to get the parent directory
# Next we use *os.chdir()* to define the new current directory
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
# Next we confirm the new current directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\healthcare-and-public-health'

---
---

#### **IMPORTANT**: 
Before running the cells below, you **MUST** restart the kernel!

**This is because:**
- Windows locks files that are currently in use.
- NumPy is loaded in the current kernel session.
- Restarting clears memory and releases file locks.

**How to restart the kernel:**
1. Click on the restart button above with the circular arrow before it
2. Confirm the restart
3. **Then** run the cells below in order

---
---

### **Install Packages**

---

In [6]:
# Upgrade numpy and pandas first (run after kernel restart)
%pip install --upgrade numpy

Note: you may need to restart the kernel to use updated packages.


In [7]:
# Install other packages (run after numpy upgrade completes)
%pip install pandas matplotlib seaborn scikit-learn plotly feature-engine

Note: you may need to restart the kernel to use updated packages.


In [8]:
# Test all imports (run after all packages are installed)
import numpy as np
import pandas as pd
import matplotlib as mb
import matplotlib.pyplot as plt
import plotly as pl
import seaborn as sns
import sklearn as sk
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_engine as fe

print("All packages imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {mb.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Scikit-learn version: {sk.__version__}")
print(f"Plotly version: {pl.__version__}")
print(f"Feature-engine version: {fe.__version__}")

All packages imported successfully!
NumPy version: 2.3.2
Pandas version: 2.3.1
Matplotlib version: 3.10.5
Seaborn version: 0.13.2
Scikit-learn version: 1.7.1
Plotly version: 6.2.0
Feature-engine version: 1.8.3


---

## **Section 1**

### **Data Extraction**
This section contains code for the loading of data.

---

Extract the dataset from the inputs folder and load it to notebooks as a DataFrame.

In [36]:
df = pd.read_csv("inputs/alzheimers_disease_data.csv")
print("Data loaded successfully!")
print(f"DataFrame shape: {df.shape}")
df

Data loaded successfully!
DataFrame shape: (2149, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,6895,61,0,0,1,39.121757,0,1.561126,4.049964,6.555306,...,0,0,4.492838,1,0,0,0,0,1,XXXConfid
2145,6896,75,0,0,2,17.857903,0,18.767261,1.360667,2.904662,...,0,1,9.204952,0,0,0,0,0,1,XXXConfid
2146,6897,77,0,0,1,15.476479,0,4.594670,9.886002,8.120025,...,0,0,5.036334,0,0,0,0,0,1,XXXConfid
2147,6898,78,1,3,1,15.299911,0,8.674505,6.354282,1.263427,...,0,0,3.785399,0,0,0,0,1,1,XXXConfid


Create a random sample of the data for further testing purposes. Consider the first 5 rows (head) throughout for better notebook observability.

In [32]:
df = df.sample(frac=0.25, random_state=10)
print("Data loaded successfully!")
print(f"DataFrame shape: {df.shape}")
print("\nFirst 5 rows:")
df.head()

Data loaded successfully!
DataFrame shape: (537, 35)

First 5 rows:


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
613,5364,61,1,1,0,29.27054,0,14.782028,7.315484,2.976619,...,1,0,0.775005,0,1,0,0,0,1,XXXConfid
1018,5769,81,1,0,0,34.641073,0,7.383103,2.473479,5.280584,...,0,0,3.389056,0,0,0,0,0,0,XXXConfid
264,5015,81,0,1,0,22.923111,0,9.314832,8.917378,3.807813,...,0,1,8.681801,0,0,0,0,0,1,XXXConfid
1758,6509,62,0,0,2,23.587924,0,1.236318,0.666426,3.360432,...,0,0,3.983733,1,0,0,0,0,0,XXXConfid
1441,6192,80,1,3,2,23.715891,1,12.339372,5.970801,1.625098,...,0,0,2.744058,0,0,0,0,1,0,XXXConfid


---

## **Section 2**

### **Data Transformation**
This section contains functions for transformer creation, pipeline code and light analysis.

---

Check the current columns.

In [39]:
print("Data loaded successfully!")
print("Available columns:")
print(df.columns.tolist())

Data loaded successfully!
Available columns:
['PatientID', 'Age', 'Gender', 'Ethnicity', 'EducationLevel', 'BMI', 'Smoking', 'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality', 'FamilyHistoryAlzheimers', 'CardiovascularDisease', 'Diabetes', 'Depression', 'HeadInjury', 'Hypertension', 'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL', 'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment', 'MemoryComplaints', 'BehavioralProblems', 'ADL', 'Confusion', 'Disorientation', 'PersonalityChanges', 'DifficultyCompletingTasks', 'Forgetfulness', 'Diagnosis', 'DoctorInCharge']


Check the minimum values for numerical columns.

In [48]:
print("Data loaded successfully!")
numerical_columns = ["Age", "Gender", "Ethnicity", "EducationLevel", "BMI", "AlcoholConsumption", "PhysicalActivity", "DietQuality", "SleepQuality", "SystolicBP", "DiastolicBP", "CholesterolTotal", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "MMSE", "FunctionalAssessment"]
print("Minimum values for numerical columns:")
df[numerical_columns].min()

Data loaded successfully!
Minimum values for numerical columns:


Age                          60.000000
Gender                        0.000000
Ethnicity                     0.000000
EducationLevel                0.000000
BMI                          15.008851
AlcoholConsumption            0.002003
PhysicalActivity              0.003616
DietQuality                   0.009385
SleepQuality                  4.002629
SystolicBP                   90.000000
DiastolicBP                  60.000000
CholesterolTotal            150.093316
CholesterolLDL               50.230707
CholesterolHDL               20.003434
CholesterolTriglycerides     50.407194
MMSE                          0.005312
FunctionalAssessment          0.000460
dtype: float64

Check the maximum values for numerical columns.

In [49]:
print("Data loaded successfully!")
numerical_columns = ["Age", "Gender", "Ethnicity", "EducationLevel", "BMI", "AlcoholConsumption", "PhysicalActivity", "DietQuality", "SleepQuality", "SystolicBP", "DiastolicBP", "CholesterolTotal", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "MMSE", "FunctionalAssessment"]
print("Maximum values for numerical columns:")
df[numerical_columns].max()

Data loaded successfully!
Maximum values for numerical columns:


Age                          90.000000
Gender                        1.000000
Ethnicity                     3.000000
EducationLevel                3.000000
BMI                          39.992767
AlcoholConsumption           19.989293
PhysicalActivity              9.987429
DietQuality                   9.998346
SleepQuality                  9.999840
SystolicBP                  179.000000
DiastolicBP                 119.000000
CholesterolTotal            299.993352
CholesterolLDL              199.965665
CholesterolHDL               99.980324
CholesterolTriglycerides    399.941862
MMSE                         29.991381
FunctionalAssessment          9.996467
dtype: float64

Check for duplicates and add their sum.

In [57]:
print("Data loaded successfully!")
df.duplicated().sum()

Data loaded successfully!


np.int64(0)

Check for null values and add their sum.

In [53]:
print("Data loaded successfully!")
df.isnull().sum()

Data loaded successfully!


PatientID                    0
Age                          0
Gender                       0
Ethnicity                    0
EducationLevel               0
BMI                          0
Smoking                      0
AlcoholConsumption           0
PhysicalActivity             0
DietQuality                  0
SleepQuality                 0
FamilyHistoryAlzheimers      0
CardiovascularDisease        0
Diabetes                     0
Depression                   0
HeadInjury                   0
Hypertension                 0
SystolicBP                   0
DiastolicBP                  0
CholesterolTotal             0
CholesterolLDL               0
CholesterolHDL               0
CholesterolTriglycerides     0
MMSE                         0
FunctionalAssessment         0
MemoryComplaints             0
BehavioralProblems           0
ADL                          0
Confusion                    0
Disorientation               0
PersonalityChanges           0
DifficultyCompletingTasks    0
Forgetfu

Create code to populate categorical columns with integer values with their string counterparts.

In [61]:
print("Data loaded successfully!")
# Replace values in categorical columns for better readability
# Gender mapping
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].replace({0: "Male", 1: "Female"})

# Ethnicity mapping
if "Ethnicity" in df.columns:
    df["Ethnicity"] = df["Ethnicity"].replace({
        0: "Caucasian", 1: "African American", 2: "Asian", 3: "Other"
    })

# Binary columns (0/1 to No/Yes)
binary_cols = ["Smoking", "CardiovascularDisease", "Depression", 
               "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", 
               "DifficultyCompletingTasks"]

for col in binary_cols:
    if col in df.columns:
        df[col] = df[col].replace({0: "No", 1: "Yes"})

# Diagnosis mapping
if "Diagnosis" in df.columns:
    df["Diagnosis"] = df["Diagnosis"].replace({0: "No Dementia", 1: "Dementia"})
print(f"DataFrame shape: {df.shape}")
df.head()

Data loaded successfully!
DataFrame shape: (2149, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,Male,Caucasian,2,22.927749,No,13.297218,6.327112,1.347214,...,No,No,1.725883,0,0,No,Yes,0,No Dementia,XXXConfid
1,4752,89,Male,Caucasian,0,26.827681,No,4.542524,7.619885,0.518767,...,No,No,2.592424,0,0,No,No,1,No Dementia,XXXConfid
2,4753,73,Male,Other,1,17.795882,No,19.555085,7.844988,1.826335,...,No,No,7.119548,0,1,No,Yes,0,No Dementia,XXXConfid
3,4754,74,Female,Caucasian,1,33.800817,Yes,12.209266,8.428001,7.435604,...,No,Yes,6.481226,0,0,No,No,0,No Dementia,XXXConfid
4,4755,89,Male,Caucasian,0,20.716974,No,18.454356,6.310461,0.795498,...,No,No,0.014691,0,0,Yes,Yes,0,No Dementia,XXXConfid


Create functions to load into transformers.

In [None]:
# Drop specific columns
def drop_columns(df):
    return df.drop(columns=["EducationLevel", "SleepQuality", "FamilyHistoryAlzheimers", "Diabetes", "HeadInjury", "Hypertension", "SystolicBP", "DiastolicBP", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "Confusion", "Disorientation", "Forgetfulness", "DoctorInCharge"], errors="ignore")

# Change column locations
def change_column_location(df):
    new_column_order = ["PatientID", "Age", "Gender", "Ethnicity", "BMI", "DietQuality", "PhysicalActivity", "Smoking", "AlcoholConsumption", "CardiovascularDisease", "CholesterolTotal", "FunctionalAssessment", "ADL", "MMSE", "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", "DifficultyCompletingTasks", "Depression", "Diagnosis", "DoctorInCharge"]
    # Only include columns that actually exist in the dataframe
    existing_columns = [col for col in new_column_order if col in df.columns]
    return df[existing_columns]

# Convert data types
def convert_data_types(df):
    if "PatientID" in df.columns:
        df["PatientID"] = df["PatientID"].astype(int)
    if "Age" in df.columns:
        df["Age"] = df["Age"].astype(int)
    if "Gender" in df.columns:
        df["Gender"] = df["Gender"].astype(str)
    if "Ethnicity" in df.columns:
        df["Ethnicity"] = df["Ethnicity"].astype(str)
    if "BMI" in df.columns:
        df["BMI"] = df["BMI"].astype(float)
    if "Smoking" in df.columns:
        df["Smoking"] = df["Smoking"].astype(str)
    if "AlcoholConsumption" in df.columns:
        df["AlcoholConsumption"] = df["AlcoholConsumption"].astype(float)
    if "PhysicalActivity" in df.columns:
        df["PhysicalActivity"] = df["PhysicalActivity"].astype(int)
    if "DietQuality" in df.columns:
        df["DietQuality"] = df["DietQuality"].astype(str)
    if "CardiovascularDisease" in df.columns:
        df["CardiovascularDisease"] = df["CardiovascularDisease"].astype(str)
    if "Depression" in df.columns:
        df["Depression"] = df["Depression"].astype(str)
    if "CholesterolTotal" in df.columns:
        df["CholesterolTotal"] = df["CholesterolTotal"].astype(float)
    if "MMSE" in df.columns:
        df["MMSE"] = df["MMSE"].astype(float)
    if "FunctionalAssessment" in df.columns:
        df["FunctionalAssessment"] = df["FunctionalAssessment"].astype(int)
    if "MemoryComplaints" in df.columns:
        df["MemoryComplaints"] = df["MemoryComplaints"].astype(str)
    if "BehavioralProblems" in df.columns:
        df["BehavioralProblems"] = df["BehavioralProblems"].astype(str)
    if "ADL" in df.columns:
        df["ADL"] = df["ADL"].astype(float)
    if "PersonalityChanges" in df.columns:
        df["PersonalityChanges"] = df["PersonalityChanges"].astype(str)
    if "DifficultyCompletingTasks" in df.columns:
        df["DifficultyCompletingTasks"] = df["DifficultyCompletingTasks"].astype(str)
    if "Diagnosis" in df.columns:
        df["Diagnosis"] = df["Diagnosis"].astype(str)
    return df

# Remove outliers using IQR method
def remove_outliers(df):
    columns = ["BMI", "CholesterolTotal"]
    df_cleaned = df.copy()
    for col in columns:
        if col in df_cleaned.columns: 
            Q1 = df_cleaned[col].quantile(0.25)
            Q3 = df_cleaned[col].quantile(0.75)
            IQR = Q3 - Q1
            mask = (df_cleaned[col] >= Q1 - 1.5 * IQR) & (df_cleaned[col] <= Q3 + 1.5 * IQR)
            df_cleaned = df_cleaned[mask]  
    return df_cleaned

# Rename columns
def rename_columns(df):
    return df.rename(columns={
        "PatientID": "Patient_ID",
        "Age": "Patient_Age",
        "AlcoholConsumption": "Alcohol_Consumption",
        "PhysicalActivity": "Physical_Activity",
        "DietQuality": "Diet_Quality",
        "CardiovascularDisease": "Cardiovascular_Disease",
        "CholesterolTotal": "Cholesterol_Total",
        "FunctionalAssessment": "Functional_Assessment",
        "MemoryComplaints": "Memory_Complaints",
        "BehavioralProblems": "Behavioral_Problems",
        "ADL": "Activities_Of_Daily_Living",
        "PersonalityChanges": "Personality_Changes",
        "DifficultyCompletingTasks": "Difficulty_Completing_Tasks",  
    })

# Drop missing values
def drop_missing_values(df):
    return df.dropna()

# Remove duplicates
def remove_duplicates(df):
    return df.drop_duplicates()

# Round numerical values to 2 decimal places
def round_values(df):
    return df.round(2)

# Capitalize column names with proper acronym handling
def capitalize_columns(df):
    def smart_title(text):
        # Common acronyms that should stay uppercase
        acronyms = {
            "bmi": "BMI",
            "mmse": "MMSE", 
            "adl": "ADL",
            "id": "ID"
        }
        
        # Split by underscore and process each part
        parts = text.split("_")
        result_parts = []
        
        for part in parts:
            lower_part = part.lower()
            if lower_part in acronyms:
                result_parts.append(acronyms[lower_part])
            else:
                result_parts.append(part.title())
        
        return "_".join(result_parts)
    
    df.columns = [smart_title(col) for col in df.columns]
    return df

# Scale numerical values and encode categorical values
scaling_transformer = ColumnTransformer([
    ("num", StandardScaler(), ["Patient_Age", "BMI", "Alcohol_Consumption", "Physical_Activity", "Cholesterol_Total", "MMSE", "Functional_Assessment", "Activities_Of_Daily_Living"]), 
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), ["Gender", "Ethnicity", "Smoking", "Cardiovascular_Disease", "Depression", "Memory_Complaints", "Behavioral_Problems", "Personality_Changes", "Difficulty_Completing_Tasks", "Doctor_In_Charge"])  
])

Create the transformers.

In [None]:
# Define transformers
change_column_location_transformer = FunctionTransformer(change_column_location)
drop_columns_transformer = FunctionTransformer(drop_columns)
convert_data_types_transformer = FunctionTransformer(convert_data_types)
remove_outliers_transformer = FunctionTransformer(remove_outliers)
rename_columns_transformer = FunctionTransformer(rename_columns)
capitalize_columns_transformer = FunctionTransformer(capitalize_columns)
drop_missing_values_transformer = FunctionTransformer(drop_missing_values)
remove_duplicates_transformer = FunctionTransformer(remove_duplicates)
round_values_transformer = FunctionTransformer(round_values)

Create the pipeline.

In [None]:
# Create data cleaning pipeline
data_cleaning_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer)
])

Create advanced machine learning pipeline.

In [None]:
# Create advanced pipeline with scaling and encoding for machine learning
# This pipeline should clean and preprocess data, rename columns, scale numerical features, encode categorical features and handle unknown categories
data_cleaning_with_ml_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer),
    ("scale_and_encode", scaling_transformer)
])

---

## **Section 3**

### **Data Loading** 
In this section, we fit both pipelines to two separate instances of the same dataframe, allowing for the transformation process to take place, and the creation of new dataframes due for loading as new, cleaned CSV documents.

---

Fit the pipeline to the DataFrame.

In [None]:
# Apply the pipeline to dataframe
processed_df = data_cleaning_pipeline.fit_transform(df)
print("Data loaded successfully!")
print(f"Processed data shape: {processed_df.shape}")
print(processed_df.head())

Check the current column list after fitting pipeline.

In [None]:
print("Data loaded successfully!")
print("Processed DataFrame columns:", processed_df.columns.tolist())

fit the machine learning pipeline.

In [None]:
# Apply the ML pipeline to the original dataframe
scaled_data = data_cleaning_with_ml_pipeline.fit_transform(df)
print("Data loaded successfully!")
print(f"Scaled data shape: {scaled_data.shape}")
print(scaled_data)

Load both previously created dataframes to separate CSV files.

In [None]:
# Save the processed dataframes
processed_df.to_csv("outputs/processed_alzheimers_disease_data_unscaled_and_unencoded.csv", index=False)
scaled_data.to_csv("outputs/processed_alzheimers_disease_data_scaled_and_encoded.csv", index=False)
print("Files saved to outputs folder!")

---

## **Conclusion**
The process approached with some difficulty, but in the end, we managed to generate the instances of the datasets we were after. These will then be used within our application.

---

### **Notes** 
**Method**
- Created the mapping and binary columns as the original data was populated with numerical values for all columns.
- Dropped several columns as aim was to focus on partiuclar parameters (health and lifestyle) and to afford a more simplistic, less technical, viewer friendly application.
- Also extracted a fractioned/sampled DataFrame from the original at a random state for analysis purposes.

**Issues**
- Following package conflicts with packages like NumPy and Pandas the installation block was added as a precautionary measure.
- Needed to install Jupyter dependencies within the notebook, as kernel kept dying, Python kernel started, then the necessary packages were downloaded.
- Pandas faced issues such as import errors and HTML errors; his was resolved via a Pandas update as well as using the print function. 

**Further Considerations**
- Consider not to rule out further factors, such  head injury, other comordid disease such as diabetes, or family history (there here were included in orginal dataset).
- Capitalize, drop missing values and remove duplicates added for quality assurance purposes.