# Notebook 01 - Problem Statement And Setup

### Project Title  
**AlzheimerPredictor4u**

---

### Group Info  
**Group 4, l25dat4bi1f**  
*CPH Business Academy Lyngby*  
*Exam Project 2025*

---

### Collaborators  
- Felicia Favrholdt — cph-ff62@cphbusiness.dk  
- Fatima Majid Shamcizadh — cph-fs156@cphbusiness.dk

---

### GitHub Links  
- **Repository**: [AlzheimerPredictor4u_BI_Exam](https://github.com/FeliciaFavrholdt/AlzheimerPredictor4u_BI_Exam.git)  
- **Streamlit Folder**: Located inside the same repository under `/streamlit_app/`

---

### Problem Statement  
*How can we use AI and BI based on categories like age, gender, lifestyle, and health to predict that a person has Alzheimer’s?*

---

### Research Questions  
1. Can we predict the stage of Alzheimer’s disease using patient demographic and clinical data?  
2. Which features are the most influential in disease progression?  
3. Can we build a dashboard to visualize disease stages for clinical decision-making?



### Motivation 
Our motivation is to help doctors and healthcare staff detect Alzheimer’s disease at an earlier stage, so patients can get the right treatment before it is too late. Alzheimer’s is one of the most common types of dementia and affects millions of people around the world. If the disease is discovered early, it can improve the patient’s quality of life and slow down the progression. In this project, we want to explore how Business Intelligence (BI) and Artificial Intelligence (AI) can be used to work with patient data like age, gender, health, and lifestyle. By analyzing these data, we hope to find patterns that can help predict how far the disease has come, so doctors and nurses can make better and faster decisions. 





### Project Goals
The main goal of this project is to build a simple and useful system that can help doctors understand how far a person is in their Alzheimer’s disease. We want to do this by using AI and BI methods on real patient data. Our goals include:  
- Creating a machine learning model that can predict the stage of the disease  
- Finding out which patient features (like age, lifestyle, health) are most important  
- Making an interactive dashboard that shows the predictions in a clear and easy way  



### Hypotheses
In this project, we expect to find the following patterns in the data:  
- Patients over the age of 80 have more than a 60% chance of having Alzheimer’s  
- At least 50% of patients with a family history of Alzheimer’s also have the disease  
- Only 10% of physically active patients between 60 and 70 years live longer than average compared to other diagnosed patients.




---------------------------------

## Project Scope and Impact
This project is focused on using data from real patients to explore how we can predict the stage of Alzheimer’s disease. We work with clinical and demographic data and apply AI and BI tools to create a solution that can support healthcare professionals. The project follows a structured development process with four sprints, covering problem definition, data cleaning, machine learning, and building a Streamlit dashboard.


### Key Objectives
- Use patient data to build a model that can predict Alzheimer’s disease stage  
- Explore which features are most important (age, gender, health, etc.)  
- Clean and prepare the data using BI methods  
- Make a dashboard that is easy to use for non technical users like doctors or nurses  
- Document every step in Jupyter Notebooks and share the solution on GitHub  



---

## Expected Outcomes
The project will deliver the following components:
- A clean and structured dataset ready for analysis  
- A machine learning model that predicts disease stage with visual output  
- Well-documented, reproducible code and workflows across modular Jupyter notebooks  
- A Streamlit dashboard that presents the results clearly  
- A full project description with problem statement, hypotheses, and interpretation of results 
- Well-documented, reproducible code and workflows across modular Jupyter notebooks

---

## Impact and Beneficiaries
The solution we develop could help healthcare professionals make faster and better decisions when diagnosing and treating Alzheimer’s. It can also save time and support early detection, which is important for improving the quality of life for patients. The main users who can benefit are doctors, nurses, clinics, and hospitals that work with elderly or memory care patients.


---------------------------------------

## Brief Annotation

**1. Which challenge would you like to address?**  
We want to address the challenge of predicting the stage of Alzheimer’s disease using patient demographic and clinical data, such as age, gender, health history, and lifestyle information.

**2. Why is this challenge an important or interesting research goal?**  
It is important because early understanding of the disease stage can help doctors give the right treatment at the right time, before it is too late. It can also improve the patient’s quality of life and slow down the development of the disease.

**3. What is the expected solution your project would provide?**  
Our solution includes a machine learning model that predicts the disease stage and a simple, easy-to-use dashboard to show the results in a clear way.

**4. What would be the positive impact of the solution, and which category of users could benefit from it?**  
The solution can help doctors and healthcare staff make faster and better decisions. It can be used in clinics or hospitals to support more effective treatment and care for patients with Alzheimer’s.


----------------------------------------

## Notebooks

The project is implemented through XX modular notebooks:

- **01_Problem_Statement_and_Setup**  
   Define the research goals, problem formulation, and prepare the working environment by creating folders, setting up libraries, and documenting the project structure. Making it easier to navigate around.

- **02_Dataset_Cleaning_Overview**  
   Load the Alzheimer’s dataset, check for missing values, clean and rename columns, remove duplicates or irrelevant features, and describe the structure of the data.

- **03_Data_Loading_and_Preprocessing**  
   Prepare the dataset for analysis by encoding categorical features, scaling numeric values, handling outliers, and selecting the most important variables for modeling.

- **04_Exploratory_Data_Analysis (EDA)**  
   Explore the data using descriptive statistics and visualizations. Create graphs to show age distribution, gender ratio, disease progression, and correlations between features.

- **05_Model_Training_and_Evaluation**  
   Train classification models (such as Logistic Regression, Naive Bayes, or Decision Trees) to predict the disease stage. Evaluate the models using accuracy scores, confusion matrices, and cross-validation.

- **06_Results_and_Interpretation**  
   Show the final results, explain which features had the most influence on predictions, and prepare visual output for the Streamlit dashboard. Summarize findings and discuss model performance.

------------------------------

### Streamlit Application
A separate **Streamlit web application** is used to deliver feedback interactively, allowing users to explore the predictions and results without needing technical knowledge. The dashboard includes an interface where users can upload data, view visualizations, and see model outputs in a simple and intuitive way. This makes the solution accessible for doctors, nurses, or clinical staff who want to understand the disease stage of a patient based on selected features.

The Streamlit app also includes charts, graphs, and summary tables based on the machine learning results. It allows users to filter, compare, and interpret the findings visually. The app is built using standard Python libraries and will be stored in a dedicated folder within the GitHub repository.

If time allows, we also plan to integrate a simple **chatbot** using GenAI tools (e.g., RAG or LangChain) that can answer basic questions based on the data and model results. This would make the dashboard more interactive and user-friendly, especially for healthcare professionals looking for quick insights.




--------------------------------

## Execution Plan: BI Sprints

**Sprint 1: Problem Formulation**  
Notebook: `01_problem_statement_and_setup.ipynb`  
Focus: Define the problem, goals, research questions, and project structure. Prepare the working environment, folders, and library setup.

**Sprint 2: Data Collection & Cleaning**  
Notebook: `02_dataset_cleaning_overview.ipynb`  
Focus: Load the Alzheimer’s dataset, clean the data, check for missing values, drop irrelevant columns, and ensure the structure is ready for analysis.

**Sprint 3: Feature Engineering and Machine Learning**  
Notebooks: `03_data_loading_and_preprocessing.ipynb`, `05_model_training_and_evaluation.ipynb`  
Focus: Prepare and transform features, apply scaling and encoding, train classification models, evaluate performance, and select the best model for deployment.

**Sprint 4: Business Application**  
Assets: `streamlit_app.py`, `06_results_and_interpretation.ipynb`  
Focus: Deploy a user-facing Streamlit application that provides simple and clear access to model predictions. Include charts, feature importance, and final explanations for clinical decision support. If time allows, consider adding a chatbot for natural language interaction with the results.

This project follows a structured BI development lifecycle, using notebooks for reproducible analysis and Streamlit for interactive delivery.



#### Team Member Engagement

| Member    | Tasks                                         | Estimated Deadline |
|-----------|-----------------------------------------------|--------------------|
| Fatima    | Sprint 1: Problem Formulation                 | Wednesday d. 4/6-25|
| Felicia   | Sprint 2: Data Preparation                    | Monday    d. 4/6-25|
| Felicia   | Sprint 3: Data Modelling                      | Friday    d. 4/6-25|
| Fatima    | Stage 4: Business Application                 | Monday    d. 4/6-25|

All members have been involved in review and testing before deliverables were pushed to GitHub or submitted.


## Project Directory Structure

The project is organized in the following folder structure to support modular development and reproducibility:
/AlzheimerPredictor4u_BI_Exam/
│
├── data/ # Raw and cleaned datasets (.csv)
├── models/ # Saved ML models (e.g., .pkl or .joblib)
├── plots/ # Generated figures and visualizations
├── reports/ # Summary reports, exported results
├── notebooks/ # All Jupyter notebooks
│ ├── 01_problem_statement_and_setup.ipynb
│ ├── 02_dataset_cleaning_overview.ipynb
│ ├── 03_data_loading_and_preprocessing.ipynb
│ ├── 04_exploratory_data_analysis.ipynb
│ ├── 05_model_training_and_evaluation.ipynb
│ └── 06_results_and_interpretation.ipynb
├── streamlit_app/ # Streamlit app script and assets
│ └── streamlit_app.py
├── utils/ # Utility scripts for setup, saving, etc.
│ ├── setup.py
│ └── save_tools.py
├── requirements.txt # Project dependencies
├── README.md # Project documentation
└── .gitignore # Files and folders to ignore in version control

------------------------------

---------------------------

## Environment Setup

We use standard Python libraries for data analysis, machine learning, and dashboard development. All paths are defined as relative to ensure reproducibility across machines and platforms.

### Libraries Used

- **Pandas** — Data manipulation and DataFrame operations  
- **NumPy** — Numerical computations  
- **Matplotlib & Seaborn** — Data visualization  
- **Scikit-learn** — Modeling, metrics, and preprocessing tools  
- **Joblib** — Model persistence  
- **Streamlit** — Interactive web application for feedback delivery

### Setup Procedures

To ensure consistency and structure throughout the project, we use a setup function stored in `utils/setup.py`. This function:

- Applies a consistent visual style for all plots  
- Verifies that key folders (`/data`, `/models`, `/plots`, `/reports`) exist  
- Ensures the working environment is ready before any notebook is executed  


#### Development Tools

- **IDE:** Visual Studio Code (VS Code) with Jupyter support
- **Version Control:** Git (GitHub remote)
- **Package Management:** Anaconda / pip (via `requirements.txt`)
- **Notebook Environment:** Jupyter Notebooks, Streamlit for deployment


### Platform Requirements

- **Python** 3.9 or higher  
- **Jupyter Notebook** (or Visual Studio Code with Jupyter extension)  
- **Streamlit** version 1.20 or later  

All notebooks are executed from within the `notebooks/` directory. Output files such as datasets, visualizations, and model summaries are saved to corresponding subdirectories located one level up for consistency across the project.

------------------------------------

### Project initialization

In [68]:
from utils.setup import init_environment
init_environment()

Environment initialized.


### Initialization Function

The `init_environment()` function performs the following setup tasks:

- Applies consistent Seaborn and Matplotlib visual styles
- Verifies the existence of required directories: `../data`, `../plots`, `../models`, and `../reports`
- Ensures a clean, reproducible environment for all notebooks

-------------------

In [72]:
# Import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Apply Seaborn styling (this sets both matplotlib and seaborn visuals)
sns.set_theme(style="whitegrid", palette="deep")


# Create directories for reproducibility
folders = ["../data", "../models", "../plots", "../reports"]
for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("Environment setup complete.")

Environment setup complete.


In [None]:
from utils.save_tools import save_notebook_and_summary

save_notebook_and_summary(
    notebook_name="01_problem_statement_and_setup",
    summary={
        "description": "Established the project scope and objectives for SpeakScape, outlined key research questions, defined the BI sprint structure, and initialized the working environment.",
        "team_members": [
            "Felicia Favrholdt",
            "Fatima Majid Shamcizadh"
        ],
        "sprints_defined": 4,
        "notebooks_planned": [
            "01_problem_statement_and_setup",
            "02_dataset_cleaning_overview",
            "03_data_loading_and_preprocessing",
            "04_exploratory_data_analysis",
            "05_model_training_and_evaluation",
            "06_results_and_interpretation"
        ],
        "folders_created": ["data", "models", "plots", "reports"],
        "tools_used": [
            "Python 3.9", "Jupyter Notebook", "VS Code", "GitHub", "Streamlit"
        ]
    }
)


<IPython.core.display.Javascript object>

Notebook save triggered.
Summary saved to: ../reports/01_problem_statement_and_setup_summary_2025-05-26_03-38-27.json


------------------------