# Brain Tumor MRI Classification - Scikit-learn Baseline

## Introduction

Brain MRI scans provide structured information about tissue patterns. Differences in shape, contrast, and texture can indicate whether a tumour is present and, if so, what type. For a radiologist, these differences are interpreted visually. For us, the task is to translate them into numerical features that a computer can analyse.

The dataset we will use - [Brain Tumor MRI Dataset (CC0)](https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset) - contains four categories: glioma, meningioma, pituitary tumour, and no tumour. Our aim is not to build a clinical tool but to create a clear, reproducible baseline using classical machine learning.

We will do this by:

* Loading the dataset either from local files or through the Kaggle API if credentials are available
* Preprocessing the MRI images (resize, grayscale, normalise) so they are consistent
* Extracting features using Histogram of Oriented Gradients (HOG), a method that captures edge and texture patterns
* Training simple classifiers such as Linear SVM and RBF SVM
* Evaluating the models with accuracy, precision, recall, and F1 score
* Discussing the limitations of this approach and the ethical considerations of using medical datasets

The goal is to show, step by step, how images can be prepared and processed for classical machine learning, and how model results should be interpreted with care. This gives us a strong foundation for understanding the workflow before moving into more advanced methods such as deep learning.

Perfect — here’s the updated **Setup — Libraries & Environment** section with `python -m pip` commands throughout (no bare `pip install`). I also kept the step-by-step walk-through for environment creation, installs, and Kaggle API setup.

## Setup — Libraries & Environment

Any meaningful exploration begins with the right tools. In neuroscience, this might mean calibrated electrodes or imaging equipment. In data analysis, our “lab bench” is the computational environment. Getting this right ensures that every step that follows is reliable and reproducible.

The core tools we’ll use are:

* **Python** — The language itself. Human-readable, flexible, and supported by a massive ecosystem of libraries.
* **NumPy** — The mathematical engine for fast, vectorised operations on arrays and matrices.
* **Pandas** — A high-level library for structuring and manipulating tabular data.
* **Matplotlib** — The foundational plotting library for turning numbers into visual insights.
* **scikit-learn** — Our main machine learning toolkit, providing SVMs, metrics, and utilities.
* **scikit-image** — Adds image-specific tools such as Histogram of Oriented Gradients (HOG).
* **kaggle** — A small command-line tool for downloading datasets directly from Kaggle.

### Creating your environment

It’s best to isolate your work in a virtual environment and install the required packages from `requirements.txt`. This ensures consistency and reduces version conflicts.

```bash
# Create a virtual environment
python -m venv venv
source venv/bin/activate    # macOS/Linux
venv\Scripts\activate       # Windows

# Install the required libraries
python -m pip install -r requirements.txt
```

That’s a great way to phrase it. Since you’ve **already provided the dataset in `data/raw/`**, we can make this section crystal clear: the data is ready to go, but users can re-download or update it themselves if they want. Here’s a tightened version you can drop straight into your notebook or README:

Perfect — here’s how you can present it in your notebook so users can run the update **directly in Python**, without leaving Jupyter. I’ll keep the explanation clear and practical.

### Accessing the dataset

The dataset is already included in this repository under `data/raw/`, so you can start working immediately.

If you prefer to download or update the dataset yourself, there are two options:

* **Manual download (always works):**

  1. Visit the [Kaggle dataset page](https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset).
  2. Download the files manually.
  3. Place them in `data/raw/` inside your project folder.

* **Automatic download (optional, requires API key):**

  1. In your Kaggle account settings, generate an API token. This downloads a file called `kaggle.json`.
  2. Place `kaggle.json` in `~/.kaggle/` (Linux/macOS) or `C:\Users\<YourName>\.kaggle\` (Windows).
  3. With the API configured, you can run the following Python cell inside the notebook to **force-update** the dataset into `data/raw/`:

In [None]:
import os, shutil
from kaggle.api.kaggle_api_extended import KaggleApi

# Remove old dataset
shutil.rmtree("data/raw", ignore_errors=True)
os.makedirs("data/raw", exist_ok=True)

# Authenticate with Kaggle
api = KaggleApi()
api.athenticate()

# Download and extract the Brain Tumor MRI dataset
api.dataset_download_files(
    "masoudnickparvar/brain-tumor-mri-dataset",
    path="data/raw",
    unzip=True
)

Dataset URL: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset


In [3]:
import sys
print(sys.executable)

!python -m pip show kaggle


c:\Users\johnj\AppData\Local\Programs\Python\Python312\python.exe
Name: kaggle
Version: 1.7.4.5
Summary: Access Kaggle resources anywhere
Home-page: https://github.com/Kaggle/kaggle-api
Author: 
Author-email: Kaggle <support@kaggle.com>
License: Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or in


     * This checks that `data/raw/` exists,
     * Then downloads and unzips the dataset,
     * And uses `--force` to overwrite any older copy.

Updating through the Kaggle API is **optional**. If no API key is detected, the command will fail safely, and you can continue using the dataset already included in `data/raw/`.