# Assignment 1 Part A - Stage 1
#### 11482 - Pattern Recognition and Machine Learning
#### James McGuinness (u3196600)
#### 

# A1
1. Provide a description of the problem, motivation and characterisation.  
2. Identify and describe the dataset.  
*(15 pts)*

**Problem:**
- **Identification of Malignant Tumors:** The goal is to develop a model that accurately identifies malignant tumors in breast scans based on the provided dataset.
- **Importance:** Early and accurate detection of malignant tumors is crucial for effective treatment and can significantly improve patient outcomes.

**Motivation:**
- **Choice of Dataset:** The breast cancer dataset is chosen because it offers a manageable yet significant problem that allows you to understand key concepts like classification, confusion matrices, and evaluation metrics like false positives (FP) and false negatives (FN).
- **Impact:** Automating tumor detection could reduce human error, speed up diagnosis, and potentially uncover patterns that aren't immediately obvious to human experts.

**Motivation Clarification:**
- **Skill Development:** The breast cancer dataset was chosen to deepen understanding of classification problems, confusion matrices, and other relevant concepts, which are foundational to more complex tasks like face recognition or space-related projects.

**Characterization:**
- **Supervised Learning:** This is a supervised learning problem where the task is to classify scans as either benign or malignant based on features extracted from the data.
- **Patterns:** The features likely display patterns that correlate with the presence or absence of malignancy, making this a typical pattern recognition problem.

### 2. Identification and Description of the Dataset

**Dataset Overview:**
- **Source:** The dataset contains breast scans with various measurements of tissue properties.
- **Label:** Each sample is labeled as either benign or malignant.
- **Features:** The dataset includes several features that could be indicators of malignancy, such as the size, shape, and texture of the tumor.

**Details:**
- **Samples:** The dataset is divided into training and testing sets, ensuring that the model is trained on a portion of the data and validated on a separate portion to evaluate its performance.
- **Feature Description:** The dataset includes features such as mean radius, mean texture, mean perimeter, etc. These features are crucial for the model to identify patterns that correlate with malignancy.

### Additional Considerations

**Model Usage in Problem Solving and Decision-Making:**
- **Clinical Decision Support:** The model could assist doctors in making more informed decisions by providing a second opinion based on data-driven patterns.
- **Screening Programs:** It could be used in large-scale screening programs to quickly and efficiently analyze scans, prioritizing those that need further investigation.

# A2
1. Identify questions to be investigated.  
2. How would this model be used in decision making in the problem domain?  
3. How would you use the model to draw outcomes for new cases?  
*(15 pts)*

- Investigate and characterize the problem
  - In order to better understand the goals of the project
  - Identify what questions will be investigated
  - How will the models learned help in answering the questions
  - How will the model be validated and used to predict unseen problems cases?

## 1. Identify Questions to be Investigated

**Research Questions:**

- **Which Features are Most Effective?**  
  - *Question:* What are the most significant features in the dataset that contribute to accurately identifying malignant tumors?  
  - *Reasoning:* Understanding which features (e.g., mean radius, mean texture) have the most impact on the model's performance can help in refining the model and improving accuracy.

- **Model Processing in Real-Time Scenarios:**  
  - *Question:* How can the model be optimized to process breast scan data in real-time for clinical use?  
  - *Reasoning:* Real-time processing is critical in clinical environments, where quick and accurate decisions can save lives. This question addresses the practicality of deploying the model in a real-world setting.

- **Model's Performance on Unseen Data:**  
  - *Question:* How well does the model generalize to new, unseen data?  
  - *Reasoning:* It's essential to ensure that the model performs well not just on the training data but also on new cases it hasn't encountered before. This helps in assessing the model's reliability in predicting outcomes for future patients.

## 2. How Would This Model Be Used in Decision Making in the Problem Domain?

**Clinical Decision-Making:**

- **Supporting Diagnoses:**  
  The model can assist radiologists by providing a second opinion on breast scans, highlighting cases with a high probability of malignancy. This reduces the chances of misdiagnosis and ensures that suspicious cases are further investigated.

- **Prioritization of Cases:**  
  In a clinical setting, the model could be used to prioritize cases for review based on the likelihood of malignancy. This ensures that urgent cases are dealt with promptly.

- **Continuous Learning:**  
  The model could be updated continuously with new data, improving its accuracy and adaptability over time. This would help in keeping up with any changes in the characteristics of the data, such as new imaging techniques or demographic shifts.

## 3. How Would You Use the Model to Draw Outcomes for New Cases?

**Application to New Cases:**

- **Input New Data:**  
  For each new patient, the model would take the breast scan measurements as input and predict whether the tumor is likely to be malignant or benign.

- **Predictive Outcome:**  
  Based on the learned patterns from the training data, the model would output a probability score indicating the likelihood of malignancy. This score would be used to classify the tumor as benign or malignant.

- **Validation with Real Data:**  
  The model's predictions on new data can be validated by comparing them with the actual outcomes. This feedback loop would help in refining the model further.

## Additional Considerations

**Model Validation and Prediction:**

- **Validation Strategy:**  
  The model will be validated using a separate test dataset that the model has not seen during training. Common strategies include cross-validation or a train-test split, ensuring the model's performance is robust and not overfitting.

- **Unseen Problem Cases:**  
  By applying the model to unseen cases (test data), we can assess its ability to generalize. A high performance on the test data would indicate that the model can reliably predict outcomes for new patients.

- **Real-Time Implementation:**  
  For real-time decision-making, the model must be optimized for speed and accuracy, potentially integrating with existing medical systems to provide instant feedback.


# A3
1. Explain why the problem is a PRML problem with reference to the design steps of a PRML solution.  
2. Discuss what 'pattern recognition' will be involved and how machine learning will be used in the design.  
3. Identify 3-4 ML algorithms you would be investigating for the selected problem, explain why are they suited?  
4. Explain why the proposed project qualifies as a pattern recognition and machine learning problem?  
(*15 pts*)

## 1. Why the Problem is a PRML Problem

The breast cancer detection problem qualifies as a Pattern Recognition and Machine Learning (PRML) problem due to the following reasons:

**Pattern Recognition Design Steps:**
- **Problem Definition:** The task is to classify breast scans into malignant or benign categories. This involves recognizing patterns in the scan features that correlate with the malignancy of tumors.
- **Feature Extraction:** Relevant features are extracted from the scan data, such as mean radius, texture, and perimeter. These features serve as inputs for the machine learning algorithms.
- **Model Training and Testing:** The dataset is split into training and testing sets. A model is trained on the training set to learn the patterns and is evaluated on the test set to measure its performance.
- **Model Validation:** Validation techniques ensure that the model generalizes well to unseen data, which is crucial for its effectiveness in real-world scenarios.

**Benefits:**
- **For Patients and the Medical System:** Automating tumor detection using a machine learning model can enhance accuracy, reduce diagnostic errors, and expedite the process, potentially leading to better patient outcomes and more efficient medical practices.

**Design Pipeline:**
- **Data Collection:** Obtain and preprocess breast scan data.
- **Feature Extraction:** Identify and extract relevant features from the scans.
- **Model Selection:** Choose appropriate machine learning models.
- **Training:** Train the models using labeled data.
- **Evaluation:** Assess the model’s performance using metrics such as accuracy, precision, recall, and F1 score.
- **Deployment:** Implement the model in a clinical setting for real-time predictions.

## 2. Pattern Recognition and Machine Learning Involvement

**Pattern Recognition:**
- **Pattern Recognition Involvement:** The task involves identifying patterns in breast scan features that distinguish malignant tumors from benign ones. The pattern recognition aspect focuses on extracting meaningful patterns from the data to make accurate classifications.

**Machine Learning Design:**
- **Supervised Learning:** This problem is a supervised learning task, where the model learns from labeled data (i.e., scans with known malignancy status) to make predictions on new, unseen data.
- **Model Training:** Machine learning algorithms are trained to recognize the patterns in the features that are indicative of malignancy.

## 3. ML Algorithms to Investigate

**1. Logistic Regression:**
- **Suitability:** Logistic Regression is suited for binary classification problems and provides probabilities for class membership. It works well if the relationship between features and the outcome is approximately linear.

**2. Support Vector Machines (SVM):**
- **Suitability:** SVM is effective for classification tasks, especially when the data is not linearly separable. It finds the optimal hyperplane that maximizes the margin between classes.

**3. Random Forest:**
- **Suitability:** Random Forest is an ensemble method that uses multiple decision trees to improve accuracy and control overfitting. It handles both linear and non-linear relationships well.

**4. Gradient Boosting Machines (GBM):**
- **Suitability:** GBM builds models in a stage-wise manner, improving performance through iterative optimization. It is effective for capturing complex patterns in the data.

**Reasoning:**
- These algorithms are chosen because they handle classification problems effectively and can accommodate both linear and non-linear relationships in the data. They also offer various ways to deal with potential overfitting and model complexity.

## 4. Why the Project Qualifies as a PRML Problem

**Pattern Recognition and Machine Learning Problem:**
- **Classification Task:** The problem involves classifying breast scans into malignant or benign categories, a typical pattern recognition problem.
- **Feature-Based Decision Making:** The model uses extracted features to make predictions, which aligns with machine learning principles.
- **Supervised Learning:** The dataset provides labeled examples, making it a supervised learning problem where the model learns from examples to generalize to new cases.

**Linear vs. Non-Linear:**
- **Relationship Exploration:** The relationship between features and the outcome may be linear or non-linear. Visualization and statistical analysis are used to determine the nature of these relationships.
- **Model Choice:** Depending on the linearity of the data, parametric models (like Logistic Regression) or non-parametric models (like Deep Neural Networks) may be used.

**Real-Time and Computational Requirements:**
- **Real-Time Implementation:** For practical applications, the model must be optimized for real-time performance, especially in clinical settings where timely results are crucial.
- **Computational Complexity:** The choice of model will depend on the computational resources available and the complexity of the data.

By addressing these aspects, the project demonstrates a clear understanding of pattern recognition and machine learning principles and their application to the breast cancer detection problem.

- Write and submit a report of 2 pages > Convert to HTML > print to PDF

# Rubric
https://uclearn.canberra.edu.au/courses/16042/assignments/129302

# References

- TODO
- ChatGPT
- Kaggle
- Learning python PDF

**End of document**