# Finding Appropriate Datasets for Each Method


## Approach:

- The assumptions and strengths of each model will be stated  
  - Based on these, an appropriate dataset will be generated for each model  
  - Every model will be tested on every dataset
- Each model will be evaluated using 5-fold cross-validation  
  - This ensures unbiased evaluation and reliable results
- Performance will be measured and compared using the F1 score  
  - This balances false positives and false negatives, making it more reliable than accuracy, especially in cases of imbalanced classes
- The test results will be placed into a matrix and compared to validate the assumptions

#### *Graph drawing utility function*

#### *Declaring the models*

#### *Declaring the result matrix*

## **Logistic regression**


#### Assumptions:
- Linearly separable data

#### Strengths:
- Efficient on smaller datasets  
- Deals well with overlap of classes

#### Weaknesses:
- Assumes linear boundaries; doesn't work with complex relationships

### Appropriate dataset:
-  **Linearly separable classes with little overlap**
  - Moderate number of instances, balanced classes

## **LDA** (Linear Discriminant Analysis)


#### Assumptions:
- Each class follows a multivariate normal (Gaussian) distribution
- Classes must have the same covariance matrix
- Linearly separable classes

#### Strengths:
- Sample-efficient, works well even with few instances
- Robust to some noise
- Less prone to overfitting

#### Weaknesses:
- Assumes linear boundries
- Imposes same distribution of classes

### Appropriate dataset:
-  **Gaussian clouds with little overlap**
  - **Small number of instances**, balanced classes

## **QDA** (**Quadratic** Discriminant Analysis)


#### Assumptions:
- Each class follows a multivariate normal (Gaussian) distribution

#### Strengths:
- Clases may have different covariances
- Draws curved decision boundries

#### Weaknesses:
- Requires greater number of instances
- Sensitive to class imbalances
- Prone to overfitting for few instances

### Appropriate dataset:
-  **Clouds of different covariances with overlap**
  - **Large number of instances**, balanced classes

## **Decision Trees** *(no pruning)*


#### Assumptions:
- Enough data to support meaningful splits  
- Ideally, data shapes that can be split along the axes

#### Strengths:
- Handle well both discrete and continuous data  
- Can capture non-linear patterns  
- Flexible with many types of data — very few assumptions made  
- Easy interpretability

#### Weaknesses:
- Prone to overfitting if implemented without pruning, especially for noisy data  
- Can struggle with curved or diagonal boundaries  
- Favours the majority class in the case of disproportional classes

### Appropriate dataset:
- **A class in the shape of a circle surrounded by points of the other class**
  - **Very little noise**
  - Balanced classes, medium number of instances
- *Other options:*
  - Square surrounded by the other class  
  - The example with half-moons


## **Decision Trees** *(depth limited to 2)*


#### Assumptions:
- Smaller number of features due to fewer being taken into account
- Similar to regular decision trees

#### Strengths:
- **More immune to noisy data**
- **Less prone to overfitting on small datasets**
- Handle well both discrete and continuous data  
- Can capture non-linear patterns  
- Flexible with many types of data — very few assumptions made  
- Easy interpretability

#### Weaknesses:
- **More limited complexity of class shapes**
- Can struggle with curved or diagonal boundaries  
- Favours the majority class in the case of disproportional classes

### Appropriate datasets:
- **A class in the shape of a circle surrounded by points of the other class**
  - **Noisy data**
  - Balanced classes, medium number of instances
    - *Takes advantage of the immunity to noise*
- **A class in the shape of a circle surrounded by points of the other class**
  - **Small number of instances**
  - Balanced classes, reduced noise
    - *Takes advantage of the immunity to overfitting*
- *Other options:*
  - Square surrounded by the other class  
  - The example with half-moons


## **SVM (Linear)**


#### Assumptions:
-

#### Strengths:
-
-

#### Weaknesses:
-

### Appropriate dataset:
-  **..**
  - .

## **SVM (RBF Kernel)**


#### Assumptions:
-

#### Strengths:
-
-

#### Weaknesses:
-

### Appropriate dataset:
-  **..**
  - .