# Project Machine Learning and Statistics

Winter 2023/24

Author: Sofiia Meteliuk

The project is to create a notebook exploring classification algorithms applied on the iris flower data set associated with Ronald
A Fisher.

--------
The Iris dataset is a famous dataset in machine learning and is often used for practicing classification algorithms. Ronald A. Fisher introduced this dataset in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems."

• In your notebook, you should first explain what supervised learning is and then explain what classification algorithms are.

## Chapter 1:  Supervised Learning and Classification Algorithms

### Supervised Learning:

Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset. In a labeled dataset, each input data point is associated with a corresponding output label or target.

 The goal of supervised learning is to learn a mapping from input features to the output labels so that the algorithm can make predictions or classifications on new, unseen data.

The process involves:

#### Training Phase:
- The algorithm is presented with a training dataset, where both the input features and their corresponding output labels are provided.
- The algorithm learns the relationship between the input features and the output labels during this training phase.

#### Testing/Evaluation Phase:
- Once the algorithm is trained, it is evaluated on a separate dataset (testing set) that it has not seen before.
- The performance is assessed by comparing the algorithm's predictions to the actual labels in the testing set.

Supervised learning tasks can be broadly categorized into two types:

- **Regression:** The algorithm predicts a continuous output variable. For example, predicting the price of a house based on its features.
- **Classification:** The algorithm predicts a discrete output variable or class label. This is the focus of your exploration on the Iris flower dataset.

### Classification Algorithms:

Classification algorithms are a subset of supervised learning algorithms specifically designed for solving classification problems. 
In a classification task, the goal is to categorize input data points into predefined classes or labels. Here are some common classification algorithms:

- **Decision Trees:** A tree-like model where each internal node represents a decision based on a feature, and each leaf node represents the class label.
- **Random Forest:** An ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
- **Support Vector Machines (SVM):** A model that finds the hyperplane that best separates different classes in the feature space.
- **K-Nearest Neighbors (KNN):** A simple algorithm that classifies a data point based on the majority class among its k-nearest neighbors.
- **Logistic Regression:** Despite its name, it's used for binary classification problems and estimates the probability that a given input belongs to a particular class.
- **Naive Bayes:** A probabilistic algorithm based on Bayes' theorem that assumes independence between features.


Source: [What is supervised learning?](https://www.ibm.com/topics/supervised-learning)

In the context of my project with the Iris flower dataset, classification algorithms are applied to predict the species of Iris flowers based on their features, such as 
sepal length, sepal width, petal length, and petal width. The choice of algorithm depends on the characteristics of the dataset and the specific requirements of the classification task.

## Chapter 2: Describe at least one common classification algorithm and implement it using the scikit-learn Python library

One common classification algorithm is the Decision Tree. 

**Decision Trees (DTs)** are a non-parametric supervised learning method used for classification and regression[^1]. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation[^1].

[^1]: Source: [scikit-learn documentation on Decision Trees](https://scikit-learn.org/stable/modules/tree.html)

It is a non-parametric model, meaning it makes no assumptions about the distribution of data.

Decision Trees learn from data to make decisions by recursively partitioning the input space into regions and assigning a label or value to each region.

Decision Trees are prone to overfitting, especially if the tree is allowed to grow too deep. To mitigate this, techniques like pruning (removing unnecessary branches) can be applied. Ensemble methods like Random Forests, which combine multiple Decision Trees, are also commonly used to improve predictive performance.

*NB! No need to download iris dataset using pandas from csv. There is iris dataset in `sklearn.datasets`. We can call it using load_iris*

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Split the dataset into training and testing sets

In [3]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
decision_tree_classifier = DecisionTreeClassifier(random_state=42)

Train the classifier on the training data

In [5]:
decision_tree_classifier.fit(X_train, y_train)


Make predictions on the testing data

In [7]:
y_pred = decision_tree_classifier.predict(X_test)
print(y_pred)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]


In [9]:
accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_report_result)

Accuracy: 1.00

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

