# Project

***

This project will explore classification algorithms applied on the iris flower data set associated with Ronald
A Fisher. I will firstly explain what supervised learning is and then explain what classification algorithms are. We will then look at one common classification algorithm and implement it using the scikit-learn Python library.

I will start off by giving a brief overview on iris flower data set.

## What is the Iris Flower Data Set?

This has been classed as a "small classic dataset from Fisher, 1936". One of the earliest known datasets used for evaluating classification methods. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. 

The dataset consists of 150 samples from each of three species of Iris flowers: Iris setosa, Iris virginica, and Iris versicolor. [1] For each sample, four features were measured:

- Sepal Length (in centimeters)
- Sepal Width (in centimeters)
- Petal Length (in centimeters)
- Petal Width (in centimeters)

# What is Supervised Learning?

The tasks of supervised learning are clearly defined and can be employed in a wide range of scenarios, such as detecting spam or forecasting rainfall. Supervised machine learning is based on the following core concepts: Data, Model, Training, Evaluating and Inferance. [2] I will briefly investigate these topics further to give an understanding of these core concepts.

- Data: we might have a dataset of movies released in 2023 or information on weather. For movies in 2023, we might have a dataset which consists of director, genre, IMDB rating etc. If it was a weather model to predict weather, we may see features such as latitude, longitude, temperature, humidity, cloud coverage, wind direction, and atmospheric pressure.

- Model: In the context of supervised learning, a model refers to a sophisticated set of parameters that encapsulates the intricate mathematical relationship between particular input feature patterns and corresponding output label values. During the training process, the model learns and discerns these patterns from the provided data.

- Training: Training a ML model involves using a machine learning algorithm to training data, allowing it to learn and adapt. ML models allows us to efficiently handling vast datasets, recognizing patterns, detecting anomalies, and exploring correlations that might be challenging for a human to accomplish. resulting in a working model that can then be validated, tested and deployed. The model’s performance during training will eventually determine how well it will work when it is eventually put into an application for the end-users. [3] An example of this maybe a image recognition ML model which might be used to detect cancerous cells by comparing cells are malignant or benign and using the data found to predict whether there may be a cancerous cell. With more and more data learned by the model, we will get better data in order to forecast the probability of someone having cancerous cells.

- Evaluating: We assess a trained model to gauge its learning effectiveness. During evaluation, we provide the model with a labeled dataset, supplying only the features while withholding the actual labels. Subsequently, we compare the model's predictions with the true values of the labels. An example of this might be to have the rainfall for Ireland for 2022, but a predicted model was created predicted 31 inches, when actually 34 inches fell, we can use the knowledge to predict a better pattern in 2024.

- Inferance: We can use the model to make predictions, called inferences, on unlabeled examples when we know that the model is accurate or very close to an accurate level. For example for weather, we would give the model the current weather conditions—like temperature, atmospheric pressure, and relative humidity—and it would predict the amount of rainfall.


## What are the different types of classification algorithms for supervised learning?

The classification algorithms most commonly used are Logistic and Linear Regression, Naive Bayes, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machines (SVM) and Decision Trees. For the purposes of studying the Iris dataset, I will use Logistic Regression and Decision Trees, but I will firstly explain the classification algorithms.

- Logistic Regression - Regression is a form of supervised learning designed to establish the connection between dependent and independent variables. Utilizing labeled datasets, this method employs algorithms to predict continuous output across diverse data. Logistic regression is used when the dependent variable is categorical or has binary outputs such as ‘yes’ or ‘no’, it predicts discrete values for variables. [4]
A real world example based on my understanding of this, would be an email spam filter, which may be able to extract features from an email such as attachments or subject title to determine if spam is likely or not.

- Linear regression - This form of classification algorithm is employed to uncover associations between two variables and facilitate future predictions. It is categorized based on the quantity of independent and dependent variables involved. [4] A real world example of this might be a predicted students such as to build a model that quantifies the relationship between these predictors and the final exam scores, where you are to predict the expected exam score for new students based on their study habits, attendance. 

- Naive Bayes - This operates as a statistical classification method grounded in Bayes' Theorem, representing one of the most straightforward supervised learning algorithms. Renowned for its speed, accuracy, and reliability, the Naive Bayes classifier exhibits high performance, particularly on large datasets. Central to Naive Bayes is the assumption that the influence of a specific feature within a class remains independent of other features. This characteristic contributes to its efficiency and effectiveness in various classification tasks. [5] An example of this could be a business loan, the naive bayes classifier could be trained on  historical data where businesses are labeled as either high-risk or low-risk borrowers. The model can then predict the risk category of new loan applicants. This would all the businesses to assess the creditworthiness of businesses and make informed decisions about whether to approve or deny loan applications.

- Random Forest - This is a supervised learning technique that involves labeled datasets, establishing connections between input and output. This versatile method is applicable to classification tasks, such as identifying a flower's species based on attributes like petal length and color, as well as regression tasks, like forecasting tomorrow's weather using historical weather data. Comprising numerous decision trees, a Random Forest produces predictions through a collective output. [6] An interesting use of random forest which I found was in predicting disease risks. In the research by Khalilia et al, they present a way to "predicting disease risk of individuals based on their medical diagnosis history". The researches used RF and SVS as predicative modeling, RF on a dataset for Nationwide Inpatient Sample (NIS) is a database for hospital inpatient admissions, which records 8 million records which gives info on the disease or illness, RF helped to fill in the gaps in data. The researchers achieved a result of 89.05% in detecting a disease or illness. [7]

- K-Nearest Neighbors (KNN) - The k-Nearest Neighbors (KNN) classification and regression algorithms belong to the family of memory-based learning or instance-based learning. Instead of creating a model, KNN relies on memorising the training dataset and utilises this data for making predictions. [8] Companies like Netflix and Amazon use this as a way to recommend movies or tv shows based on prior watching.

- Support Vector Machines (SVM) - This is  a robust machine learning algorithm employed for tasks involving linear or nonlinear classification, regression, and outlier detection. SVMs find applications in various tasks, including text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection. [9] An example of SVM is in image classification, such as identifying if a cat is in a picture.

- Decision Trees - A decision tree is a supervised machine learning method utilized for categorization or prediction based on responses to a series of questions. This model undergoes supervised learning, where it is trained and tested on a dataset containing the specified categorizations. [10] The decision tree approach could be applied in sports analytics, like in baseball. A decision tree could be used to predict whether a batter will swing at a pitch based on factors such as a batter's historical performance or due to maybe a current game situation. 



# References

- [1] Iris – UCI Machine Learning Repository. Aug. 17, 2023. url: https://archive.ics.uci.edu/dataset/53/iris(visited on 01/11/2023).

- [2] Supervised learning&nbsp; |&nbsp; machine learning&nbsp; |&nbsp; google for developers (no date) Google. Available at: https://developers.google.com/machine-learning/intro-to-ml/supervised (Accessed: 01 November 2023). 

- [3] Weedmark, D. (2023) Machine learning model training: What it is and why it’s important, Domino Data Lab. Available at: https://domino.ai/blog/what-is-machine-learning-model-training (Accessed: 03 December 2023). 

- [4] Emeritus (2023) What is supervised learning and its top examples?, Emeritus Online Courses. Available at: https://emeritus.org/blog/ai-and-ml-supervised-learning/#:~:text=By%20analyzing%20patterns%20and%20relationships,how%20supervised%20learning%20is%20used. (Accessed: 03 December 2023). 

- [5] Awan, A.A. and Navlani, A. (2023) Naive Bayes classifier tutorial: With Python Scikit-Learn, DataCamp. Available at: https://www.datacamp.com/tutorial/naive-bayes-scikit-learn (Accessed: 03 December 2023). 

- [6] Molina, E. (2021) A practical guide to implementing a random forest classifier in Python, Medium. Available at: https://towardsdatascience.com/a-practical-guide-to-implementing-a-random-forest-classifier-in-python-979988d8a263 (Accessed: 03 December 2023). 

- [7] Khalilia, M., Chakraborty, S. and Popescu, M. (2011) Predicting disease risks from highly imbalanced data using Random Forest - BMC Medical Informatics and Decision Making, BioMed Central. Available at: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-11-51 (Accessed: 03 December 2023).

- [8] k-Nearest Neighbors (KNN) (2023) K-Nearest Neighbors (KNN). Available at: https://www.ibm.com/docs/en/db2oc?topic=procedures-k-nearest-neighbors-knn (Accessed: 04 December 2023). 

- [9] Support Vector Machine (SVM) algorithm (2023) GeeksforGeeks. Available at: https://www.geeksforgeeks.org/support-vector-machine-algorithm/ (Accessed: 04 December 2023). 

- [10] Decision tree (2023) CORP-MIDS1 (MDS). Available at: https://www.mastersindatascience.org/learning/machine-learning-algorithms/decision-tree/ (Accessed: 04 December 2023). 