Title: Iris Flower Classification Project

Main Objective:
The main objective of this analysis is to build a classification model for the Iris flower dataset. The focus will be on prediction, aiming to accurately classify iris flowers into different species based on their sepal length, sepal width, petal length, and petal width. The benefits of this analysis include providing a reliable model for species prediction, which can be valuable for botanists, horticulturists, and researchers interested in understanding and classifying iris flowers.

Data Set Description:
The selected dataset is the famous Iris flower dataset, which consists of 150 samples of iris flowers, each belonging to one of three species: setosa, versicolor, or virginica. The dataset includes four features: sepal length, sepal width, petal length, and petal width. The goal is to develop a classification model that can accurately predict the species of an iris flower based on these features.

Data Exploration and Actions:

Conducted exploratory data analysis to understand the distribution of each feature and the relationships between features.
Checked for missing values and outliers, and performed necessary data cleaning.
Utilized feature engineering to create additional relevant features if needed.
Classifier Models:
Trained and evaluated three different classifier models:

Logistic Regression: A simple baseline model for classification.
Random Forest: An ensemble model known for its accuracy and robustness.
Support Vector Machine (SVM): A model effective in high-dimensional spaces, suitable for this multivariate dataset.
Recommended Final Model:
After thorough evaluation, the Random Forest model is recommended as the final model. It demonstrated high accuracy in species prediction and provides a good balance between interpretability and predictability. The ensemble nature of Random Forest helps in handling complex relationships between features and enhances generalization.

Key Findings and Insights:

Petal dimensions (length and width) were found to be the most important features for distinguishing between iris species.
The model achieved an accuracy of X% on the test set, indicating its effectiveness in predicting iris species.
Suggestions for Next Steps:

Feature Importance Refinement: Investigate further feature importance to refine the model and potentially exclude less relevant features.
Fine-Tuning Hyperparameters: Perform hyperparameter tuning to optimize the Random Forest model for better performance.
Additional Data Features: Consider incorporating additional features or external data sources to enhance model accuracy and interpretability.
Cross-Validation: Implement cross-validation to ensure the model's robustness and reliability.
Continuous Monitoring: Regularly update the model with new data to maintain its accuracy and relevance over time.
This analysis provides a foundation for an accurate and interpretable classification model for iris flowers, with potential applications in botanical research and horticulture.

Building a Classification Model for the Iris data set

In [28]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

In [29]:
iris = datasets.load_iris()

In [30]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [31]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [32]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [33]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [34]:
X = iris.data
Y = iris.target

In [35]:
X.shape

(150, 4)

In [36]:
Y.shape

(150,)

In [37]:
clf = RandomForestClassifier()

In [38]:
clf.fit(X, Y)

Feature Importance

In [39]:
print(clf.feature_importances_)

[0.12170035 0.02889323 0.45053093 0.39887549]


Make Prediction

In [40]:
X[0]

array([5.1, 3.5, 1.4, 0.2])

In [41]:
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))

[0]


In [42]:
print(clf.predict(X[[0]]))

[0]


In [43]:
print(clf.predict_proba(X[[0]]))

[[1. 0. 0.]]


In [44]:
clf.fit(iris.data, iris.target_names[iris.target])

In [45]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [46]:
X_train.shape, Y_train.shape

((120, 4), (120,))

In [47]:
X_test.shape, Y_test.shape

((30, 4), (30,))

Rebuild the Random Forest Model

In [48]:
clf.fit(X_train, Y_train)

 Performs prediction on single sample from the data set

In [49]:
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))

[0]


In [50]:
print(clf.predict_proba([[5.1, 3.5, 1.4, 0.2]]))

[[1. 0. 0.]]


<h3>Performs prediction on the test set</h3>

Predicted class labels

In [51]:
print(clf.predict(X_test))

[0 1 0 0 0 0 1 2 2 2 2 1 0 0 0 2 2 0 2 0 2 1 1 0 1 1 0 0 2 2]


Actual class labels

In [52]:
print(Y_test)

[0 1 0 0 0 0 1 2 2 1 2 1 0 0 0 2 2 0 2 0 2 1 1 0 1 1 0 0 2 2]


In [53]:
print(clf.score(X_test, Y_test))

0.9666666666666667
