# Iris Flower Classification using Logistic Regression



## 1. Project Title
**Iris Flower Classification using Logistic Regression**



## 2. Objective
The primary objective of this project is to classify the species of Iris flowers (Setosa, Versicolor, and Virginica) using a logistic regression model based on various flower features.



## 3. Dataset Information


- **Dataset Name:** Iris Dataset


- **Source:** The Iris dataset is a classic dataset provided by UCI Machine Learning Repository.


- **Number of Instances (Rows):** 150


- **Number of Attributes (Columns):** 5 (4 features + 1 target label)



## 4. Columns Information


- **Features:**


  1. **Sepal Length (cm):** The length of the sepal in centimeters.
  
  
  2. **Sepal Width (cm):** The width of the sepal in centimeters.
  
  
  3. **Petal Length (cm):** The length of the petal in centimeters.
  
  
  4. **Petal Width (cm):** The width of the petal in centimeters.
  
  
- **Target Variable:**


  - **Species:** The species of Iris flower, which can be one of three classes:
  
  
    - `Setosa`
    - `Versicolor`
    - `Virginica`



In [9]:
import pandas as pd 
import numpy as np 

### Data Loading

In [10]:
from sklearn.datasets import load_iris

In [19]:
dataset = load_iris()

In [20]:
dataset

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [21]:
dataset.DESCR



### Analyzing the Dataset

## 5. Exploratory Data Analysis (EDA)
- **Data Distribution:** 
  - Analyzed the distribution of each feature to understand their spread and central tendencies.
- **Correlation Analysis:** 
  - Identified relationships between features using correlation matrices and visualizations like pair plots.
- **Class Distribution:** 
  - Verified that the dataset is balanced across the three species classes.



In [23]:
df = pd.DataFrame(dataset.data,columns=dataset.feature_names)

In [24]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [26]:
df['target'] = dataset.target

In [27]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


## 6. Data Preprocessing
- **Handling Missing Values:** 
  - Checked for any missing values and found none.
- **Feature Scaling:** 
  - Applied feature scaling using standardization (Z-score normalization) to bring all features onto a similar scale.
- **Splitting the Dataset:** 
  - Divided the dataset into training (80%) and testing (20%) sets to evaluate the model's performance.



In [28]:
dff = df[df['target']!= 2]

In [29]:
dff

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,1
96,5.7,2.9,4.2,1.3,1
97,6.2,2.9,4.3,1.3,1
98,5.1,2.5,3.0,1.1,1


In [30]:
x = dff.iloc[:,:-1]
y = dff.iloc[:,-1]

In [31]:
x

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
95,5.7,3.0,4.2,1.2
96,5.7,2.9,4.2,1.3
97,6.2,2.9,4.3,1.3
98,5.1,2.5,3.0,1.1


In [32]:
y

0     0
1     0
2     0
3     0
4     0
     ..
95    1
96    1
97    1
98    1
99    1
Name: target, Length: 100, dtype: int32

## 7. Model Building
- **Model Used:** Logistic Regression
- **Reason for Choosing Logistic Regression:**
  - Logistic regression is suitable for classification problems and works well with linear decision boundaries.
- **Training the Model:**
  - The logistic regression model was trained using the training dataset.
- **Hyperparameter Tuning:**
  - Used cross-validation to fine-tune the model's hyperparameters, such as the regularization parameter.

In [34]:
from sklearn.linear_model import LogisticRegression

In [35]:
classifier = LogisticRegression()

In [36]:
from sklearn.model_selection import train_test_split

In [37]:
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=.2,random_state=20)

In [38]:
x_train.shape , x_test.shape , y_train.shape , y_test.shape

((80, 4), (20, 4), (80,), (20,))

In [39]:
classifier.fit(x_train,y_train)

LogisticRegression()

In [40]:
classifier.predict_proba(x_test)

array([[0.00336393, 0.99663607],
       [0.0099712 , 0.9900288 ],
       [0.98289688, 0.01710312],
       [0.94566469, 0.05433531],
       [0.004571  , 0.995429  ],
       [0.97705682, 0.02294318],
       [0.9761887 , 0.0238113 ],
       [0.9808763 , 0.0191237 ],
       [0.9801104 , 0.0198896 ],
       [0.97763807, 0.02236193],
       [0.00233787, 0.99766213],
       [0.12157695, 0.87842305],
       [0.02380738, 0.97619262],
       [0.02145096, 0.97854904],
       [0.96248577, 0.03751423],
       [0.00140634, 0.99859366],
       [0.94229869, 0.05770131],
       [0.00433177, 0.99566823],
       [0.00512825, 0.99487175],
       [0.98239461, 0.01760539]])

In [41]:
y_pred = classifier.predict(x_test)

In [42]:
y_pred

array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0])

In [43]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [44]:
accuracy_score(y_pred,y_test)

1.0

In [45]:
confusion_matrix(y_pred,y_test)

array([[10,  0],
       [ 0, 10]], dtype=int64)

In [46]:
classification_report(y_pred,y_test)

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        10\n           1       1.00      1.00      1.00        10\n\n    accuracy                           1.00        20\n   macro avg       1.00      1.00      1.00        20\nweighted avg       1.00      1.00      1.00        20\n'

In [67]:
from sklearn.model_selection import GridSearchCV

In [80]:
from warnings import filterwarnings 
filterwarnings('ignore')

In [81]:
parameters = {'penalty':('l1','l2','elasticnet',None),'C':[1,10,20],
             'solver':('lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga')}

In [82]:
clf = GridSearchCV(classifier , param_grid=parameters , cv=5)

In [83]:
clf.fit(x_train,y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [1, 10, 20],
                         'penalty': ('l1', 'l2', 'elasticnet', None),
                         'solver': ('lbfgs', 'liblinear', 'newton-cg',
                                    'newton-cholesky', 'sag', 'saga')})

In [84]:
clf.best_params_

{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

In [88]:
classifier = LogisticRegression(C= 1, penalty = 'l1', solver = 'liblinear')

In [89]:
classifier.fit(x_train,y_train)

LogisticRegression(C=1, penalty='l1', solver='liblinear')

In [90]:
classifier.predict_proba(x_test)

array([[0.00554382, 0.99445618],
       [0.01025365, 0.98974635],
       [0.98422045, 0.01577955],
       [0.98127853, 0.01872147],
       [0.0091412 , 0.9908588 ],
       [0.99173365, 0.00826635],
       [0.96745465, 0.03254535],
       [0.99219569, 0.00780431],
       [0.98983797, 0.01016203],
       [0.98376382, 0.01623618],
       [0.00215246, 0.99784754],
       [0.03454223, 0.96545777],
       [0.02188722, 0.97811278],
       [0.01117498, 0.98882502],
       [0.99124449, 0.00875551],
       [0.00203145, 0.99796855],
       [0.96460197, 0.03539803],
       [0.00355705, 0.99644295],
       [0.00190685, 0.99809315],
       [0.97487377, 0.02512623]])

In [92]:
y_pred = classifier.predict(x_test)

In [93]:
accuracy_score(y_pred,y_test)

1.0

In [94]:
classification_report(y_pred,y_test)

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        10\n           1       1.00      1.00      1.00        10\n\n    accuracy                           1.00        20\n   macro avg       1.00      1.00      1.00        20\nweighted avg       1.00      1.00      1.00        20\n'

In [95]:
confusion_matrix(y_pred,y_test)

array([[10,  0],
       [ 0, 10]], dtype=int64)

In [96]:
from sklearn.model_selection import RandomizedSearchCV

In [100]:
random_clf = RandomizedSearchCV(LogisticRegression() , param_distributions=parameters , n_iter=10)

In [101]:
random_clf.fit(x_train,y_train)

RandomizedSearchCV(estimator=LogisticRegression(),
                   param_distributions={'C': [1, 10, 20],
                                        'penalty': ('l1', 'l2', 'elasticnet',
                                                    None),
                                        'solver': ('lbfgs', 'liblinear',
                                                   'newton-cg',
                                                   'newton-cholesky', 'sag',
                                                   'saga')})

In [102]:
random_clf.best_params_

{'solver': 'liblinear', 'penalty': 'l2', 'C': 10}

In [106]:
classifier.predict([[5.7,3,4.2,0]])

array([1])

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df = pd.read_csv("C:\\Users\\Khan Mokhit\\Downloads\\diabetes.csv")

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:
df.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [30]:
from warnings import filterwarnings 
filterwarnings("ignore")

In [8]:
x = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [9]:
x

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [10]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [12]:
x_train , x_test , y_train , y_test = train_test_split(x,y , test_size=0.3 , random_state = 30 )

In [13]:
x_train.shape , x_test.shape , y_train.shape , y_test.shape

((537, 8), (231, 8), (537,), (231,))

In [14]:
classifierr = LogisticRegression()

In [31]:
classifierr.fit(x_train,y_train)

LogisticRegression()

In [16]:
classifierr.predict_proba(x_test)

array([[0.88044069, 0.11955931],
       [0.85419425, 0.14580575],
       [0.85230025, 0.14769975],
       [0.36935683, 0.63064317],
       [0.83768642, 0.16231358],
       [0.94785695, 0.05214305],
       [0.29015761, 0.70984239],
       [0.72310716, 0.27689284],
       [0.65662872, 0.34337128],
       [0.95172102, 0.04827898],
       [0.88942901, 0.11057099],
       [0.81638859, 0.18361141],
       [0.87074912, 0.12925088],
       [0.79417153, 0.20582847],
       [0.91056087, 0.08943913],
       [0.91359943, 0.08640057],
       [0.76691669, 0.23308331],
       [0.75795029, 0.24204971],
       [0.19503911, 0.80496089],
       [0.85355934, 0.14644066],
       [0.52950856, 0.47049144],
       [0.12726388, 0.87273612],
       [0.47262392, 0.52737608],
       [0.63092233, 0.36907767],
       [0.02986788, 0.97013212],
       [0.56463117, 0.43536883],
       [0.90945424, 0.09054576],
       [0.95198234, 0.04801766],
       [0.95085448, 0.04914552],
       [0.61032721, 0.38967279],
       [0.

In [17]:
y_pred = classifierr.predict(x_test)

In [18]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [19]:
confusion_matrix(y_pred,y_test)

array([[139,  27],
       [ 20,  45]], dtype=int64)

In [20]:
accuracy_score(y_pred,y_test)

0.7965367965367965

In [23]:
classification_report(y_pred,y_test)

'              precision    recall  f1-score   support\n\n           0       0.87      0.84      0.86       166\n           1       0.62      0.69      0.66        65\n\n    accuracy                           0.80       231\n   macro avg       0.75      0.76      0.76       231\nweighted avg       0.80      0.80      0.80       231\n'

In [24]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

In [27]:
parameters = {'penalty':('l1','l2','elasticnet',None),'C':[1,10,20],
             'solver':('lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga')}

In [28]:
clf = GridSearchCV(LogisticRegression(),param_grid=parameters , cv=5)

In [32]:
clf.fit(x_train,y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [1, 10, 20],
                         'penalty': ('l1', 'l2', 'elasticnet', None),
                         'solver': ('lbfgs', 'liblinear', 'newton-cg',
                                    'newton-cholesky', 'sag', 'saga')})

In [33]:
clf.best_params_

{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

In [39]:
classifierr = LogisticRegression(penalty='l1',solver='liblinear',C=1)

In [40]:
classifierr.fit(x_train,y_train)

LogisticRegression(C=1, penalty='l1', solver='liblinear')

In [43]:
y_pred = classifierr.predict(x_test)

In [44]:
confusion_matrix(y_pred,y_test)

array([[141,  27],
       [ 18,  45]], dtype=int64)

In [45]:
classification_report(y_pred,y_test)

'              precision    recall  f1-score   support\n\n           0       0.89      0.84      0.86       168\n           1       0.62      0.71      0.67        63\n\n    accuracy                           0.81       231\n   macro avg       0.76      0.78      0.76       231\nweighted avg       0.82      0.81      0.81       231\n'

In [46]:
accuracy_score(y_pred,y_test)

0.8051948051948052

In [48]:
random_clf = RandomizedSearchCV(LogisticRegression(),param_distributions=parameters,cv=5)

In [49]:
random_clf.fit(x_train,y_train)

RandomizedSearchCV(cv=5, estimator=LogisticRegression(),
                   param_distributions={'C': [1, 10, 20],
                                        'penalty': ('l1', 'l2', 'elasticnet',
                                                    None),
                                        'solver': ('lbfgs', 'liblinear',
                                                   'newton-cg',
                                                   'newton-cholesky', 'sag',
                                                   'saga')})

In [50]:
random_clf.best_params_

{'solver': 'liblinear', 'penalty': 'l1', 'C': 1}

In [51]:
y_pred = random_clf.predict(x_test)

In [52]:
confusion_matrix(y_pred,y_test)

array([[141,  27],
       [ 18,  45]], dtype=int64)

In [53]:
classification_report(y_pred,y_test)

'              precision    recall  f1-score   support\n\n           0       0.89      0.84      0.86       168\n           1       0.62      0.71      0.67        63\n\n    accuracy                           0.81       231\n   macro avg       0.76      0.78      0.76       231\nweighted avg       0.82      0.81      0.81       231\n'

In [54]:
accuracy_score(y_pred,y_test)

0.8051948051948052



## 8. Model Evaluation
- **Metrics Used:**
  - **Accuracy:** The percentage of correct predictions.
  - **Confusion Matrix:** To visualize the performance and errors across classes.
  - **Precision, Recall, and F1-Score:** To measure the model's effectiveness in classifying each species.
- **Results:**
  - The model achieved an accuracy of approximately `XX%` on the test set (replace `XX` with your actual result).
  - Precision, recall, and F1-scores were also calculated for each class, showing that the model performed well across all species.

## 9. Conclusion
The logistic regression model successfully classified the Iris species with high accuracy. The features Petal Length and Petal Width were the most significant contributors to the model's decision-making process.

The model can be further improved by exploring more advanced techniques like regularization, feature engineering, or using different algorithms like Support Vector Machines or Decision Trees.

## 10. Future Work
- **Improvement Suggestions:**
  - Consider using other classification algorithms such as SVM or Random Forest and compare their performance with logistic regression.
  - Perform feature engineering to create new features that might improve the model's accuracy.
- **Model Deployment:**
  - Deploy the model using Flask or another web framework to create a simple web application where users can input flower measurements and get the predicted species.

## 11. References
- Cite any sources, libraries, or papers you referenced during your project.

## 12. Appendix
- Include any additional information such as the full code, detailed EDA, or supplementary analysis.
