In [1]:
from sklearn.datasets import load_iris

dataset = load_iris()
print(dataset.DESCR) # DESCR, an attribute of datasets. It contains textual description of the dataset 

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [3]:
dataset

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [5]:
dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [7]:
import pandas as pd
import numpy as np

df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [9]:
# Independent and Dependent features

X = df
df['target'] = dataset.target

In [13]:
# Binary classification

df_copy = df[df['target']!=2] 
df_copy.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Step-by-Step Breakdown of df[df['target'] != 2]:

1. *Accessing the Column*:
   - df['target'] retrieves the column named 'target' from the DataFrame df.
   - This column is a *Pandas Series*, which is essentially a one-dimensional array-like structure. 

---

2. *Applying the Condition*:
   - df['target'] != 2 applies the condition *"not equal to 2"* to each element in the 'target' column.
   - This results in a *Boolean Series*, where:
     - True indicates that the condition (target != 2) is satisfied.
     - False indicates that the condition is not satisfied.

---

3. *Boolean Indexing*:
   - The *Boolean Series* (True or False) is then used to filter rows in the DataFrame df.
   - Each True value in the Boolean Series indicates that the corresponding row in the DataFrame should be included in the result.
   - Each False value indicates that the corresponding row should be excluded.

   Internally, this step works by masking the DataFrame rows using the Boolean Series.  

---

### Key Mechanisms Behind the Scenes:

1. *Element-wise Comparison*:
   - The condition df['target'] != 2 is applied element-wise to the Series. This is optimized by Pandas using NumPy for fast array operations.

2. *Boolean Masking*:
   - The resulting Boolean Series acts as a "mask" that tells Pandas which rows to keep or drop.
   - Rows corresponding to True are retained, and those with False are removed.

3. *Efficiency*:
   - Pandas uses optimized low-level C and NumPy functions for these operations, making them very efficient even for large datasets.

---

### Visualization of the Process:

| *Index* | *Target Value* | **Condition (target != 2)** | *Row Included?* |
|-----------|------------------|------------------------------|--------------------|
| 0         | 0                | True                         | Yes                |
| 1         | 1                | True                         | Yes                |
| 2         | 2                | False                        | No                 |
| 3         | 1                | True                         | Yes                |
| 4         | 0                | True                         | Yes                |

---

In [16]:
# Independent and Dependent 

# iloc[rows_selection, columns_selection]
X = df_copy.iloc[:,:-1] # selects all rows, selects all columns except the last one
y = df_copy.iloc[:,-1] # selects all rows, selects just the last column

In [18]:
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20, random_state=42)

In [20]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train) # trains (fits) the logistic model on the training data

In [22]:
# Predict foresights

classifier.predict_proba(X_test) # Gives probability estimates for each class

array([[0.00118085, 0.99881915],
       [0.01580857, 0.98419143],
       [0.00303433, 0.99696567],
       [0.96964813, 0.03035187],
       [0.94251523, 0.05748477],
       [0.97160984, 0.02839016],
       [0.99355615, 0.00644385],
       [0.03169836, 0.96830164],
       [0.97459743, 0.02540257],
       [0.97892756, 0.02107244],
       [0.95512297, 0.04487703],
       [0.9607199 , 0.0392801 ],
       [0.00429472, 0.99570528],
       [0.9858324 , 0.0141676 ],
       [0.00924893, 0.99075107],
       [0.98144334, 0.01855666],
       [0.00208036, 0.99791964],
       [0.00125422, 0.99874578],
       [0.97463766, 0.02536234],
       [0.96123726, 0.03876274]])

In [24]:
# Prediction

y_pred = classifier.predict(X_test)
y_pred

array([1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0])

In [26]:
# Confusion matrix, accuracy score, classification report

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_pred, y_test))
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

[[12  0]
 [ 0  8]]
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00         8

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



In [43]:
# Hyperparameter Tuning

### *Confusion Matrix, Accuracy Score, Classification Report*

These metrics evaluate the performance of a classification model. 

#### *1. Confusion Matrix*
- **confusion_matrix()**:
  - Computes a table showing *true positives (TP), **true negatives (TN), **false positives (FP), and **false negatives (FN)*.
  - Input:
    - **y_pred**: Predicted labels from the model.
    - **y_test**: Actual true labels.
  - Output: A 2D array.
  
##### Example (binary classification):
|                | Predicted Positive | Predicted Negative |
|----------------|--------------------|--------------------|
| *Actual Positive* | TP                 | FN                 |
| *Actual Negative* | FP                 | TN                 |

##### Behind the Scenes:
- Compares each value of y_pred with y_test and counts occurrences in each of the four categories (TP, TN, FP, FN).

---

#### *2. Accuracy Score*
from sklearn.metrics import accuracy_score<br>
print(accuracy_score(y_pred, y_test))


- **accuracy_score()**:
  - Calculates the proportion of correct predictions:
    Accuracy = (TP + TN) / (TP + TN + FP + FN)
  - Input:
    - **y_pred**: Predicted labels.
    - **y_test**: True labels.
  - Output: A single numeric value (0 to 1).

##### Behind the Scenes:
- Loops through y_pred and y_test and counts matching values.
- Divides the total number of matches by the total number of predictions.

---

#### *3. Classification Report*
from sklearn.metrics import classification_report<br>
print(classification_report(y_pred, y_test))


- **classification_report()**:
  - Provides detailed performance metrics for each class:
    - *Precision*: Proportion of correct positive predictions.
      Precision = TP / (TP + FP)
    - *Recall (Sensitivity)*: Proportion of actual positives correctly identified.
      Recall = TP / (TP + FN)
    - *F1-Score*: Harmonic mean of precision and recall.
      F1-Score = 2 * [(Precision * Recall) / (Precision + Recall)]
    - *Support*: Number of occurrences of each class in y_test.

##### Behind the Scenes:
- Iterates through each class in y_test and computes metrics using the confusion matrix.

---

### *GridSearchCV for Hyperparameter Tuning*

#### *1. Importing and Setting Up GridSearchCV*
from sklearn.model_selection import GridSearchCV<br>
parameters = {
    'penalty': ('l1', 'l2', 'elasticnet', None),
    'C': [1, 10, 20],
}<br>
clf = GridSearchCV(classifier, param_grid=parameters, cv=5)


- *Purpose*:
  - Perform an exhaustive search over a specified grid of hyperparameter values.
  - Finds the best hyperparameters that maximize the model’s performance.

- **GridSearchCV Components**:
  - **classifier**: The base model to optimize (Logistic Regression in this case).
  - **param_grid**:
    - Dictionary of hyperparameters to search over.
    - Example:
      - penalty: Specifies the type of regularization (L1, L2, etc.).
      - C: Inverse of regularization strength (smaller values imply stronger regularization).
  - **cv=5**:
    - Specifies *k-fold cross-validation* (5 folds in this case).
    - Splits the training data into 5 subsets: trains on 4 and validates on 1, repeating for all subsets.

---

#### *2. Splitting and Training with GridSearchCV*
clf.fit(X_train, y_train)


- *How It Works*:
  1. Splits the X_train and y_train into 5 folds (as specified by cv=5).
  2. For each combination of hyperparameters (penalty and C), it:
     - Trains the model on 4 folds.
     - Validates the model on the remaining fold.
  3. Repeats this process for all folds, calculating an average performance metric for each hyperparameter combination.
  4. Selects the hyperparameter combination that yields the best performance.

##### Behind the Scenes:
1. *Hyperparameter Grid*:
   - Expands the param_grid into all possible combinations:
     - Example: {penalty: 'l1', C: 1}, {penalty: 'l2', C: 10}, etc.
   - If there are 4 penalty options and 3 C values, there are 4 * 3 = 12 combinations.
2. *Cross-Validation*:
   - Divides the training data into 5 folds.
   - For each combination:
     - Trains on 4 folds.
     - Validates on the 1 remaining fold.
     - Computes average performance (e.g., accuracy, F1-score) across all 5 folds.
3. *Best Hyperparameters*:
   - Chooses the combination with the highest average performance score.
4. *Model Refit*:
   - Once the best hyperparameters are found, the model is retrained on the entire training set using those hyperparameters.

---

In [34]:
# GridSearchCV

from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [36]:
parameters = {
    'penalty' : ('L1', 'L2', 'elasticnet', None),
    'C' : [1, 10, 20],
}

In [38]:
clf = GridSearchCV(classifier, param_grid=parameters, cv=5)

# Split train data to validation data
clf.fit(X_train, y_train)

In [40]:
clf.best_params_

{'C': 1, 'penalty': None}

In [42]:
classifier = LogisticRegression(C=1, penalty=None)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(confusion_matrix(y_pred, y_test))
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

[[12  0]
 [ 0  8]]
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00         8

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



In [44]:
# RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

random_clf = RandomizedSearchCV(LogisticRegression(), param_distributions=parameters, cv=5)
random_clf.fit(X_train, y_train)

random_clf.best_params_

{'penalty': None, 'C': 1}

In [59]:
# lOGISTIC Regression model and check accuracy