# 🤖 Introduction to Scikit-learn
This notebook provides a beginner-friendly introduction to scikit-learn — a widely used machine learning library in Python.

## ✅ What is Scikit-learn?
- Scikit-learn (or `sklearn`) offers simple and efficient tools for predictive data analysis.
- It supports:
  - Classification (e.g., spam detection)
  - Regression (e.g., price prediction)
  - Clustering (e.g., customer segmentation)
  - Dimensionality reduction
  - Model evaluation and selection

Scikit-learn, also known as sklearn, is an open-source, robust Python machine learning library. It was created to help simplify the process of implementing machine learning and statistical models in Python. 

In [22]:
# ✅ Step 1: Import Required Libraries
!pip install pandas scikit-learn
import pandas as pd
from sklearn.datasets import load_iris



## 🌸 Step 2: Load Sample Dataset (Iris)

The first aspect of the sklearn we will explore is the data; Scikit-learn comes with some standard machine learning datasets, which means you’re not required to download them from an external website or database. 

Examples of the toy datasets available in sklearn include the iris dataset for classification and the diabetes dataset for regression. For our example, we will be using the iris dataset. 

Let’s load it into memory:

In [23]:
iris_data = load_iris()

Executing the code above returns a dictionary-like object containing the data along with metadata about the data it contains. 

In [24]:
iris_data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  


The data we need is in the .data key, we can access it as an attribute of the iris dataset instance as follows: 

In [25]:
iris_data.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [26]:
iris_data.data.shape

(150, 4)

This returns an N x M array where N is the number of samples and M is the number of features.  

Let’s use this knowledge to load our data into a pandas DataFrame, which is much easier to manipulate and analyze. 

In [27]:
# Create DataFrame from the feature data
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)


In [28]:
iris_data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [29]:
iris_data.target_names[0]

np.str_('setosa')

In [30]:
iris_data.target_names[1]

np.str_('versicolor')

In [31]:
iris_data.target_names[2]

np.str_('virginica')

In [32]:
# Add target column using original object (iris)
iris_df["target"] = iris_data.target

# Optionally add target names
iris_df["target_name"] = [iris_data.target_names[i] for i in iris_data.target]

# Preview
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


We are going to do is call the info() method on our pandas DataFrame; this will print a concise summary of the wine data contained within the DataFrame.

In [33]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
 5   target_name        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


After executing this cell, you will learn: 

The data contains 150 data samples
There are 6 total columns including the target column (what we want to predict)
There are 0 columns with missing values; you can infer this from the “Non-Null Count” column. 
All features are of data type float64, whereas the target label is an int64.
target_name is object
The data uses 7.2 KB of memory.

In [34]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


Executing this code shows us that our features are on different scales, which may cause problems when dealing with ML based algorithms like logistic regression, and when dealing with distance-based algorithms like support vector machines. This is because they are sensitive to the range of data points.

###  Data preprocessing

Data processing is a vital step in the machine learning workflow because data from the real world is messy. It may contain: 

1. Missing values,
2. Redundant values
3. Outliers
4. Errors
5. Noise

What is the importance of Scaling and Normalisation?

You must deal with all of this before feeding the data to a machine learning model; otherwise, the model will incorporate these mistakes into its approximation function – it will learn to make mistakes on new instances. 

Other than our data being on different scales, there’s not much else wrong with our data at first glance. To combat this problem, let’s standardize the features using sklearn’s StandardScaler class; this will ensure the mean of each feature is approximately equal to zero. 

In [35]:
from sklearn.preprocessing import StandardScaler

# Split data into features and label
# when training we need to split data into features(X) and labels(Y)
X = iris_df[iris_data.feature_names].copy()
y = iris_df["target"].copy() 

# Instantiate scaler and fit on features
scaler = StandardScaler()
scaler.fit(X)

# Transform features
X_scaled = scaler.transform(X.values)

# View first instance
print(X_scaled[0]) # for numpy
print(X.iloc[0])      #  for DataFrame


[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
sepal length (cm)    5.1
sepal width (cm)     3.5
petal length (cm)    1.4
petal width (cm)     0.2
Name: 0, dtype: float64




### Model Training

Before a machine learning model can make accurate predictions, it first needs to be trained on labeled data so it can learn an approximation of the underlying pattern.

But how can we tell if the model will perform well on new, unseen data? We can’t—unless we test it.

To evaluate a model safely before deploying it in the real world, we typically split our data into two parts: one for training and one for testing. This process is called offline evaluation. It helps us measure how well the model has learned by testing it on data it hasn’t seen during training.

There are various ways to perform this split, but scikit-learn provides a convenient method called train_test_split() that does it for us.

In this example, we'll use train_test_split() to divide our dataset so that:

70% of the data is used for training the model

30% is used to evaluate how well the model generalizes to new inputs

In [36]:
from sklearn.model_selection import train_test_split

# Split data into train and test
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled,
                                                                  y,
                                                             train_size=.7,
                                                           random_state=25)

# Check the splits are correct
print(f"Train size: {round(len(X_train_scaled) / len(X) * 100)}% \n\
Test size: {round(len(X_test_scaled) / len(X) * 100)}%")


Train size: 70% 
Test size: 30%


### Building the model


Thanks to sklearn, building a machine learning model is extremely simple. 

We are going to build three models to predict the class of wine: 

Logistic regression


In [37]:
from sklearn.linear_model import LogisticRegression


# Instnatiating the models 
# This creates an instance of the LogisticRegression model.
# It’s now ready to be trained with data.
logistic_regression = LogisticRegression()


In [38]:
logistic_regression

In [39]:
# You’re training (or “fitting”) the model on your training data.
# X_train_scaled = the input features (already scaled)
# y_train = the target labels
# The model learns patterns from the training data so it can make prediction
logistic_regression.fit(X_train_scaled, y_train)

In [40]:
# Now that the model is trained, you use it to predict labels on new, unseen data (X_test_scaled).
# The output, log_reg_preds, is an array of predicted classes.
log_reg_preds = logistic_regression.predict(X_test_scaled)
print(log_reg_preds)

[0 1 2 1 2 1 2 0 1 1 0 0 0 1 0 1 2 2 1 1 1 1 1 0 0 2 1 2 2 0 1 2 2 0 2 1 1
 0 0 0 0 0 0 0 2]


### Model evaluation

Model evaluation is done to test how well the model generalizes to unseen instances. Scikit-learn provides an array of classification and regression metrics to evaluate a trained model's performance. 

For our use case, we are going to use classification_report() from the metrics module to build a text report showing the main classification metrics such as precision, recall, f1_score, accuracy, etc. 

| Metric        | What it means                                         |
| ------------- | ----------------------------------------------------- |
| **Precision** | Out of all predicted positives, how many were correct |
| **Recall**    | Out of all actual positives, how many were found      |
| **F1-score**  | Harmonic mean of precision and recall (balance)       |
| **Accuracy**  | Overall correctness of predictions (shown at bottom)  |


![image.png](attachment:image.png)

In [41]:
from sklearn.metrics import classification_report

# Store model predictions in a dictionary
# this makes it's easier to iterate through each model
# and print the results. 
model_preds = {
    "Logistic Regression": log_reg_preds
}

for model, preds in model_preds.items():
    print(f"{model} Results:\n{classification_report(y_test, preds)}", sep="\n\n")


Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       0.94      0.94      0.94        16
           2       0.92      0.92      0.92        12

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45

