# Class 10 – Introduction to Scikit-Learn

Welcome to our introduction to Scikit-Learn! Today, we will walk through the basic workflow for Machine Learning using Scikit-Learn. We’ll learn how to load datasets, preprocess data, split data, train a model, and evaluate its performance.

## What is Scikit-Learn?

Scikit-Learn (or `sklearn`) is a powerful Python library that simplifies the process of creating machine learning models. Think of it as a toolkit with ready-made tools to help you solve ML problems without building everything from scratch.

For example:
- You don’t need to build a car engine to drive a car. Similarly, you don’t need to implement algorithms like Logistic Regression from scratch — Scikit-Learn already has them!

### Key Modules in Scikit-Learn:
- `datasets`: Load sample datasets like Iris, Wine, Digits
- `preprocessing`: Scale and encode your data
- `model_selection`: Split your data into training and testing
- `metrics`: Evaluate model performance
- Models like `linear_model`, `svm`, `tree`, etc.

## Loading and Preprocessing Datasets

We'll use the Iris dataset. It contains measurements of flowers like petal length, sepal width, and the flower species (our target).

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [3]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()


df = pd.DataFrame(iris.data, columns= iris.feature_names)
df["target"] = iris.target

df.tail(10)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
140,6.7,3.1,5.6,2.4,2
141,6.9,3.1,5.1,2.3,2
142,5.8,2.7,5.1,1.9,2
143,6.8,3.2,5.9,2.3,2
144,6.7,3.3,5.7,2.5,2
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [None]:
# 0 : setosa
# 1: versicolor
# 2: virginica

In [5]:
X = iris.data  # Features
y = iris.target  # Labels

print("First 5 samples (features):")
print(X[:5])
print("First 5 labels:")
print(y[:5])



First 5 samples (features):
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
First 5 labels:
[0 0 0 0 0]


**Explanation:**
- `load_iris()` loads the dataset containing flower measurements.
- We create a DataFrame `df` with the data and column names.
- The `target` column represents the flower species as numbers (0, 1, 2).
- `df.head()` shows the first few rows of the data.

**Explanation:**
Imagine you're a botanist trying to identify flower species. Each row in `X` is like a measurement taken from a flower: petal length, width, etc. The `y` is the actual species (like Iris Setosa).

## Preprocessing: Standardizing the Data

Why scale data? Imagine two features: weight (0–100 kg) and height (0–2 m). If not scaled, models treat weight as more important because it has a higher range.

StandardScaler transforms data to have **mean = 0** and **standard deviation = 1**.

In [6]:
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("First 5 scaled samples:")
print(X_scaled[:5])


First 5 scaled samples:
[[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
 [-1.38535265  0.32841405 -1.39706395 -1.3154443 ]
 [-1.50652052  0.09821729 -1.2833891  -1.3154443 ]
 [-1.02184904  1.24920112 -1.34022653 -1.3154443 ]]


**Analogy:** Think of scaling like converting all measurements to the same unit — like converting inches and feet to centimeters so comparisons are fair.

## Splitting the Data

We split data into training and testing sets. Training is like studying for an exam. Testing is writing the actual exam.

We use 80% for training, 20% for testing.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

**Explanation:**
- `X` contains the features (inputs): sepal/petal length and width.
- `y` contains the labels (outputs): flower species.
- We split the data into training and testing sets using `train_test_split()`.
- `test_size=0.2` means 20% of the data is used for testing.

**Note:** `random_state=42` ensures the same split every time (for reproducibility).

## Implementing a Simple Model (Logistic Regression) 
Training the Model

Now let’s train a Logistic Regression model. Despite its name, it is a classification model.

**Analogy:** Training a model is like teaching a child to recognize animals by showing many labeled pictures.

In [8]:
from sklearn.linear_model import LogisticRegression

# Initialize and train
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

**Explanation:**
- We create a decision tree model.
- `model.fit()` trains the model using the training data (`X_train`, `y_train`).
- The model learns how different measurements relate to flower species.

In [9]:
y_pred = model.predict(X_test)

In [11]:
from sklearn.metrics import accuracy_score
# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(f"Accuracy in percent: {accuracy * 100:.2f}%")

Accuracy: 1.0
Accuracy in percent: 100.00%


**Explanation:**
- `.fit()` is learning from training data
- `.predict()` makes guesses
- `accuracy_score` checks how many guesses are correct

## 4. 📊 Visualizing and Explaining the Output
- We trained a model using flower measurements.
- The model learned to predict the type of iris flower.

**Try changing `test_size` from 0.2 to 0.3 or 0.1 and see how accuracy changes.**

In [None]:
# User input for prediction
sepal_len = float(input("Enter sepal length (cm): "))
sepal_wid = float(input("Enter sepal width (cm): "))
petal_len = float(input("Enter petal length (cm): "))
petal_wid = float(input("Enter petal width (cm): "))

input_data = [[sepal_len, sepal_wid, petal_len, petal_wid]]
prediction = model.predict(input_data)# [SETOSA]

print("Predicted species:", iris.target_names[prediction[0]])

Predicted species: virginica


**Explanation:**
- This code takes user input for flower measurements.
- It creates a feature list `input_data` with those values.
- The trained model predicts the flower species based on the input.
- The species name is printed using `iris.target_names`.

# Activity: Wine Dataset

Now it's your turn to try the full ML workflow on the Wine dataset!

In [13]:
from sklearn.datasets import load_wine

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train and Evaluate
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on Wine dataset: {accuracy * 100:.2f}%")

Accuracy on Wine dataset: 100.00%


**Try it yourself:**
- Change the model to `DecisionTreeClassifier`
- Use a different dataset like `load_digits()` or `load_breast_cancer()`

# Wrap-Up

### ML Workflow Summary:
1. Load Dataset
2. Preprocess (Scaling)
3. Split into Training and Testing
4. Train Model
5. Evaluate Model

**Why Preprocessing Matters:**
Scaling ensures that all features contribute equally. Without it, some features may dominate just because of their scale.

### Homework:
- Try the same steps using another dataset: `load_digits()` or `load_breast_cancer()`
- Try a different model: `DecisionTreeClassifier`, `KNeighborsClassifier`
- Compare accuracy and think about why one model might perform better than another.