# Muhammad Dawood Khan  

### Welcome to the **Best Place to Learn scikit-learn (SKLEARN)**  
Get ready for an **exciting journey** through Machine Learning with hands-on, practical examples.  
Buckle up, explore, and most importantly — **enjoy the ride!**  

---

### Happy Coding & Keep Learning!
If you have any **questions**, **feedback**, or **suggestions**, feel free to reach out anytime.  
I’m always happy to connect with fellow learners and developers!

---

### Connect With Me  
[![GitHub](https://img.shields.io/github/followers/Dawood-ML?label=GitHub&style=social)](https://github.com/Dawood-ML)  
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/muhammad-dawood-khan-5a3292329/)

---

_“Learning never exhausts the mind — it only sharpens it.”_


### **Chunk 1: The Scikit-Learn API Philosophy**

#### **1. Concept Introduction**

You understand the theory behind machine learning models. Scikit-learn's genius is not in the algorithms themselves, but in its **unified API**. Every algorithm, whether it's a simple regression or a complex ensemble, is an "Estimator" object that shares the same simple, consistent methods. This is the key to productivity.

The three core methods are:

1.  `.fit(X, y)`: This is the **training** step. The estimator "learns" from your data `X` (features) and `y` (target). Every supervised learning model in scikit-learn has this method.
2.  `.predict(X_new)`: Once the model is trained, this method generates predictions for new, unseen data `X_new`.
3.  `.transform(X)`: This is for **preprocessing**. It takes data and returns a transformed version of it (e.g., scaled, encoded, or with imputed values). We will cover this more in the next chunk. `fit_transform()` is a convenient shortcut that learns the transformation parameters from the data and applies the transformation in one step.

This leads to the most fundamental workflow in all of supervised machine learning, which you will type thousands of times:

**Load Data → Split Data → Train Model → Predict → Evaluate**

Let's see this in action with our first dataset.

#### **2. Dataset EDA: The Iris Dataset**

The Iris dataset is the "hello, world" of machine learning. It's a clean, simple dataset for classifying three species of iris flowers based on four measurements.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Set Plot style
sns.set_style("whitegrid")

In [None]:
iris = load_iris()
# Create a Pandas DataFrame for easier manipulation
df = pd.DataFrame(data=iris.data, 
                    columns=iris.feature_names)

# Create target column
df['target'] = iris.target
df['species'] = df['target'].map({i: name for i, name in enumerate(iris.target_names)})

print("INFO")
df.info()

In [None]:
# Incase you are wondering what this line does.
# Uncomment the code below

#{i: name for i, name in enumerate(iris.target_names)}

In [None]:
# Statistical summary
df.describe()

In [None]:
# First 5 rows
df.head()

In [None]:
# Check for missing values
print("MIssing Values")
df.isnull().sum()

In [None]:
# Target Variable Distribution
print("Target Variable Distribution")
plt.figure(figsize=(8,5))
sns.countplot(x='species', data = df)
plt.title("Distribution of Iris Species")
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

In [None]:
# Feature Distributions ( Histogram )
df[iris.feature_names].hist()
plt.suptitle('Histogram of Feature Distribution');



In [None]:
# Feature Distributions ( Box PLots )
plt.figure(figsize=(12,8))
sns.boxplot(data=df[iris.feature_names])
plt.title('Box Plots of Features')
plt.show(
    
)

In [None]:
# Pair plot to visualize relationships
# This is a powerful command to see realtionships between all pairs of features
# and the distribution of each feature, colored by the target variable.
sns.pairplot(df, hue = 'species', height=2.5)
plt.suptitle('Pair Plot of Iris Dataset', y=1.02);

In [None]:
# Correlation Matric Heatmap
# Select Only numeric features or columns for correlation
df_num = df.drop(columns=['species'])
plt.figure(figsize=(10,8))
corr_mat  = df_num.corr() # Built in function in pandas
sns.heatmap(corr_mat,
            annot=True,
            cmap='coolwarm',
            linewidths=0.5)
plt.title('Correlation Matrix of Features and Target')
plt.show()

## Minimal Working Example : 
> Load -> Split -> Train -> Predict -> Evaluate

In [None]:
# IMports
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# For reproducible results
np.random.seed(42)


In [None]:
# Load the data
# We have already loaded the data, So there is no need to re-load it

# Let's split the data into Features and Label
#                OR
#                                  X and  y
X = df.drop(['target', 'species'], axis=1)
y = df['target']

print(F"Features Shape : {X.shape}")
print(F"Target Shape : {y.shape}")

# Another way of doing is : 
# Less common
X, y  = load_iris(return_X_y=True)

print(F"Features Shape : {X.shape}")
print(F"Target Shape : {y.shape}")
print("\nSee Same thing")

In [None]:
# Split DATA
# Split the data into Training ( 80% ) and testing ( 20% ) sets.
# stratify = y ensures that the proportion of classes in the train and test sets
# is the same as the original dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y
                                                    )

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

In [None]:
# Train Mode
# 1. Instantiate the estimator (create the model object)
# we'll use Logistic Regression, a simple and powerfull classification model
model = LogisticRegression(max_iter=200) # INcreased max_iter for convergence

# Fit the model to the training data
model.fit(X_train, y_train)
print("model training complete")

In [None]:
# Predict
# Use the trained model to make predictions on the unseen data.
y_pred = model.predict(X_test)
print(f"First 5 predictions : {y_pred[:5]}")
print(f"First 5 actual labels : {y_test[:5]}")


In [None]:
# Evaluate the model
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {100 * accuracy:.2f}%")

## Wow that is a nice model. Isn't it?


## Variations & Key Concepts
The beauty of Sklearn is its consistency. Let's swap the model. Notice how little the code changes.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Instantiate a new model
new_model = KNeighborsClassifier(n_neighbors=5)

# Fit the model OR Train the model
new_model.fit(X_train, y_train)

# Predict
y_pred_new  = new_model.predict(X_test)

# Evaluate 
accuracy_knn = accuracy_score(y_test, y_pred_new)
print(f"Accuracy: {100 * accuracy_knn:.2f}%")

Many models can also predict probabilities, which is often more useful than a hard prediction

In [None]:
# Getting Probabilities
# Get the probabilities for each class for the first 5 test samples
probabilities = model.predict_proba(X_test[:5])

print("Probabilities for first 5 samples : ")
np.set_printoptions(precision=3, suppress=True)
print(probabilities)

print("\nPredicted class for first sample:", np.argmax(probabilities[0]))


# Common Pitfalls and Mistakes : 
    1. **Training and Testing on the Same Data (Data Leakage):** This is the #1 mistake. It gives you a perfect completely useless score 

In [None]:
# DO NOT DO THIS
model_leaky = LogisticRegression(max_iter=200)
model_leaky.fit(X, y) # Fit on ALL data

leaky_preds = model_leaky.predict(X) # Predict on the SAME data

leaky_accuracy = accuracy_score(y, leaky_preds)
print(f"Dangerously optimistic accuracy: {leaky_accuracy * 100:.2f}%") # This is meaningless!

    2. **Forgetting to INstantiate the Model:** You must create an *instance* of the model class by using parenthesis()

In [None]:
# Wrong way
try:
    LogisticRegression.fit(X_train, y_train)
except Exception as e:
    print(f"Error: {e}")

# Correct way
model = LogisticRegression(max_iter=200) # Correct: creates an instance
model.fit(X_train, y_train)
print("\nModel fitted correctly after instantiation.")

#### Congratulations. You have just successfully trained, predicted with, and evaluated your first two machine learning models using the standard, professional workflow. You've seen the core scikit-learn API (`fit`, `predict`) and used the most critical function for model validation (`train_test_split`).