1. Introduction to Machine Learning with scikit-learn
Theoretical Explanation
Machine Learning (ML) is a subset of artificial intelligence that focuses on building systems that can learn from and make decisions based on data. It involves algorithms that improve their performance at a task with experience (i.e., data).

Key Concepts:

Types of Machine Learning:

Supervised Learning: The model is trained on labeled data. Examples include regression and classification.
Unsupervised Learning: The model works with unlabeled data to find hidden patterns. Examples include clustering and dimensionality reduction.
Reinforcement Learning: The model learns by interacting with an environment to maximize some notion of cumulative reward.
Common Terminologies:

Features: Independent variables/input data used to make predictions.
Labels: Dependent variables/output data the model aims to predict.
Training Set: Subset of data used to train the model.
Testing Set: Subset of data used to evaluate the model's performance.
Overfitting: When a model learns the training data too well, including noise, and performs poorly on unseen data.
Underfitting: When a model is too simple to capture the underlying pattern of the data.
Why scikit-learn?

scikit-learn is a powerful and widely-used Python library for machine learning. It provides simple and efficient tools for data analysis and modeling, built on NumPy, SciPy, and Matplotlib.

Programmatic Implementation
Let's implement a simple supervised learning task using scikit-learn. We'll use the Iris dataset, a classic dataset in machine learning.

Objective: Build a classification model to predict the species of iris flowers based on their features.

In [1]:
import pandas as pd 
import numpy as np 
from sklearn import  datasets
from sklearn.model_selection import  train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report,confusion_matrix

In [4]:
# Load the Iris Dataset 
iris =datasets.load_iris()

X=iris.data   # Feature 
y=iris.target # label 

In [5]:
# Conver  To DataFrame for Better visualization 

df = pd.DataFrame(data=np.c_[iris['data'],iris['target']],
columns =iris['feature_names'] + ['target '])

In [7]:
print('first 5 rows of Data Set ')
df.head()

first 5 rows of Data Set 


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    float64
dtypes: float64(5)
memory usage: 6.0 KB


In [10]:
# cheack for null values 
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [11]:
X_train,X_test ,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [13]:
# Feature Scalling
scaler = StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

In [14]:
# Initilize the K-Nearest Neighbour classifier 
knn = KNeighborsClassifier(n_neighbors=5)

In [15]:
# Train the Model 
knn.fit(X_train,y_train)

In [16]:
# make Prediction on the Test Test 
y_pred=knn.predict(X_test)

In [17]:
# Evaluate the model 
print('Confusion Matrix\n')
print(confusion_matrix(y_test,y_pred))

Confusion Matrix

[[10  0  0]
 [ 0  9  0]
 [ 0  1 10]]


In [22]:
#  classification Report 
print('\n Classification Report :')
print(classification_report(y_test,y_pred))


 Classification Report :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.90      1.00      0.95         9
           2       1.00      0.91      0.95        11

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30



Output Explanation:
Loading the Dataset:

We load the Iris dataset and separate it into features (X) and labels (y).
DataFrame Creation:

For better visualization, we convert the data into a Pandas DataFrame and display the first five rows.
Train-Test Split:

The dataset is split into training and testing sets with an 80-20 ratio.
Feature Scaling:

We apply feature scaling using StandardScaler to standardize the feature values, which often improves model performance.
Model Training:

We initialize a K-Nearest Neighbors (KNN) classifier with n_neighbors=5 and train it on the training data.
Prediction and Evaluation:

Predictions are made on the test set.
We evaluate the model's performance using a confusion matrix and classification report, which includes metrics like precision, recall, and F1-score.