# Session 10 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 76. Scikit-learn
Scikit-learn (or sklearn) is one of the most popular machine learning libraries in Python, providing simple and efficient tools for data mining, preprocessing, model training, and evaluation. It is built on NumPy, SciPy, and Matplotlib, making it highly efficient for numerical computations.

***

# 77. Important Features of Scikit-learn
- Supervised Learning (Classification, Regression)
- Unsupervised Learning (Clustering, Dimensionality Reduction)
- Model Selection & Evaluation (Cross-validation, Hyperparameter Tuning)
- Data Preprocessing (Scaling, Encoding, Imputation)
- Pipeline Construction (Chaining preprocessing & models)
- Integration with NumPy & Pandas

***

# 78. Supervised Learning with Scikit-learn

***

## 78-1. Classification (Predicting Categories)
Logistic Regression (Binary Classification)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict & evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

***

### 78-1-1. Common Classification Algorithms
- Algorithm	**Class**
- Logistic Regression	**LogisticRegression**
- Decision Tree	**DecisionTreeClassifier**
- Random Forest	**RandomForestClassifier**
- SVM	**SVC**
- K-Nearest Neighbors	**KNeighborsClassifier**

***

## 78-2. Regression (Predicting Continuous Values)
Linear Regression

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate data
X = [[1], [2], [3]]
y = [2, 4, 6]

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict
y_pred = model.predict([[4]])
print("Prediction:", y_pred)

Prediction: [8.]


***

### 78-2-1. Common Regression Algorithms
- Algorithm	**Class**
- Linear Regression	**LinearRegression**
- Ridge Regression	**Ridge**
- Lasso Regression	**Lasso**
- Decision Tree	**DecisionTreeRegressor**
- Random Forest	**RandomForestRegressor**

***

# 79. Unsupervised Learning

***

## 79-1. Clustering (Grouping Similar Data)
K-Means Clustering

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=3)

# Apply K-Means
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

# Get cluster labels
labels = kmeans.labels_

***

### 79-1-1. Common Clustering Algorithms
- Algorithm	**Class**
- K-Means	**KMeans**
- DBSCAN	**DBSCAN**
- Agglomerative Clustering	**AgglomerativeClustering**


***

## 79-2. Dimensionality Reduction (Feature Extraction)
PCA (Principal Component Analysis)

In [5]:
from sklearn.decomposition import PCA

# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

***

# 80. Data Preprocessing

***

## 80-1. Feature Scaling

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

***

## 80-2. Handling Missing Values

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")
X_imputed = imputer.fit_transform(X)

***

## 80-3. Categorical Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)

***

# 81. Model Evaluation

***

## 81-1. Classification Metrics

In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

***

## 81-2. Regression Metrics

In [None]:
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score,
)

print("MSE:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))

***

## 81-3. Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)  # 5-fold CV
print("Cross-validated Accuracy:", scores.mean())

***

# 82. Hyperparameter Tuning

***

## 82-1. Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': [3, 5, 7]}
grid = GridSearchCV(KNeighborsClassifier(), params, cv=5)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)

***

## 82-2. Randomized Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

params = {'n_neighbors': randint(1, 10)}
random_search = RandomizedSearchCV(KNeighborsClassifier(), params, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

***

# 83. Pipelines (Chaining Steps)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

***

***

# Some Excercises

**1.**  Use the load_iris dataset to train a Logistic Regression model and evaluate its performance.

Steps:
- Load the Iris dataset.
- Split into train/test sets (test_size=0.2).
- Train a LogisticRegression model.
- Predict on test data and compute accuracy.
- Print the confusion matrix and classification report.

___

**2.** Predict disease progression using load_diabetes and Linear Regression.

Steps:
- Load the dataset.
- Split into train/test sets.
- Train a LinearRegression model.
- Predict on test data and compute MSE and R² score.
- Plot predictions vs actual values using matplotlib.

---

**3.**  Cluster synthetic data using KMeans and visualize the clusters.

Steps:
- Generate synthetic data with make_blobs(n_samples=300, centers=4).
- Apply KMeans(n_clusters=4).
- Plot the clusters (use different colors for each cluster).
- Print the inertia (sum of squared distances to centroids).

---

**4.**  Reduce the Iris dataset to 2D using PCA and visualize it.

Steps:
- Load the Iris dataset.
- Apply PCA(n_components=2).
- Transform the data and plot the reduced features.
- Color points by their true class labels.

***

**5.** Preprocess a dataset with missing values and categorical features.

Steps:

- Create a synthetic dataset with:
    - Numerical features (some missing values).
    - Categorical features (text labels).
- Build a Pipeline with:
    - SimpleImputer (fill missing values).
    - OneHotEncoder (encode categories).
    - StandardScaler (scale features).
- Fit and transform the data.

***

**6.** Optimize a Random Forest classifier using grid search.

Steps:
- Load the Iris dataset.
- Define a parameter grid: params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]}
- Use GridSearchCV with RandomForestClassifier.
- Fit and print the best parameters and best score.

***

**7.** Evaluate a SVC model using 5-fold cross-validation.

Steps:
- Load the Iris dataset.
- Use cross_val_score with SVC().
- Print the mean accuracy and standard deviation.

***

**8.** Build a pipeline that preprocesses data and trains a classifier.

Steps:
- Load the Titanic dataset (or any dataset with mixed features).
- Define a pipeline:
    - Impute missing values.
    - Scale numerical features.
    - Encode categorical features.
    - Train a RandomForestClassifier.
- Evaluate using accuracy and classification report.

***

**9.** Compare Logistic Regression, SVM, and Random Forest on the same dataset.

Steps:
- Train all 3 models on the Iris dataset.
- Compare their accuracy, precision, and recall.
- Visualize results in a bar plot.

***

#                                                        🌞 https://github.com/AI-Planet 🌞