# Preparatory Class
## Scikit-Learn Basics for Beginners

&rarr; Create the happit of chekcing Sklearn documentations:
https://scikit-learn.org/stable/user_guide.html

![image.png](attachment:image.png)
https://scikit-learn.org/stable/_downloads/b82bf6cd7438a351f19fac60fbc0d927/ml_map.svg

## Data Pipelines
Data pipelines in scikit-learn allow you to chain multiple processing steps together.

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [2]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

In [None]:
# pipeline.fit(X_train, y_train)
# predictions = pipeline.predict(X_test)

## Transformers vs. Estimators

### Transformers:
- Used for data preprocessing (e.g., scaling, encoding, imputation).
- Implements a `fit` method to learn from data and a `transform` method to apply the transformation.

### Estimators:
- Used for modeling (e.g., regression, classification).
- Implements a `fit` method to train the model and a `predict` method to make predictions.

###
- Transformers modify the input data, while estimators predict outcomes based on the input data.

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

In [4]:
# Transformer: MinMaxScaler
scaler = MinMaxScaler()
# scaler.fit(X_train); X_scaled = scaler.transform(X_train)

# Estimator: LinearRegression
model = LinearRegression()
# model.fit(X_train, y_train); predictions = model.predict(X_test)

## Datasets in scikit-learn

In [5]:
from sklearn.datasets import load_iris

In [6]:
# Load the Iris dataset
iris = load_iris()
print("Features:", iris.feature_names)
print("Target Classes:", iris.target_names)

# Access data and target
X, y = iris.data, iris.target

Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Classes: ['setosa' 'versicolor' 'virginica']


In [7]:
X.shape, y.shape

((150, 4), (150,))

## Creating Custom Datasets

- `make_classification`: Generate a classification dataset.
- `make_regression`: Generate a regression dataset.
- `make_blobs`: Generate clusters of points for clustering.

In [8]:
from sklearn.datasets import make_classification

In [9]:
# Generate classification dataset
X, y = make_classification(n_samples=100, n_features=4, n_classes=2, random_state=42)

In [10]:
X.shape, y.shape

((100, 4), (100,))

## Basic Tools in scikit-learn
scikit-learn provides several tools to simplify machine learning workflows.

### Key Tools:
- `train_test_split`: Split data into training and testing sets.
- `GridSearchCV`: Perform hyperparameter tuning using cross-validation.
- `metrics`: Evaluate model performance using metrics like accuracy, precision, recall, etc.

In [11]:
# Example: Splitting Data and Evaluating a Model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate a model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9


## Main Modules 


### Modules that we will use in this bootcamp
- `sklearn.datasets`: Provides utilities to load and generate datasets.
- `sklearn.model_selection`: Tools for splitting data, cross-validation, and hyperparameter tuning.
- `sklearn.preprocessing`: Functions for data preprocessing, such as scaling and encoding.
- `sklearn.metrics`: Metrics for evaluating model performance.
- `sklearn.ensemble`: Ensemble methods like Random Forest and Gradient Boosting.
- `sklearn.linear_model`: Linear models for regression and classification.
- `sklearn.svm`: Support Vector Machines for classification and regression.

In [12]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [14]:
# 1. Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Preprocess data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Train a model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# 5. Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  1.0
