Scikit learn

In [None]:
"""
📚 Full Scikit-Learn (sklearn) Interview Guide
---

1. What is Scikit-Learn?

- Scikit-Learn (sklearn) is Python’s most popular library for:
  - Machine Learning (ML) algorithms
  - Data preprocessing
  - Model selection and evaluation
  - Pipelines (end-to-end workflows)

- Key features:
  - Easy-to-use API
  - Lots of models (classification, regression, clustering, etc.)
  - Integrates well with NumPy, Pandas, and Matplotlib

---

2. Installation

Use pip:
    pip install scikit-learn

---

3. Basic Workflow (VERY IMPORTANT)

Every project with Scikit-Learn typically follows this pattern:

    1. Load Data
    2. Preprocess Data
    3. Split Data (train/test)
    4. Choose a Model
    5. Train the Model
    6. Predict on new data
    7. Evaluate Performance
    8. Tune Hyperparameters (optional)
    9. (Optional) Save the Model

---

4. Practical Hands-On Example

Dataset: Iris Dataset (small, famous ML dataset)
"""

# 1. Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
import joblib

"""
Explanation:
- `load_iris` loads a built-in dataset (flowers, 3 classes).
- `train_test_split` splits data into training and testing parts.
- `StandardScaler` standardizes features by removing mean and scaling to unit variance.
- `LogisticRegression` is a classification model.
- `accuracy_score` and `classification_report` are used to evaluate how good the model is.
- `GridSearchCV` is used for hyperparameter tuning.
- `joblib` is used to save the trained model to disk.
"""

# 2. Load data
iris = load_iris()
X = iris.data
y = iris.target

"""
X = features (petal length, width, etc.)
y = labels (flower types)
"""

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

"""
- 70% training data
- 30% testing data
- random_state ensures reproducibility
"""

# 4. Preprocessing (feature scaling)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

"""
- Fit on training data only (important to avoid data leakage)
- Then transform both training and testing sets
"""

# 5. Choose and Train model
model = LogisticRegression()
model.fit(X_train, y_train)

"""
- `fit` method trains the model using training data
"""

# 6. Predict
y_pred = model.predict(X_test)

"""
- `predict` method predicts labels for the testing set
"""

# 7. Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

"""
- Accuracy: proportion of correctly predicted labels
- Classification report: precision, recall, f1-score, support
"""

# 8. Tune Hyperparameters (Optional)
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters from GridSearchCV:", grid_search.best_params_)

"""
- GridSearchCV searches over combinations of hyperparameters
- Cross-validation (cv=5) is used to avoid overfitting
"""

# 9. Save the Model (Optional)
joblib.dump(grid_search.best_estimator_, "best_logistic_model.pkl")

"""
- Saves the best trained model to disk using joblib
- Can later load it with `model = joblib.load("best_logistic_model.pkl")`
"""

