## I. Introduction & Core Concepts

This section covers the fundamentals of the Scikit-learn library.

### 1. What is Scikit-learn?

* **Purpose:** Scikit-learn (often imported as `sklearn`) is the most widely used Python library for "traditional" machine learning (i.e., non-deep learning). It provides a comprehensive set of tools for building ML models, including algorithms for classification, regression, clustering, and dimensionality reduction, as well as utilities for preprocessing data, selecting features, evaluating models, and tuning parameters.
* **Strengths:**
    * **Wide Range of Algorithms:** Covers most standard ML tasks.
    * **Preprocessing Tools:** Excellent utilities for cleaning and transforming data (scaling, encoding, imputation).
    * **Model Evaluation:** Robust methods for assessing model performance and generalizability (cross-validation, metrics).
    * **Consistent API:** A uniform interface across different algorithms makes it easy to swap models.
    * **Efficiency:** Many algorithms are optimized using `NumPy`, `SciPy`, and Cython.
    * **Great Documentation:** Extensive user guides, examples, and API references.
* **Dependencies:** Built upon Python's core scientific libraries: `NumPy` (for array manipulation) and `SciPy` (for scientific computations). It often integrates with `Matplotlib`/`Seaborn` for visualization and `Pandas` for data handling.

## 2. Installation

If you don't have it installed, you can typically install it using pip:

```bash
pip install scikit-learn
```

(Note: The import name is `sklearn`, but the package name for installation is `scikit-learn`).

## 3. Key Design Principles

Scikit-learn follows several key principles that make it user-friendly and powerful:

* **Consistency:** All objects share a common, simple interface centered around the Estimator API (see below).
* **Inspection:** Model parameters and results are stored as public attributes on the estimator objects after fitting (e.g., `model.coef_`).
* **Sensible Defaults:** Algorithms have default parameter values that work reasonably well in many cases, making it easy to get started.
* **Composition:** Easily combine multiple steps (like preprocessing and modeling) using tools like `Pipeline`.

## 4. The Estimator API

This is the heart of Scikit-learn's consistency. Most objects in the library are "estimators" and follow this pattern:

* **Instantiation:** You create an instance of an estimator class, setting its hyperparameters (parameters not learned from data).
  ```python
  from sklearn.linear_model import LinearRegression
  model = LinearRegression(fit_intercept=True)
  ```
* **Fitting:** The estimator learns from the data using the `fit()` method.
    * Supervised learning (needs features `X` and target `y`): `estimator.fit(X, y)`
    * Unsupervised learning (needs only features `X`): `estimator.fit(X)`
* **Prediction/Transformation:** Once fitted, the estimator can perform tasks:
    * `estimator.predict(X_new)`: Make predictions on new data (Classification/Regression).
    * `estimator.predict_proba(X_new)`: Get probability estimates for each class (Classification).
    * `estimator.transform(X_new)`: Apply data transformation (Preprocessing/Dimensionality Reduction).
    * `estimator.fit_transform(X)`: A convenience method to fit and transform on the same data (often more efficient than calling `fit` then `transform`).
* **Evaluation:** Estimators often have a `score()` method providing a default evaluation metric.
    * `estimator.score(X_test, y_test)` (Supervised)
    * `estimator.score(X_test)` (Unsupervised, metric depends on estimator)

## 5. Data Representation

Scikit-learn expects data primarily in these formats:

* **Features (`X`):** A 2D array-like structure (`NumPy` array, `Pandas` `DataFrame`, `SciPy` sparse matrix) where rows represent samples and columns represent features. Should generally be numeric.
* **Target (`y`):** A 1D array-like structure (`NumPy` array, `Pandas` `Series`) containing the target values (numbers for regression, class labels/integers for classification) corresponding to the rows in `X`.

## 6. Loading Example Datasets

Scikit-learn includes several small standard datasets within the `sklearn.datasets` module, useful for examples and testing.

```python
# --- Quick Example ---
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

# 1. Load Data
iris = load_iris()
X = iris.data # Features (NumPy array)
y = iris.target # Target (NumPy array)
print(f"Feature shape: {X.shape}") # (150 samples, 4 features)
print(f"Target shape: {y.shape}") # (150 samples,)
print(f"Target names: {iris.target_names}") # ['setosa', 'versicolor', 'virginica']

# 2. Instantiate Estimator
knn = KNeighborsClassifier(n_neighbors=3)

# 3. Fit Estimator
knn.fit(X, y)
print(f"\nModel fitted: {knn}")

# 4. Predict on new (or existing) data
# Using first 5 samples as example 'new' data
X_new = X[:5]
predictions = knn.predict(X_new)
print(f"\nPredictions for first 5 samples: {predictions}") # Should predict class 0
print(f"Predicted class names: {iris.target_names[predictions]}")

# 5. Evaluate (using default score - accuracy for classifiers)
accuracy = knn.score(X, y) # Evaluate on the training data (not ideal, just for demo)
print(f"\nModel accuracy on training data: {accuracy:.4f}")
# --------------------
```