# Learning Scikit-learn

Scikit-learn is a powerful machine learning library in Python that provides tools for data preprocessing, model training, evaluation, and deployment. Here's an overview of its core modules and their functionalities:

## 1. Preprocessing
### Module: sklearn.preprocessing

- **Purpose**: Prepare data for machine learning algorithms by standardizing, normalizing, or encoding it.
- **Key Tools**:
>- **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.
>- **MinMaxScaler**: Scales features to a specific range, usually [0, 1].
>- **OneHotEncoder**: Converts categorical variables into a one-hot numeric array.
>- **LabelEncoder**: Converts labels into integers.
>- **PolynomialFeatures**: Generates polynomial and interaction features.

## 2. Feature Selection
### Module: sklearn.feature_selection

- **Purpose**: Identify and select the most relevant features in your dataset.
- **Key Tools**:
>- **SelectKBest**: Selects the top k features based on statistical tests.
>- **RFE (Recursive Feature Elimination)**: Recursively removes the least important features.
>- **VarianceThreshold**: Removes features with low variance.

## 3. Dimensionality Reduction
### Module: sklearn.decomposition

- **Purpose**: Reduce the number of features while retaining important information.
- **Key Tools**:
>- **PCA (Principal Component Analysis)**: Reduces dimensionality by finding the principal components.
>- **TruncatedSVD**: Performs dimensionality reduction on sparse matrices.

## 4. Model Selection and Validation
### Module: sklearn.model_selection

- **Purpose**: Split data into training and testing sets, and perform cross-validation.
- **Key Tools**:
>- **train_test_split**: Splits data into training and test sets.
>- **GridSearchCV**: Performs hyperparameter tuning using exhaustive search.
>- **RandomizedSearchCV**: Performs hyperparameter tuning using randomized search.
>- **KFold**: Splits data for k-fold cross-validation.

## 5. Classification
### Module: sklearn.linear_model, sklearn.svm, sklearn.tree, etc.

- **Purpose**: Algorithms for classifying data into categories.
- **Key Models**:
>- **Logistic Regression**: From sklearn.linear_model.
>- **Support Vector Machines (SVM)**: From sklearn.svm.
>- **Decision Trees**: From sklearn.tree.
>- **Naive Bayes**: From sklearn.naive_bayes.

## 6. Regression
### Module: sklearn.linear_model, sklearn.svm, etc.

- **Purpose**: Algorithms for predicting continuous values.
- **Key Models**:
>- **Linear Regression**: From sklearn.linear_model.
>- **Ridge/Lasso Regression**: Regularized regression models.
>- **SVR (Support Vector Regression)**: From sklearn.svm.

## 7. Clustering
### Module: sklearn.cluster

- **Purpose**: Group similar data points together without predefined labels.
- **Key Models**:
>- **KMeans**: Clusters data into k clusters.
>- **DBSCAN**: Groups data based on density.
>- **AgglomerativeClustering**: Hierarchical clustering method.

## 8. Ensemble Methods
### Module: sklearn.ensemble

- **Purpose**: Combine multiple models for better performance.
- **Key Models**:
>- **RandomForest**: Ensemble of decision trees.
>- **GradientBoosting**: Builds models sequentially to correct errors.
>- **VotingClassifier**: Combines predictions from multiple models.

## 9. Metrics
### Module: sklearn.metrics

- **Purpose**: Evaluate model performance.
- **Key Tools**:
>- **accuracy_score**, **f1_score**, **precision_score**, **recall_score** for classification.
>- **mean_squared_error**, **r2_score** for regression.
>- **confusion_matrix** and **roc_auc_score** for detailed performance analysis.

## 10. Pipeline
### Module: sklearn.pipeline

- **Purpose**: Combine preprocessing steps and modeling into a single workflow.
- **Key Tools**:
>- **Pipeline**: Chains multiple steps together.
>- **FeatureUnion**: Combines multiple feature extraction methods.

---

# Linear Regression
Linear regression is one of the most fundamental and widely used machine learning algorithms for predicting a continuous target variable. It models the relationship between independent variables (features) and a dependent variable (target) by fitting a linear equation to the data.

### Equation:
$y=w_1x_1+w_2x_2+...+w_nx_n+b$,
where:
- $y$: Predicted value.
- $x_1, x_2, ...,x_n$: Features.
- $w_1, w_2, ..., w_n$: Weights (coefficients) to be learned.
- $b$: Bias term (intercept).

## Objective:
Minimize the error between predicted and actual values. This is achieved by minimizing a loss function, commonly the Mean Squared Error (MSE):
$$ MSE=\frac{1}{n}\sum\limits _{i=1}^{n}(y_i-\hat{y_i})^2$$

## Implementation with Scikit-Learn
Here’s how to implement linear regression in Python using scikit-learn:

### Step 1: Import Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### Step 2: Prepare Data
Let’s assume you have a dataset data.csv:

In [None]:
# Load the dataset
df = pd.read_csv("data.csv")

# Features (X) and Target (y)
X = df[['feature1', 'feature2']]  # Replace with your features
y = df['target']  # Replace with your target variable

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Train the Model

In [None]:
# Create the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Coefficients and intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

### Step 4: Make Predictions

In [None]:
# Predict on test data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

## Visualize the Results
For a single feature, you can visualize the regression line:

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of actual vs predicted
plt.scatter(X_test['feature1'], y_test, color='blue', label='Actual')
plt.plot(X_test['feature1'], y_pred, color='red', label='Predicted')
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.legend()
plt.show()

---

# Preprocessing

Preprocessing in machine learning involves transforming raw data into a clean, normalized, and optimized form to make it suitable for training machine learning models. In scikit-learn, the **sklearn.preprocessing** module provides a variety of tools for preprocessing tasks.

## Why Preprocess?
**1. Consistency**: Ensure all features are on the same scale.

**2. Handling Missing Data**: Replace missing values appropriately.

**3. Encoding Categorical Data**: Convert categorical variables to numeric 
format.

**4. Improving Model Performance**: Optimize data for better learning.

**5. Reducing Noise**: Clean data to focus on relevant patterns.

## Key Preprocessing Techniques
Here’s a breakdown of the most common preprocessing techniques in scikit-learn:

### 1. Scaling Features
Scaling ensures all features contribute equally to the model by bringing them to the same scale. Without scaling, features with larger ranges dominate the training process.

#### StandardScaler
- Standardizes features by removing the mean and scaling to unit variance. After scaling, the data will have a mean of 0 and a standard deviation of 1.

- Formula: $Z=\frac{x-mean}{std}$

- **When to use:**
When your data follows a **Gaussian (normal) distribution**.

In [9]:
from sklearn.preprocessing import StandardScaler

# Example data
X = [[1, 10], [2, 15], [3, 25]]

# Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

[[-1.22474487 -1.06904497]
 [ 0.         -0.26726124]
 [ 1.22474487  1.33630621]]


#### MinMaxScaler
- Scales features to a specific range, typically [0, 1].

- Formula: $x^\prime = \frac{x-min}{max-min}$

- **When to use:**
When you want all features to be within a specific range, useful for algorithms like **neural networks**.

In [13]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_min_max_scaled = min_max_scaler.fit_transform(X)

print(X_min_max_scaled)

[[0.         0.        ]
 [0.5        0.33333333]
 [1.         1.        ]]


### 2. Encoding Categorical Variables
Many machine learning algorithms require categorical variables to be converted into numeric representations.

#### OneHotEncoder
- Converts categorical variables into a one-hot (binary) format. Each category gets its own column.

In [2]:
from sklearn.preprocessing import OneHotEncoder

# Example categorical data
data = [['red'], ['blue'], ['green']]

encoder = OneHotEncoder()
encoded = encoder.fit_transform(data).toarray()

print(encoded)

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


**Output**:
  
Each category (e.g., red, blue, green) is converted into a binary column.

red=[1,0,0],blue=[0,1,0],       green=[0,0,1]

#### LabelEncoder
- Converts categories into numeric labels. This is typically used for the **target variable**.

In [3]:
from sklearn.preprocessing import LabelEncoder

labels = ['cat', 'dog', 'mouse']

encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print(encoded_labels)

[0 1 2]


**Output**:


cat=0,dog=1,mouse=2

### 3. Imputation of Missing Values
Handling missing data is critical because most machine learning models cannot handle NaN values.

#### SimpleImputer
- Replaces missing values with a specified strategy (mean, median, or most frequent).

In [4]:
from sklearn.impute import SimpleImputer
import numpy as np

data2 = [[1, 2], [np.nan, 3], [7, 6]]

imputer = SimpleImputer(strategy='mean')
filled_data = imputer.fit_transform(data2)

print(filled_data)

[[1. 2.]
 [4. 3.]
 [7. 6.]]


### 4. Generating Polynomial Features
#### PolynomialFeatures
- Creates new features that are combinations of existing ones.
- Useful for polynomial regression or capturing non-linear relationships.
- **When to use**:
When you suspect non-linear relationships between features and the target.

In [5]:
from sklearn.preprocessing import PolynomialFeatures

# Example data
Y = [[2, 3]]

poly = PolynomialFeatures(degree=2)
Y_poly = poly.fit_transform(Y)

print(Y_poly)

[[1. 2. 3. 4. 6. 9.]]


Original features: $[x_1, x_2]$ 
Generated features: $[1, x_1, x_2, x_1^2, x_1x_2, x_2^2]$

### 5. Binarization
#### Binarizer
- Converts numeric values to binary (0 or 1) based on a threshold.

In [6]:
from sklearn.preprocessing import Binarizer

data3 = [[1.5], [2.0], [3.5]]
binarizer = Binarizer(threshold=2.0)
binary_data = binarizer.fit_transform(data3)

print(binary_data)

[[0.]
 [0.]
 [1.]]


### 6. Normalization
#### Normalizer
- Normalizes feature vectors to have a unit norm.
- Useful for text classification or clustering with sparse data.

In [7]:
from sklearn.preprocessing import Normalizer

data4 = [[1, 2, 3], [4, 5, 6]]
normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data4)

print(normalized_data)

[[0.26726124 0.53452248 0.80178373]
 [0.45584231 0.56980288 0.68376346]]


**Output:**

Each row is normalized such that the sum of squared values equals 1.

## Preprocessing with Pipelines
To streamline preprocessing, you can use pipelines in scikit-learn to combine multiple steps:

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

# Example pipeline: Scaling + Polynomial Features
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2))
])

X_transformed = pipeline.fit_transform(X)
print(X_transformed)

[[ 1.         -1.22474487 -1.06904497  1.5         1.30930734  1.14285714]
 [ 1.          0.         -0.26726124  0.         -0.          0.07142857]
 [ 1.          1.22474487  1.33630621  1.5         1.63663418  1.78571429]]


In [None]:
def train_models(X, y, degree):
    # TODO: Create and train a model based on the given degree
    features = PolynomialFeatures(degree=degree)
    lr = LinearRegression()
    
    model = make_pipeline(features, lr)    
    model.fit(X.reshape(-1, 1), y)
    
    return model

**model = make_pipeline(features, lr)**: 

**make_pipeline**: A utility from sklearn.pipeline to streamline the preprocessing and model-fitting process.
- It ensures the PolynomialFeatures step is applied first (to transform the features) before fitting the LinearRegression model.
- This simplifies the workflow and ensures the two steps are performed in sequence.

## How to Choose Preprocessing Techniques?
- **Scaling**: Always scale features when your model depends on distance or gradients (e.g., SVM, KNN).
- **Encoding**: Use encoding for categorical data.
- **Imputation**: Fill missing values before training.
- **Normalization**: Use for data with varying magnitudes or when vector lengths matter.

# Metrics

## 1. Mean Squared Error (MSE):
Mean Squared Error (MSE) is a commonly used metric to evaluate regression models. It measures the average squared difference between the actual values and the predicted values.

$$ MSE=\frac{1}{n}\sum\limits _{i=1}^{n}(y_i-\hat{y_i})^2$$


1. MSE penalizes large errors more than smaller ones because the errors are squared.
2. A lower MSE indicates a better fit of the model to the data.
3. MSE is always non-negative and has the same units as the square of the target variable.

### Using mean_squared_error from Scikit-learn
The **scikit-learn** library provides a built-in function for MSE, which simplifies the calculation:

In [2]:
from sklearn.metrics import mean_squared_error

# Actual and predicted values
y_actual = [3, -0.5, 2, 7]
y_predicted = [2.5, 0.0, 2, 8]

# Calculate MSE
mse = mean_squared_error(y_actual, y_predicted)

print("Mean Squared Error:", mse)

Mean Squared Error: 0.375


#### When to Use MSE
- **Regression Models**: Evaluate how well the model predicts continuous target variables.
- **Model Comparison**: Compare different models; the one with the lowest MSE is better.

### Advantages
- Sensitive to large errors because of squaring, which can help identify poor predictions.
- Easy to compute and widely understood.
### Disadvantages
- Since MSE squares errors, it can overemphasize large deviations, making it sensitive to outliers.
- The value is in squared units, which can be harder to interpret compared to metrics like RMSE (Root Mean Squared Error).

### Real-World Example
Suppose you train a polynomial regression model and want to calculate its MSE on the testing set:

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 5 * (X**2).flatten() + np.random.normal(0, 10, X.shape[0])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a polynomial regression model
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict and calculate MSE
y_pred = model.predict(X_test_poly)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)

Mean Squared Error: 61.96330450394991


### Interpreting MSE
- A **low MSE** indicates the model is accurately predicting the target variable.
- A **high MSE** suggests poor predictions or overfitting/underfitting issues.

---

# Lasso and Ridge Regression

Lasso and Ridge are two types of regularized regression methods that help prevent overfitting by adding a penalty to the regression coefficients.

Both methods are useful when you have multicollinearity (high correlation between features) or when you want to control the complexity of a model.

## 1. Ridge Regression (L2 Regularization)

### Concept
Ridge regression modifies Linear Regression by adding a penalty term that is the sum of the squared values of the coefficients:
$$                    Loss Function = \sum(y_i-\hat{y_i})^2 + \lambda\sum w_j^2                         $$

- The **first term** is the standard Mean Squared Error (MSE).
- The **second term** $\lambda\sum w_j^2$ is the **L2 penalty** that **shrinks** the coefficients.
- $\lambda$ **(alpha in Python)** controls the regularization strength:
>- If$\lambda = 0$ → Ridge behaves like regular linear regression.
>- if **$\lambda$ is large** → The model forces coefficients to be **small**, reducing variance but possibly increasing bias.

### When to Use Ridge?
- When all features contribute to the output, but some may have **small effects**.
- When you want to **reduce overfitting** but **keep all features** (shrink coefficients instead of removing them).

## 2. Lasso Regression (L1 Regularization)

### Concept
Lasso regression also adds a penalty but uses the **absolute values** of the coefficients:
$$ Loss Function = \sum(y_i-\hat{y_i})^2 + \lambda\sum |w_j|  $$

- The **L1 penalty** ($\lambda\sum |w_j|$) forces some coefficients to become exactly **zero**.
- This means Lasso **performs feature selection** by removing some features completely.

### When to Use Lasso?
- When you suspect **some features are irrelevant** and want to **automatically select important features**.
- When you need **a simpler, more interpretable model**.

## 3. Ridge vs. Lasso: Key Differences

|Feature | Ridge Regression (L2) | Lasso Regression (L1)|
|---|---|---|
|Penalty Type|$\sum w_j^2$| $\sum absolute(w_j)$ |
|Effect on Coefficients|Shrinks them towards zero|Some coefficients become exactly zero|
|Feature Selection|No|Yes (removes irrelevant features)|
|Best for|Keeping all features but reducing their impact|Selecting only important features|

## 4. Implementing Ridge and Lasso in Python
We will use *Ridge* and *Lasso* from *sklearn.linear_model*.

### Example: Ridge Regression

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate sample data
np.random.seed(42)
X = 3 * np.random.rand(100, 1)
y = 5 + 2 * X + np.random.randn(100, 1)  # y = 5 + 2X + noise

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Ridge Regression
ridge_reg = Ridge(alpha=1.0)  # lambda (regularization strength)
ridge_reg.fit(X_train_scaled, y_train)

# Predict
y_pred = ridge_reg.predict(X_test_scaled)

# Coefficients
print("Ridge Coefficients:", ridge_reg.coef_)

Ridge Coefficients: [1.61038427]


### Example: Lasso Regression

In [2]:
from sklearn.linear_model import Lasso

# Train Lasso Regression
lasso_reg = Lasso(alpha=0.1)  # Smaller alpha allows some coefficients to be zero
lasso_reg.fit(X_train_scaled, y_train)

# Predict
y_pred_lasso = lasso_reg.predict(X_test_scaled)

# Coefficients
print("Lasso Coefficients:", lasso_reg.coef_)

Lasso Coefficients: [1.53051407]


## 5. Choosing Between Ridge and Lasso
### Use Ridge when:
- You **don't want to remove features**, just **reduce their effect**.
- You have **many correlated features**.
- You need a **stable model** that doesn't eliminate variables.

### Use Lasso when:
- You suspect **some features are irrelevant** and want to remove them.
- You prefer **a simpler, more interpretable model**.
- You need **automatic feature selection**.

### Use Elastic Net when:
- You want a balance between Ridge and Lasso → **Elastic Net** combines both penalties.
- When Ridge keeps all features, but Lasso removes too many.
- It's useful when there are **highly correlated** features.

## 6. Elastic Net Regression
Elastic Net is a **hybrid** of **Ridge (L2) and Lasso (L1) regression** that combines both regularization techniques.
$$                    Loss Function = \sum(y_i-\hat{y_i})^2 +\lambda_1\sum |w_j| + \lambda_2\sum w_j^2                         $$

- The **first term** is the standard Mean Squared Error (MSE).
- The **L1 penalty** ($\lambda_1\sum |w_j|$) encourages **sparse models** (like Lasso).
- The **L2 penalty** ($ \lambda_2\sum w_j^2 $) prevents **over-shrinking** the coefficients (like Ridge).
>- If $\lambda_1 = 0$ → Elastic Net behaves like **Ridge**.
>- If $\lambda_2 = 0$ → Elastic Net behaves like **Lasso**.
>- **If both $\lambda_1 $ and $\lambda_2 $ are nonzero**, Elastic Net balances Ridge and Lasso. 

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate sample data
np.random.seed(42)
X = 3 * np.random.rand(100, 1)
y = 5 + 2 * X + np.random.randn(100, 1)  # y = 5 + 2X + noise

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Elastic Net Regression
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # 50% L1, 50% L2
elastic_net.fit(X_train_scaled, y_train)

# Predict
y_pred = elastic_net.predict(X_test_scaled)

# Coefficients
print("Elastic Net Coefficients:", elastic_net.coef_)

Elastic Net Coefficients: [1.5052515]
