# Model Types

Base on your type of problem is the model that need to be used, some examples of traditional ML are:

1. Supervised Learning:

 - Classification: Logistic Regression, Decision Trees, Random Forest, SVM,Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost),KNN, Naive Bayes,Neuronal Networks (tbd).

 - Regression: Linear Regression, Multiple Linear Regression,Polynomial Regression, Decision Trees, Random Forest, Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost), KNN, Support Vector Regression (SVR),Neuronal Networks (tbd)

2. Unsupervised Learning: K-Means Clustering, Hierarchical Clustering

3. Time Series Models: ARIMA, LSTM (Long Short-Term Memory), SARIMA (Seasonal AutoRegressive Integrated Moving Average),Prophet,  N-BEATS (tbd)


Criteria for Model Selection:

- Problem Type: Classification, regression, clustering, etc.
- Data Size: Some models (e.g., neural networks) require large datasets, while others (e.g., decision trees) work well with smaller datasets.
- Interpretability: Linear models are more interpretable than complex models like neural networks.
- Computational Complexity: Some models (e.g., SVMs) are computationally expensive.


Model Comparison: Compare multiple models to select the best one based on performance metrics or use ensemble methods that combines multiple models (e.g., bagging, boosting) to improve performance.

Many machine learning libraries (e.g., XGBoost, LightGBM,tree base moldels, scikit-learn, TensorFlow) provide built-in support for parallelism, so take advantage of these options. Parallelization can consume a lot of CPU/GPU resources. Be sure to monitor your system's resource usage to avoid overloading your hardware.

## Supervised Learning

### Regression

#### Linear Regression

- When to Use: 
When the relationship between features and the target is linear.
- Advantages:
Simple and interpretable.
Fast to train.
- Disadvantages:
Assumes linearity, which may not hold in real-world data.
Sensitive to outliers.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Example data
X = [[1], [2], [3], [4], [5]]
y = [1, 3, 2, 3, 5]

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)
print("MSE:", mean_squared_error(y, y_pred))

####  Multiple Linear Regression
- When to Use:
When you have multiple features (independent variables) that may influence the target (dependent variable).
When the relationship between features and the target is linear.
- Advantages:
Simple and interpretable.
Fast to train and predict.
- Disadvantages:
Assumes linearity, which may not hold in real-world data.
Sensitive to multicollinearity (high correlation between independent variables).


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Example data
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [5, 4, 3, 2, 1],
    'y': [2, 4, 5, 4, 5]
}
df = pd.DataFrame(data)

# Features and target
X = df[['X1', 'X2']]
y = df['y']

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)

# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

#### Polynomial Regression

- When to Use:
When the relationship between the independent and dependent variables is nonlinear.
When linear regression fails to capture the complexity of the data.
- Advantages:
Can model complex, nonlinear relationships.
Flexible in capturing curvature in data.
- Disadvantages:
Prone to overfitting, especially with high-degree polynomials.
Computationally expensive for high-degree polynomials.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Example data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1, 4, 9, 16, 25])

# Transform features to polynomial
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Train model
model = LinearRegression()
model.fit(X_poly, y)

# Predict
y_pred = model.predict(X_poly)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)


#### Desicion Trees
- When to Use: when interpretability is important.
- Advantages:
Easy to interpret and visualize.
Handles nonlinear relationships.
- Disadvantages:
Prone to overfitting.
Sensitive to small changes in data.

#### Random Forest
- When to Use: when high accuracy is required.
- Advantages:
Reduces overfitting compared to single decision trees.
Handles nonlinear relationships well.
- Disadvantages:
Less interpretable than single decision trees.
Slower to train.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Example data
X = np.array([[1], [2], [3], [4], [5]])  # Feature
y = np.array([1, 3, 2, 3, 5])            # Target

# Train model
model = DecisionTreeRegressor(random_state=42)
model.fit(X, y)

# Predict
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error (MSE):", mse)


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Example data
X = np.array([[1], [2], [3], [4], [5]])  # Feature
y = np.array([1, 3, 2, 3, 5])            # Target

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Predict
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error (MSE):", mse)


#### Gradient Boosting: XGBoost

- When to Use:
When you need high predictive accuracy.
When computational efficiency is important (XGBoost is highly optimized).

- Advantages:
High Performance: Often outperforms other algorithms on structured data.
Regularization: Built-in L1/L2 regularization to prevent overfitting.
Flexibility: Supports custom loss functions and evaluation metrics.
Handles Missing Data: Automatically handles missing values.

- Disadvantages:
Complexity: Requires careful tuning of hyperparameters.
Computationally Expensive: Can be slow for very large datasets.
Less Interpretable: Compared to simpler models like linear regression.
Requires one-hot encoding

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost model for regression
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (XGBoost):", mse)

#### Gradient Boosting: LightGBM 


- When to Use:
For large datasets with many features.
When training speed is critical.

- Advantages:
Speed: Faster training compared to XGBoost, especially on large datasets.Is designed for efficiency and scalability
Memory Efficiency: Uses less memory due to histogram-based splitting.
Handles Categorical Features: Automatically handles categorical variables.
Scalability: Works well with distributed computing.

- Disadvantages:
Overfitting: Can overfit on small datasets if not properly tuned.
Complexity: Requires hyperparameter tuning.
Less Robust to Noisy Data: Compared to XGBoost.

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM model for regression
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
model = lgb.train(params, train_data, num_boost_round=100)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (LightGBM):", mse)

#### Gradient Boosting: CatBoost

- When to Use:
When your dataset contains categorical features.
When you want a high-performance model with minimal preprocessing.
When you need a model that is robust to overfitting.

- Advantages:
Handles Categorical Features: Automatically encodes categorical variables, reducing the need for manual preprocessing.
High Accuracy: Often outperforms other gradient boosting algorithms like XGBoost and LightGBM.
Robust to Overfitting: Uses techniques like ordered boosting to reduce overfitting.
GPU Support: Can be accelerated using GPUs for faster training (task_type parameter).

- Disadvantages:
Slower Training: Compared to LightGBM, CatBoost can be slower, especially on large datasets.
Less Customizable: Fewer hyperparameters to tune compared to XGBoost.
Memory Intensive: Requires more memory than some other gradient boosting algorithms.


CatBoost automatically handles missing values, so you don’t need to impute them manually. However, you can specify how missing values are handled using the nan_mode parameter:
* nan_mode='Min': Treat missing values as the minimum value.
* nan_mode='Max': Treat missing values as the maximum value.


In [None]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset with categorical features
data = {
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],  # Categorical feature
    'target': [10, 20, 15, 25, 30, 35, 40, 45, 50, 55]
}
df = pd.DataFrame(data)

# Split data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostRegressor
model = CatBoostRegressor(
    cat_features=['feature2'],  # Specify categorical features
    iterations=100,             # Number of boosting iterations
    learning_rate=0.1,          # Learning rate
    verbose=0,                  # Disable logging
    task_type='GPU'  # Enable GPU acceleration
)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))

#### KNN

Is a non-parametric algorithm. It predicts the target variable by finding the k most similar instances (neighbors) in the training data and averaging their values (for regression).

- When to Use:
When the data is small to medium-sized (k-NN is computationally expensive for large datasets).
When the decision boundary is nonlinear.
When interpretability is important (you can explain predictions based on neighbors).
- Advantages:
Simple and easy to implement.
No training phase (lazy learning).
Works well with nonlinear relationships.
- Disadvantages:
Computationally expensive for large datasets (requires storing the entire dataset).
Sensitive to the choice of k and distance metric.
Requires feature scaling (since it relies on distance calculations).

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np

# Example data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])  # y = 2 * X

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for k-NN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train k-NN regression model
model = KNeighborsRegressor(n_neighbors=3)  # k=3
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

#### Support Vector Regression (SVR)

It works by finding a hyperplane that best fits the data while allowing a margin of error (epsilon).

- When to Use:
When the data has a nonlinear relationship.
When you need a robust model that is less sensitive to outliers.
- Advantages:
Effective in high-dimensional spaces.
Can model nonlinear relationships using kernel functions (e.g., RBF, polynomial).
- Disadvantages:
Computationally expensive for large datasets.
Requires careful tuning of hyperparameters (e.g., kernel, C, epsilon).


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Example data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1, 4, 9, 16, 25])

# Train model
model = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
model.fit(X, y)

# Predict
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)

### Classification

#### Logistic Regression
- When to Use: For classification problems where the target is categorical.
- Advantages:
Simple and interpretable.
Works well with small datasets.
- Disadvantages:
Assumes linear decision boundaries.
Struggles with nonlinear relationships.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 0, 1, 1, 1]

# Train model
model = LogisticRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))

#### Desicion Trees
- When to Use: when interpretability is important.
- Advantages:
Easy to interpret and visualize.
Handles nonlinear relationships.
- Disadvantages:
Prone to overfitting.
Sensitive to small changes in data.

#### Random Forest
- When to Use: when high accuracy is required.
- Advantages:
Reduces overfitting compared to single decision trees.
Handles nonlinear relationships well.
- Disadvantages:
Less interpretable than single decision trees.
Slower to train.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 0, 1, 1, 1]

# Train model
model = DecisionTreeClassifier()
model.fit(X, y)

# Predict
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 0, 1, 1, 1]

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

# Predict
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))

#### Support Vector Machines (SVM)

- When to Use: For classification tasks with clear margins of separation.
- Advantages:
Effective in high-dimensional spaces.
Works well with small datasets.
- Disadvantages:
Computationally expensive for large datasets.
Requires careful tuning of hyperparameters.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 0, 1, 1, 1]

# Train model
model = SVC(kernel='linear')
model.fit(X, y)

# Predict
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))

#### Gradient Boosting: XGBoost

- When to Use:
When you need high predictive accuracy.
When computational efficiency is important (XGBoost is highly optimized).

- Advantages:
High Performance: Often outperforms other algorithms on structured data.
Regularization: Built-in L1/L2 regularization to prevent overfitting.
Flexibility: Supports custom loss functions and evaluation metrics.
Handles Missing Data: Automatically handles missing values.

- Disadvantages:
Complexity: Requires careful tuning of hyperparameters.
Computationally Expensive: Can be slow for very large datasets.
Less Interpretable: Compared to simpler models like logistic regression.
Requires one-hot encoding


In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost model
model = xgb.XGBClassifier(objective='multi:softmax', num_class=3, n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

#### Gradient Boosting: LightGBM

- When to Use:For large datasets with many features.
When training speed is critical.

- Advantages:
Speed: Faster training compared to XGBoost, especially on large datasets.
Memory Efficiency: Uses less memory due to histogram-based splitting.
Handles Categorical Features: Automatically handles categorical variables.
Scalability: Works well with distributed computing.

- Disadvantages:
Overfitting: Can overfit on small datasets if not properly tuned.
Complexity: Requires hyperparameter tuning.
Less Robust to Noisy Data: Compared to XGBoost.

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM model
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
model = lgb.train(params, train_data, num_boost_round=100)

# Predict
y_pred = model.predict(X_test)
y_pred = [list(x).index(max(x)) for x in y_pred]  # Convert probabilities to class labels
print("Accuracy:", accuracy_score(y_test, y_pred))

#### Gradient Boosting: CatBoost

- When to Use:
When your dataset contains categorical features.
When you want a high-performance model with minimal preprocessing.
When you need a model that is robust to overfitting.

- Advantages:
Handles Categorical Features: Automatically encodes categorical variables, reducing the need for manual preprocessing.
High Accuracy: Often outperforms other gradient boosting algorithms like XGBoost and LightGBM.
Robust to Overfitting: Uses techniques like ordered boosting to reduce overfitting.
GPU Support: Can be accelerated using GPUs for faster training (task_type parameter)

- Disadvantages:
Slower Training: Compared to LightGBM, CatBoost can be slower, especially on large datasets.
Less Customizable: Fewer hyperparameters to tune compared to XGBoost.
Memory Intensive: Requires more memory than some other gradient boosting algorithms.

CatBoost automatically handles missing values, so you don’t need to impute them manually. However, you can specify how missing values are handled using the nan_mode parameter:
* nan_mode='Min': Treat missing values as the minimum value.
* nan_mode='Max': Treat missing values as the maximum value.


In [None]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset with categorical features
data = {
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],  # Categorical feature
    'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

# Split data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostClassifier
model = CatBoostClassifier(
    cat_features=['feature2'],  # Specify categorical features
    iterations=100,             # Number of boosting iterations
    learning_rate=0.1,          # Learning rate
    verbose=0,                   # Disable logging
    task_type='GPU'  # Enable GPU acceleration
)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))

#### KNN
Is a non-parametric algorithm. It predicts the target variable by finding the k most similar instances (neighbors) in the training data and taking a majority vote (for classification).

- When to Use:
When the data is small to medium-sized (k-NN is computationally expensive for large datasets).
When the decision boundary is nonlinear.
When interpretability is important (you can explain predictions based on neighbors).
- Advantages:
Simple and easy to implement.
No training phase (lazy learning).
Works well with nonlinear relationships.
- Disadvantages:
Computationally expensive for large datasets (requires storing the entire dataset).
Sensitive to the choice of k and distance metric.
Requires feature scaling (since it relies on distance calculations).

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Example data
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for k-NN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train k-NN model
model = KNeighborsClassifier(n_neighbors=3)  # k=3
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

#### Naive Bayes
 
 Is a probabilistic algorithm based on Bayes' Theorem. It assumes that features are independent of each other (the "naive" assumption), which simplifies the computation.

- When to Use:
For text classification (e.g., spam detection, sentiment analysis).
When the dataset is small or high-dimensional.
When you need a fast and simple model.
- Advantages:
Fast to train and predict.
Works well with high-dimensional data (e.g., text data).
Performs well even with the naive assumption of feature independence.
- Disadvantages:
The naive assumption of feature independence may not hold in real-world data.
Struggles with continuous data (requires discretization).
May produce poor results if the data distribution is not well-represented.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example data
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naive Bayes model
model = GaussianNB()  # Gaussian Naive Bayes for continuous data
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

## Unsupervised Learning

#### K-Means Clustering

- When to Use: For clustering tasks where the number of clusters is known or can be estimated.
- Advantages:
Simple and fast.
Works well with large datasets.
- Disadvantages:
Requires specifying the number of clusters.
Sensitive to initial cluster centers.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Example data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

# Train model
model = KMeans(n_clusters=2)
model.fit(X)

# Predict clusters
labels = model.predict(X)
print("Cluster Labels:", labels)

#### Hierarchical Clustering
It groups similar data points into clusters by either a divisive (top-down) or agglomerative (bottom-up) approach. The result is often visualized as a dendrogram, which shows the arrangement of clusters.

- When to Use:
When you want to explore the hierarchical structure of the data.
When the number of clusters is not known in advance.
For small to medium-sized datasets (due to computational complexity).

- Advantages:
No Need to Specify Number of Clusters: The dendrogram helps in deciding the number of clusters.
Interpretable: The hierarchical structure provides insights into the relationships between clusters.
Works with Any Similarity Metric: Can use Euclidean distance, cosine similarity, etc.

- Disadvantages:
Computationally Expensive: Not suitable for large datasets (time complexity is  O(nˆ3) for agglomerative clustering).
Sensitive to Noise and Outliers: Can produce misleading clusters if the data contains noise.
Once a decision is made to combine clusters, it cannot be undone: This can lead to suboptimal clustering.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate sample data
X, _ = make_blobs(n_samples=50, centers=3, cluster_std=0.60, random_state=0)

# Standardize the data (optional but recommended for clustering)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform hierarchical clustering using the "ward" method (minimizes variance)
linked = linkage(X_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut the dendrogram to form clusters (e.g., 3 clusters)
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(linked, t=3, criterion='maxclust')

# Visualize the clusters
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis', s=50)
plt.title('Hierarchical Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()

#### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)


- When to Use: For Non-spherical Clusters: When data has irregularly shaped clusters (e.g., geographical data, anomaly detection), Noise Handling: When outlier detection is important (e.g., fraud detection, environmental data) or when Variable Density Clusters: When clusters have different densities, which K-Means struggles with.
- Advantages:
No need to predefine clusters: Unlike K-Means, it automatically finds clusters.
Identifies outliers: Points in low-density regions are labeled as noise.
Works with non-linear cluster shapes : Unlike K-Means, which assumes spherical clusters.
- Disadvantages:
Struggles with high-dimensional data : Performance degrades when dimensions increase.
Parameter-sensitive : The eps (neighborhood size) and min_samples need careful tuning.
Issues with variable density clusters:  If clusters have large density variations, DBSCAN may not work well.


In [None]:
from sklearn.cluster import DBSCAN
import numpy as np

# Create sample data (2D points)
X = np.array([[1, 2], [2, 3], [2, 2], [8, 8], [8, 9], [25, 80]])  # The last point is an outlier, anomaly detection

# Apply DBSCAN
db = DBSCAN(eps=2, min_samples=2).fit(X)
labels = db.labels_

print("Cluster labels:", labels)  # -1 represents noise (outliers)


## Time Series

#### ARIMA (AutoRegressive Integrated Moving Average)

- When to Use: For stationary time series data.
- Advantages:
Handles trends and seasonality.
Widely used in forecasting.
- Disadvantages:
Requires stationary data.
Complex to tune

In [None]:
from statsmodels.tsa.arima.model import ARIMA

# Example data
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Train model
model = ARIMA(data, order=(1, 1, 1))
model_fit = model.fit()

# Forecast
forecast = model_fit.forecast(steps=3)
print("Forecast:", forecast)

#### Vector AutoRegression (VAR)
VAR is an extension of ARIMA for multivariate time series forecasting. Unlike ARIMA, which models a single time series, VAR captures relationships between multiple time-dependent variables.

- When to Use: Multivariate Time Series: When two or more time-dependent variables influence each other (e.g., stock prices and interest rates). Interdependencies Between Variables: When lagged values of multiple variables affect future values (e.g., economic forecasting).
- Advantages:
Models Multiple Time Series Together 
Well-established & interpretable
Handles Lag Dependencies – Captures effects over time.
- Disadvantages:
Requires stationary data.
High Complexity with Many Variables – Large datasets require careful selection of lag parameters.
Assumes Linear Relationships – Non-linear dependencies may not be captured well.


It learns relationships between all the variables and their past values (lags).The model then forecasts the next two time steps for each feature by considering:
The past two time steps (maxlags=2). The interactions between all time series in the dataset.

In [None]:
import pandas as pd
from statsmodels.tsa.api import VAR

# Sample multivariate time series data
data = {
    'GDP': [2.3, 2.5, 2.8, 3.0, 3.2, 3.3],  # GDP growth rate
    'Unemployment': [5.2, 5.0, 4.8, 4.6, 4.5, 4.3]  # Unemployment rate
}
df = pd.DataFrame(data)

# Fit VAR model
model = VAR(df)
result = model.fit(maxlags=2)  # Selecting 2 lags

# Forecast next 2 periods
forecast = result.forecast(df.values[-2:], steps=2)  #columns correspond to the order of input features
print(forecast)

#### VARIMA (Vector ARIMA)
VARIMA is a combination of VAR and ARIMA, extending ARIMA to multiple time series. It allows for both differencing (I) and moving average (MA) components in addition to autoregression (AR).

- When to Use:
Multivariate Time Series with Differencing Needed: If data has trends or seasonality that require differencing.
Better Handling of Non-Stationarity: If VAR alone fails due to non-stationarity.
Time Series with Moving Average Components: If errors have serial correlation.
- Advantages:
Captures Complex Dependencies and its more generalized than VAR.
Handles Non-Stationary Data
- Disadvantages:
Computationally Expensive.
Difficult to Interpret.
Needs Large Datasets.

In [None]:
from pmdarima import auto_arima
import pandas as pd

# Example DataFrame with two time series
data = {
    'GDP': [2.3, 2.5, 2.8, 3.0, 3.2, 3.3],
    'Unemployment': [5.2, 5.0, 4.8, 4.6, 4.5, 4.3]
}
df = pd.DataFrame(data)

# Automatically find best VARIMA order
model = auto_arima(df, seasonal=False, error_action='ignore', trace=True)

# Fit the best model
model.fit(df)

# Forecast next 2 periods
forecast = model.predict(n_periods=2) #columns results correspond to the order of input features
print(forecast)


#### LSTM (Long Short-Term Memory)

- When to Use: For complex time series or sequential data with long-term dependencies.
- Advantages:
Handles long-term dependencies.
Works well with large datasets.
- Disadvantages:
Computationally expensive.
Requires large amounts of data.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import numpy as np

# Example data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 3, 4, 5, 6])

# Reshape data for LSTM
X = X.reshape((X.shape[0], X.shape[1], 1))

# Build LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(1, 1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

# Train model
model.fit(X, y, epochs=200, verbose=0)

# Predict
y_pred = model.predict(X)
print("Predictions:", y_pred)

#### SARIMA (Seasonal AutoRegressive Integrated Moving Average)
Is an extension of ARIMA that explicitly models seasonality in time series data. It is useful for time series that exhibit seasonal patterns (e.g., monthly sales data with yearly seasonality).

- When to Use:
WFor forecasting tasks where seasonality is a key factor (e.g., monthly, quarterly, or yearly cycles).
When the data is non-stationary and requires differencing to remove trends and seasonality.

- Advantages:
Explicitly models seasonality, making it suitable for seasonal time series.
Flexible in handling both trend and seasonality.
Widely used in forecasting applications.

- Disadvantages:
Requires careful tuning of hyperparameters (e.g., seasonal order).
Computationally expensive for large datasets.
Assumes that seasonality is constant over time.



In [None]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Example data: Monthly sales with seasonality
data = [100, 120, 130, 150, 200, 250, 300, 350, 400, 450, 500, 550,
        110, 130, 140, 160, 210, 260, 310, 360, 410, 460, 510, 560]
dates = pd.date_range(start='2020-01-01', periods=len(data), freq='M')
df = pd.DataFrame({'date': dates, 'sales': data})
df.set_index('date', inplace=True)

# Fit SARIMA model
# SARIMA(p, d, q)(P, D, Q, S)
# p: AR order, d: differencing, q: MA order
# P: Seasonal AR order, D: Seasonal differencing, Q: Seasonal MA order, S: Seasonality length (e.g., 12 for monthly data)
model = SARIMAX(df['sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit(disp=False)

# Forecast the next 12 months
forecast = results.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
confidence_intervals = forecast.conf_int()

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(df.index, df['sales'], label='Observed')
plt.plot(forecast_mean.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(confidence_intervals.index,
                 confidence_intervals.iloc[:, 0],
                 confidence_intervals.iloc[:, 1], color='pink', alpha=0.3)
plt.title('SARIMA Forecast')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()

#### Prophet

- When to Use:
When the time series has strong seasonality (e.g., daily, weekly, or yearly patterns).
When you want to include holiday effects or other known events.
For quick and robust forecasting without extensive tuning.

- Advantages:
Easy to use and requires minimal hyperparameter tuning.
Handles missing data and outliers gracefully.
Provides intuitive forecasts with uncertainty intervals.

- Disadvantages:
Less flexible than SARIMA for custom seasonality or trend modeling.
May not perform well on time series without clear seasonality.
Requires additional setup for custom seasonality or holiday effects.




In [None]:
from fbprophet import Prophet
import pandas as pd
import matplotlib.pyplot as plt

# Example data: Daily sales with seasonality
data = {
    'ds': pd.date_range(start='2020-01-01', periods=365, freq='D'),
    'y': [100 + 10 * (i % 30) + 5 * (i % 7) for i in range(365)]  # Simulated seasonal data
}
df = pd.DataFrame(data)

# Initialize and fit Prophet model
model = Prophet()
model.fit(df)

# Create future dataframe for forecasting
future = model.make_future_dataframe(periods=30)  # Forecast the next 30 days

# Make predictions
forecast = model.predict(future)

# Plot the forecast
fig = model.plot(forecast)
plt.title('Prophet Forecast')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

# Plot components (trend, weekly seasonality, yearly seasonality)
fig2 = model.plot_components(forecast)
plt.show()