# Python Packages used in ML

## Key Libraries:
- **Numpy**
- **Scipy**
- **Matplotlib**
- **Pandas**
- **Scikit-learn**

# Categories of Machine Learning

### Supervised Learning:
- **Definition**: The model is trained on a labeled dataset (output is known).
- **Goal**: To find hidden patterns in data.
- **Types:**
  - **Regression**:
    - Predicts a continuous value.
    - Examples: Predicting house prices, stock trends.
    - Algorithms: Linear Regression, Ridge, Lasso, Polynomial Regression.
  - **Classification**:
    - Predicts a category label.
    - Examples: Spam detection, image classification.
    - Algorithms: Logistic Regression, SVM, Decision Trees, Random Forest, Neural Networks.

- **Regression Algorithms:**
  - Ordinal Regression
  - Poisson Regression
  - Bayesian Linear Regression
  - Boosted Decision Tree Regression
  - Neural Network Regression

### Unsupervised Learning:
- **Definition**: The model is trained on unlabeled data (no output provided).
- **Goal**: To infer natural structure within data.
- **Types:**
  - **Clustering**:
    - Groups similar objects.
    - Algorithms: k-Means, Hierarchical Clustering, DBSCAN.
  - **Dimensionality Reduction**:
    - Reduces data complexity while retaining key information.
    - Algorithms: PCA, t-SNE, SVD.

# Supervised vs Unsupervised Learning
| Feature | Supervised | Unsupervised |
|---------|------------|--------------|
| **Labeled Data** | Requires labels | No labels required |
| **Goal** | Predict outputs | Discover patterns |
| **Methods** | Classification, Regression | Clustering, Dimensionality Reduction |
| **Applications** | Spam detection, trend prediction | Market basket analysis, anomaly detection |

# Regression Techniques
## Simple Linear Regression:
- **Equation**: y = β0 + β1x + ε
  - y: Dependent variable
  - x: Independent variable
  - β0: Intercept
  - β1: Slope
  - ε: Error term

## Multiple Linear Regression:
- **Definition**: Uses multiple independent variables to predict the dependent variable.

# Model Evaluation Metrics
- **Training Accuracy**: Model's accuracy on training data.
  - High training accuracy may indicate overfitting.
- **Out-of-Sample Accuracy**:
  - Accuracy on unseen data.
  - Helps assess generalization capability.

In [4]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on the training set
y_train_pred = model.predict(X_train)
# Calculate training accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)

# Predict on the test set
y_test_pred = model.predict(X_test)
# Calculate out-of-sample (test) accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Out-of-Sample Accuracy: {test_accuracy:.2f}")

Training Accuracy: 0.96
Out-of-Sample Accuracy: 1.00


# K-Fold Cross-Validation
- **Definition**: Splits data into k subsets, trains on k-1 subsets, validates on the remaining subset.
- **Benefits**:
  - Better data utilization.
  - Reduced overfitting.
  - Reliable performance metric.

In [5]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a logistic regression model
model = LogisticRegression(max_iter=200)

# Set up K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate the model using cross-validation
cv_scores = cross_val_score(model, X, y, cv=kf)

# Print the cross-validation scores
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.2f}")
print(f"Standard Deviation of CV Score: {cv_scores.std():.2f}")

Cross-Validation Scores: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean CV Score: 0.97
Standard Deviation of CV Score: 0.02


# Evaluation Metrics
1. **R-Squared (R²)**:
   - Proportion of variance explained by the model.
   - Higher values indicate a better fit.
2. **Mean Absolute Error (MAE)**:
   - Average of absolute differences between predictions and actuals.
3. **Mean Squared Error (MSE)**:
   - Average of squared differences between predictions and actuals.
4. **Root Mean Squared Error (RMSE)**:
   - Square root of MSE.

---

# Typical Machine Learning Workflow

1. **Data Collection:** Gather relevant and clean data from various sources.
2. **Data Preprocessing:** Handle missing values, encode categorical variables, normalize/scale features.
3. **Feature Selection/Engineering:** Identify important attributes, create new features if necessary.
4. **Model Selection:** Choose the right algorithm for the problem (e.g., regression, classification, clustering).
5. **Model Training:** Fit the model to the training data.
6. **Model Evaluation:** Assess model performance using appropriate metrics and validation strategies.
7. **Hyperparameter Tuning:** Adjust algorithm parameters for better accuracy and generalization.
8. **Deployment:** Integrate the model into production systems for real-world use.

# Common Challenges in Machine Learning

- **Overfitting:** Model performs well on training data but poorly on unseen data.
- **Underfitting:** Model is too simple to capture underlying patterns.
- **Imbalanced Data:** Unequal class distribution affect classification performance.
- **Data Leakage:** When information from outside the training dataset is used to create the model, leading to overly optimistic results.

# Good Practices

- Use **cross-validation** to validate models more reliably.
- Always keep **test data** separate for final evaluation.
- Perform **feature scaling** where required.
- Regularly visualize data and results to spot issues and interpret findings.
