# Step 4: Model Selection and Training üß†

Once your data is engineered, compressed, and split, you've reached the main event: training the algorithm to learn patterns.

## 1. Algorithm Selection (Choosing Your Fighter) ü•ä

You rarely train just one model. You select a few different types of algorithms to see which naturally understands your data best.

### Common Algorithms for ZS Interviews

| Algorithm | Type | When to Use | Why to Use It |
| :--- | :--- | :--- | :--- |
| **Linear Regression** | Regression | Predicting continuous numbers (sales, prices). | Highly interpretable; clear mathematical relationship. |
| **Logistic Regression** | Classification | Predicting binary outcomes (Yes/No, Fraud/Not). | Outputs probabilities; great for stakeholder explanations. |
| **Decision Trees** | Classification/Regression | Interpretable decisions. | Easy to visualize (flowchart style). |
| **Random Forest** | Classification/Regression | Complex data with many features. | Robust, reduces overfitting, handles missing values well. |
| **Gradient Boosting (XGBoost/LightGBM)** | Classification/Regression | Maximum predictive accuracy. | Iteratively learns from mistakes; top-performing for tabular data. |
| **K-Means** | Clustering | Grouping customers/segments without labels. | Fast, scales well, drives business segmentation. |
| **PCA** | Dimensionality Reduction | Compressing huge datasets. | Speeds up training and removes noise. |

## 2. Model Training (Fitting the Data) üõ†Ô∏è

This is the learning part. Feed `X_train` (features) and `y_train` (labels) to the algorithm. The math works behind the scenes to assign weights and map relationships.

## 3. Deep Dive: Random Forest üå≤üå≤üå≤

Instead of relying on one deep tree (which tends to overfit), a Random Forest builds hundreds of trees and combines their answers.

### How It Works:
1.  **Bootstrapping (Data Mixing) üé≤**: Creates mini-datasets by randomly sampling original training data.
2.  **Random Features üîÄ**: Each tree is forced to choose splits from a random subset of features. This ensures diversity and prevents the strongest feature from dominating every tree.
3.  **Aggregation (The Vote) üó≥Ô∏è**: 
    - **Regression**: Average of all tree predictions.
    - **Classification**: Majority vote.

### 4. Implementation Example


In [1]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# 1. Select the Algorithm
model = RandomForestRegressor(random_state=42)

# 2. Define Hyperparameters (The Dials)
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10]
}

# 3. Set up Grid Search with 5-Fold Cross Validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)

# 4. Train the model
# grid_search.fit(X_train, y_train)  # Uncomment in actual execution environment

# 5. Check Results
# print(f"Best Settings Found: {grid_search.best_params_}")
# print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")

# 6. Save the best version of the model
# best_model = grid_search.best_estimator_

### 5. How to Run the Model (Predictions) üöÄ

Once you have the `best_model`, you can use it to predict outcomes on new, unseen data.

In [None]:
# Make predictions using the final tuned model
# predictions = best_model.predict(X_test)

# For classification, you can also get probabilities
# probabilities = best_model.predict_proba(X_test)[:, 1]

# Example of checking the first 5 predictions
# print(predictions[:5])