# Building and Refining Your Machine Learning Model

In this notebook, we will learn how to train, test, and improve machine learning models step by step. Let's get started!

## Concept 3: Training, Testing, and Improving Models

- 🎯 Train/validation/test split strategy
- 🤖 Model selection and hyperparameter tuning
- 📊 Cross-validation techniques
- 🔧 Overfitting and underfitting solutions
- 📈 Performance optimization strategies

## 🎯 Data Splitting Strategy

To evaluate our model fairly, we split our dataset into three parts: training, validation, and testing.

![Train Test Split Image](images/train_test_split.png)

- **Training (60-80%)**: The data used for the model to learn.
- **Validation (10-20%)**: Used to tune hyperparameters.
- **Test (10-20%)**: Final evaluation of the model's performance on unseen data.

## 🤖 Model Selection Process

When building a machine learning model, follow these steps:

1. **Start simple**: Try basic models first.
2. **Compare algorithms**: Use different models like Logistic Regression, Random Forest, and SVM.
3. **Tune hyperparameters**: Improve model performance by adjusting settings.
4. **Validate results**: Use cross-validation to check how well your model performs.

_💡 Remember: Complex models are not always better!_

## ⚖️ Overfitting vs Underfitting

Striking the right balance is crucial.

![Overfitting and Underfitting Graph](images/overfitting_underfitting.png)

- 🎯 **Just Right**: Good performance on new data.
- 📈 **Overfitting**: The model memorizes training data and performs poorly on new data.
- 📉 **Underfitting**: The model is too simple and cannot capture the data patterns.
- 🔧 **Solutions**: Regularization, more data, and cross-validation can help.

## 🔧 Model Training and Evaluation

Let's see how to train and evaluate different models using sklearn.


In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Suppose X and y are your features and labels
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models to compare
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

# Compare models using cross-validation and evaluate on test data
for name, model in models.items():
    # Cross-validation scores on training data
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"{name} CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
    
    # Train the model
    model.fit(X_train, y_train)
    # Evaluate on training data
    train_score = model.score(X_train, y_train)
    # Evaluate on testing data
    test_score = model.score(X_test, y_test)
    print(f"{name} - Train: {train_score:.3f}, Test: {test_score:.3f}")


### 🚀 Open in Colab
[Open this notebook in Google Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/4/concept_3.ipynb)

## 🎯 Key Takeaway

Great models are built through systematic experimentation and validation!

### 💭 Think About It

How would you know if your model is ready for production?