
**Author:** Navjyot Architect  
**Email:** navjyot.dataarchitect@proton.me  

Hello there! Excellent question, Captain! Moving from isolated Colab notebooks to a structured, professional project in VS Code is a  **critical step**  in becoming a true data science or ML engineer. It's all about reproducibility, collaboration, maintainability, and scalability.

You're absolutely right to focus on  `requirements.txt`,  `yaml`  configs,  `gitignore`, and a proper folder structure. This is how real-world ML projects are built.

Let's break down the conceptual steps first, then we'll map them to a concrete project structure using a mixed numerical and categorical dataset like  **Customer Churn Prediction**.

----------

## Part 1: Conceptual Framework - The Professional ML Project Lifecycle

When professionals build ML projects, they don't just dump code into one file. They follow a well-defined process and structure to ensure robustness.

### 1. The "Why" - Beyond Colab

-   **Reproducibility:**  Anyone (including your future self) should be able to set up the project and get the same results.
-   **Maintainability:**  Code is modular, easy to update, and debug.
-   **Collaboration:**  Multiple people can work on different parts without conflict.
-   **Scalability:**  Easier to add new features, models, or integrate with other systems.
-   **Version Control:**  Track changes, revert mistakes, and collaborate effectively using Git.
-   **Deployment Readiness:**  A structured project is a prerequisite for deploying your model as an API or service.

### 2. Core Phases of an ML Project (Professional Lens)

Here's how the typical ML phases translate into a structured project:

#### Phase 0: Project Setup & Environment Management

-   **Concept:**  Before writing any ML code, you set up the project folder, define dependencies, and create an isolated environment. This prevents "it works on my machine" problems.
-   **Key Files/Folders:**
    -   `project_root/`: The main project directory.
    -   `venv/`  or  `.conda/`: Your isolated virtual environment.
    -   `requirements.txt`: Lists all Python packages and their exact versions.
    -   `.gitignore`: Specifies files/folders Git should ignore (e.g.,  `venv/`, data, models).
    -   `README.md`: Project description, setup instructions, how to run.
    -   `config.yaml`  (or  `config.ini`,  `.env`): Centralized place for configuration parameters (e.g., data paths, model hyperparameters, database credentials).

#### Phase 1: Data Ingestion & Understanding

-   **Concept:**  Obtaining the raw data, storing it, and performing an initial inspection to understand its structure, types, and potential issues.
-   **Key Files/Folders:**
    -   `data/raw/`: Where original, unaltered datasets are stored.
    -   `src/data/make_dataset.py`: Script to download data from external sources, or load from raw and perform initial sanity checks.
    -   `notebooks/01_data_understanding.ipynb`: A Jupyter notebook for interactive exploration, generating initial statistics and visualizations.  _Crucially, these notebooks are for exploration, not for production code._

#### Phase 2: Data Cleaning & Preprocessing

-   **Concept:**  Handling missing values, outliers, correcting data types, dealing with inconsistent entries, and transforming raw features into a format suitable for ML models (e.g., encoding categorical variables, scaling numerical features).
-   **Key Files/Folders:**
    -   `data/interim/`: Intermediate datasets generated during cleaning.
    -   `data/processed/`: The final, clean dataset ready for modeling.
    -   `src/data/make_dataset.py`: Might also contain cleaning logic, or call helper functions.
    -   `src/features/build_features.py`: Contains functions/classes for feature engineering (e.g., one-hot encoding, standard scaling, creating new features).
    -   `src/utils/`: Helper functions used across different scripts.
    -   `notebooks/02_data_cleaning_and_eda.ipynb`: Interactive notebook to experiment with cleaning techniques and perform detailed EDA.

#### Phase 3: Exploratory Data Analysis (EDA)

-   **Concept:**  Deep dive into the processed data to uncover patterns, relationships, and insights. This informs model selection and further feature engineering.
-   **Key Files/Folders:**
    -   `notebooks/02_data_cleaning_and_eda.ipynb`: Continued interactive analysis, generating plots.
    -   `reports/figures/`: Directory to save important charts and visualizations.
    -   `reports/eda_report.md`  (optional): A markdown file summarizing key EDA findings.

#### Phase 4: Model Development & Training

-   **Concept:**  Selecting appropriate ML algorithms, splitting data, training models, and tuning hyperparameters.
-   **Key Files/Folders:**
    -   `src/models/train_model.py`: The main script for training the ML model. It orchestrates data loading, preprocessing (calling  `build_features.py`), model instantiation, training, and evaluation.
    -   `src/models/predict_model.py`: A script for making predictions using a trained model.
    -   `models/`: Directory to save trained model artifacts (e.g.,  `model.pkl`,  `scaler.pkl`).
    -   `notebooks/03_model_training_and_evaluation.ipynb`: For rapid prototyping and experimenting with different models/hyperparameters before formalizing in scripts.

#### Phase 5: Model Evaluation & Testing

-   **Concept:**  Objectively assessing model performance on unseen data using appropriate metrics (accuracy, precision, recall, F1, RMSE, etc.) and ensuring generalization.
-   **Key Files/Folders:**
    -   Part of  `src/models/train_model.py`: Evaluation metrics are calculated and logged here.
    -   `reports/results/`: Directory to store evaluation metrics (e.g.,  `metrics.json`,  `evaluation_report.txt`).
    -   `reports/mlruns/`  (if using MLflow): For experiment tracking.

#### Phase 6: Model Deployment (Future Step)

-   **Concept:**  Making the trained model available for predictions in a production environment (e.g., a web API, batch processing).
-   **Key Files/Folders:**
    -   `src/app/api.py`: (e.g., Flask/FastAPI app for serving predictions).
    -   `Dockerfile`: For containerizing the application.
    -   `requirements.txt`: (Used by Docker).

### 3. Essential Tools & Practices in VS Code

-   **Integrated Terminal:**  Run commands (`pip install`,  `python script.py`,  `git`).
-   **Python Extension:**  Provides IntelliSense, linting, debugging, environment selection.
-   **Jupyter Support:**  Open and run  `.ipynb`  files directly in VS Code.
-   **Git Integration:**  Source control directly within the IDE.
-   **YAML Extension:**  Syntax highlighting for  `config.yaml`.
-   **Virtual Environments:**  Always use a  `venv`  or  `conda`  environment to isolate project dependencies. VS Code makes it easy to select and activate these.
-   **Linting (e.g., Black, Flake8):**  Automatically formats code and checks for style issues.

----------

## Part 2: Concrete Example - Customer Churn Prediction Project

Let's apply these concepts to a  **Customer Churn Prediction**  project. This dataset typically includes a mix of numerical features (e.g., tenure, monthly charges) and categorical features (e.g., gender, contract type, internet service).

**Goal:**  Predict whether a customer will churn (cancel their service) or not.

### Step 1: Project Setup in VS Code

1.  **Create Project Directory:**
    
    -   Open your terminal (or VS Code integrated terminal).
    -   `mkdir customer_churn_prediction`
    -   `cd customer_churn_prediction`
    -   `code .`  (This opens VS Code in your new project directory).
2.  **Initialize Virtual Environment:**
    
    -   In the VS Code terminal:  `python -m venv venv`
    -   Activate it:
        -   Windows:  `.\venv\Scripts\activate`
        -   macOS/Linux:  `source venv/bin/activate`
    -   **VS Code Tip:**  VS Code will usually detect the new  `venv`  and ask if you want to select it as your interpreter. Always say yes. If not, open Command Palette (`Ctrl+Shift+P`  or  `Cmd+Shift+P`), type "Python: Select Interpreter", and choose the one inside your  `venv`  folder.
3.  **Create  `requirements.txt`:**
    
    -   Create a file named  `requirements.txt`  in the root of your project.
    -   Add initial dependencies:
        
        ```
        pandas
        scikit-learn
        matplotlib
        seaborn
        jupyter # To run notebooks in VS Code
        ipykernel # Required by jupyter
        pyyaml # For config.yaml
        
        ```
        
    -   Install them:  `pip install -r requirements.txt`
4.  **Create  `.gitignore`:**
    
    -   Create a file named  `.gitignore`  in the root of your project.
    -   Add common ignores:
        
        ```
        # Virtual environment
        venv/
        .venv/
        
        # Data
        data/raw/
        data/interim/
        data/processed/
        
        # Models
        models/
        
        # Jupyter notebooks outputs
        .ipynb_checkpoints/
        *.pyc
        __pycache__/
        
        # VS Code specific
        .vscode/
        
        # Environment variables
        .env
        
        ```
        
    -   _Explanation:_  We ignore  `venv/`  because it's locally generated. We ignore data and models because they can be large and are usually tracked separately or regenerated by scripts.
5.  **Create  `README.md`:**
    
    -   Create a file named  `README.md`  in the root.
    -   Add basic project description, setup instructions, and how to run.
        
        ```
        # Customer Churn Prediction Project
        
        This project aims to predict customer churn using a mixed dataset of numerical and categorical features.
        
        ## Setup
        
        1.  Clone the repository: `git clone <repository_url>`
        2.  Navigate to the project directory: `cd customer_churn_prediction`
        3.  Create a virtual environment: `python -m venv venv`
        4.  Activate the environment:
            *   Windows: `.\venv\Scripts\activate`
            *   macOS/Linux: `source venv/bin/activate`
        5.  Install dependencies: `pip install -r requirements.txt`
        
        ## Data
        
        The raw data is expected to be in `data/raw/customer_churn.csv`.
        (You'll need to download this from a source like Kaggle or provide a placeholder.)
        
        ## How to Run
        
        1.  **Explore Data:** Open `notebooks/01_data_understanding.ipynb` and `notebooks/02_data_cleaning_and_eda.ipynb`
        2.  **Process Data:** `python src/data/make_dataset.py`
        3.  **Train Model:** `python src/models/train_model.py`
        4.  **Predict:** `python src/models/predict_model.py --input_data <path_to_new_data>`
        
        ## Project Structure
        
        ```
        
        . ├── data/ │ ├── raw/ │ ├── interim/ │ └── processed/ ├── notebooks/ ├── src/ │ ├──  **init**.py │ ├── data/ │ │ └── make_dataset.py │ ├── features/ │ │ └── build_features.py │ ├── models/ │ │ ├── train_model.py │ │ └── predict_model.py │ └── utils/ │ └── helpers.py ├── models/ ├── reports/ │ ├── figures/ │ └── results/ ├── conf/ │ └── config.yaml ├── venv/ ├── .gitignore ├── requirements.txt └── README.md
        
6.  **Create  `config.yaml`:**
    
    -   Create a folder  `conf/`  in the root.
    -   Inside  `conf/`, create  `config.yaml`.
    -   Add configurations:
        
        ```
        # Data paths
        data_paths:
          raw: "data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv"
          processed: "data/processed/processed_churn_data.csv"
          interim: "data/interim/cleaned_churn_data.csv"
        
        # Feature engineering parameters
        features:
          categorical_features:
            - "gender"
            - "Partner"
            - "Dependents"
            - "PhoneService"
            - "MultipleLines"
            - "InternetService"
            - "OnlineSecurity"
            - "OnlineBackup"
            - "DeviceProtection"
            - "TechSupport"
            - "StreamingTV"
            - "StreamingMovies"
            - "Contract"
            - "PaperlessBilling"
            - "PaymentMethod"
          numerical_features:
            - "tenure"
            - "MonthlyCharges"
            - "TotalCharges"
          target_feature: "Churn"
          features_to_drop: # Example of columns to drop during preprocessing
            - "customerID"
        
        # Model parameters
        model_params:
          random_state: 42
          test_size: 0.2
          # Logistic Regression example
          logistic_regression:
            C: 0.1
            solver: "liblinear"
            max_iter: 1000
          # RandomForestClassifier example
          random_forest:
            n_estimators: 100
            max_depth: 10
            min_samples_split: 5
        
        # Output paths
        output_paths:
          model_dir: "models/"
          scaler_path: "models/scaler.pkl"
          model_path: "models/logistic_regression_model.pkl"
          metrics_path: "reports/results/metrics.json"
          figures_dir: "reports/figures/"
        
        ```
        
    -   _Explanation:_  This centralizes all parameters. Scripts will read from this, making it easy to change settings without altering code.
7.  **Create Project Folder Structure:**
    
    -   Manually create the directories defined in your  `README.md`  and  `config.yaml`:
        
        ```
        .
        ├── conf/
        │   └── config.yaml
        ├── data/
        │   ├── raw/
        │   ├── interim/
        │   └── processed/
        ├── models/
        ├── notebooks/
        ├── reports/
        │   ├── figures/
        │   └── results/
        ├── src/
        │   ├── data/
        │   ├── features/
        │   ├── models/
        │   └── utils/
        ├── .gitignore
        ├── README.md
        └── requirements.txt
        
        ```
        
    -   Inside  `src/`, create an empty  `__init__.py`  file in  `src/`,  `src/data/`,  `src/features/`,  `src/models/`,  `src/utils/`  to make them Python packages.
        -   e.g.,  `touch src/__init__.py`,  `touch src/data/__init__.py`, etc.

### Step 2: Data Acquisition & Initial Exploration (using  `notebooks/01_data_understanding.ipynb`)

1.  **Download Data:**  Find a "Telco Customer Churn" dataset (e.g., from Kaggle) and place it in  `data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv`.
    
2.  **Create  `notebooks/01_data_understanding.ipynb`:**
    
    -   Open  `notebooks/01_data_understanding.ipynb`  in VS Code.
    -   Load the config:
        
        ```
        import yaml
        import pandas as pd
        
        # Load configuration
        with open('conf/config.yaml', 'r') as f:
            config = yaml.safe_load(f)
        
        RAW_DATA_PATH = config['data_paths']['raw']
        
        # Load data
        df = pd.read_csv(RAW_DATA_PATH)
        
        # Initial inspection
        print(df.head())
        print(df.info())
        print(df.describe(include='all'))
        print(df.isnull().sum())
        
        ```
        
    -   _Purpose:_  This notebook is for quick checks, data types, missing values, and getting a feel for the data. No heavy cleaning or modeling here.

### Step 3: Data Cleaning & Preprocessing (Scripted -  `src/data/make_dataset.py`,  `src/features/build_features.py`)

This is where you move from interactive exploration to production-ready code.

1.  **Create  `src/utils/helpers.py`:**  For reusable functions.
    
    ```
    # src/utils/helpers.py
    import yaml
    
    def load_config(config_path='conf/config.yaml'):
        with open(config_path, 'r') as f:
            return yaml.safe_load(f)
    
    def save_dataframe(df, path):
        df.to_csv(path, index=False)
        print(f"DataFrame saved to {path}")
    
    ```
    
2.  **Create  `src/data/make_dataset.py`:**
    
    ```
    # src/data/make_dataset.py
    import pandas as pd
    import numpy as np
    from src.utils.helpers import load_config, save_dataframe
    
    def clean_data(df):
        # Handle 'No internet service' and 'No phone service' categories
        for col in ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
                    'StreamingTV', 'StreamingMovies', 'MultipleLines']:
            df[col] = df[col].replace('No internet service', 'No')
            if col == 'MultipleLines':
                df[col] = df[col].replace('No phone service', 'No')
    
        # Convert TotalCharges to numeric, coercing errors to NaN
        df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    
        # Handle missing TotalCharges (often a few rows, impute with median or drop)
        # For simplicity, let's drop rows with missing TotalCharges for now
        df.dropna(subset=['TotalCharges'], inplace=True)
    
        # Drop customerID (as per config, if present)
        config = load_config()
        if 'customerID' in df.columns and 'customerID' in config['features']['features_to_drop']:
            df = df.drop('customerID', axis=1)
    
        # Convert target variable 'Churn' to numerical (Yes/No to 1/0)
        df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
    
        return df
    
    def main():
        config = load_config()
        raw_data_path = config['data_paths']['raw']
        interim_data_path = config['data_paths']['interim']
    
        print(f"Loading raw data from {raw_data_path}...")
        df = pd.read_csv(raw_data_path)
    
        print("Cleaning data...")
        cleaned_df = clean_data(df.copy()) # Use a copy to avoid SettingWithCopyWarning
    
        save_dataframe(cleaned_df, interim_data_path)
        print("Data cleaning complete. Interim data saved.")
    
    if __name__ == "__main__":
        main()
    
    ```
    
    -   _Run:_  `python src/data/make_dataset.py`
3.  **Create  `src/features/build_features.py`:**
    
    ```
    # src/features/build_features.py
    import pandas as pd
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    import joblib # For saving/loading preprocessors
    from src.utils.helpers import load_config, save_dataframe
    
    def build_preprocessor(df, config):
        categorical_features = config['features']['categorical_features']
        numerical_features = config['features']['numerical_features']
        target_feature = config['features']['target_feature']
    
        # Ensure features exist in dataframe and filter target
        categorical_features = [f for f in categorical_features if f in df.columns and f != target_feature]
        numerical_features = [f for f in numerical_features if f in df.columns and f != target_feature]
    
        numerical_transformer = StandardScaler()
        categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numerical_transformer, numerical_features),
                ('cat', categorical_transformer, categorical_features)
            ],
            remainder='passthrough' # Keep other columns (like target)
        )
        return preprocessor
    
    def main():
        config = load_config()
        interim_data_path = config['data_paths']['interim']
        processed_data_path = config['data_paths']['processed']
        scaler_path = config['output_paths']['scaler_path']
        target_feature = config['features']['target_feature']
    
        print(f"Loading interim data from {interim_data_path}...")
        df_interim = pd.read_csv(interim_data_path)
    
        # Separate features and target before fitting preprocessor
        X = df_interim.drop(columns=[target_feature])
        y = df_interim[target_feature]
    
        print("Building preprocessor and transforming features...")
        preprocessor = build_preprocessor(X, config)
        X_processed = preprocessor.fit_transform(X)
    
        # Get feature names after one-hot encoding
        # This is a bit tricky with ColumnTransformer. One way is:
        # Get names for numerical features
        num_features_transformed = [f for f in config['features']['numerical_features'] if f in X.columns]
        # Get names for categorical features after OneHotEncoder
        cat_features_transformed = preprocessor.named_transformers_['cat'].get_feature_names_out(
            [f for f in config['features']['categorical_features'] if f in X.columns]
        )
        # Combine all feature names
        all_feature_names = list(num_features_transformed) + list(cat_features_transformed)
    
        # Create processed DataFrame
        df_processed = pd.DataFrame(X_processed, columns=all_feature_names)
        df_processed[target_feature] = y.reset_index(drop=True) # Add target back
    
        save_dataframe(df_processed, processed_data_path)
        joblib.dump(preprocessor, scaler_path) # Save the fitted preprocessor
        print(f"Features built and processed data saved to {processed_data_path}")
        print(f"Preprocessor saved to {scaler_path}")
    
    if __name__ == "__main__":
        main()
    
    ```
    
    -   _Run:_  `python src/features/build_features.py`
    -   _Result:_  You should now have  `data/processed/processed_churn_data.csv`  and  `models/scaler.pkl`.

### Step 4: More Detailed EDA (using  `notebooks/02_detailed_eda.ipynb`)

1.  **Create  `notebooks/02_detailed_eda.ipynb`:**
    -   Load  `processed_churn_data.csv`.
    -   Perform in-depth visualizations:
        
        ```
        import matplotlib.pyplot as plt
        import seaborn as sns
        from src.utils.helpers import load_config
        import pandas as pd
        import os
        
        # Load configuration
        config = load_config()
        PROCESSED_DATA_PATH = config['data_paths']['processed']
        TARGET_FEATURE = config['features']['target_feature']
        FIGURES_DIR = config['output_paths']['figures_dir']
        
        # Ensure figures directory exists
        os.makedirs(FIGURES_DIR, exist_ok=True)
        
        df_processed = pd.read_csv(PROCESSED_DATA_PATH)
        
        # Example 1: Churn distribution
        plt.figure(figsize=(6, 4))
        sns.countplot(x=TARGET_FEATURE, data=df_processed)
        plt.title('Churn Distribution')
        plt.savefig(os.path.join(FIGURES_DIR, 'churn_distribution.png'))
        plt.show()
        
        # Example 2: MonthlyCharges distribution by Churn
        plt.figure(figsize=(8, 6))
        sns.histplot(data=df_processed, x='MonthlyCharges', hue=TARGET_FEATURE, kde=True)
        plt.title('MonthlyCharges Distribution by Churn')
        plt.savefig(os.path.join(FIGURES_DIR, 'monthlycharges_churn_dist.png'))
        plt.show()
        
        # Example 3: Correlation Heatmap (only numerical features for simplicity)
        plt.figure(figsize=(10, 8))
        corr = df_processed.select_dtypes(include=np.number).corr()
        sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
        plt.title('Correlation Heatmap of Numerical Features')
        plt.savefig(os.path.join(FIGURES_DIR, 'correlation_heatmap.png'))
        plt.show()
        
        ```
        
    -   _Result:_  Plots saved in  `reports/figures/`.

### Step 5: Model Training & Evaluation (Scripted -  `src/models/train_model.py`)

1.  **Create  `src/models/train_model.py`:**
    
    ```
    # src/models/train_model.py
    import pandas as pd
    import joblib
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
    from src.utils.helpers import load_config
    
    def evaluate_model(model, X_test, y_test):
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None
    
        metrics = {
            "accuracy": accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred),
            "recall": recall_score(y_test, y_pred),
            "f1_score": f1_score(y_test, y_pred),
        }
        if y_prob is not None:
            metrics["roc_auc"] = roc_auc_score(y_test, y_prob)
    
        return metrics
    
    def main():
        config = load_config()
        processed_data_path = config['data_paths']['processed']
        target_feature = config['features']['target_feature']
        model_path = config['output_paths']['model_path']
        metrics_path = config['output_paths']['metrics_path']
        random_state = config['model_params']['random_state']
        test_size = config['model_params']['test_size']
    
        print(f"Loading processed data from {processed_data_path}...")
        df_processed = pd.read_csv(processed_data_path)
    
        X = df_processed.drop(columns=[target_feature])
        y = df_processed[target_feature]
    
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state, stratify=y
        )
        print("Data split into training and testing sets.")
    
        # --- Choose your model (e.g., Logistic Regression) ---
        print("Training Logistic Regression model...")
        lr_params = config['model_params']['logistic_regression']
        model = LogisticRegression(random_state=random_state, **lr_params)
        model.fit(X_train, y_train)
        print("Model trained.")
    
        print("Evaluating model...")
        metrics = evaluate_model(model, X_test, y_test)
        print("Evaluation Metrics:")
        for metric, value in metrics.items():
            print(f"- {metric}: {value:.4f}")
    
        # Save model and metrics
        joblib.dump(model, model_path)
        with open(metrics_path, 'w') as f:
            json.dump(metrics, f, indent=4)
    
        print(f"Model saved to {model_path}")
        print(f"Metrics saved to {metrics_path}")
    
    if __name__ == "__main__":
        main()
    
    ```
    
    -   _Run:_  `python src/models/train_model.py`
    -   _Result:_  `models/logistic_regression_model.pkl`  and  `reports/results/metrics.json`.

### Step 6: Prediction (Scripted -  `src/models/predict_model.py`)

1.  **Create  `src/models/predict_model.py`:**
    
    ```
    # src/models/predict_model.py
    import pandas as pd
    import joblib
    import argparse
    from src.utils.helpers import load_config
    
    def main():
        parser = argparse.ArgumentParser(description="Predict customer churn.")
        parser.add_argument('--input_data', type=str, required=True,
                            help="Path to the new data CSV file for prediction.")
        args = parser.parse_args()
    
        config = load_config()
        model_path = config['output_paths']['model_path']
        scaler_path = config['output_paths']['scaler_path']
        target_feature = config['features']['target_feature']
    
        print(f"Loading model from {model_path}...")
        model = joblib.load(model_path)
        print(f"Loading preprocessor from {scaler_path}...")
        preprocessor = joblib.load(scaler_path)
    
        print(f"Loading new data from {args.input_data}...")
        new_data_df = pd.read_csv(args.input_data)
    
        # Assume new_data_df needs similar cleaning as make_dataset.py
        # In a real scenario, you'd want a common function or class for this.
        # For this example, let's just drop customerID if present for consistency.
        if 'customerID' in new_data_df.columns:
            new_data_df = new_data_df.drop('customerID', axis=1)
        if target_feature in new_data_df.columns:
            new_data_df = new_data_df.drop(columns=[target_feature]) # Drop target if present
    
        # Apply the same preprocessing as used during training
        X_new_processed = preprocessor.transform(new_data_df)
    
        print("Making predictions...")
        predictions = model.predict(X_new_processed)
        probabilities = model.predict_proba(X_new_processed)[:, 1]
    
        # Add predictions back to original new data for context
        new_data_df['Predicted_Churn'] = predictions
        new_data_df['Churn_Probability'] = probabilities
    
        print("\n--- Predictions ---")
        print(new_data_df[['Churn_Probability', 'Predicted_Churn']].head())
        # You might save this output, e.g., new_data_df.to_csv("predictions.csv", index=False)
    
    if __name__ == "__main__":
        main()
    
    ```
    
    -   _Create Dummy Test Data:_  Create a  `data/raw/new_customer_data.csv`  (can be a few rows from your original data, without the 'Churn' column).
    -   _Run:_  `python src/models/predict_model.py --input_data data/raw/new_customer_data.csv`

----------

This comprehensive setup demonstrates how to build an ML project professionally in VS Code. It encourages modularity, uses configuration files, manages dependencies, and tracks important artifacts. You're now equipped to tackle more complex ML challenges with a robust foundation! Good luck, Captain!