
## Pipeline for Telecom Customer Churn Prediction

This pipeline outlines the key steps for building a machine learning workflow to predict customer churn in a dataset.

1. **Data Generation and Collection**
   - Collect and generate raw customer data from internal systems, external sources (e.g., data marketplace)
   - Collect historical customer data including demographics, usage patterns, billing information, and service details.

2. **Data Ingestion**
   - Ingest collected data into the analytics platform (e.g., Databricks, data lake).
   - Store ingested data in the Bronze Layer (raw zone) for traceability and reproducibility.

3. **Data Splitting**
   - Split data into training and test sets before cleaning.
   - Perform stratified train-test split to maintain class proportions.
   - This approach simulates real-time scenarios where models must handle raw, uncleaned data.

4. **Data Cleaning**
   - Handle missing values and outliers.
   - Perform outlier detection and capping to reduce the impact of extreme values.
   - Standardize data formats and values.
   - Compute column statistics (mean, median, std, min, max) for validation and profiling.
   - Perform data validation checks to ensure data consistency (schema, datatypes, and null handling).
   - Store cleaned data in the Silver Layer.

5. **Data Analysis**
   - Perform exploratory data analysis (EDA) to understand distributions, correlations, and patterns.
   - Identify key factors influencing churn and ensure no data leakage from target variables.

6. **Data Transformation**
   - Handle skewness using log transformation or other techniques.
   - Apply feature scaling (e.g., MinMaxScaler, StandardScaler).
   - Encode categorical variables (e.g., OneHotEncoding, LabelEncoding).
   - Standardize and normalize data formats.

7. **SMOTE Analysis**
   - Apply SMOTE or other resampling techniques to address class imbalance.
   - Apply SMOTE only on training set to prevent data leakage and balance classes.

8. **Feature Engineering**
   - Added features (e.g., tenure buckets, contract type indicators).
   - Perform feature correlation analysis to identify relationships.
   - Drop unnecessary columns based on analysis.
   - Evaluate feature importance and ranking (e.g., feature_importances_).
   - Reorder features based on importance.
   - Assemble features using VectorAssembler.
   - Apply StandardScaler to feature vectors.

9. **Model Training**
   - Train multiple ML algorithms: Logistic Regression (lr) and Random Forest (rf).
   - Fit both models on the training data.
   - Log training metrics, parameters, and artifacts for both models using MLflow for reproducibility.
   - **Generate predictions on test data using both models.**
   - **Compare model performance on test data to select the best model.**

10. **Model Evaluation**
   - Evaluate model using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
   - Visualize results with confusion matrix.

11. **ML Pipeline Construction**
    - Build a ML pipeline with only one stage: model training.
    - Integrate the training step for the selected model (e.g., Logistic Regression or Random Forest).

12. **Model Registry**
    - Register the best model in a model registry for versioning and governance.
    - Use Databricks MLflow Model Registry for tracking versions, ownership, and deployment readiness.

13. **Model Deployment**
    - Deploy the registered model for real-time inference.
    - Deploy the model via REST API, Databricks Model Serving, or batch scoring jobs.

14. **Inference Pipeline**
    - Use the deployed model to predict churn for new customers.
    - Integrate predictions into business workflows.
    - Save inference results in Gold Layer tables for business dashboards.
    - Monitor data drift and model performance regularly.