## Machine Learning 

Let's proceed with creating a notebook for training a machine learning model to predict customer churn. 
We will focus on both Logistic Regression and Random Forest classifiers

### Step 1: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Customer Churn Model Training") \
    .getOrCreate()

print("Spark session initialized.")


### Step 2: Load the Preprocessed Data

The data is stored in a Delta Lake format or CSV, load it to continue from the previous step:

In [None]:
# Load preprocessed data (from Delta or CSV)
# If using Delta Lake:

df = spark.read.format("delta").load("s3a://your-bucket-name/preprocessed_data")

# If using CSV (ensure the file path is correct)
data_path = "C:/Users/ADMIN/Desktop/Projects/Batch-Processing-Project-Customer-Churn-Prediction-Pipeline/datasets/processed_data.csv"
df = pd.read_csv(data_path)

# Convert it to a Spark DataFrame
df_spark = spark.createDataFrame(df)

# Check the schema and first few rows
df_spark.printSchema()
df_spark.show(5)


### Step 3: Feature Selection and Vectorization
Assemble features into a single vector: Machine learning models in Spark require the features to be in a vector format. We’ll use the VectorAssembler to combine all feature columns into a single column.

In [None]:
# List of feature columns
feature_columns = ['tenure', 'MonthlyCharges', 'TotalCharges', 'charges_diff'] + \
                  [col + '_Encoded' for col in categorical_columns]

# Assemble features into a single vector column
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_spark = assembler.transform(df_spark)

# Show the assembled features
df_spark.select('features').show(5)


### Step 4: Train-Test Split

Split the dataset into training and testing datasets.

In [None]:
# Split data into training (80%) and testing (20%)
train_data, test_data = df_spark.randomSplit([0.8, 0.2], seed=1234)

# Show the sizes of the train and test datasets
print(f"Training Data Size: {train_data.count()}")
print(f"Test Data Size: {test_data.count()}")


### Step 5: Train Logistic Regression Model



In [None]:
# Initialize the Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="Churn")

# Train the model on the training data
lr_model = lr.fit(train_data)

# Make predictions on the test data
lr_predictions = lr_model.transform(test_data)

# Show the predictions
lr_predictions.select("Churn", "prediction", "probability").show(5)


### Step 6: Train Random Forest Model


In [None]:
# Initialize the Random Forest classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="Churn")

# Train the Random Forest model
rf_model = rf.fit(train_data)

# Make predictions on the test data
rf_predictions = rf_model.transform(test_data)

# Show the predictions
rf_predictions.select("Churn", "prediction", "probability").show(5)


### Step 7: Model Evaluation
Evaluate both models using a Binary Classification Evaluator to compute metrics like Accuracy, ROC AUC, and F1 Score.

In [None]:
# Initialize evaluator
evaluator = BinaryClassificationEvaluator(labelCol="Churn", rawPredictionCol="prediction")

# Evaluate Logistic Regression
lr_auc = evaluator.evaluate(lr_predictions)
print(f"Logistic Regression AUC: {lr_auc}")

# Evaluate Random Forest
rf_auc = evaluator.evaluate(rf_predictions)
print(f"Random Forest AUC: {rf_auc}")


### Step 8: Hyperparameter Tuning 

You can perform hyperparameter tuning using Grid Search and Cross Validation.

In [None]:
# Define the parameter grid for tuning
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.1, 0.01])
             .addGrid(lr.elasticNetParam, [0.0, 0.5])
             .build())

# Initialize CrossValidator
crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid,
                          evaluator=evaluator, numFolds=5)

# Run cross-validation to find the best model
cv_model = crossval.fit(train_data)

# Get the best model
best_lr_model = cv_model.bestModel
print("Best Logistic Regression Model:", best_lr_model)


### Step 9: Model Export 
Once the model is trained, you can save the model for later use:

In [None]:
# Save the trained Logistic Regression model
lr_model.save("path_to_save_lr_model")

# Save the trained Random Forest model
rf_model.save("path_to_save_rf_model")


### Step 10: Conclusion
Summarize the model performance and provide the next steps:

In [None]:
print("Model training completed.")
print("1. Logistic Regression and Random Forest models trained.")
print("2. Hyperparameter tuning performed (if applicable).")
print("3. Models evaluated based on AUC.")
print("4. Models saved for deployment.")
