<a href="https://colab.research.google.com/github/DYNOSuprovo/MachineLearning/blob/main/The_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome To the Notebook**


### **Task 1 - Loading our data**

Installing the pyspark using pip

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m316.9/317.0 MB[0m [31m34.2 MB/s[0m eta [36m0:00:01[0m

Importing Modules

In [None]:
# importing spark session
from pyspark.sql import SparkSession

# data visualization modules
import matplotlib.pyplot as plt
import plotly.express as px

# pandas module
import pandas as pd

# pyspark SQL functions
from pyspark.sql.functions import col, when, count, udf

# pyspark data preprocessing modules
from pyspark.ml.feature import Imputer, StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder

# pyspark data modeling and model evaluation modules
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator


Building our Spark Session

In [None]:
spark = SparkSession.builder.appName("Customer_Churn_Prediction").getOrCreate()
spark

Loading our data

In [None]:
data=spark.read.format('csv').option('header',  True).option('inferSchema','true').option('header',True).load("dataset.csv")
data.show(4)

Print the data schema to check out the data types

In [None]:
data.printSchema()

Get the data dimension

In [None]:
data.count()


### **Task 2 - Exploratory Data Analysis**
- Distribution Analysis
- Correlation Analysis
- Univariate Analysis
- Finding Missing values

Let's define some lists to store different column names with different data types.

In [None]:
numerical_columns = [name for name,typ in data.dtypes if typ=="double" or typ =='int']
categorical_columns = [name for name,typ in data.dtypes if typ=="string" ]
data.select(numerical_columns).show()

Let's get all the numerical features and store them into a pandas dataframe.

In [None]:
df=data.select(numerical_columns).toPandas()
df.head()

Let's create histograms to analyse the distribution of our numerical columns.

In [None]:
fig=plt.figure(figsize=(15,10))
ax=fig.gca()
df.hist(ax=ax,bins=20)
df.tenure.describe()

Let's generate the correlation matrix

In [None]:
df.corr()

Let's check the unique value count per each categorical variables

In [None]:
for column in categorical_columns:
  data.groupBy(column).count().show()

Let's find number of null values in all of our dataframe columns

In [None]:
for column in data.columns:
  data.select(count(when(col(column).isNull(),column)).alias(column)).show()

### **Task 3 - Data Preprocessing**
- Handling the missing values
- Removing the outliers

**Handling the missing values** <br>
Let's create a list of column names with missing values

In [None]:
columns_with_missing_values = ["TotalCharges"]

Creating our Imputer

In [None]:
imputer=Imputer(inputCols=columns_with_missing_values,outputCols=columns_with_missing_values).setStrategy("mean")

Use Imputer to fill the missing values

In [None]:
imputer=imputer.fit(data)
data=imputer.transform(data)

Let's check the missing value counts again

In [None]:
data.select(count(when(col("TotalCharges").isNull(),"TotalCharges")).alias("TotalCharges")).show()

**Removing the outliers** <br>
Let's find the customer with the tenure higher than 100

In [None]:
data.select("*").where(data.tenure>100).show()

Let's drop the outlier row

In [None]:
print("Before removing the outliers",data.count())
data=data.filter(data.tenure<100)
print("After removing the outliers",data.count())
data.select("*").where(data.tenure>100).show()

### **Task 4 - Feature Preparation**
- Numerical Features
    - Vector Assembling
    - Numerical Scaling
- Categorical Features
    - String Indexing
    - Vector Assembling

- Combining the numerical and categorical feature vectors




**Feature Preparation - Numerical Features** <br>

`Vector Assembling --> Standard Scaling` <br>

**Vector Assembling** <br>
To apply our machine learning model we need to combine all of our numerical and categorical features into vectors. For now let's create a feature vector for our numerical columns.


In [None]:
numerical_vector_assembler=VectorAssembler(inputCols=numerical_columns, outputCol="numerical_features_vector")
data=numerical_vector_assembler.transform(data)
data.show()

**Numerical Scaling** <br>
Let's standardize all of our numerical features.

In [None]:
scaler=StandardScaler(inputCol="numerical_features_vector", outputCol="numerical_features_scaled", withStd=True, withMean=True)
data=scaler.fit(data).transform(data)
data.show()

**Feature Preperation - Categorical Features** <br>

`String Indexing --> Vector Assembling` <br>

**String Indexing** <br>
We need to convert all the string columns to numeric columns.

In [None]:
categorical_columns_indexed=[name+"_Indexed" for name in categorical_columns]
indexer=StringIndexer(inputCols=categorical_columns, outputCols=categorical_columns_indexed)
data=indexer.fit(data).transform(data)
data.show()


Let's combine all of our categorifal features in to one feature vector.

In [None]:
if "customerID_Indexed" in categorical_columns_indexed:
    categorical_columns_indexed.remove("customerID_Indexed")
if "Churn_Indexed" in categorical_columns_indexed:
    categorical_columns_indexed.remove("Churn_Indexed")

categorical_vector_assembler=VectorAssembler(inputCols=categorical_columns_indexed,outputCol="categorical_features_vector")
data=categorical_vector_assembler.transform(data)
data.show()

Now let's combine categorical and numerical feature vectors.

In [None]:
final_vector_assembler=VectorAssembler(inputCols=["categorical_features_vector","numerical_features_scaled"],outputCol="final_feature_vector")
data=final_vector_assembler.transform(data)


### **Task 5 - Model Training**
- Train and Test data splitting
- Creating our model
- Training our model
- Make initial predictions using our model

In this task, we are going to start training our model

In [None]:
train,test=data.randomSplit([0.7,0.3],seed=100)
train.count()
test.count()

Now let's create and train our desicion tree

In [None]:


dt=DecisionTreeClassifier(featuresCol="final_feature_vector",labelCol="Churn_Indexed",maxDepth=3)
dtModel=dt.fit(train)

Let's make predictions on our test data

In [None]:
predictions_test=dtModel.transform(test)
predictions_test.select(["Churn","prediction"]).show()

### **Task 6 - Model Evaluation**
- Calculating area under the ROC curve for the `test` set
- Calculating area under the ROC curve for the `training` set
- Hyper parameter tuning

In [None]:
evaluator=BinaryClassificationEvaluator(labelCol="Churn_Indexed")
auc_test=evaluator.evaluate(predictions_test,{evaluator.metricName:"areaUnderROC"})
auc_test

Let's get the AUC for our `training` set

In [None]:
predictions_test=dtModel.transform(train)
auc_train=evaluator.evaluate(predictions_test,{evaluator.metricName:"areaUnderROC"}) # Use predictions_test since that's what you just calculated
auc_train

**Hyper parameter tuning**

Let's find the best `maxDepth` parameter for our DT model.

In [None]:
def evaluate_dt(mode_params):
      test_accuracies = []
      train_accuracies = []

      for maxD in mode_params:
        # train the model based on the maxD
        decision_tree = DecisionTreeClassifier(featuresCol = 'final_feature_vector', labelCol = 'Churn_Indexed', maxDepth = maxD)
        dtModel = decision_tree.fit(train)

        # calculating test error
        predictions_test = dtModel.transform(test)
        evaluator = BinaryClassificationEvaluator(labelCol="Churn_Indexed")
        auc_test = evaluator.evaluate(predictions_test, {evaluator.metricName: "areaUnderROC"})
        # recording the accuracy
        test_accuracies.append(auc_test)

        # calculating training error
        predictions_training = dtModel.transform(train)
        evaluator = BinaryClassificationEvaluator(labelCol="Churn_Indexed")
        auc_training = evaluator.evaluate(predictions_training, {evaluator.metricName: "areaUnderROC"})
        train_accuracies.append(auc_training)

      return(test_accuracies, train_accuracies)

Let's define `params` list to evaluate our model iteratively with differe maxDepth parameter.  

In [None]:
maxDepths=[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
test_accs,train_accs=evaluate_dt(maxDepths)
print(train_accs)
print(test_accs)


Let's visualize our results

In [None]:
df=pd.DataFrame()
df["maxDepth"]=maxDepths
df["test_Accs"]=test_accs
df["train_Accs"]=train_accs
px.line(df,x="maxDepth",y=["train_Accs","test_Accs"])

### **7 - Model Deployment**
- Giving Recommendations using our model



We were asked to recommend a solution to reduce the customer churn.


In [None]:
feature_importance=dtModel.featureImportances
scores=[score for i,score in enumerate(feature_importance)]
df=pd.DataFrame(scores,columns=["score"],index=categorical_columns_indexed+numerical_columns)
px.bar(df,y="score")

In [None]:
df = data.groupby(["Contract_Indexed","Churn"]).count().toPandas() # Use () for groupby
px.bar(df,x="Contract_Indexed",y="count",color="Churn")

Let's create a bar chart to visualize the customer churn per contract type

The bar chart displays the number of churned customers based on their contract type. It is evident that customers with a "Month-to-month" contract have a higher churn rate compared to those with "One year" or "Two year" contracts. As a recommendation, the telecommunication company could consider offering incentives or discounts to encourage customers with month-to-month contracts to switch to longer-term contracts.