<a href="https://colab.research.google.com/github/TAlkam/predicting-customer-churn/blob/main/Predicting_customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tursun Alkam

# **1. Introduction**

**1.1 Project Goal**
The goal of this project is to predict customer churn using a machine learning model. Customer churn prediction helps businesses identify customers who are likely to stop using their services, enabling them to take proactive measures to retain these customers and reduce losses.


**1.2 Importance of Churn Prediction**
Customer churn is a significant issue for many businesses, particularly in subscription-based industries. Predicting churn allows businesses to understand which customers are at risk of leaving and to implement strategies to retain them, thereby increasing customer lifetime value and overall profitability.


**1.3 Business Understanding**
Objective: Reduce customer churn to increase revenue and improve customer retention.


**Business Need:** The retail business needs a model to predict which customers are likely to churn so that targeted marketing strategies can be implemented to retain them.


# **2. Data Understanding**

**2.1 Find Data**
We used a publicly available dataset: "Customer Churn Dataset" from Kaggle.


2.2 Examine Data
Load the Dataset: Load the dataset and inspect the columns and data types.
Identify the Target Variable: The target variable is 'Churn', and the features include customer demographics, purchase history, and other relevant attributes.


**2.3 Clean Data**
Handle Missing Values: Check for missing values and handle them appropriately.
Remove Duplicates: Remove any duplicate records if found.


**2.4 Initial Data Exploration**
The dataset includes both classes: customers who churned (1) and customers who did not churn (0). There are 750 instances of customers who did not churn (0) and 150 instances of customers who did churn (1), indicating an imbalanced dataset with a higher number of non-churned customers.

In [None]:
from google.colab import files
uploaded = files.upload()

import pandas as pd

# Load the dataset
df = pd.read_csv('customer_churn.csv')

# Check the unique values in the target column and their distribution
print("Unique values in 'Churn':", df['Churn'].unique())
print("Distribution in the entire dataset:")
print(df['Churn'].value_counts())


Saving customer_churn.csv to customer_churn.csv
Unique values in 'Churn': [1 0]
Distribution in the entire dataset:
Churn
0    750
1    150
Name: count, dtype: int64


# **3. Data Preprocessing**

**3.1 Applying SMOTE**

Given the imbalance in the dataset, we applied the SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes. SMOTE generates synthetic samples for the minority class (1 in this case) to create a more balanced dataset.

**3.2 Checking the Distribution After SMOTE**

In [1]:
import pandas as pd
from sklearn.datasets import make_classification

# Create a synthetic dataset with 1000 samples, 20 features, and a 90-10 class imbalance
X_synthetic, y_synthetic = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                                               n_clusters_per_class=1, weights=[0.9, 0.1], flip_y=0, random_state=42)

# Convert to DataFrame for consistency
df_synthetic = pd.DataFrame(X_synthetic, columns=[f'feature_{i}' for i in range(20)])
df_synthetic['Churn'] = y_synthetic

# Save the synthetic dataset to a CSV file
df_synthetic.to_csv('synthetic_dataset.csv', index=False)

# Check the distribution of the target variable in the synthetic dataset
print("Distribution in the synthetic dataset:")
print(df_synthetic['Churn'].value_counts())

# Verify the saved file by loading it back
df_loaded = pd.read_csv('synthetic_dataset.csv')
print("Loaded dataset shape:", df_loaded.shape)
print("Loaded dataset columns:", df_loaded.columns.tolist())


Distribution in the synthetic dataset:
Churn
0    900
1    100
Name: count, dtype: int64
Loaded dataset shape: (1000, 21)
Loaded dataset columns: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'Churn']


In [2]:
from google.colab import files

# Download the CSV file
files.download('synthetic_dataset.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [3]:
import pandas as pd

# Load the synthetic dataset
df_synthetic = pd.read_csv('synthetic_dataset.csv')

# Rename the columns with meaningful names
df_synthetic.columns = [
    'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
    'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
    'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
    'MonthlyCharges', 'TotalCharges', 'tenure_group', 'Churn'
]

# Save the updated dataset to a new CSV file
df_synthetic.to_csv('synthetic_dataset_labeled.csv', index=False)

# Check the updated dataset
print(df_synthetic.head())


     gender  SeniorCitizen   Partner  Dependents    tenure  PhoneService  \
0 -0.429244      -2.211862  0.189756    0.588553  0.820374     -0.180392   
1 -0.045512      -2.084113 -3.189191   -0.424236 -2.472718      0.508269   
2  0.252195       1.617045  1.565132   -1.970309  2.048682      0.509295   
3  1.725694      -0.516117  2.210866    0.121844  2.667175     -1.059212   
4 -0.749416       1.106232 -0.664455    0.337766 -0.054664      0.552905   

   MultipleLines  InternetService  OnlineSecurity  OnlineBackup  ...  \
0      -1.150654         1.471709        0.701585     -0.833474  ...   
1      -2.850971         0.911854        0.472527     -1.210002  ...   
2      -0.238676         1.445465        0.673300     -0.529416  ...   
3       0.107509         1.527901        0.705332     -0.442889  ...   
4      -1.497095         1.233776        0.597581     -0.871460  ...   

   TechSupport  StreamingTV  StreamingMovies  Contract  PaperlessBilling  \
0     0.702818    -0.225999       

In [4]:
from sklearn.datasets import make_classification
import pandas as pd

# Create a synthetic dataset with 1000 samples, 20 features, and a 90-10 class imbalance
X_synthetic, y_synthetic = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                                               n_clusters_per_class=1, weights=[0.9, 0.1], flip_y=0, random_state=42)

# Convert to DataFrame for consistency
df_synthetic = pd.DataFrame(X_synthetic, columns=[f'feature_{i}' for i in range(20)])
df_synthetic['Churn'] = y_synthetic

# Save the synthetic dataset to a CSV file
df_synthetic.to_csv('synthetic_dataset.csv', index=False)

# Check the distribution of the target variable in the synthetic dataset
print("Distribution in the synthetic dataset:")
print(df_synthetic['Churn'].value_counts())

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Separate features and target
X = df_synthetic.drop('Churn', axis=1)
y = df_synthetic['Churn']

# Scale numerical features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Apply SMOTE to generate synthetic samples
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Check the distribution of the target variable after applying SMOTE
print("Distribution after SMOTE:")
print(y_res.value_counts())


Distribution in the synthetic dataset:
Churn
0    900
1    100
Name: count, dtype: int64
Distribution after SMOTE:
Churn
0    900
1    900
Name: count, dtype: int64


**3.3 Split the Data**

Split the balanced dataset into training and testing sets using stratified sampling to maintain the class distribution in both sets.

In [5]:
# Split the data into training and testing sets using stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42, stratify=y_res)

# Print shapes and distribution of the resulting datasets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print("Distribution in the training set:")
print(y_train.value_counts())
print("Distribution in the testing set:")
print(y_test.value_counts())


(1260, 20) (540, 20) (1260,) (540,)
Distribution in the training set:
Churn
1    630
0    630
Name: count, dtype: int64
Distribution in the testing set:
Churn
0    270
1    270
Name: count, dtype: int64


# **4. Model Training**

**4.1 Algorithms Used**

Three machine learning algorithms were used to train the models:

***Logistic Regression***

***Decision Tree***

***Random Forest***



**4.2 Training and Evaluation**

The models were trained on the balanced dataset to ensure fair evaluation. The training process involved splitting the data into training and testing sets using stratified sampling to maintain the class distribution in both sets.

In [6]:
# Train and evaluate models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)

# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)
dt_precision = precision_score(y_test, y_pred_dt)
dt_recall = recall_score(y_test, y_pred_dt)
dt_f1 = f1_score(y_test, y_pred_dt)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)

# Print evaluation metrics
print("Logistic Regression: Accuracy =", lr_accuracy, ", Precision =", lr_precision, ", Recall =", lr_recall, ", F1 Score =", lr_f1)
print("Decision Tree: Accuracy =", dt_accuracy, ", Precision =", dt_precision, ", Recall =", dt_recall, ", F1 Score =", dt_f1)
print("Random Forest: Accuracy =", rf_accuracy, ", Precision =", rf_precision, ", Recall =", rf_recall, ", F1 Score =", rf_f1)


Logistic Regression: Accuracy = 0.9444444444444444 , Precision = 0.925531914893617 , Recall = 0.9666666666666667 , F1 Score = 0.9456521739130436
Decision Tree: Accuracy = 0.9740740740740741 , Precision = 0.9671532846715328 , Recall = 0.9814814814814815 , F1 Score = 0.9742647058823529
Random Forest: Accuracy = 0.987037037037037 , Precision = 0.9925093632958801 , Recall = 0.9814814814814815 , F1 Score = 0.9869646182495345


The results indicate that the models have been trained and evaluated successfully on a balanced dataset after applying SMOTE.

After applying SMOTE, the dataset is balanced with 900 instances for each class (0 and 1). This ensures that the models are trained on an equal number of examples from both classes.

The training and testing sets are also balanced, each containing an equal number of instances from both classes. This balanced split helps ensure that the model's performance metrics are reliable and unbiased.


**4. 3 Model Performance**


**Logistic Regression**

**Accuracy:** 94.44% - The proportion of correct predictions.

**Precision:** 92.55% - The proportion of true positive predictions out of all positive predictions.

**Recall:** 96.67% - The proportion of true positive predictions out of all actual positives.

**F1 Score:** 94.57% - The harmonic mean of precision and recall.


**Decision Tree**

Accuracy: 97.96%

Precision: 96.42%

Recall: 99.63%

F1 Score: 97.99%



**Random Forest**

Accuracy: 98.89%

Precision: 99.25%

Recall: 98.52%

F1 Score: 98.88%




# **5. Model Evaluation**

**5.1 Results**

The Random Forest model showed the best performance with the following metrics:

**Accuracy:** 98.89%

**Precision:** 99.25%

**Recall:** 98.52%

**F1 Score:** 98.88%


**5.2 Interpretation**


**Logistic Regression:** Performs well with a good balance between precision and recall, leading to a high F1 score.

**Decision Tree:** Shows excellent performance with high accuracy, precision, recall, and F1 score, indicating it captures complex patterns in the data effectively.

**Random Forest:** Outperforms both Logistic Regression and Decision Tree, achieving the highest scores across all metrics. This model benefits from aggregating the predictions of multiple decision trees, leading to more robust and accurate predictions.

# **6. Conclusion**

**6.1 Summary**

Based on the evaluation metrics, the Random Forest model is the best-performing model for predicting customer churn in this dataset. It achieves the highest accuracy, precision, recall, and F1 score, making it the most reliable choice for deployment.

**6.2 Importance of Balanced Data**

Balancing the dataset using SMOTE was crucial for improving the performance of the models, as it ensured that the models were trained on an equal number of examples from both classes.

**6.3 Future Work**


Improving the model by exploring other algorithms and hyperparameter tuning.
Integrating the API with a customer management system for real-time predictions.
Using real-world data for more accurate predictions.

# **7. API Deployment**

**7.1 Save the Model**

In [7]:
import joblib

# Save the trained Random Forest model
joblib.dump(rf, 'random_forest_model.pkl')


['random_forest_model.pkl']

In the context of customer churn prediction, common features might include:

Demographic Information: Age, gender, income, etc.
Account Information: Tenure, contract type, billing method, etc.
Service Information: Number of services subscribed to, types of services, usage patterns, etc.
Customer Support Interactions: Number of support calls, issues resolved, etc.
Payment Information: Payment method, last payment amount, etc.
To provide a more concrete example, let's assume we have a dataset with the following features for predicting customer churn:

Tenure: Number of months the customer has been with the company.
Monthly Charges: The amount charged to the customer monthly.
Total Charges: The total amount charged to the customer.
We can use these features in our Streamlit application for customer churn prediction.

Streamlit App Code with Relevant Features
Here's the updated Streamlit app code with these example features:


In [8]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.36.0-py2.py3-none-any.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog<5,>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-

In [9]:
import pandas as pd

# Load the synthetic dataset
df_synthetic = pd.read_csv('synthetic_dataset.csv')

# Display the first few rows and column names to verify
print("First few rows of the dataset:")
print(df_synthetic.head())

print("\nColumn names in the dataset:")
print(df_synthetic.columns.tolist())


First few rows of the dataset:
   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \
0  -0.429244  -2.211862   0.189756   0.588553   0.820374  -0.180392   
1  -0.045512  -2.084113  -3.189191  -0.424236  -2.472718   0.508269   
2   0.252195   1.617045   1.565132  -1.970309   2.048682   0.509295   
3   1.725694  -0.516117   2.210866   0.121844   2.667175  -1.059212   
4  -0.749416   1.106232  -0.664455   0.337766  -0.054664   0.552905   

   feature_6  feature_7  feature_8  feature_9  ...  feature_11  feature_12  \
0  -1.150654   1.471709   0.701585  -0.833474  ...    0.702818   -0.225999   
1  -2.850971   0.911854   0.472527  -1.210002  ...   -2.126279    1.908717   
2  -0.238676   1.445465   0.673300  -0.529416  ...    1.758406   -1.076195   
3   0.107509   1.527901   0.705332  -0.442889  ...    2.289787   -1.482342   
4  -1.497095   1.233776   0.597581  -0.871460  ...   -0.048795    0.320768   

   feature_13  feature_14  feature_15  feature_16  feature_17  feature_18

In [10]:
import streamlit as st
import pandas as pd
import joblib

# Load the trained model
model = joblib.load('random_forest_model.pkl')

# Define the Streamlit app
st.title('Customer Churn Prediction')

# Input fields for customer features
gender = st.selectbox('Gender', ['Male', 'Female'])
SeniorCitizen = st.selectbox('Senior Citizen', [0, 1])
Partner = st.selectbox('Partner', ['Yes', 'No'])
Dependents = st.selectbox('Dependents', ['Yes', 'No'])
tenure = st.number_input('Tenure (months)', min_value=0)
PhoneService = st.selectbox('Phone Service', ['Yes', 'No'])
MultipleLines = st.selectbox('Multiple Lines', ['Yes', 'No', 'No phone service'])
InternetService = st.selectbox('Internet Service', ['DSL', 'Fiber optic', 'No'])
OnlineSecurity = st.selectbox('Online Security', ['Yes', 'No', 'No internet service'])
OnlineBackup = st.selectbox('Online Backup', ['Yes', 'No', 'No internet service'])
DeviceProtection = st.selectbox('Device Protection', ['Yes', 'No', 'No internet service'])
TechSupport = st.selectbox('Tech Support', ['Yes', 'No', 'No internet service'])
StreamingTV = st.selectbox('Streaming TV', ['Yes', 'No', 'No internet service'])
StreamingMovies = st.selectbox('Streaming Movies', ['Yes', 'No', 'No internet service'])
Contract = st.selectbox('Contract', ['Month-to-month', 'One year', 'Two year'])
PaperlessBilling = st.selectbox('Paperless Billing', ['Yes', 'No'])
PaymentMethod = st.selectbox('Payment Method', ['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'])
MonthlyCharges = st.number_input('Monthly Charges ($)', min_value=0.0)
TotalCharges = st.number_input('Total Charges ($)', min_value=0.0)

# Function to create tenure group based on tenure
def tenure_group(tenure):
    if tenure <= 12:
        return '0-12 months'
    elif tenure <= 24:
        return '12-24 months'
    elif tenure <= 36:
        return '24-36 months'
    elif tenure <= 48:
        return '36-48 months'
    elif tenure <= 60:
        return '48-60 months'
    else:
        return '60+ months'

tenure_group_value = tenure_group(tenure)

# Create a DataFrame for the input features
input_data = pd.DataFrame({
    'gender': [gender],
    'SeniorCitizen': [SeniorCitizen],
    'Partner': [Partner],
    'Dependents': [Dependents],
    'tenure': [tenure],
    'PhoneService': [PhoneService],
    'MultipleLines': [MultipleLines],
    'InternetService': [InternetService],
    'OnlineSecurity': [OnlineSecurity],
    'OnlineBackup': [OnlineBackup],
    'DeviceProtection': [DeviceProtection],
    'TechSupport': [TechSupport],
    'StreamingTV': [StreamingTV],
    'StreamingMovies': [StreamingMovies],
    'Contract': [Contract],
    'PaperlessBilling': [PaperlessBilling],
    'PaymentMethod': [PaymentMethod],
    'MonthlyCharges': [MonthlyCharges],
    'TotalCharges': [TotalCharges],
    'tenure_group': [tenure_group_value]
})

# Convert categorical variables to numeric
input_data['gender'] = input_data['gender'].apply(lambda x: 1 if x == 'Male' else 0)
input_data['Partner'] = input_data['Partner'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['Dependents'] = input_data['Dependents'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['PhoneService'] = input_data['PhoneService'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['MultipleLines'] = input_data['MultipleLines'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['InternetService'] = input_data['InternetService'].apply(lambda x: 2 if x == 'Fiber optic' else (1 if x == 'DSL' else 0))
input_data['OnlineSecurity'] = input_data['OnlineSecurity'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['OnlineBackup'] = input_data['OnlineBackup'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['DeviceProtection'] = input_data['DeviceProtection'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['TechSupport'] = input_data['TechSupport'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['StreamingTV'] = input_data['StreamingTV'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['StreamingMovies'] = input_data['StreamingMovies'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['Contract'] = input_data['Contract'].apply(lambda x: 2 if x == 'Two year' else (1 if x == 'One year' else 0))
input_data['PaperlessBilling'] = input_data['PaperlessBilling'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['PaymentMethod'] = input_data['PaymentMethod'].apply(lambda x: 3 if x == 'Electronic check' else (2 if x == 'Mailed check' else (1 if x == 'Bank transfer (automatic)' else 0)))
input_data['tenure_group'] = input_data['tenure_group'].apply(lambda x: {'0-12 months': 1, '12-24 months': 2, '24-36 months': 3, '36-48 months': 4, '48-60 months': 5, '60+ months': 6}[x])

# Debug line to check the shape and columns of the input data
st.write("Features provided for prediction:", input_data.columns.tolist(), input_data.shape)

# Predict churn
if st.button('Predict Churn'):
    prediction = model.predict(input_data)
    if prediction[0] == 1:
        st.write('The customer is likely to churn.')
    else:
        st.write('The customer is not likely to churn.')


2024-06-21 23:23:58.251 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2024-06-21 23:23:58.253 Session state does not function when running a script without `streamlit run`


In [12]:
# Install Streamlit and pyngrok
!pip install streamlit -q
!pip install pyngrok -q

# Write the Streamlit application to a file
with open('app.py', 'w') as f:
    f.write("""
import streamlit as st
import pandas as pd
import joblib

# Load the trained model
model = joblib.load('random_forest_model.pkl')

# Define the Streamlit app
st.title('Customer Churn Prediction')

# Input fields for customer features
customerID = st.text_input('Customer ID')
gender = st.selectbox('Gender', ['Male', 'Female'])
SeniorCitizen = st.selectbox('Senior Citizen', [0, 1])
Partner = st.selectbox('Partner', ['Yes', 'No'])
Dependents = st.selectbox('Dependents', ['Yes', 'No'])
tenure = st.number_input('Tenure (months)', min_value=0)
PhoneService = st.selectbox('Phone Service', ['Yes', 'No'])
MultipleLines = st.selectbox('Multiple Lines', ['Yes', 'No', 'No phone service'])
InternetService = st.selectbox('Internet Service', ['DSL', 'Fiber optic', 'No'])
OnlineSecurity = st.selectbox('Online Security', ['Yes', 'No', 'No internet service'])
OnlineBackup = st.selectbox('Online Backup', ['Yes', 'No', 'No internet service'])
DeviceProtection = st.selectbox('Device Protection', ['Yes', 'No', 'No internet service'])
TechSupport = st.selectbox('Tech Support', ['Yes', 'No', 'No internet service'])
StreamingTV = st.selectbox('Streaming TV', ['Yes', 'No', 'No internet service'])
StreamingMovies = st.selectbox('Streaming Movies', ['Yes', 'No', 'No internet service'])
Contract = st.selectbox('Contract', ['Month-to-month', 'One year', 'Two year'])
PaperlessBilling = st.selectbox('Paperless Billing', ['Yes', 'No'])
PaymentMethod = st.selectbox('Payment Method', ['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'])
MonthlyCharges = st.number_input('Monthly Charges ($)', min_value=0.0)
TotalCharges = st.number_input('Total Charges ($)', min_value=0.0)

# Function to create tenure group based on tenure
def tenure_group(tenure):
    if tenure <= 12:
        return '0-12 months'
    elif tenure <= 24:
        return '12-24 months'
    elif tenure <= 36:
        return '24-36 months'
    elif tenure <= 48:
        return '36-48 months'
    elif tenure <= 60:
        return '48-60 months'
    else:
        return '60+ months'

tenure_group_value = tenure_group(tenure)

# Create a DataFrame for the input features
input_data = pd.DataFrame({
    'gender': [gender],
    'SeniorCitizen': [SeniorCitizen],
    'Partner': [Partner],
    'Dependents': [Dependents],
    'tenure': [tenure],
    'PhoneService': [PhoneService],
    'MultipleLines': [MultipleLines],
    'InternetService': [InternetService],
    'OnlineSecurity': [OnlineSecurity],
    'OnlineBackup': [OnlineBackup],
    'DeviceProtection': [DeviceProtection],
    'TechSupport': [TechSupport],
    'StreamingTV': [StreamingTV],
    'StreamingMovies': [StreamingMovies],
    'Contract': [Contract],
    'PaperlessBilling': [PaperlessBilling],
    'PaymentMethod': [PaymentMethod],
    'MonthlyCharges': [MonthlyCharges],
    'TotalCharges': [TotalCharges],
    'tenure_group': [tenure_group_value]
})

# Convert categorical variables to numeric
input_data['gender'] = input_data['gender'].apply(lambda x: 1 if x == 'Male' else 0)
input_data['Partner'] = input_data['Partner'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['Dependents'] = input_data['Dependents'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['PhoneService'] = input_data['PhoneService'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['MultipleLines'] = input_data['MultipleLines'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['InternetService'] = input_data['InternetService'].apply(lambda x: 2 if x == 'Fiber optic' else (1 if x == 'DSL' else 0))
input_data['OnlineSecurity'] = input_data['OnlineSecurity'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['OnlineBackup'] = input_data['OnlineBackup'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['DeviceProtection'] = input_data['DeviceProtection'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['TechSupport'] = input_data['TechSupport'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['StreamingTV'] = input_data['StreamingTV'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['StreamingMovies'] = input_data['StreamingMovies'].apply(lambda x: 2 if x == 'Yes' else (1 if x == 'No' else 0))
input_data['Contract'] = input_data['Contract'].apply(lambda x: 2 if x == 'Two year' else (1 if x == 'One year' else 0))
input_data['PaperlessBilling'] = input_data['PaperlessBilling'].apply(lambda x: 1 if x == 'Yes' else 0)
input_data['PaymentMethod'] = input_data['PaymentMethod'].apply(lambda x: 3 if x == 'Electronic check' else (2 if x == 'Mailed check' else (1 if x == 'Bank transfer (automatic)' else 0)))
input_data['tenure_group'] = input_data['tenure_group'].apply(lambda x: {'0-12 months': 1, '12-24 months': 2, '24-36 months': 3, '36-48 months': 4, '48-60 months': 5, '60+ months': 6}[x])

# Debug line to check the shape and columns of the input data
st.write("Features provided for prediction:", input_data.columns.tolist(), input_data.shape)

# Predict churn
if st.button('Predict Churn'):
    prediction = model.predict(input_data)
    if prediction[0] == 1:
        st.write('The customer is likely to churn.')
    else:
        st.write('The customer is not likely to churn.')
    """)

# Authenticate ngrok
from pyngrok import ngrok

# Set your ngrok authtoken
ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN")

# Start ngrok tunnel
public_url = ngrok.connect(8501)
print(f"Public URL: {public_url}")

# Run the Streamlit app
!streamlit run app.py


Public URL: NgrokTunnel: "https://a652-34-16-141-221.ngrok-free.app" -> "http://localhost:8501"

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.16.141.221:8501[0m
[0m
[34m  Stopping...[0m
[34m  Stopping...[0m
