<a href="https://colab.research.google.com/github/ShabnumBatool/customer-churn-prediction/blob/main/churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing Libraries**
First step for every Machine Learning project is to import the necessary libraries for data processing, visualization, and modeling.

These libraries provide tools to handle datasets, build models, and evaluate performance efficiently.

In [45]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import pickle
from sklearn.linear_model import LogisticRegression

# **Loading Dataset**
The next step is loading the dataset into your environment so you can work with it.

This is usually done using libraries like Pandas, which make it easy to read CSV or other data files into a DataFrame for analysis and preprocessing.

In [18]:

df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [19]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [20]:
df.tail()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.8,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.2,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [22]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


# **Handle missing values**

Handling missing values is important to ensure data quality and prevent errors during model training.
In the Telco Churn dataset, the TotalCharges column contains missing or invalid values, so we convert it to numeric and remove rows with nulls.

In [None]:

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)

# **Encode categorical variables**

 Machine learning models work with numerical data, so categorical variables must be converted into numbers.
 This process is called encoding, and we use techniques like Label Encoding or One-Hot Encoding to transform text values into numeric form.

In [25]:

label_encoder = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
    df[col] = label_encoder.fit_transform(df[col])


In [26]:
# Now note that the catagorical variables are converted into numeric variables
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,5375,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,2505,0
1,3962,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1466,0
2,2564,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,157,1
3,5535,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1400,0
4,6511,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,925,1


# **Features and target**

1- Features are the input variables (independent variables) used by the model to make predictions.

2-  The target is the output variable (dependent variable) we want to predict, which in this case is whether the customer will churn or not.

In [28]:

X = df.drop('Churn', axis=1)
y = df['Churn']

In [30]:
X.head() # note that the last column is missing

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,5375,0,0,1,0,1,0,1,0,0,2,0,0,0,0,0,1,2,29.85,2505
1,3962,1,0,0,0,34,1,0,0,2,0,2,0,0,0,1,0,3,56.95,1466
2,2564,1,0,0,0,2,1,0,0,2,2,0,0,0,0,0,1,3,53.85,157
3,5535,1,0,0,0,45,0,1,0,2,0,2,2,0,0,1,0,0,42.3,1400
4,6511,0,0,0,0,2,1,0,1,0,0,0,0,0,0,0,1,2,70.7,925


In [31]:
y.head() # Note that only target column (churn) is in y

Unnamed: 0,Churn
0,0
1,0
2,1
3,0
4,1


# **Train-test split**
 Train-test split is the process of dividing the dataset into two parts: training data to build the model and testing data to evaluate its performance.

In [32]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training data and 20% testing data


In [36]:
X.shape

(7043, 20)

In [37]:
X_train.shape  # 80% data is train data that the model is used during training

(5634, 20)

In [38]:
X_test.shape  # 20% data for testing

(1409, 20)

# **Train model**
Training the model is the process where the machine learning algorithm learns from the training data.
It identifies patterns and relationships between input features and the target variable.

For churn prediction,Firtst of all we use a model Random Forest to predict whether a customer will churn or not.

In [50]:
# Train Logistic Regression
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [53]:
#using Random forest model
model2 = RandomForestClassifier(n_estimators=100, random_state=42)
model2.fit(X_train, y_train)

# **Evaluation**
Evaluation checks how well the trained model performs on unseen data.

We use metrics like accuracy, precision, recall, and F1-score to measure performance.

This step ensures the model is reliable and can generalize to real-world data.


In [56]:
# evaluation of model 1(Linear Regression model)
y_pred = model1.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"✅ Model Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

✅ Model Accuracy: 82.04%

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.91      0.88      1036
           1       0.69      0.58      0.63       373

    accuracy                           0.82      1409
   macro avg       0.77      0.74      0.76      1409
weighted avg       0.81      0.82      0.81      1409



In [55]:
# Evaluation of model 2( Random forest)
y_pred = model2.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"✅ Model Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


✅ Model Accuracy: 79.70%

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      1036
           1       0.66      0.48      0.56       373

    accuracy                           0.80      1409
   macro avg       0.74      0.70      0.71      1409
weighted avg       0.78      0.80      0.79      1409



# **Saving the Model**
Saving a model in a pickle file means converting the trained machine learning model into a binary format and storing it on disk.

This is done using Python's `pickle` module, allowing the model to be reloaded later for making predictions without retraining.


In [57]:
# 9. Save model
with open('churn_model.pkl', 'wb') as file:
    pickle.dump(model1, file)

print("Model saved as churn_model.pkl")


Model saved as churn_model.pkl


# **Model Deployment**
Model deployment is the process of making a trained machine learning model available for use in a real-world application.
It involves integrating the model into an environment (such as a web app, mobile app, or API) where users can input data and get predictions.
Common deployment tools include Streamlit, Gradio, Flask, and cloud platforms like AWS or Streamlit Cloud.
# **Gradio**
We will use Gradio for deployment because it provides a simple way to create interactive web interfaces for machine learning models.

With just a few lines of code, Gradio allows users to input data, run predictions, and see results in real time without complex setup.


In [58]:
import gradio as gr
import pickle
import pandas as pd

# Load the trained model
with open('churn_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Prediction function
def predict_churn(gender, senior, partner, dependents, tenure, phone_service, multiple_lines,
                  internet_service, online_security, online_backup, device_protection,
                  tech_support, streaming_tv, streaming_movies, contract, paperless_billing,
                  payment_method, monthly_charges, total_charges):

    # Convert categorical to numeric
    data = {
        'gender': 1 if gender == 'Male' else 0,
        'SeniorCitizen': 1 if senior == 'Yes' else 0,
        'Partner': 1 if partner == 'Yes' else 0,
        'Dependents': 1 if dependents == 'Yes' else 0,
        'tenure': int(tenure),
        'PhoneService': 1 if phone_service == 'Yes' else 0,
        'MultipleLines': 0 if multiple_lines == 'No' else (1 if multiple_lines == 'Yes' else 2),
        'InternetService': 0 if internet_service == 'DSL' else (1 if internet_service == 'Fiber optic' else 2),
        'OnlineSecurity': 1 if online_security == 'Yes' else 0,
        'OnlineBackup': 1 if online_backup == 'Yes' else 0,
        'DeviceProtection': 1 if device_protection == 'Yes' else 0,
        'TechSupport': 1 if tech_support == 'Yes' else 0,
        'StreamingTV': 1 if streaming_tv == 'Yes' else 0,
        'StreamingMovies': 1 if streaming_movies == 'Yes' else 0,
        'Contract': 0 if contract == 'Month-to-month' else (1 if contract == 'One year' else 2),
        'PaperlessBilling': 1 if paperless_billing == 'Yes' else 0,
        'PaymentMethod': 0 if payment_method == 'Electronic check' else (1 if payment_method == 'Mailed check' else (2 if payment_method == 'Bank transfer (automatic)' else 3)),
        'MonthlyCharges': float(monthly_charges),
        'TotalCharges': float(total_charges)
    }

    df = pd.DataFrame([data])
    prediction = model.predict(df)[0]
    return "✅ Customer is likely to CHURN!" if prediction == 1 else "❌ Customer is NOT likely to churn."

# Create beautiful Gradio interface
with gr.Blocks(theme=gr.themes.Soft(primary_hue="blue", secondary_hue="cyan")) as demo:
    gr.Markdown("<h1 style='text-align:center;'>📊 Customer Churn Prediction</h1>")
    gr.Markdown("<p style='text-align:center;'>Enter the customer details below to predict churn probability. <br> Powered by <b>Machine Learning</b> & <b>Gradio</b></p>")

    with gr.Row():
        with gr.Column():
            gender = gr.Dropdown(['Male', 'Female'], label="Gender")
            senior = gr.Dropdown(['Yes', 'No'], label="Senior Citizen")
            partner = gr.Dropdown(['Yes', 'No'], label="Partner")
            dependents = gr.Dropdown(['Yes', 'No'], label="Dependents")
            tenure = gr.Number(label="Tenure (Months)")
            phone_service = gr.Dropdown(['Yes', 'No'], label="Phone Service")
            multiple_lines = gr.Dropdown(['Yes', 'No', 'No phone service'], label="Multiple Lines")
            internet_service = gr.Dropdown(['DSL', 'Fiber optic', 'No'], label="Internet Service")
            online_security = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Online Security")
            online_backup = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Online Backup")
        with gr.Column():
            device_protection = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Device Protection")
            tech_support = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Tech Support")
            streaming_tv = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Streaming TV")
            streaming_movies = gr.Dropdown(['Yes', 'No', 'No internet service'], label="Streaming Movies")
            contract = gr.Dropdown(['Month-to-month', 'One year', 'Two year'], label="Contract")
            paperless_billing = gr.Dropdown(['Yes', 'No'], label="Paperless Billing")
            payment_method = gr.Dropdown(['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'], label="Payment Method")
            monthly_charges = gr.Number(label="Monthly Charges")
            total_charges = gr.Number(label="Total Charges")

    output = gr.Textbox(label="Prediction", placeholder="Result will appear here...", interactive=False)

    btn = gr.Button("🔍 Predict Now", variant="primary")
    btn.click(predict_churn, inputs=[gender, senior, partner, dependents, tenure, phone_service, multiple_lines,
                                     internet_service, online_security, online_backup, device_protection,
                                     tech_support, streaming_tv, streaming_movies, contract, paperless_billing,
                                     payment_method, monthly_charges, total_charges],
              outputs=output)

    gr.Examples(
        examples=[
            ["Male", "No", "Yes", "No", 12, "Yes", "No", "Fiber optic", "No", "Yes", "Yes", "No", "Yes", "Yes", "Month-to-month", "Yes", "Electronic check", 70.5, 900.0],
            ["Female", "Yes", "No", "Yes", 48, "Yes", "Yes", "DSL", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Two year", "No", "Credit card (automatic)", 89.1, 2000.5]
        ],
        inputs=[gender, senior, partner, dependents, tenure, phone_service, multiple_lines,
                internet_service, online_security, online_backup, device_protection,
                tech_support, streaming_tv, streaming_movies, contract, paperless_billing,
                payment_method, monthly_charges, total_charges]
    )

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://830325876c51e1ca55.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


