This project aims to predict customer churn using machine learning models. The dataset contains information about customer demographics, service plans, billing, and churn status, which is analyzed and used to train various classification models. The best-performing model is deployed as an API for real-time predictions.
- Dataset Description
- Data Preprocessing
- Exploratory Data Analysis
- Feature Engineering and Scaling
- Model Training
- Model Performance Evaluation
- Feature Importance
- Hyperparameter Tuning
- Business Insights and Recommendations
- Model Deployment
- Rows: 360
- Columns: 21
- Target Column:
Churn(Yes/No)
This dataset contains customer data, including demographic information, service details, and billing information. The data is used to understand the reasons for customer churn and create machine learning models for churn prediction.
The dataset consists of the following columns:
customerID(object): Unique identifier for the customer.gender,Partner,Dependents,PhoneService, etc. (object): Categorical columns representing various customer features.SeniorCitizen,tenure(int64): Numeric columns representing customer attributes.MonthlyCharges,TotalCharges(float64): Numeric columns related to billing information.Churn(object): Target column with one missing value that was removed.
-
Handling Missing Values:
- Removed one missing value in the
Churncolumn (at index 359).
- Removed one missing value in the
-
Handling Non-Numeric Data in
TotalCharges:- Converted
TotalChargestofloat64and checked for non-numeric entries. - If necessary, missing values would be imputed using mean, median, or mode, but in this case, no non-numeric values were found.
- Converted
- Customers with tenures below 20 months are highly prone to churn. This indicates the importance of early retention strategies.
- After 60 months, the churn rate drops significantly, suggesting that long-tenured customers are more loyal.
- Univariate and bivariate analyses were conducted to explore relationships between customer attributes and churn behavior.
- Categorical Features: Converted categorical variables (e.g.,
gender,Partner) into numerical format using One-Hot Encoding or Label Encoding.
- Applied StandardScaler or MinMaxScaler to numerical features (
tenure,MonthlyCharges,TotalCharges) for normalization.
- Created a feature combining
StreamingTVandStreamingMovies. - Created an
average_monthly_chargefeature (TotalCharges/tenure).
- Split the dataset into training and testing sets using an 80-20 split to ensure reproducibility.
The following models were trained to predict customer churn:
- Logistic Regression
- Random Forest
- Gradient Boosting
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
| Model | Accuracy | Precision | Recall | F1-score | ROC-AUC Score |
|---|---|---|---|---|---|
| Logistic Regression | 77.78% | 61.54% | 42.11% | 50.00% | 75.47% |
| Random Forest | 76.39% | 60.00% | 31.58% | 41.38% | 73.98% |
| Gradient Boosting | 76.39% | 60.00% | 31.58% | 41.38% | 66.33% |
| Support Vector Machine (SVM) | 80.56% | 66.67% | 52.63% | 58.82% | 76.17% |
| K-Nearest Neighbors | 77.78% | 61.54% | 42.11% | 50.00% | 74.68% |
Best Model: The SVM model performed the best, achieving the highest accuracy (80.56%), precision (66.67%), and recall (52.63%).
The most important features contributing to churn prediction (using Random Forest and Gradient Boosting) are:
ContractMonthlyChargesTechSupportaverage_monthly_chargeTotalChargestenurePaymentMethodOnlineSecuritySeniorCitizenPaperlessBilling
Method: GridSearchCV was used for hyperparameter tuning to improve model performance.
Tuned Hyperparameters:
C = 10: Low regularization for fitting complex patterns.gamma = 1: Medium influence for decision boundaries.kernel = 'linear': Data separation using a linear hyperplane.probability = True: Enabled probability estimates for better decision-making.
- The SVM model can help predict potential churners, allowing the business to take proactive actions like offering incentives or personalized support.
- Features like
Contract,MonthlyCharges,TechSupport, andTotalChargeshighlight areas needing improvement to reduce churn.
- Contract Type: Offer discounts to secure longer contracts, as customers with longer contracts are less likely to churn.
- Monthly Charges: Consider discounted bundles or individualized pricing for at-risk customers.
- TechSupport and OnlineSecurity: Provide free trials or discounts to enhance customer satisfaction.
- Loyalty Programs: Reward loyal customers to improve retention.
- Payment Options: Market PaperlessBilling and offer flexible payment options to customers with higher
TotalCharges.
- The best-trained SVM model was saved as a
.pklfile for future use.
A simple Flask application was built to accept input data through an API and return churn predictions.
- Route:
/predict - Deployment Platform: The Flask app was deployed on Render.com.
Deployed Link: https://mastersoft.onrender.com/
- Clone the repository.
- Install the required packages from
requirements.txt. - Run the Flask app locally using
python app.py. - Use the deployed API for predictions by sending POST requests to
/predictwith the necessary input data.