- MonthlyCharges: The amount charged to the customer monthly
- TotalCharges: The total amount charged to the customer
- CustomerID : Customer ID unique for each customer
- gender : Whether the customer is a male or a female
- SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
- Partner: Whether the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- Tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has a phone service or not (Yes, No)
- MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup: Whether the customer has an online backup or not (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two years)
- PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- Shape
- Null Values summary
- Column datatypes aswell as a small amount of cleaning and preprocessing
FILE_TAG: (EDA_1) This is where we try to visualize the features, and how they are related with our target. we also try to draw patters/trends from these features, aswell as concluding some basic theory on our main question Aswell as providing possible solution to what we conclude is part of the problem
FILE_TAG: (EDA_2) Here we take several groups of people
- People who left and stayed in first 6 months
- Loyal Customers We mainly apply the same metrics we applied in EDA_1 file, where we also compare different features with out target
FILE_TAG: (Final_Train_Model)
- Feature extraction (Recursive Feature Elimination (RFE) and SelectKBest)
- Sampling Techniques (SMOTE/RandomOverSampler)
- Data Splitting and training
- Model Evaluation (confusion_matrix, ROC_AUC_Curve)
Our evaluation was Recall based (Lower False Negative)
[What is a confusion matrix?] : https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5
Recall = out of the total actually who churned , how many did the model actually manage to predict correctly
Precision = out of the total which the model predicted will leave , how many actually churned
If recall increases , precision decreases, and vice versa
In our situation, we decided that lowering false negative values is more important than lowering false positive
Positive prediction (0) = no churn
Negative prediction (1) = churn
False Negative = predicted positve , while actual value was negative.
In simpler terms , a positive prediction means that the model predicited that a certain customer did not churn, but he actually did.
this is bad in a business situation.
False Positive = predicted negative , while actualy value was positive
In simpler terms , a negative prediction means that the model predicited that a certain customer churned, but he actually did not
this is digestable in a business situation and not as bad as the situation above
A higher recall mean lower false negatives. Recall was our main evaluation metric , while also taking into consideration a considerable f1 score(harmonic mean of precision and recall).
You can even increase recall more by applying a high threshold to the predicted probabilities. This inturn will increase recall significantly , but a decreases precision aswell. Overall lower f1-score , but high precision
** Under Construction **