# Phase 3 Project:

Student Name: **Paul Kamau**

DSC-ft06



# Project Overview
***SyriaTel Customer Churn***

The project strategically harnesses data analytics and machine learning to augment customer experiences and reduce churn for Syriatel, a leading telecommunications provider in Syria. In an intensely competitive market, Syriatel's sustained growth hinges on customer retention and satisfaction. 

# Business Understanding
#### 1.1 Introduction to Syriatel
Syriatel is a mobile network provider in Syria that was founded in 2000. It is one of the two dominant providers in the country, along with MTN Syria. Syriatel offers LTE, 3G, and GSM services to its customers, under the brand name Super Surf. The company is owned by Rami Makhlouf, a cousin of Syrian president Bashar al-Assad, and has about 3,500 employees and 8 million subscribers. Syriatel is headquartered in Damascus and operates in all Syrian governorates. However, the company has faced several challenges in recent years, such as the European Union sanctions, the Syrian civil war, and the judicial custody order issued by a Syrian court in 2020. Syriatel is also facing competition from a new entrant, Wafa Telecom, which received the third telecom license in Syria in 2022. Syriatel faces the imperative challenge of evolving and enhancing its services. This evolution is crucial to maintain its competitive edge and continue providing exceptional customer experiences in a rapidly changing market.

#### 1.2 Key Stakeholders
My stakeholders are: (1) Executives, (2) Customer Retention Team, and (3) Potential Investors and Partners of  'SyriaTel' where I'm a lead datascientist The project's success is deeply connected to meeting the needs and expectations of a varied group of stakeholders, each with distinct interests and objectives:

##### A.	Syriatel:
1. **Syriatel Executives**: The company's leadership is focused on sustaining and growing the customer base. Their strategic objectives encompass ensuring the long-term prosperity and market dominance of Syriatel, amidst increasing competition and evolving market demands.


2. **Syriatel Customer Retention Team**: Tasked with bolstering customer acquisition and retention, the marketing department is keen on enhancing customer engagement. They are responsible for devising and implementing targeted promotional campaigns and strategies to attract and retain customers.

##### B.	Potential Investors and Partners: 
While not directly involved in day-to-day operations, investors and business partners have a vested interest in Syriatel's market performance and strategic direction. Their support and investment are essential for funding new initiatives and driving technological advancements.

# Business Problem
We lack the ability to identify customers when they're on the cusp of churning (**Churn ~ loss of customers to competition**). The objective of this project is to predict likely churners. Our dataset includes 20 variables describing over 3,000 current and churned customers. Achieving this predictive ability will allow us to examine the data on a rolling basis and quickly implement targeted incentivization.

### Objectives
The project is aligned with the following key objectives:

1. **Churn Prediction and Mitigation**:

    1. Train, test, and evaluate advanced classification models to accurately predict customer churn.
    2. Leverage predictive insights to identify customers at risk of churn, facilitating proactive retention strategies.
    3. Use model outputs to develop targeted interventions, mitigating churn and enhancing customer loyalty.

2.	**Customer Experience Enhancement**:

    1. Utilize insights derived from machine learning models to understand customer needs and preferences.

    2. Implement data-driven recommendations to improve customer satisfaction, such as personalized services and prompt issue resolution.
    3. Ensure high network performance and service quality, fostering a positive and consistent customer experience.

3.	**Strategic Business Decision Making**:
    1. Apply the findings from the final model to inform and support key business decisions.

    2. Develop and execute strategies that not only retain existing customers but also attract new ones, strengthening Syriatel’s market position.
    3. Craft innovative solutions and services based on customer data insights, ensuring Syriatel remains competitive and responsive to market dynamics.

By focusing on these objectives, the project aims to empower Syriatel with actionable data-driven insights, reinforcing its market presence and customer-centric approach. The integration of machine learning into business strategies is envisioned to transform customer interactions and decision-making processes, setting new standards for excellence in the telecommunications industry.


# Data understanding

Sourced from: 

   - [Kaggle: Churn in Telecom's dataset](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset/)

## Dataset Overview

This dataset encompasses records of 3,333 clients from a fictional telecommunications company named "SyriaTel." It includes 20 different attributes, capturing details such as customer geographic locations, usage patterns for day, evening, and night calls, the presence of voice mail or international plans, and the length of the account number. Notably, the account number's length serves as an indicator of the duration of a customer's association with SyriaTel, making it a useful measure of the customer's lifetime value.

### Variable Descriptions

1. **Churn**: Indicates if the customer has stopped doing business with SyriaTel. (False = No churn, True = Churned)

2. **State**: The U.S. State of the customer. (Requires one-hot encoding; not ordinal)

3. **Account Length**: A smaller number signifies an older account. (Indicative of Customer Lifetime Value)

4. **Area Code**: Area code of the customer's phone number.

5. **Phone Number**: The customer's phone number.

6. **International Plan**: Whether the customer has an international plan. ('yes' or 'no'; binary and thus effectively one-hot encoded)

7. **Voice Mail Plan**: Whether the customer subscribes to a voice mail plan. ('yes' or 'no'; as above)

8. **Number of Voice Mail Messages**: Total number of voice mail messages left by the customer.

9. **Total Day Minutes**: Aggregate of daytime minutes used.

10. **Total Day Calls**: Total number of calls made during the day.

11. **Total Day Charge**: Total charges incurred for daytime calls.

12. **Total Eve Minutes**: Total minutes spent on calls in the evening.

13. **Total Eve Calls**: Number of calls made during the evening.

14. **Total Eve Charge**: Charges for evening calls.

15. **Total Night Minutes**: Total minutes for nighttime calls.

16. **Total Night Calls**: Number of calls made at night.

17. **Total Night Charge**: Nighttime call charges.

19. **Total Intl Minutes**: Cumulative international minutes (covering day, evening, and night).

20. **Total Intl Calls**: Total number of international calls (across all time periods).

21. **Total Intl Charge**: Total charges for international calls.

22. **Customer Service Calls**: Number of calls made to customer service by the customer.


### Target Variable Desription:
22. **Churn**: if the customer has churned (true or false)


# Data Preparation and EDA 

In [1]:
# Importing Required Python Libraries:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import log_loss

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline

# Hiding warnings
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
warnings.filterwarnings(action='ignore',module='sklearn')

In [2]:
# import the data-set
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')

# print the first 5 rows
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# Preview all columns and their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

**Analysis of Data Types in the Dataset**

The examination of our dataset's structure revealed the following distribution of data types:

| Data Type   | Quantity |
|:------------|----------|
| Boolean     | 1        |
| Float64     | 8        |
| Integer64   | 8        |
| String      | 4        |

An important observation is that each column uniformly contains **3,333 entries**. This uniformity suggests a potential absence of missing values in the dataset.

In [4]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0
