## Predicting patterns in customer behavior that can help identify customers who are likely to churn soon from SyriaTel's services. 

### Phase 3 Group Project Members:
* Joseph Kinuthia 
* John Mark
* Peter Kariuki
* Collins KAnyiri
* Calvin Kipkirui
* Raphael Muthemba

## 1. Introduction

***
In today's rapidly evolving telecommunications landscape, customer churn remains a significant challenge that directly impacts business profitability and sustainability. Customer churn, also known as customer attrition or customer turnover, refers to the phenomenon where customers stop doing business with a company or stop using its services. Churn analysis is commonly used in various industries to understand why customers leave, predict which customers are likely to churn, and develop strategies to retain customers. With the increasing prevalence of choices and options available to consumers, telecommunications companies must navigate the delicate balance of attracting new customers while retaining their existing ones. 

* N\B: One of the key questions that arises concerns predictable patterns in customer behavior that can help identify customers who are likely to churn soon from a firm's services. 

This question is at the heart of our project, where we delve into the realm of predictive analytics to uncover insights that can shape the future of customer retention strategies. Our focus is on SyriaTel, a telecommunications company dedicated to providing cutting-edge services to its customers. By harnessing the power of data analysis and machine learning, we aim to provide SyriaTel with the tools to proactively identify potential churners and implement targeted efforts for retaining their valuable customer base.

In the following sections of our project, we will embark on a comprehensive journey through the telecom dataset provided by SyriaTel. We will explore the rich tapestry of customer interactions, behaviors, and characteristics that contribute to the phenomenon of churn. We will also delve into the interpretability of our predictive models, seeking to understand the features and behaviors that have the most significant impact on churn prediction. Armed with this knowledge, SyriaTel can make informed decisions about targeted interventions and tailored strategies to enhance customer satisfaction and retention.
***



## 2. Business Understanding

*** 
### Stakeholders:
The success of this project relies on collaboration among several key stakeholders:

* SyriaTel Management: As the ultimate decision-makers, they're vested in the project's outcomes for improved business performance.
* Marketing Team: They will utilize insights to design targeted retention campaigns and optimize customer engagement strategies.
* Customer Support Team: The findings will help them identify customer pain points and enhance support services.
* Data Science Team (Project Team): Responsible for executing the project, analyzing data, and creating predictive models.

### Direct Impact:
The creation of this project directly affects the core operations of SyriaTel, impacting customer retention strategies, revenue streams, and customer satisfaction levels.

### Business Problem(s) Solved:
This Data Science endeavor addresses the critical business problem of customer churn. It aims to predict potential churners and guide SyriaTel in proactive strategies to minimize attrition. In this sense, our research questions are:

    * Can a predictive model accurately forecast whether a customer is likely to churn based on the available attributes and usage metrics?

    * Which features contribute the most to the model's predictions?
    
    * How well does the developed model generalize to new, unseen data? Are there certain patterns that the model consistently struggles to capture?

### Scope of the Project:
Inside the project's scope are the following components:

* Churn Prediction: Developing machine learning models to predict customer churn.
* Feature Analysis: Identifying significant features and behaviors linked to customer churn.
* Recommendations: Offering actionable suggestions to curb churn and enhance retention.

### Outside the Scope:
While the project tackles the formidable challenge of churn prediction, certain aspects lie beyond its immediate purview. Specifically, the implementation of recommended strategies to mitigate churn is a subsequent endeavor. Additionally, the evaluation of the financial impact arising from the project's outcomes is a distinct consideration.

### Data Sources:
It's important to note that the project's data originates from Kaggle, a well-regarded platform for diverse datasets. The SyriaTel telecommunications dataset, sourced from Kaggle, forms the cornerstone of our analysis, offering a comprehensive array of customer behaviors, usage patterns, and churn-related data.

### Expected Timeline:
The projected timeline for the completion of this venture is estimated at approximately 2-3 months. While stringent deadlines do not apply, the project is tailored to provide timely insights that align with SyriaTel's retention strategies.

### Stakeholder Alignment:
Even as stakeholders from disparate realms of the organization may possess a foundational grasp of the project's underpinnings, the utmost importance is placed on cultivating a shared and comprehensive understanding. This is achieved through consistent communication, updates, and clarifications, ensuring the alignment of objectives and aspirations across all stakeholders involved.
***

## 3. Problem Statement

***
In the landscape of modern telecommunications, the persistent challenge of customer churn demands strategic solutions that transcend conventional practices. SyriaTel, a telecommunications company aiming to enhance customer retention, faces the pressing question: Are there discernible patterns in customer behavior that can aid in the early identification of customers on the brink of churning? This project encapsulates the endeavor to unravel these patterns, employing data science techniques to predict customer churn and provide actionable insights for SyriaTel's proactive retention efforts.

### Challenge:
The primary challenge lies in SyriaTel's pursuit of understanding and predicting customer behavior that leads to churn. The vast volume of customer data available needs to be distilled into predictive models that not only forecast potential churn but also offer valuable insights for targeted interventions.

### Objective:
The objective of this project is to build a classifier developing accurate predictive models capable of identifying customers who are likely to churn soon. By delving into the dataset and analyzing customer attributes, usage patterns, and interactions, we aim to uncover patterns that contribute to churn, ultimately enabling SyriaTel to mitigate customer attrition.

This project encompasses data preprocessing, exploratory data analysis, feature engineering, machine learning model development, and the interpretation of model results. It involves understanding the correlation between various customer attributes, usage metrics, and churn rates, thereby offering insights into patterns that can inform SyriaTel's proactive efforts.The expected outcome is a set of predictive models capable of accurately forecasting customer churn. The insights derived from these models will not only aid SyriaTel in identifying potential churners but also guide the formulation of tailored strategies for customer engagement and retention.
### Benefits
By successfully addressing the challenge of predicting customer churn, SyriaTel stands to gain several benefits:
* Proactive Retention: The ability to identify potential churners in advance allows for targeted interventions and personalized retention strategies.
* Enhanced Customer Satisfaction: Addressing pain points revealed by the data can lead to improved customer satisfaction and loyalty.
* Optimized Resource Allocation: Precise churn predictions enable resource allocation for retention efforts, optimizing operational efficiency.
* Business Sustainability: By reducing churn, SyriaTel can bolster its revenue streams and establish a solid foundation for long-term growth.
***

## 4. Data Understanding

***
In the pursuit of understanding our data, we delve into a comprehensive exploration of the datasets that underpin our project. This stage involves unraveling the intricacies of the data we have at hand, as well as gaining insights into the origins, characteristics, and potential limitations of the data. Our primary source of data for this project is the SyriaTel dataset, retrieved from Kaggle. This dataset encapsulates a myriad of customer interactions, usage patterns, and churn-related information, serving as a valuable foundation for our predictive models.
The data sources are under the control of the SyriaTel organization. The necessary steps to access the data involve obtaining necessary permissions or credentials, and potentially liaising with the relevant data custodians within the organization.
***

In [1]:
# Read data from csv file & create dataframe. Checking the first 5 rows.
import pandas as pd
data = pd.read_csv('Churn in Telecoms dataset.csv')
data

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


### Target Variable:

***
Our target variable is "churn," which signifies whether a customer has churned or not. This binary boolean-type variable forms the core of our predictive modeling, as we aim to predict whether a customer is likely to churn in the future.
***

### Predictors and Data Types:

***
* state: the state the customer lives in (Categorical: String/Object)

* account length: the number of days the customer has had an account 

* area code: the area code of the customer

* phone number: the phone number of the customer

* international plan: true if the customer has the international plan, 
otherwise false

* voice mail plan: true if the customer has the voice mail plan, otherwise false

* number vmail messages: the number of voicemails the customer has sent

* total day minutes: total number of minutes the customer has been in calls during the day

* total day calls: total number of calls the user has done during the day

* total day charge: total amount of money the customer was charged by the Telecom company for calls during the day

* total eve minutes: total number of minutes the customer has been in calls during the evening

* total eve calls: total number of calls the customer has done during the evening

* total eve charge: total amount of money the customer was charged by the Telecom company for calls during the evening

* total night minutes: total number of minutes the customer has been in calls during the night

* total night calls: total number of calls the customer has done during the night

* total night charge: total amount of money the customer was charged by the Telecom company for calls during the night

* total intl minutes: total number of minutes the user has been in international calls

* total intl calls: total number of international calls the customer has done

* total intl charge: total amount of money the customer was charged by the Telecom company for international calls

* customer service calls: number of calls the customer has made to customer service
***

### Summary of Features to Work With

##### Categorical Columns:

    state
    international plan
    voice mail plan

##### Numerical Columns:

    account length
    number vmail messages
    total day minutes
    total day calls
    total day charge
    total eve minutes
    total eve calls
    total eve charge
    total night minutes
    total night calls
    total night charge
    total intl minutes
    total intl calls
    total intl charge
    customer service calls

***
**'state'**: Geographic location might impact customer behavior and preferences, potentially influencing churn.

**'international plan'** and **'voice mail plan'**: Customer subscription plans can directly affect their engagement and usage patterns, which in turn might influence churn.

Various usage metrics like **'total day minutes'**, **'total eve minutes'**, **'total night minutes'**, and **'total intl minutes'**: These usage patterns, along with corresponding usage patterns for **total day calls**, and **total day charge** could provide insights into customer interaction with the telecom services and their propensity to churn.

**'customer service calls'**: High customer service call frequency might indicate dissatisfaction and potentially lead to higher churn rates.
***

The dataset consists of 3333 observations and spans various customer attributes and usage metrics. This size is moderate, providing a reasonable amount of data for modeling. However, it's important to consider the complexity of the predictive task and the number of features. If the model requires a high level of accuracy or deals with intricate relationships, more data might be beneficial. Resampling methods like **bootstrapping** or **oversampling** can be employed if the dataset is deemed insufficient for building robust models.

To verify data accuracy, it's crucial to understand the data collection process. The dataset's source, Kaggle, is a reputable platform, but it's still advisable to investigate potential errors or inconsistencies. Data collection methods, whether from customer records or surveys, should be evaluated for reliability. A validation process, such as cross-checking with other reliable sources, can help identify any anomalies. While the data might generally be correct, factors like entry errors or outdated information could introduce inaccuracies. Rigorous preprocessing and cleaning are necessary to minimize these risks and ensure the data's reliability for meaningful analysis and modeling.

### Train-Test Split with Scikit-Learn
N\B: It's a good practice to split your dataset into training and testing sets before performing any preprocessing, especially when you're dealing with modeling tasks. This helps prevent data leakage, which occurs when information from the testing set inadvertently influences the training process, leading to overly optimistic performance metrics.

Here is our approach: 

Data Splitting: Divide your dataset into two separate sets: one for training and one for testing.

Preprocessing: Perform data preprocessing steps (cleaning, transformation, encoding, etc.) on the training set. 

Modeling and Evaluation: Train and validate your models using the preprocessed training set. 


In [27]:
from sklearn.model_selection import train_test_split
# Selecting the target variable
y = data["churn"]

# Selecting the features
# Categorical Columns
categorical_columns = ['state', 'international plan', 'voice mail plan']

# Numeric Columns
numeric_columns = [col for col in data.columns if pd.api.types.is_numeric_dtype(data[col])]

X_numeric = data[numeric_columns]
X_categorical = data[categorical_columns]

# Concatenate numeric and categorical features
X = pd.concat([X_numeric, X_categorical], axis=1)


X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
