# Business Understanding

## Introduction
 For telecommunications companies such as Syria Tel, the loss of paying subscribers is a major challenge. This leads to a reduction in revenue, increased costs associated with the acquisition of customers and diminished brand loyalty. SyriaTel needs a strong and datadriven approach to proactively tackle customer turnover and increase retention of customers.
## Problem Statement
The goal of this project is to use SyriaTel's customer data to create a model for predicting customer churn. SyriaTel will be able to implement targeted retention strategies and reduce customer attrition by using the model to identify customers who are at high risk of churning. 

## Business Objectives
- Forecasting customer churn to lower client attrition.
- Focusing on high-risk consumers for retention efforts.
- Increasing customer lifetime value by focusing on significant revenue contributors.
- Enhancing client experience to increase satisfaction and loyalty.
- Using model insights for customized marketing campaigns to re-engage potential customers.
## Data Description
The project will make use of a dataset from SyriaTel's customer relationship management (CRM) system that contains previous customer data. Features like user demographics, account information, call usage trends, and customer service interactions are anticipated to be included in this data.

## Expected Benefits
- Better customer experience: By anticipating problems and knowing how customers behave, businesses can build loyalty and improve the overall customer experience.
- Data-driven decision making: Strategic customer retention initiatives and resource allocation are guided by insights.
- Customized marketing initiatives: Models help with campaign customization for particular clientele groups.
- Proactive customer engagement: By addressing customer complaints and preventing churn, early detection of churn risk helps to prevent it.



## Data Understanding

### Dataset Overview
The customer relationship management (CRM) system of SyriaTel provides historical customer data in the dataset.
Customer demographics, account information, phone usage trends, and customer service exchanges are among the features.

### Structure and Features
- **Features:**
  - State of residence
  - Account length
  - Service plans (e.g., international plan, voice mail plan)
  - Call usage metrics (e.g., total minutes, number of calls)
  - Customer service interactions
  
### Potential Issues
- Missing values: Determine appropriate handling procedures after looking for any missing values in the dataset.
- Outliers: Spot any outliers in numerical features and assess whether further action is required.
- Data discrepancies: Check categorical characteristics for errors or inconsistencies that could impact analysis.


### Data Preprocessing
To address missing numbers, outliers, and inconsistencies, clean up the dataset. carrying out required transformations, such as scaling numerical characteristics and encoding categorical variables.

### Data Exploration

### Overview
The first analysis of the SyriaTel customer churn dataset aimed to understand its properties and potential connections to customer attrition.
Service plans, consumption patterns, and client demographics are all included in the dataset. Examining feature distribution, spotting anomalies, and figuring out possible connections between variables and the target variable were the key goals.

##### Importing Libraries 

In [13]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy import stats

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import warnings
warnings.filterwarnings("ignore")

##### Loading CSV data

In [14]:
# Load the data from a CSV file
df = pd.read_csv("churn.csv")

# Display first  10 rows
df.head(10)

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
5,AL,118,510,391-8027,yes,no,0,223.4,98,37.98,...,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False
6,MA,121,510,355-9993,no,yes,24,218.2,88,37.09,...,108,29.62,212.6,118,9.57,7.5,7,2.03,3,False
7,MO,147,415,329-9001,yes,no,0,157.0,79,26.69,...,94,8.76,211.8,96,9.53,7.1,6,1.92,0,False
8,LA,117,408,335-4719,no,no,0,184.5,97,31.37,...,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
9,WV,141,415,330-8173,yes,yes,37,258.6,84,43.96,...,111,18.87,326.4,97,14.69,11.2,5,3.02,0,False


In [18]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   


The **info()** method was used to load the dataset and examine its contents. It was found that each column included 3333 non-null entries. This suggests that the dataset is free of missing values.


In [20]:
df.isnull().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

The **df.isnull().sum()** method was used to further verify that there were no missing values, and it returned 0 to indicate this. We can confidently move forward with additional research and modeling now that there are no missing values.