<a href="https://colab.research.google.com/github/Harshad1025/Harshad1025-Capstone-2-telecom-churn-analysis/blob/main/Telecom_Churn_Analysis_Capstone_EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -Harshad Thombre**


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.<br>
https://github.com/Harshad1025?tab=repositories

# **Problem Statement**


**Orange S.A., formerly France Télécom S.A., is a French multinational telecommunications corporation. The Orange Telecom's Churn Dataset, consists of cleaned customer activity data (features), along with a churn label specifying whether a customer cancelled the subscription. Explore and analyze the data to discover key factors responsible for customer churn and come up with ways/recommendations to ensure customer retention.**

#### **Define Your Business Objective?**

The primary business objective for exploring and analyzing the Orange Telecom's Churn Dataset is to identify key factors influencing customer churn and to formulate strategic recommendations aimed at improving customer retention. The goal is to gain actionable insights from the dataset, understand the patterns, and propose effective measures to reduce churn rates.

The ultimate goal is to foster a customer-centric approach that not only reduces churn but also enhances overall customer satisfaction. By implementing data-driven strategies, the aim is to create a more resilient and loyal customer base, positively impacting the long-term sustainability and profitability of Orange Telecom.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries


import functools
import os
import time
import warnings
import datetime

import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

warnings.filterwarnings('ignore')


pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)

blue = '\033[94m'
bold = '\033[1m'
italics = '\033[3m'
underline = '\033[4m'
end = '\033[0m'
pretty_print_start = italics+bold+underline+blue
pretty_print_end = italics+bold+underline+blue+end

### Dataset Loading

In [None]:
# Load Dataset
data = pd.read_csv('/content/telecom_churn.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
data.tail()

In [None]:
data.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Dataset Rows count: {data.shape[0]}")
print(f"Dataset Columns count: {data.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(20,10))
sns.heatmap(data.isna(), cmap="viridis", cbar_kws={'label': 'Missing Data'})
plt.title('Visualization of Missing Values', fontsize=18)
plt.show()

### What did you know about your dataset?



1. **Size of the Dataset:**
   - The dataset contains 3333 rows and 20 columns.

2. **Columns and Data Types:**
   - The columns include a mix of numerical and categorical data types.
   - There are three main types of data in the dataset: `int64` (integer), `float64` (floating-point), `object` (string or categorical), and `bool` (boolean).

3. **Columns and Their Meanings:**
   - The dataset includes columns such as 'State,' 'Account length,' 'Area code,' 'International plan,' 'Voice mail plan,' 'Number vmail messages,' and various columns related to the total minutes, calls, and charges for day, evening, night, and international calls.

4. **Target Variable:**
   - The 'Churn' column appears to be the target variable, indicating whether a customer has canceled the subscription. It is of boolean type (`bool`), it has values of either `True` or `False`.

5. **Memory Usage:**
   - The memory usage of the dataset is approximately 498.1 KB, indicating the dataset is reasonably sized and should be manageable for analysis.

6. **Potential Features:**
   - There are features like 'International plan,' 'Voice mail plan,' and 'Customer service calls' that may be crucial for understanding customer behavior and predicting churn.

7. **Data Quality:**
   - No missing values are reported in the information, suggesting that the dataset is relatively clean. However, further exploration may reveal outliers or anomalies that need attention.



## ***2. Understanding Your Variables***

In [None]:
# display only columns that contains text to confirm if dtype is correct or not
data.select_dtypes(include=['object','boolean']).columns.tolist()

In [None]:
# desplay only columns that contains text to confirm if dtype is correct of not
data.select_dtypes(include=['number']).columns.tolist()

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the include parameter.

In [None]:
# discriptive statistics on non-numerical features
data.describe(include=["object", "bool"])

### Variables Description



1. **State (Object):**
   - The state in which the customer resides.

2. **Account length (Integer):**
   - The number of days the customer has been an account holder.

3. **Area code (Integer):**
   - The three-digit area code corresponding to the customer's phone number.

4. **International plan (Object):**
   - Whether the customer has an international calling plan (Yes/No).

5. **Voice mail plan (Object):**
   - Whether the customer has a voice mail plan (Yes/No).

6. **Number vmail messages (Integer):**
   - The number of voice mail messages received by the customer.

7. **Total day minutes (Float):**
   - The total number of minutes the customer used during the day.

8. **Total day calls (Integer):**
   - The total number of calls made by the customer during the day.

9. **Total day charge (Float):**
   - The total charge incurred by the customer for day calls.

10. **Total eve minutes (Float):**
    - The total number of minutes the customer used during the evening.

11. **Total eve calls (Integer):**
    - The total number of calls made by the customer during the evening.

12. **Total eve charge (Float):**
    - The total charge incurred by the customer for evening calls.

13. **Total night minutes (Float):**
    - The total number of minutes the customer used during the night.

14. **Total night calls (Integer):**
    - The total number of calls made by the customer during the night.

15. **Total night charge (Float):**
    - The total charge incurred by the customer for night calls.

16. **Total intl minutes (Float):**
    - The total number of international minutes used by the customer.

17. **Total intl calls (Integer):**
    - The total number of international calls made by the customer.

18. **Total intl charge (Float):**
    - The total charge incurred by the customer for international calls.

19. **Customer service calls (Integer):**
    - The number of customer service calls made by the customer.

20. **Churn (Boolean):**
    - The target variable indicating whether the customer has canceled the subscription (True/False).



### Check Unique Values for each variable.

In [None]:
print("Number of unique values ")
for column in data.columns:
    unique_values = len(data[column].unique())
    print(f"{column}: {unique_values}")

In [None]:
data.nunique()

In [None]:
# show all unique values for categorical columns
for col in data.select_dtypes(include=['object','boolean']):
    print(f"Feature name : {col}'\n' values: {data[col].unique()}'\n'")

In [None]:
columns_to_exclude=[
    'Account length',
    'Number vmail messages',
    'Total day minutes',
    'Total day calls',
    'Total day charge',
    'Total eve minutes',
    'Total eve calls',
    'Total eve charge',
    'Total night minutes',
    'Total night calls',
    'Total night charge' ,
    'Total intl minutes',
    'Total intl calls',
    'Total intl charge']

In [None]:
# Check Unique Values for each variable
for column in data.columns:
    if column not in columns_to_exclude:
        unique_values = data[column].unique()
        print(f"Unique values for {column}:", unique_values)
        print("-" * 50)



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# created a copy of original dataframe
df = data.copy()

In [None]:
# 1) corrected column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ","_" )

In [None]:
df.columns


In [None]:
# 2) adding new columns

# #creating total [calls-minuts-charge] without considering international calls
# df["no_inter_total_calls"]=df['total_day_calls']+df['total_eve_calls']+df['total_night_calls']
# df["no_inter_total_revenue"]=df['total_day_charge']+df['total_eve_charge']+df['total_night_charge']
# df["no_inter_total_minutes"]=df['total_day_minutes']+df['total_eve_minutes']+df['total_night_minutes']

#creating total columns with internationl [calls-minuts-charge]
# df["total_calls"]=df['total_day_calls']+df['total_eve_calls']+df['total_night_calls']+df['total_intl_calls']
# df["total_revenue"]=df['total_day_charge']+df['total_eve_charge']+df['total_night_charge']+df['total_intl_charge']
# df["total_minutes"]=df['total_day_minutes']+df['total_eve_minutes']+df['total_night_minutes']+df['total_intl_minutes']

In [None]:
# 3) Label encoding to  international_plans, voice_mail_plan, churn columns
df['international_plan'] = df['international_plan'].replace({'Yes': 1, 'No': 0})
df['voice_mail_plan'] = df['voice_mail_plan'].replace({'Yes': 1, 'No': 0})
df['churn'] = df['churn'].replace({True: 1, False: 0})


In [None]:
# converted from int data type to category [area_code column]
df['area_code'] =df['area_code'].astype('category')

In [None]:
#  4) full state names
state_dict={
    'KS': 'Kansas',
    'OH': 'Ohio',
    'NJ': 'New Jersey',
    'OK': 'Oklahoma',
    'AL': 'Alabama',
    'MA': 'Massachusetts',
    'MO': 'Missouri',
    'LA': 'Louisiana',
    'WV': 'West Virginia',
    'IN': 'Indiana',
    'RI': 'Rhode Island',
    'IA': 'Iowa',
    'MT': 'Montana',
    'NY': 'New York',
    'ID': 'Idaho',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'TX': 'Texas',
    'FL': 'Florida',
    'CO': 'Colorado',
    'AZ': 'Arizona',
    'SC': 'South Carolina',
    'NE': 'Nebraska',
    'WY': 'Wyoming',
    'HI': 'Hawaii',
    'IL': 'Illinois',
    'NH': 'New Hampshire',
    'GA': 'Georgia',
    'AK': 'Alaska',
    'MD': 'Maryland',
    'AR': 'Arkansas',
    'WI': 'Wisconsin',
    'OR': 'Oregon',
    'MI': 'Michigan',
    'DE': 'Delaware',
    'UT': 'Utah',
    'CA': 'California',
    'MN': 'Minnesota',
    'SD': 'South Dakota',
    'NC': 'North Carolina',
    'WA': 'Washington',
    'NM': 'New Mexico',
    'NV': 'Nevada',
    'DC': 'District of Columbia',
    'KY': 'Kentucky',
    'ME': 'Maine',
    'MS': 'Mississippi',
    'TN': 'Tennessee',
    'PA': 'Pennsylvania',
    'CT': 'Connecticut',
    'ND': 'North Dakota'
    }

df['state'] = df['state'].replace(state_dict)


In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

### What all manipulations have you done and insights you found?

1) corrected column names <br>

2) added new columns <br>

3) Label encoding to  international_plans, voice_mail_plan, churn columns <br>

4) Area code should be a categorical column <br>

5)  Full state names are more readable and intuitive for individuals who may not be familiar with state abbreviations. so using full names.

### Analysis

#### cost per min for non international calls

In [None]:
# Define the total cost associated with non-international minutes
# total_cost = df["no_inter_total_revenue"].sum()

# # Convert total duration to minutes
# total_duration_minutes = df["no_inter_total_minutes"].sum()

# # Calculate the cost per minute
# cost_per_minute = total_cost / total_duration_minutes

# # Print the result
# print(f"Cost per Minute for Non-International Minutes: ${cost_per_minute:.2f} per minute")

 #### Revenue is generated without international calls “charge”?

In [None]:
# revenue_without_inter = df["no_inter_total_revenue"].sum()
# revenue_without_inter

 #### Revenue is generated with international calls “charge”?

In [None]:
# total_revenue = df["total_revenue"].sum()
# total_revenue

How much profit generated from international calls from people who have international plan?

also how much revenue generated from international calls from people who don’t have international plan?

In [None]:
# # Calculate total cost of international calls for people with an international plan
# total_cost_intl_plan = df.loc[df['international_plan'] == 'Yes', 'total_intl_charge'].sum()

# # Calculate total cost of international calls for people without an international plan
# total_cost_no_intl_plan = df.loc[df['international_plan'] == 'No', 'total_intl_charge'].sum()

# print(f"Profit generated from international calls for people with international plan: {total_cost_intl_plan:.2f}")
# print(f"Revenue generated from international calls for people without international plan: {total_cost_no_intl_plan:.2f}")

**it looks that cost per minute for international calls doesn't chage for people with and without the plan**

In [None]:
# # Calculate total number of international minutes for people with an international plan
# total_minutes_intl_plan = df.loc[df['international_plan'] == 'Yes', 'total_intl_minutes'].sum()

# # Calculate total number of international minutes for people without an international plan
# total_minutes_no_intl_plan = df.loc[df['international_plan'] == 'No', 'total_intl_minutes'].sum()

# # Calculate cost per minute for people with an international plan
# cost_per_minute_intl_plan = total_cost_intl_plan / total_minutes_intl_plan

# # Calculate cost per minute for people without an international plan
# cost_per_minute_no_intl_plan = total_cost_no_intl_plan / total_minutes_no_intl_plan

# print(f"Cost per minute for people with an international plan: ${cost_per_minute_intl_plan:.2f}")
# print(f"Cost per minute for people without an international plan: ${cost_per_minute_no_intl_plan:.2f}")

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate analysis**

In [None]:
numerical_columns = df.select_dtypes(include=['number']).columns.tolist()
categorical_columns = df.select_dtypes(exclude=['number']).columns.tolist()

# Print the lists of numerical and categorical columns
print("Numerical Columns:", numerical_columns)
print("Categorical Columns:", categorical_columns)

## **Categorical columns**

In [None]:
categorical_col = [ 'area_code', 'international_plan', 'voice_mail_plan', 'churn']
total_count = len(df)

for column in categorical_columns:
    print(f"Column: {column}")
    value_counts = df[column].value_counts()
    percentage_values = (value_counts / total_count) * 100

    for value, count in value_counts.items():
        percentage = percentage_values[value]
        print(f"{value}: Count = {count}, Percentage = {percentage:.2f}%")

    print("\n")

### 1) Count plot

In [None]:
# Distribution of categorical columns using count plots
for column in categorical_col:
    plt.figure(figsize=(9, 7))
    ax = sns.countplot(data=df, x=column, palette='Set2')

    # Adding annotations on top of the bars
    for p in ax.patches:
        ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', xytext=(0, 10), textcoords='offset points')

    plt.xlabel(column)
    plt.ylabel('Count')
    plt.title(f'Count Plot of {column}')
    plt.xticks(rotation=90)


    # Show the plot
    plt.show()

    # Close the plot to avoid overlapping when creating the next plot
    plt.close()


for better visuals created separate plots for each feature

In [None]:

plt.figure(figsize=(5, 4))
sns.countplot(data=df, x='area_code', palette='Set2')
plt.xlabel('Area Code')
plt.ylabel('Count')
plt.title(f'Count Plot of area_code')
plt.xticks(rotation=90)
plt.show()
plt.close()

In [None]:
plt.figure(figsize=(5, 4))
sns.countplot(data=df, x='international_plan', palette='Set2')
plt.xlabel('international_plan')
plt.ylabel('Count')
plt.title(f'Count Plot of international_plan')
plt.xticks(rotation=90)
plt.show()
plt.close()

In [None]:
plt.figure(figsize=(5, 4))
sns.countplot(data=df, x=column, palette='Set2')
plt.xlabel('voice_mail_plan')
plt.ylabel('Count')
plt.title(f'Count Plot of voice_mail_plan')
plt.xticks(rotation=90)
plt.show()
plt.close()

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='state')
plt.xlabel('state')
plt.ylabel('Count')
plt.title('Count Plot of state')
plt.xticks(rotation=90)
plt.show()
plt.close()

##### 1. Why did you pick the specific chart?

**countplot** is suitable for visualizing categorical data as it efficiently displays the count of each category, offering a quick overview of the distribution of categorical variables.

##### 2. What is/are the insight(s) found from the chart?



1) 1st chart reveals a concentration of users in area code 415, with approximately equal numbers in area codes 510 and 408.

2) 2nd chart illustrates that the majority of our users, around 90%, do not have an international plan, indicating a low adoption of this service among our customer base.

3) According to the 3rd chart, approximately 27.66% of our users have opted for the voicemail plan, suggesting a moderate level of interest in this feature.

4) 4rth chart highlights that West Virginia has the highest customer count, contrasting with California, which has the lowest number of customers compared to other states.

##### 3. Will the gained insights help creating a positive business impact?

##### Are there any insights that lead to negative growth? Justify with specific reason.

1) **Concentration in Area Code 415:**
   - *Positive Impact:* Enables targeted marketing and resource allocation.
   - *Negative Growth:* Unlikely, as concentrating resources strategically can be positive.

2) **Low Adoption of International Plan:**
   - *Positive Impact:* Identifies opportunities for promoting international plans.
   - *Negative Growth:* Possible if unaddressed, as it indicates missed revenue opportunities.

3) **Moderate Voicemail Plan Adoption:**
   - *Positive Impact:* Allows for targeted promotions and improved customer engagement.
   - *Negative Growth:* Potential if adoption remains stagnant and the plan is a significant revenue driver.

4) **State-wise Customer Distribution:**
   - *Positive Impact:* Informs tailored strategies for marketing and service improvements.
   - *Negative Growth:* Unlikely from the insight itself, but not addressing low customer counts in certain states could hinder business expansion.

### 2) Donut Chart

In [None]:
churn_percentage = (df['churn'].value_counts() / df.shape[0] * 100).round(2).astype(str) + '%'
churn_percentage

In [None]:
data = df['churn'].value_counts()

plt.pie(data.values, labels=['Not Churned', 'Churned'], autopct='%1.1f%%', startangle=90)
centre_circle = plt.Circle((0, 0), 0.50, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis('equal')
plt.legend(loc='upper left')
plt.title('Churn Distribution (Donut Chart)')
plt.show()



##### 1. Why did you pick the specific chart?

I chose a donut chart to visually represent the Churn Distribution due to its aesthetic appeal and clear depiction of the proportion of Churned and Not Churned categories. The central hole enhances visual engagement while maintaining data clarity, making it an effective choice for this categorical analysis.

##### 2. What is/are the insight(s) found from the chart?

The donut chart provides a visual insight into the churn distribution, revealing that approximately 14.5% of customers have churned. This indicates the proportion of customers who ended their subscription, giving a quick and clear understanding of the overall churn rate in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


- *Positive Impact:* The pie chart's insight into a 14.5% churn rate provides a quick and clear understanding of the overall churn in the dataset. This information can guide strategic efforts to improve customer retention and loyalty.
- *Negative Growth:* If the churn rate is considered high for the industry or business standards, it may lead to negative growth, indicating a potential problem in customer satisfaction and retention strategies.

The churn distribution can positively impact business decisions by highlighting the need for effective retention strategies. Conversely, a high churn rate may signal negative growth potential if not addressed promptly.

## **Numerical data**

In [None]:
numerical_columns

### 3) Box plot

#### a)

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
sns.boxplot(data=df, y='total_day_calls', color='lightblue')
plt.ylabel('Total Day Calls', fontsize=12)
plt.ylim(bottom=0)
# plt.title('Boxplot of Total Day Calls')

plt.subplot(1, 3, 2)
sns.boxplot(data=df, y='total_eve_calls', color='lightgreen')
plt.ylabel('Total Eve Calls', fontsize=12)
plt.ylim(bottom=0)
# plt.title('Boxplot of Total Eve Calls')

plt.subplot(1, 3, 3)
sns.boxplot(data=df, y='total_night_calls', color='orange')
plt.ylabel('Total Night Calls', fontsize=12)
plt.ylim(bottom=0)
# plt.title('Boxplot of Total Night Calls')

plt.suptitle('Call Distribution Across Different Times of Day', fontsize=16)  # Overall title

plt.subplots_adjust(wspace=0.4)
# plt.grid()
plt.show()



##### 1. Why did you pick the specific chart?

I chose boxplots for 'total_day_calls,' 'total_eve_calls,' and 'total_night_calls' because they effectively illustrate the distribution of numerical data, showcasing key statistics like median and quartiles. This visualization choice allows for a clear comparison of call distributions across different times of the day. Boxplots are particularly suitable for displaying variations in multiple categories, making it easy to observe and compare data distribution characteristics.

##### 2. What is/are the insight(s) found from the chart?

From above plots can see that more calls are made in the morning with respect to the evening and night when it is the lowest

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Morning Call Peak:** The graph shows that more calls happen in the morning compared to the evening and night, peaking during those early hours.

**Impact on Business:**
- *Positive Impact:* Knowing about the morning peak helps the business better plan staff schedules and promotions, making customers happier.

- *Negative Growth:* Ignoring the busy morning time might mean missing chances to make customers happier and could lead to slower growth for the business.

#### b)


In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
sns.boxplot(data=df, y='total_day_minutes', color='lightblue')
plt.ylabel('Total Day Minutes', fontsize=12)
plt.ylim(bottom=0)


plt.subplot(1, 3, 2)
sns.boxplot(data=df, y='total_eve_minutes', color='lightgreen')
plt.ylabel('Total Eve Minutes', fontsize=12)
plt.ylim(bottom=0)


plt.subplot(1, 3, 3)
sns.boxplot(data=df, y='total_night_minutes', color='orange')
plt.ylabel('Total Night Minutes', fontsize=12)
plt.ylim(bottom=0)

plt.suptitle('Call Duration Distribution Across Different Times of Day', fontsize=16)  # Overall title

plt.subplots_adjust(wspace=0.4)

plt.show()


##### 1. Why did you pick the specific chart?

I chose boxplots for 'total_day_minutes,' 'total_eve_minutes,' and 'total_night_minutes' because they effectively showcase the distribution of call duration, highlighting key statistics like median and quartiles. This visualization choice allows for a clear comparison of call duration across different times of the day. Boxplots are especially suitable for revealing variations in multiple categories, providing a concise summary of the data distribution.

##### 2. What is/are the insight(s) found from the chart?

* Despite having more calls made during the day, we can see that users do not spend longer time on the calls in the evening.
* Users tend to talk for longer time in the evening than other times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Morning Call Duration:** Despite more calls in the morning, users do not spend longer on calls during this time.

**Evening Call Duration:** Users tend to talk for a longer time in the evening compared to other times.

**Impact on Business:**
- **Positive Impact:** Recognizing that users talk longer in the evening can inform targeted promotions or service enhancements during peak call duration, potentially increasing customer satisfaction.

- **Negative Growth:** If the business fails to leverage the insight about shorter morning calls and longer evening calls, it may miss opportunities to tailor services or promotions, potentially leading to stagnation in customer engagement and business growth. Ignoring patterns in call duration might result in misaligned strategies and lower satisfaction during peak usage times.

#### c)


In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
sns.boxplot(data=df, y='total_day_charge', color='lightblue')
plt.ylabel('Total Day Charge', fontsize=12)
plt.ylim(bottom=0)


plt.subplot(1, 3, 2)
sns.boxplot(data=df, y='total_eve_charge', color='lightgreen')
plt.ylabel('Total Eve Charge', fontsize=12)
plt.ylim(bottom=0)

plt.subplot(1, 3, 3)
sns.boxplot(data=df, y='total_night_charge', color='orange')
plt.ylabel('Total Night Charge', fontsize=12)
plt.ylim(bottom=0)

plt.suptitle('Charge Distribution Across Different Times of Day', fontsize=16)  # Overall title

plt.subplots_adjust(wspace=0.4)

plt.show()


##### 1. Why did you pick the specific chart?

I chose boxplots for 'total_day_charge,' 'total_eve_charge,' and 'total_night_charge' because they effectively visualize the distribution of call charges, emphasizing key statistics like median and quartiles. This choice allows for a clear comparison of charge distribution across different times of the day. Boxplots succinctly summarize the variability in multiple categories, offering insights into how charges vary during specific periods, which is essential for understanding customer behavior and optimizing pricing strategies.

##### 2. What is/are the insight(s) found from the chart?

The charges are maximum in the evening time and lowest in the night time.
These plots are in conjunction with the minutes spoken.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


1. **Evening Charges Peak:** Charges are highest in the evening, correlating with the longer call durations during that time.
2. **Night Charges Lowest:** Charges are lowest at night, aligning with shorter call durations during this period.

**Impact on Business:**
- *Positive Impact:* Recognizing the evening charges peak allows the business to optimize pricing strategies or introduce targeted promotions during high-charge periods, potentially increasing revenue. Aligning charges with usage patterns can enhance customer satisfaction.

- *Negative Growth:* If the business fails to align pricing strategies with usage patterns, especially during peak evening hours, it may miss opportunities to maximize revenue. Neglecting these insights could lead to misaligned pricing and potential dissatisfaction among customers, impacting long-term growth prospects.

### 4) Histplot

In [None]:
df.columns

#### a)

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(10, 6))


sns.histplot(data=df, x='account_length', bins=20, kde=True, color='#5DADE2', edgecolor='black')


plt.axvline(df['account_length'].mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')

plt.xlabel('Account Length')
plt.ylabel('Frequency')
plt.title('Distribution of Account Length')
plt.legend()  # Show legend with the mean line

sns.despine()

plt.show()
plt.close()


##### 1. Why did you pick the specific chart?

I chose a histogram with a kernel density estimate (KDE) for 'account_length' to visually represent the distribution of customer account durations. This choice effectively highlights the central tendency, spread, and common durations, providing a comprehensive overview. The dashed red line indicating the mean adds clarity to the average account length, facilitating insights into customer tenure patterns.

##### 2. What is/are the insight(s) found from the chart?

**Insight from the Chart:**

- **Central Tendency:** The dashed red line represents the mean account length, which is approximately 101.06. This indicates the average duration for which customers have held their accounts.

- **Spread of Data:** The histogram and kernel density estimate (KDE) showcase the distribution of account lengths. The spread of data is evident, ranging from a minimum of 1 day to a maximum of 243 days.

- **Common Durations:** The quartiles (25th, 50th, and 75th percentiles) provide insights into common durations. For instance, 50% of customers have an account length between 74 and 127 days, emphasizing the concentration of account durations within this range.

Overall, the chart provides a visual summary of account length distribution, highlighting the central tendency and variability in customer account durations.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
Understanding the average account length guides tailored retention strategies, enabling the identification of loyal customers and opportunities for exclusive loyalty programs. Recognizing common durations informs targeted promotions for diverse customer segments, allowing for personalized engagement strategies.

**Potential Negative Impact:**
Neglecting variability in account lengths may lead to generalized strategies, missing opportunities for personalized engagement with loyal customers and negatively affecting customer satisfaction and business growth.

#### b)

In [None]:
# Chart - 6 visualization code

sns.set(style="whitegrid")

plt.figure(figsize=(10, 5))
sns.histplot(data=df, x='number_vmail_messages', bins=20, kde=True, color='skyblue', edgecolor='black')


plt.xlabel('Number of Voicemail Messages', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Voicemail Messages', fontsize=14)

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.yscale('log')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a special kind of bar chart to show how many people use different amounts of voicemail messages. I used logarithmic scale to make it easier to see the details, especially for people who don't use voicemail much. This helps us notice that many people hardly use voicemail, and a few use it a lot.

##### 2. What is/are the insight(s) found from the chart?

A substantial number of clients exhibit near to zero voicemail usage, forming a distinct peak, while the distribution for the remaining clients follows a more typical, bell-shaped pattern. This suggests a notable segment of users either does not utilize or has disabled voicemail services, presenting an opportunity for targeted engagement or service customization for this specific group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


1. **Distinct Voicemail Peak:** A substantial number of clients show near-zero voicemail usage, forming a distinct peak in the distribution.
2. **Typical Bell-Shaped Pattern:** The distribution for the remaining clients follows a more typical, bell-shaped pattern.

**Impact on Business:**
- *Positive Impact:* Recognizing the significant segment with minimal voicemail usage presents an opportunity for targeted engagement or customized services. This insight can guide personalized marketing strategies or feature enhancements, potentially increasing user satisfaction and loyalty.

- *Negative Growth:* Ignoring the distinct voicemail peak might result in missed opportunities for engagement and customization for a notable user segment. Failing to address the specific needs or preferences of this group could lead to dissatisfaction and potential attrition, impacting long-term business growth.

#### c)

In [None]:

sns.set(style="whitegrid")
sns.set_palette("pastel")

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Plot without log scale
sns.histplot(data=df, x='customer_service_calls', bins=20, kde=True, color='skyblue', edgecolor='black', ax=axes[0])
axes[0].set_xlabel('Number of Customer Service Calls', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Customer Service Calls', fontsize=14)
axes[0].grid(axis='y', linestyle='--', alpha=0.7)

# Plot with log scale
sns.histplot(data=df, x='customer_service_calls', bins=20, kde=True, color='skyblue', edgecolor='black', ax=axes[1])
axes[1].set_xlabel('Number of Customer Service Calls', fontsize=12)
axes[1].set_ylabel('Frequency (Log Scale)', fontsize=12)
axes[1].set_title('Distribution of Customer Service Calls (Log Scale)', fontsize=14)
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
axes[1].set_yscale('log')


plt.tight_layout()

plt.show()


In [None]:
df['customer_service_calls'].describe()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

**Insight(s) from the Chart:**
- **Central Tendency:** The histogram reveals that the majority of customers make 1 to 2 customer service calls, as indicated by the peak around the mean of 1.56.
  
- **Spread of Data:** The distribution has a right-skewed shape, suggesting that while most customers have a low number of service calls, there are instances of higher call frequencies, with a maximum of 9 calls.

- **Common Scenarios:** The quartiles indicate that 50% of customers make 1 or fewer calls, while 75% make 2 or fewer calls. This emphasizes the prevalence of relatively low customer service call volumes.

Overall, the chart provides insights into the typical customer service call patterns, highlighting a concentration around a moderate number of calls, with a tail indicating instances of higher call frequencies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
- **Positive Impact:** Understanding common scenarios allows for targeted and efficient allocation of customer service resources. Recognizing the right-skewed distribution helps in identifying and addressing the needs of customers with higher call frequencies, potentially enhancing customer satisfaction.

- **Negative Growth Risk:** Neglecting the instances of higher call frequencies may lead to inadequate resource allocation and a failure to address the needs of a subset of customers. Ignoring this variability might result in negative growth as the business might miss opportunities to improve service quality and customer experience for those with higher service call requirements.

# **Bivariate analysis**

##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:

target_column = 'churn'

included_columns = df.columns[df.columns != target_column]
corr_with_churn = df[included_columns].corrwith(df[target_column])
corr_df = pd.DataFrame({'Correlation with Churn': corr_with_churn})

corr_df = corr_df.sort_values(by='Correlation with Churn', ascending=False)
plt.figure(figsize=(14, 5))
sns.barplot(x=corr_df.index, y='Correlation with Churn', data=corr_df, palette='coolwarm')
plt.title('Correlation with Churn', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.show()


##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:

corr_matrix = df.drop(to_drop, axis=1).corr()


plt.figure(figsize=(15, 10))

# Customize the heatmap
sns.heatmap(corr_matrix, annot=True, fmt=".2f", linewidths=.5)

# Add title
plt.title('Correlation Heatmap)', fontsize=16)

plt.xticks(rotation=45, ha='right')

plt.show()


##### 1. Why did you pick the specific chart?

The heatmap was chosen due to its ability to visually represent correlations in a concise manner. Its color gradients and annotations facilitate the identification of high correlation values. It efficiently communicates relationships between variables, aiding in the interpretation of patterns. Specific details, such as the correlation of 'churn' with various features, become easily noticeable. The heatmap's clarity and visual appeal make it an effective tool for exploring complex interdependencies in the dataset.

##### 2. What is/are the insight(s) found from the chart?

**With regard to the heatmap we can see high correlation bettween the following variables:¶**

**With respect to churn**
* Total day minutes to Churn at 0.21
* Total day charge to Churn at 0.21
* Customer service calls to Churn at 0.21
* International plan to Churn at 0.26

**Other relations:**
* Total day charge to Total day minutes at 1
* Total evening charge to total eve minutes at 1
* total night charge to total night minutes at 1

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***