<a href="https://colab.research.google.com/github/DileepNalluri010723/EDA_Playstore_App_review_analysis/blob/main/EDA_Playstore_App_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **EDA Playstore App Review Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


The Play Store apps data offers tremendous potential for developers to drive business success by gaining actionable insights. This dataset includes various metrics for each app, such as category, rating, and size, which allows for a comprehensive analysis of factors contributing to app engagement and success. Additionally, another dataset comprising customer reviews provides invaluable feedback directly from users. By exploring and analyzing this data, developers can identify trends and patterns, understand user preferences, and address common pain points. This insight enables developers to refine their apps, enhance user experience, and tailor their offerings to better meet user needs. Overall, leveraging the Play Store apps data empowers developers to make informed decisions, innovate on existing features, and strategically position their apps in the competitive Android market, ultimately achieving higher engagement and user satisfaction.

#### **Define Your Business Objective?**

***Key Factors for App engagement and success***

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
###Setting the default view options in output screen for better visibility
pd.set_option('display.max_columns', None) # None --> All Columns
pd.set_option('display.max_rows', 1000)

In [None]:
##Cloning the dataset from github
!git clone https://github.com/DileepNalluri010723/EDA_Playstore_App_review_analysis

### Dataset Loading

In [None]:
# Load Dataset
##Try and expect blocks are being used to handle any exceptions
def load_dataset(datasetName):
    '''
    This function fetches data from google drive and returns the result.
    '''
    try:
      if datasetName == "Play Store":
              filepath = "/content/EDA_Playstore_App_review_analysis/Play Store Data.csv"
      else:
            filepath = "/content/EDA_Playstore_App_review_analysis/User Reviews.csv"
      df = pd.read_csv(filepath)
      return df
    except Exception as e:
        print(e)

In [None]:
# Dataset First Look
playstore_data = load_dataset('Play Store')
review_data = load_dataset('Reviews')

### Dataset First View

In [None]:
playstore_data.head()

In [None]:
review_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
playstore_data.shape

In [None]:
playstore_data.columns

In [None]:
review_data.shape

### Dataset Information

In [None]:
# Dataset Info
review_data.info()

In [None]:
playstore_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# len(playstore_data[playstore_data.duplicated()])
playstore_data['App'].value_counts()
dup = playstore_data[playstore_data['App'] =='ROBLOX']
dup

In [None]:
####Since some applications are duplicated we are storing the first occurence of the application
final_playstore_data = playstore_data.drop_duplicates(subset='App',keep='first')
final_playstore_data['App'].value_counts()

In [None]:
###We are following the same
final_playstore_data = playstore_data.drop_duplicates(subset='App',keep='first')
final_playstore_data['App']

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(final_playstore_data.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(final_playstore_data.isnull(), cbar=False)

In [None]:
###Since rating is a primary factor in this analysis, we are removing the rows which are having NaN values
final_playstore_data = final_playstore_data.dropna(axis = 0)
final_playstore_data.isnull().sum()

### What did you know about your dataset?

In [None]:
review_data.info()

In [None]:
final_playstore_data.info()


The provided dataset from the Google Play Store is intended for analyzing key factors contributing to app engagement and success, and deriving actionable insights from this analysis.

Several variables might influence app engagement, including rating, size, and whether the app is paid. By examining both the app data and user reviews, we aim to identify these influential factors and provide practical insights.

The Play Store dataset initially contains 10,841 rows and 13 columns. However, after addressing missing and duplicate values, the cleaned dataset consists of 8,190 rows and 13 columns.

Regarding the review dataset, it includes 37,427 rows and 5 columns, with no missing or duplicate data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
final_playstore_data.columns

In [None]:
# Dataset Describe
final_playstore_data.describe(include='all')

### Variables Description

**Playstore data variables**:

* **App                :**The name of the mobile application

* **Category      :**The primary category to which the app belongs (e.g., Games, Productivity, Education).


* **Rating          :**The overall user rating of the app, represented as a numeric value (e.g., 4.5).


* **Reviews            :**The total number of user reviews for the app, as recorded when the data was scraped.


* **Size           :**The size of the app file (e.g., 15MB, 50MB), as recorded when the data was scraped.

* **Installs        :**The total number of times the app has been downloaded/installed by users, as recorded when the data was scraped.

* **Type             :**The cost of the app if it is paid, represented as a numeric value (e.g., $2.99). If the app is free, the price will be 0.


* **Price**         :The cost of the app if it is paid, represented as a numeric value (e.g., $2.99). If the app is free, the price will be 0.


* **Content Rating**         :The age group the app is targeted at (e.g., Everyone, Teen, Mature 21+, Adult).


* **Genres**          :Additional categories or genres the app belongs to, apart from its main category (e.g., Musical, Family, Game).

**Review Data variables**:

* **App**          :The name of the mobile application being reviewed

* **Translated_Review**         :The user review of the app, which has been preprocessed (e.g., cleaned, tokenized) and translated into English.

* **Sentiment**         :The sentiment of the user review, categorized as Positive, Negative, or Neutral, based on text analysis.

* **Sentiment Polarity**        :
A numeric score representing the sentiment polarity of the review, ranging from-1 (most negative) to 1 (most positive).

* **Sentiment_Subjectivity**      :A numeric score indicating the subjectivity of the review, ranging from 0 (objective) to 1 (subjective), showing how subjective or opinion-based the review is.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in final_playstore_data.columns.tolist():
  print("No. of unique values in ",i,"is",final_playstore_data[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Function to determine the dominant sentiment
def dominant_sentiment(row):
  sentiment_counts = row.value_counts()
  return sentiment_counts.idxmax()

In [None]:
##Dropping the NaN values from the review data
review_data.dropna(inplace= True)
##We are aggregating the reviews by using the mean value from all the reviews given for a single application
review_data_agg = review_data.groupby('App').agg({'Sentiment_Polarity':'mean','Sentiment_Subjectivity':'mean'}).reset_index()


In [None]:
#Using the dominant sentiment approach, we are aggregrating the sentiment
review_data_agg_sentiment = review_data.groupby('App')['Sentiment'].apply(dominant_sentiment)

In [None]:
#Merging the review data so that we have only one unique row for each application
final_review_data = review_data_agg.merge(review_data_agg_sentiment,on='App',how='left')
final_review_data.info()

In [None]:
#Merging the playstore data and review data to analyse the application engagement based on user reviews.
playstore_review_merged_data = final_playstore_data.merge(final_review_data, on='App',how='left')
playstore_review_merged_data.dropna(axis = 0, inplace= True)
playstore_review_merged_data.info()

In [None]:
playstore_review_merged_data.head()

In [None]:
###Clean the installs by removing the commas and returning the value in string format
def clean_installs(row):
  try:
    return int(row.replace(',', '').replace('+', ''))
  except:
    return row

final_playstore_data['Installs'] = final_playstore_data['Installs'].apply(clean_installs)
final_playstore_data['Installs']

In [None]:
###We are here calculating the total sum of installs of each category for Category vs Total Installs analysis
category_installs_df = final_playstore_data.groupby(['Category','Type'])['Installs'].sum().reset_index()

category_installs_df.head()

### What all manipulations have you done and insights you found?

**User Review Data Manipulations**:
We aggregated user reviews by calculating the mean values of sentiment subjectivity and sentiment polarities. Additionally, we determined the dominant sentiment for each app based on all reviews and assigned this as the overall sentiment in the sentiment column. By merging both dataframes, we created a final review dataframe that includes mean values and dominant sentiment, resulting in 865 unique rows and 5 columns.

**Insights**:
Upon merging the user review data with the Play Store data, we obtained 816 unique rows and 5 columns. This indicates that we have comprehensive data for 816 applications that include both Play Store metrics and user reviews. Consequently, our primary analysis will focus on the Play Store data, considering the factors available in the dataset since there is insufficient user review data for a complete analysis. However, we will still consider user reviews to analyze application engagement and provide actionable insights to improve app performance.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Distribution of App ratings (Univariate - Numerical)

In [None]:
# Chart - 2 visualization code
# Showing Distribution of App ratings
final_playstore_data['Rating']

In [None]:
fig_subscribers_histogram = px.histogram(
    final_playstore_data, x='Rating', nbins=25,
    title="Distribution of Application ratings",
    color_discrete_sequence=['lightgreen']
)

# Customize size and font
fig_subscribers_histogram.update_layout( height=600,  # Set the height of the figure
    width=900,  # Set the width of the figure
    yaxis_title = "Count of Applications",
    title_font=dict(size=20, color='darkblue'),  # Title font settings
    font=dict(size=14,color='darkblue'),  # General font size for axes and labels
    bargap=0.01  # Gap between bars
)
fig_subscribers_histogram.show()


##### 1. Why did you pick the specific chart?

Histogram helps us to find out the distribution of a numerical variable in the form of bins. To analyse the distribution of app rating, Histogram is the perfect chart. So I have picked Histogram for this analysis.

##### 2. What is/are the insight(s) found from the chart?

This chart shows the Distribution of Application ratings.

Majority of the applications i.e more than 60 percent of the applications have good application rating 4.0+ which is an encouraging insight from the chart.

But more than 30 percent are below 4 rating and 20 percent are below 3.5 rating which would be actionable item here.

We can also observe that 10 percent of that application are below 2 rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The visualization indicates that the majority of applications are achieving quality engagement and user satisfaction. However, there is room for improvement for those with lower ratings. Developers should focus on these lower-rated applications, utilizing user reviews to identify and address issues. By doing so, they can enhance user engagement and improve the overall quality of their apps.

#### Chart - 2 - Bar chart -- Category vs Installs (Categorical and Numerical -Bivariate)

In [None]:
# Chart - 1 visualization code
fig_category_count = px.bar(
    category_installs_df,
    y='Installs', x='Category',
    title="Bar Plot: Count of Installs per Category ",
)

fig_category_count .update_layout( height=600,  # Set the height of the figure
    width=900,  # Set the width of the figure
    title_font=dict(size=20, color='darkblue'),  # Title font settings
    font=dict(size=14,color='darkblue'),  # General font size for axes and labels
    bargap=0.2,  # Gap between bars
    xaxis=dict( tickangle=-45),# Set the angle of the x ticks
    yaxis_title='Total Number of Installs',
)
# Apply rainbow colors

fig_category_count.show()

##### 1. Why did you pick the specific chart?

A bar chart is an excellent choice for visualizing the total number of installs per category. Since we are conducting bivariate analysis involving both categorical and numerical values, the bar chart effectively illustrates these relationships. This is why I have chosen this chart type for our analysis, as it provides clear and concise insights into the distribution of installs across different app categories.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that,
1. Game, Communication and Tools are the most installed categories and most demanded categories
2. Education,Beauty, Finance are the less installed category, we can consider these as under served niche categories, offering opportunities for new apps to enter and fill a gap.
3. Apart from Tools, Photography and Productivity are having a better installs, we can consider this change towards these categories and build some effective applications for much more user engagment.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above analysis, businesses can take the following actions to drive success:

1. **Resource Allocation**: By focusing on categories with higher engagement, such as Games, Communication, and Tools, businesses can allocate their resources more effectively and tap into areas with proven user interest and potential for growth.

2. **Exploring Niche Opportunities**: In a proactive approach, businesses can target categories with fewer installs, like Finance and Education. By developing engaging and high-quality applications in these under-served areas, businesses can attract users looking for niche content that is currently lacking focus.

3.  **Partnerships and Collaborations**: Collaborating with educational institutions, financial experts, or other relevant entities can enhance the credibility and functionality of apps in under-served categories like Finance and Education.
These strategies can help businesses optimize their efforts, drive user engagement, and seize opportunities in both highly popular and less explored app categories.

#### Chart - 3 - Category wise rating (Bivariate)


In [None]:
df = final_playstore_data[(final_playstore_data['Category']=='FINANCE')& (final_playstore_data['Rating'] < 3.9)]
df.head()

In [None]:
fig_engagement_country = px.box(
    final_playstore_data, x='Category', y='Rating',
    title="Box Plot - Rating by Category",
    color ='Type'
)
fig_engagement_country.update_xaxes(title_text = 'Category')
fig_engagement_country.update_yaxes(title_text = 'Rating')

fig_engagement_country.show()


In [None]:
# Marking outliers using Inter Quartile Range
# Outlier Treatment Inter Quartile Range

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
def outlier_marking(data, column_name):
  Q1 = data[column_name].quantile(0.25) # Calculate the percentile
  Q3 = data[column_name].quantile(0.75)
  IQR = Q3 - Q1

  # Determine the bounds for outliers
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  # Flag outliers
  return data[column_name].apply(
      lambda x: 'Outlier' if x < lower_bound or x > upper_bound else 'Not Outlier'
  )


In [None]:
###Marking the outliers to the main playstore data
final_playstore_data['Outlier Marking'] = outlier_marking(final_playstore_data,'Rating')
final_playstore_data.head()

##### 1. Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers. So, I used box plot to get the maximum and minimum value with well sagreggated outliers with well defined mean and median as shown in the box plot graph.

##### 2. What is/are the insight(s) found from the chart?

This chart shows the Rating of the Applications by Category

We can observe that Art and Design, Events and Parenting have the best rating across all the categories.

We have multiple outliers with lower rating than expected for categories like Finance, Family, Tools etc. Even though the median ratings are good, these outliers are something we must look at.

Entertainment category has the lowest rating when compared to other categories followed by Maps and Navigation.

We can also observe that Lifestyle and Dating has the lowest fences when compared to other categories.

When viewed with Type of the application whether it is free or paid, we could a new insight, we have less outliers, we could also observe that Parenting category has some apps which are having lowest rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Even though the median ratings are encouraging as most of them have a median rating of minimum 4.2, but many categories need immediate attention like Finance, Family etc as they consists applications which have low ratings. Developers needs to look into reviews and resolve the issue as soon as possible to mitigate the negative impact from the users, otherwise leading to negative growth in such categories

On the other hand some category like Art and Designs, Beauty etc which have good median rating have very less number of applications, Business team can focus on such categories and fill the gap by introducing user engaging applications.

When it comes to paid apps, Business team must focus on Parenting category applications as they are having very low rating which would impact the business in a negative manner. Also need to have a keen look other categories like Maps and Navigation, dating etc for which we have less rating applications.

#### Chart - 4 - Type vs Installs (Pie chart)

In [None]:
# Chart - 4 visualization code
type_installs_df = final_playstore_data.groupby('Type')['Installs'].sum().reset_index()
type_installs_df.head()

In [None]:
#Type vs Installs visualization code
fig_category_count = px.bar(
    type_installs_df,
    y='Installs', x='Type',
    title="Bar Plot: Count of Installs per Type ",
)

fig_category_count .update_layout( height=600,  # Set the height of the figure
    width=600,  # Set the width of the figure
    title_font=dict(size=20, color='darkblue'),  # Title font settings
    font=dict(size=14,color='darkblue'),  # General font size for axes and labels
    bargap=0.2,  # Gap between bars
    yaxis_title='Total Number of Installs',
)
# Apply rainbow colors

fig_category_count.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the average percentage of true churn with respect to Area Code, I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

All Area Code have around 14% Churn rate. So, Area Code doesn't matter.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No, Area Code doesn't have any contribution to churn rate for 14% churn rate is same for all the area codes. But while furthur sagreggating the area codes with respective states in those area code, it can be analysed the states in which the issue is happeing. So, it has been defined above.

So, here Area code won't help to create business impact but the respective states wise analysis can help.

#### Chart - 5 - Installs vs Size of the application

In [None]:
# Chart - 5 visualization code
#Installs vs size of the application
fig_engagement_vs_views = px.scatter(
    final_playstore_data, x='Installs', y='Size',
    title="Scatter Plot: Engagement Rate vs Total Video Views",
    color='Category',
    hover_name='Type',
    labels={'video views': 'Total Video Views', 'engagement_rate': 'Engagement Rate'}
)

# Ensure the x-axis is sorted from low to high
fig_engagement_vs_views.update_layout(yaxis=dict(categoryorder='total ascending'))

fig_engagement_vs_views.show()


In [None]:
# Assigning values for furthur charts
i1 = dataset['International plan'].unique()
i2 = dataset.groupby('International plan')['Churn'].mean()*100
i3 = dataset.groupby(['International plan'])['Total intl charge'].mean()
i4 = dataset.groupby(["Churn"])['Total intl minutes'].mean()

In [None]:
# Visualizing code for people churning percentage having international plan
plt.rcParams['figure.figsize'] = (6, 7)

plt.bar(i1,i2 , color=['b','r'])

plt.title(" Percentage of people leaving", fontsize = 20)
plt.xlabel('International plan', fontsize = 15)
plt.ylabel('percentage', fontsize = 15)
plt.show()

In [None]:
# Visualizing code for average calling charge of customers having international plan
plt.rcParams['figure.figsize'] = (6, 7)

plt.bar(i1,i3, color=['b','r'])
plt.title(" Average charge of people", fontsize = 20)
plt.xlabel('International plan', fontsize = 15)
plt.ylabel(' charge', fontsize = 15)
plt.show()

In [None]:
# Visualizing code for average minutes takled by customers having international plan
plt.rcParams['figure.figsize'] = (6, 7)

plt.bar(i1,i4, color=['b','r'])
plt.title(" Average minute people talk", fontsize = 20)
plt.xlabel('International plan', fontsize = 15)
plt.ylabel(' Minutes', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are used to represent the proportional data or relative data in a single chart. The concept of pie slices is used to show the percentage of a particular data from the whole pie.

Thus, I used to show the percentage of people taken international plan through pie chart with differentr colored area under a circle.

A bar chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data. From a bar chart, we can see which groups are highest or most common, and how other groups compare against the others.

Thus, I used bar chart to show the percentage of customers churned having international plan and the avergae calling charge as well as conversation average minutes of customers those have international plan.

##### 2. What is/are the insight(s) found from the chart?

**INTERNATIONAL PLAN**

3010 dont have a international plan

323 have a international plan

Among those who  have a international plan 42.4 % people churn.

Whereas among those who dont have a international plan  only 11.4 % people churn.

Among those who  have a international plan their average charge is 2.86 and they talk for 10.7 minutes average .

Whereas among those who dont have a international plan their average charge is 2.75 and they talk for 10.15 minutes average .

The reason why people having international plan might be leaving is that they are [paying same amount of money for international calls as for those customers who dont have a international plan.Hence they arent getting any benefits for having an international plan so they might be unhappy.


***Customers with the International Plan tend to churn more frequently ***


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights found will definitely help for a positive business impact. Thsoe people who  have international plan they are paying some additional charges to get the plan but the talk time value charge is same as those customers having no international plan. That's might be great reason for more churns those having international plan.

#### Chart - 6 - Voice Mail (Univariate + Bivariate)

In [None]:
# Chart - 5 visualization code
# vizualizing code for customers percentage having voice mail plan
dataset['Voice mail plan'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['No','Yes'],
                               colors=['skyblue','red'],
                               explode=[0,0]
                              )

In [None]:
# Vizualizing code for customers churning while having voice mail plan
plt.rcParams['figure.figsize'] = (6, 7)

cc1=list(['no','yes'])
cc2=dataset.groupby('Voice mail plan')['Churn'].mean()*100
plt.bar(cc1,cc2, color=['b','r'])

plt.title(" Percentage of people leaving", fontsize = 20)
plt.xlabel('Voice mail plan', fontsize = 15)
plt.ylabel('percentage', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart is a type of graph in which a circle is divided into sectors that each represents a proportion of the whole. Pie charts are a useful way to organize data in order to see the size of components relative to the whole, and are particularly good at showing percentage or proportional data.

Thus, I have used pie chart to show the percentage of customers having voice mail plan.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time.

Thus, I have used bar chart to show the percentage of customers churned having voice mail plan.

##### 2. What is/are the insight(s) found from the chart?

**VOICE MAIL**

2411 dont have a voice mail plan

922 have a voice mail plan

Among those who dont have a voice mail 16.7 % people churn.

Whereas among those who have a voice mail plan only 8.7 % people churn.

**Hypothesis Based on Voice Mail**
* Customers sending less number of voicemails either as per their Requirement or There would be less Network Stability at that place for which they won't be able to send the voice message successfully.

***Customers  with the Voice Mail Plan tend to churn less frequently ***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Voice mail plan doesn't have that much impact in business untill the hypothesis has been proven. If the hypothesis remains true, we need to take care of the respective states where there is less network stability or the states the towers needs either maintainance or new towers should be installed.

yes, voice mail plan might be considered partially.

#### Chart - 7- Overall Calls (Bivariate)

In [None]:
# Chart - 7 visualization code
# Geeting means of churn vs total day calls, total day minutes, total day charge
print(dataset.groupby(["Churn"])['Total day calls'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total day minutes'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total day charge'].mean())

# 18% more min    18% more charge    no insight

In [None]:
# Vizualizing Total day minutes vs total day charge
cdd = sns.scatterplot(x="Total day minutes", y="Total day charge", hue="Churn", data=dataset)


In [None]:
# Geeting means of churn vs total eve calls, total eve minutes, total evening charge
print(dataset.groupby(["Churn"])['Total eve calls'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total eve minutes'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total eve charge'].mean())

In [None]:
# Vizualizing total evening minutes vs total evening charge
cdd = sns.scatterplot(x="Total eve minutes", y="Total eve charge", hue="Churn", data=dataset)

In [None]:
# Getting means of churn vs total night calls, total m=night minutes, total night charge
print(dataset.groupby(["Churn"])['Total night calls'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total night minutes'].mean())
print(" ")
print(dataset.groupby(["Churn"])['Total night charge'].mean())

In [None]:
# Vizualizing Total nights minutes vs total night charge
cdd = sns.scatterplot(x="Total night minutes", y="Total night charge", hue="Churn", data=dataset)

In [None]:
# Import pandas library
import pandas as pd

# initialize list of lists
data1 = [['Total day minutes',175.17 , 206.91], ['Total day charge',29.78, 35.17]]

#7.012,6.12,6.86

# Create the pandas DataFrame
minutes_code1 = pd.DataFrame(data1, columns = ['day', 'dont churn',' churn'])

# print dataframe.
minutes_code1

In [None]:
# Vizualizing code for the above created dataframe
plt.rcParams['figure.figsize'] = (8, 6)


minutes_code1.plot(kind='bar', x='day',ylabel='mean  ')

In [None]:
# Import pandas library
import pandas as pd

# initialize list of lists
data2 = [ ['Total eve minutes',199.04, 212.41], ['Total night minutes',200.13,205.23]]

#7.012,6.12,6.86

# Create the pandas DataFrame
minutes_code2 = pd.DataFrame(data2, columns = ['minutes', 'dont churn',' churn'])

# print dataframe.
minutes_code2

In [None]:
# Vizualizing teh above created dataframe
plt.rcParams['figure.figsize'] = (8,6)


minutes_code2.plot(kind='bar', x='minutes',xlabel='minutes',ylabel='mean of churn ')

In [None]:
# Import pandas library
import pandas as pd

# initialize list of lists
data3 = [ ['Total eve charge',16.91, 18.05], ['Total night charge',9,9.23]]

#7.012,6.12,6.86

# Create the pandas DataFrame
minutes_code3 = pd.DataFrame(data3, columns = ['charge', 'dont churn',' churn'])

# print dataframe.
minutes_code3

In [None]:
# Vizualizing code for the above dataset
plt.rcParams['figure.figsize'] = (8,6)


minutes_code3.plot(kind='bar', x='charge',ylabel='mean charge')

##### 1. Why did you pick the specific chart?

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

Thus, I have used the scatter plot to depict the relationship between evening, day &n night calls , minutes and charge.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the evening, night and day manipulated data to depict meaningful insights.

##### 2. What is/are the insight(s) found from the chart?

**OVERALL CALLS**

Churn customers speak more minutes that non-churn customers at day,evening and night. Hence they pay more charge that non-churn customers.

We can retain churn customers if we include master plan.
In master plan if a customer is talking more minutes then we can charge a little less amount from him or he can get discount or additional few free minutes to talk.

This will make customers who are going to churn happy and they will not leave the company.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For telecom service provider calling and messaging are two essential product plans. Thus, optimizing voice call plans will definitely create a business impact. Those who are using just calling service must be provided som additional offers either in talktime or powerplus plan. Those who use voice call plan for night only, we might offer some exciting plans from midnight 12 to morning 6. For customers those who have higher accout length should be provided exciting offers as they are our loyal customers. churing of higher account length customer will have a negative impact on business.


#### Chart - 8 - Customer Service Calls (Bivariate)

In [None]:
# Chart - 8 visualization code
# Visualizing churn rate per customer service calls
plt.rcParams['figure.figsize'] = (12, 8)


s1=list(dataset['Customer service calls'].unique())
s2=list(dataset.groupby(['Customer service calls'])['Churn'].mean()*100)
plt.bar(s1,s2, color = ['violet','indigo','b','g','y','orange','r'])


plt.title(" Churn rate per service call", fontsize = 20)
plt.xlabel('No of cust service call', fontsize = 15)
plt.ylabel(' percentage', fontsize = 15)
plt.show()


##### 1. Why did you pick the specific chart?

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between churn rate per customer service calls.

##### 2. What is/are the insight(s) found from the chart?

**CUSTOMER SERVICE CALL**

The service calls of customers varies from 0 to 9 .

Those customers who make more service calls they have a high probability of leaving.

As we can see from graph , customers with more then 5 service calls their probability of leaving is more then 50 %.

Hence customers who make more then 5 service calls, their queries should be solved immediately and they should be given better service so that they dont leave the company.

***Customers with four or more customer service calls churn more than four times as often as do the other customers***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Customer service is an essential factor for every business. SO definitely good customer service will have a positive impact to the business. We have to look afted the customer calls and customer query report resolution duration. Need to optimize the time period. If one type of issue is coming from more than 5 customers, root cause analysis should be done on that same issue and should be resolved for everyone.
Need to reduce the calls for each customer and he should be satisfied in a single call only. The customer service agents should be given great offer or recognition over great performance of customer issue resolution.

#### Chart - 9 - Column wise Histogram & Box Plot Univariate Analysis

In [None]:
# Chart - 9 visualization code
# Visualizing code of hist plot for each columns to know the data distibution
for col in dataset.describe().columns:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (dataset[col])
  sns.distplot(dataset[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
plt.show()

# Visualizing code of box plot for each columns to know the data distibution
for col in dataset.describe().columns:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    dataset.boxplot( col, ax = ax)
    ax.set_title('Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()


##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers.

Thus, for each numerical varibale in the given dataset, I used box plot to analyse the outliers and interquartile range including mean, median, maximum and minimum value.

##### 2. What is/are the insight(s) found from the chart?

Almost all columns are symmetric distributed and mean is nearly same with median for numerical columns. Here Area code will be treated as text values as there are only 3 values in the particular column.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Just a histogram and box plot cannot define business impact. It's done just to see the distribution of the column data over the dataset.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = dataset.corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From the above correlation heatmap, we can see total day charge & total day minute, total evening charge & total evening minute, total night charge & total night minute are positiveliy highly correlated with a value of 1.

Customer service call is positively correlated only with area code and negative correlated with rest variables.

Rest all correlation can be depicted from the above chart.


#### Chart - 11 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(dataset, hue="Churn")

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there are less linear relationship between variables and data poiunts aren't linearly separable. Churned customers data is clusetered and ovearlapped each other. Non churn data are quite symmetrical in nature and churned customer data are quite non symmetric in nature. In this whole pair plot, the importance of area code can be seen and the number of churn with respect to different features are really insightful. Rest insights can be depicted from the above graph.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Solution to Reduce Customer Churn**

*	Modify International Plan as the charge is same as normal one.
*	Be proactive with communication.
*	Ask for feedback often.
*	Periodically throw Offers to retain customers.
*	Look at the customers facing problem in  the most churning states.
*	Lean into  best customers.
*	Regular Server Maintenance.
*	Solving Poor Network Connectivity Issue.
*	Define a roadmap for new customers.
*	Analyze churn when it happens.
*	Stay competitive.




# **Conclusion**

•	The four charge fields are linear functions of the minute fields.

•	The area code field and/or the state field are anomalous, and can be omitted.

•	Customers with the International Plan tend to churn more frequently.

•	Customers with four or more customer service calls churn more than four times as often as do the other customers.

•	Customers with high day minutes and evening minutes tend to churn at a higher rate than do the other customers.

•	There is no obvious association of churn with the variables day calls, evening calls, night calls, international calls, night minutes, international minutes, account length, or voice mail messages.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***