# Grouping and Summarizing Data

### Summary statistics for both classes
Consider the following .groupby() code:

###### # Group by x and compute the standard deviation
df.groupby(['x']).std()
Here, a DataFrame df is grouped by a column 'x', and then the standard deviation is calculated across all columns of df for each value of 'x'. The .groupby() method is incredibly useful when you want to investigate specific columns of your dataset. Here, you're going to explore the 'Churn' column further to see if there are differences between churners and non-churners. A subset version of the telco DataFrame, consisting of the columns 'Churn', 'CustServ_Calls', and 'Vmail_Message' is available in your workspace.

If you need a refresher on how .groupby() works, please refer back to the pre-requisite Manipulating DataFrames with pandas course.

In [1]:
# Group telco by 'Churn' and compute the mean
print(telco.groupby(['Churn']).mean())

NameError: name 'telco' is not defined

### Churn by State
When dealing with customer data, geographic regions may play an important part in determining whether a customer will cancel their service or not. You may have noticed that there is a 'State' column in the dataset. In this exercise, you'll group 'State' and 'Churn' to count the number of churners and non-churners by state. For example, if you wanted to group by x and aggregate by y, you could use .groupby() as follows:

df.groupby('x')['y'].value_counts()

In [2]:
# Count the number of churners and non-churners by State
print(telco.groupby('State')['Churn'].value_counts())

NameError: name 'telco' is not defined

# Exploring your Data Using Visualization

### Exploring feature distributions
You saw in the video that the 'Account_Length' feature was normally distributed. Let's now visualize the distributions of the following features using seaborn's distribution plot:

'Day_Mins'

'Eve_Mins'

'Night_Mins'

'Intl_Mins'

To create a feature's distribution plot, pass it in as an argument to sns.distplot(). The Telco dataset is available to you as a DataFrame called telco.

In [3]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Day_Mins'
sns.distplot(telco['Day_Mins'])
sns.distplot(telco['Eve_Mins'])
sns.distplot(telco['Night_Mins'])
sns.distplot(telco['Intl_Mins'])

# Display the plot
plt.show()

NameError: name 'telco' is not defined

### Customer service calls and churn
You've already seen that there's not much of a difference in account lengths between churners and non-churners, but that there is a difference in the number of customer service calls left by churners.

Let's now visualize this difference using a box plot and incorporate other features of interest - do customers who have international plans make more customer service calls? Or do they tend to churn more? How about voicemail plans? Let's find out!

Recall the syntax for creating a box plot using seaborn:

sns.boxplot(x = "X-axis variable",
            y = "Y-axis variable",
            data = DataFrame)
If you want to remove outliers, you can specify the additional parameter sym="", and you can add a third variable using hue.

In [4]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create the box plot
sns.boxplot(x = 'Churn',
          y = 'CustServ_Calls',
          data = telco)

# Display the plot
plt.show()

NameError: name 'telco' is not defined

# Data Preperation

### Encoding binary features
Recasting data types is an important part of data preprocessing. In this exercise you will assign the values 1 to 'yes' and 0 to 'no' to the 'Vmail_Plan' and 'Churn' features, respectively.

You saw two approaches to doing this in the video - one using pandas, and the other using scikit-learn. For straightforward tasks like this, sticking with pandas is recommended, so that's what we'll do in this exercise. If you're trying to build machine learning pipelines, on the other hand - which is beyond the scope of this course - you can explore using LabelEncoder(). When doing data science, it's important to be aware that there is always more than one way to accomplish a task, and you need to pick the one that is most effective for your application.

In [8]:
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] = telco['Vmail_Plan'].replace('no',0)
telco['Vmail_Plan'] = telco['Vmail_Plan'].replace('yes',1)
# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
telco['Churn'] = telco['Churn'].replace('yes',1)
telco['Churn'] = telco['Churn'].replace('no',0)
# Print the results to verify
print(telco['Vmail_Plan'].head())
print(telco['Churn'].head())

NameError: name 'telco' is not defined

### One hot encoding
In the video, you saw how the 'State' feature can be encoded numerically using the technique of one hot encoding:

ohe_part3.png

Doing this manually would be quite tedious, especially when you have 50 states and over 3000 customers! Fortunately, pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.

In [6]:
# Import pandas
import pandas as pd

# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])

NameError: name 'telco' is not defined