# Keeping Customers: Fighting Churn with Pandas

In this assignment, you'll play the role of lead analyst for a credit card provider. You’ve been provided with a CSV file that provides information on both customers who have churned and customers who still use the credit card. You have been tasked to determine whether churned vs. existing customers show significant differences when it comes to the following factors:

- Age
- Number of dependents 
- Total Revolving balance; i.e. the balance that still needs to be paid 
- Income 

The first three metrics are numeric so you will apply the same strategies to solve those problems. Customer income is categorical so you will be applying a slightly different strategy to solve that problem.

By the end of this notebook, you will have determined which of the above factors may help identify a customer who is likely to churn. This will help your organization better identify and retain flight risks in the future.

---
### Getting Started
To get started, download the following files:
- `Unit 20 - Business - Unsolved.ipynb` (_this notebook_)
- `CreditCardChurn.csv`

Place these together in to a dedicated directory on your hard drive. We recommend creating a folder in your `Documents` directory for this week of class, as follows:

```
Documents/
  Python/
    Week20/
      Unit 20 - Business - Unsolved.ipynb
      CreditCardChurn.csv
```

Then, start Jupyter Notebook and open `Unit 20 - Business - Unsolved.ipynb` in your browser. Make sure the `CreditCardChurn.csv` file lives in the same directory.

---

### Problem Structure
Each problem will be accompanied by:
- **Instructions**
  - Each problem features a markdown cell explaining the problem.
- **Unfinished Code Cells**
  - Each problem has unfinished code cells, where you will write code to solve the problem.
  - Cells will contain either starter code for you to finish, or a comment explaining what your code should do.
- **Expected Output**. 
  - Many unfinished code cells will have output below them. You will be expected to write code that produces the same output.
  - Some unfinished code cells do _not_ have output below them. This is simply because not all code will generate output. Your solutions for these cells should _not_ print anything.
  
---
  
### Deliverables
To receive credit for this assignment, you must submit the following files:
- Your completed Jupyter Notebook

Your completed Jupyter Notebook will be this file, but with all of the problems solved. This is the only file you will need to submit. When you're done with the assignment, run all cells to verify that your code executes as expected. Then, save and submit this notebook.

Good luck!

----

## Part 1: Loading & Cleaning Data
In Part 1, you will perform the following steps on the data in `CreditCardChurn.csv`:
- Load the CSV into a dataframe and print the first five rows
- Add a new column called `Churned`


### Problem 1: Loading Data
You will load the data in `CreditCardChurn.csv`, and inspect its columns using `head`. 

You have been provided a `filename` variable, which contains the path to `CreditCardChurn.csv`. Use it to complete the steps below:
- Load `filename` into a DataFrame called `churn`
- Print the first 5 rows of `churn`

---

Your code should print the following:

```
ClientID	AttritionFlag	CustomerAge	Gender	DependentCount	EducationLevel	IncomeCategory	TotalRevolvingBal
0	768805383	Existing Customer	45	M	3	High School	 60𝐾− 80K	777
1	818770008	Existing Customer	49	F	5	Graduate	Less than $40K	864
2	713982108	Existing Customer	51	M	3	Graduate	 80𝐾− 120K	0
3	769911858	Existing Customer	40	F	4	High School	Less than $40K	2517
4	709106358	Existing Customer	40	M	3	Uneducated	 60𝐾− 80K	0
```


In [1]:
# Provided Code -- Do NOT Edit!
import pandas as pd
filename = 'CreditCardChurn.csv'

In [2]:
# TODO: Load `filename` into a DataFrame called `churn`
churn = pd.read_csv('CreditCardChurn.csv', sep=',')

In [3]:
# TODO: Print first 5 rows of `churn`
churn.head(5)

Unnamed: 0,ClientID,AttritionFlag,CustomerAge,Gender,DependentCount,EducationLevel,IncomeCategory,TotalRevolvingBal
0,768805383,Existing Customer,45,M,3,High School,$60K - $80K,777
1,818770008,Existing Customer,49,F,5,Graduate,Less than $40K,864
2,713982108,Existing Customer,51,M,3,Graduate,$80K - $120K,0
3,769911858,Existing Customer,40,F,4,High School,Less than $40K,2517
4,709106358,Existing Customer,40,M,3,Uneducated,$60K - $80K,0


### Problem 2: Adding a `Churned` Column
Note that the `AttritionFlag` Series contains one of two values: Either `Existing Customer`, indicating that the customer is still a subscriber; or `Attrited Customer`, indicating that they've canceled. 

In this problem, you'll add a new column, called `Churned`, which will be `True` if `AttritionFlag` is `Attrited Customer`, and `False` otherwise. Follow the steps below:
- Create a new column, called `Churned`, which is `True` for rows where `AttritionFlag` is equal to `Attrited Customer`, and `False` otherwise
- Count the values of the `Churned` column

---

Your code should print the following:

```
False    8500
True     1627
Name: Churned, dtype: int64
```

In [4]:
# TODO: Create column called `Churned` 
churn['Churned'] = False

In [5]:
# TODO: Count values in `Churned` column
churn.loc[churn.AttritionFlag == "Attrited Customer", "Churned"] = True
churn['Churned'].value_counts()

False    8500
True     1627
Name: Churned, dtype: int64

## Part 2: Differences in Numeric Variables
In Part 2, you will see if there are significant differences in the following columns between the two groups:
- Customer Age
- Number of Dependents
- Total Revolving Balance

### Problem 1: Calculating the Average Age Difference Between `churned` vs `not_churned`
You will write code that computes the average age of churned customers, and the average age of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `CustomerAge` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE AGE DIFFERENCE: 0.3973783578582015
```

In [6]:
# TODO: Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned == True]
unchurned_customers = churn[churn.Churned == False]

In [7]:
# TODO: Compute average age of churned and unchurned customers
churned_average_age = churned_customers['CustomerAge'].mean()
print(churned_average_age)
unchurned_average_age = unchurned_customers['CustomerAge'].mean()
print(unchurned_average_age)
# TODO: Compute `difference_in_average_age`
difference_in_average_age = churned_average_age - unchurned_average_age

46.659496004917024
46.26211764705882


In [8]:
# TODO: Print `difference_in_average_age`
print('AVERAGE AGE DIFFERENCE: ' + str(difference_in_average_age))

AVERAGE AGE DIFFERENCE: 0.3973783578582015


After finding the difference in average age, answer the following questions:
- What are the minimum and maximum values in `churn['CustomerAge']`? Store the minimum value in a variable called `min_age` and maximum value in a variable called `max_age`.
- What is the difference between these values? Store the result in a variable `difference_in_average_dependent_count`.
- Is `CustomerAge` predictive of churn?

In [9]:
# Compute and print `min_age`
min_age = churn['CustomerAge'].min()
print(min_age)

26


In [10]:
max_age = churn['CustomerAge'].max()
print(max_age)

73


In [11]:
# Compute and print difference between maximum and minimum ages
difference_in_average_dependent_count = max_age-min_age
print(difference_in_average_dependent_count)

47


### Problem 2: Calculating Difference in Average Number of Dependents
Next, you will write code that computes the average number of dependents of churned customers, and the average number of dependents of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `DependentCount` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE DIFFERENCE IN DEPENDENT COUNT:  0.06716967352398884
```

---

In [12]:
# TODO: Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned == True]
unchurned_customers = churn[churn.Churned == False]

In [13]:
# TODO: Compute average `DependentCount` for each group
churned_dependent_count = churned_customers['DependentCount'].mean()
unchurned_dependent_count = unchurned_customers['DependentCount'].mean()
# TODO: Compute `difference_in_average_dependent_count`
difference_in_average_dependent_count = churned_dependent_count - unchurned_dependent_count

In [14]:
# TODO: Print `difference_in_average_dependent_count`
print('AVERAGE DIFFERENCE IN DEPENDENT COUNT:  ' + str(difference_in_average_dependent_count))

AVERAGE DIFFERENCE IN DEPENDENT COUNT:  0.06716967352398884


After finding the difference in average number of dependents, answer the following questions:
- What are the minimum and maximum values in `churn['DependentCount']`? Store the minimum value in a variable called `min_dependent_count` and maximum value in a variable called `max_dependent_count`.
- What is the difference between these values? Store the result in a variable called `dependent_count_range`.
- Is `DependentCount` predictive of churn?

In [15]:
# Compute and print `min_dependent_count`
min_dependent_count = churn['DependentCount'].min()

In [16]:
# Compute and print `max_dependent_count`
max_dependent_count = churn['DependentCount'].max()

In [17]:
# Compute and print difference between max and min dependent count
dependent_count_range = max_dependent_count - min_dependent_count
print(dependent_count_range)

5


### Problem 3: Calculating Average Difference in Total Revolving Balance
Next, you will write code that computes the average number of dependents of churned customers, and the average number of dependents of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `DependentCount` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE:  -583.7811305542499
```

---

In [18]:
# TODO: Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned == True]
unchurned_customers = churn[churn.Churned == False] 

In [19]:
# TODO: Compute average `TotalRevolvingBal` for each group
revloving_balance_churned_customers = churned_customers['TotalRevolvingBal'].mean()
revloving_balance_unchurned_customers = unchurned_customers['TotalRevolvingBal'].mean()
# TODO: Compute `difference_in_average_revolving_balance`
difference_in_numeric = revloving_balance_churned_customers - revloving_balance_unchurned_customers

In [20]:
# TODO: Print `difference_in_numeric` of `TotalRevolvingBal`
print('AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE: ' + str(difference_in_numeric))

AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE: -583.7811305542499


After finding the difference in total revolving balance, answer the following questions:
- What are the maximum and minimum values in `churn['TotalRevolvingBal']`? Store the minimum value in a variable called `min_balance` and maximum value in a variable called `max_balance`.
- What is the difference between these values? Store the result in a variable called `balance_range`
- Is `TotalRevolvingBal` predictive of churn?

In [21]:
# TODO: Compute and print minimum revolving balance
min_revloving_balance = churn['TotalRevolvingBal'].min()

In [22]:
# TODO: Compute and print maximum revolving balance
max_revloving_balance = churn['TotalRevolvingBal'].max()

In [23]:
# TODO: Compute difference between max and min revolving balances
balance_range = max_revloving_balance - min_revloving_balance
print(balance_range)

2517


## Part 3: Studying Income Categories
In Part #, you will continue looking for differences between churned and unchurned customers. This time, you will see if there low-income or high-income customers are more likely to churn, by studying the `IncomeCategory` column.

### Problem 1: Comparing "Low-Income" Churned vs Unchurned Customers
Next, you will determine whether there is a difference in the number of churned vs unchurned customers who qualify as "low-income" -- i.e., those who make less than $40K per year.

You have been provided with a variable, called `low_income`, containing the value `'Less than $40K'`. Use it to complete the steps below:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to unchurned customers
- Create the following two DataFrames:
  - `low_income_churned`: Filter `churned_customers` for only rows with an `IncomeCategory` value of `low_income'
  - `low_income_unchurned`: Filter `unchurned_customers` for only rows with an `IncomeCategory` value of `low_income`
- Compute the mean of each of these new DataFrames, then print the difference between these means.

---

Your code should print the following:

```
0.029211251310604147
```

In [24]:
# Provided Code -- Do NOT Edit!
low_income = 'Less than $40K'

In [25]:
# TODO: Create `churned_customers` and `unchurned_customers` DataFrames
churned_customers = churn[churn.Churned == True]
unchurned_customers = churn[churn.Churned == False] 

In [26]:
# TODO: Filter for low-income, churned customers
low_income_churned = churned_customers['IncomeCategory'] == low_income

# TODO: Filter for low-income, unchurned customers
low_income_unchurned = unchurned_customers['IncomeCategory'] == low_income

# TODO: Compute difference in means between `low_income_churned` and `low_income_unchurned`
low_mean_income = low_income_churned.mean() - low_income_unchurned.mean()
low_mean_income

0.029211251310604147

### Problem 2: Comparing "High-Income" Churned vs Unchurned Customers
Next, you will determine whether there is a difference in the number of churned vs unchurned customers who qualify as "low-income" -- i.e., those who make $120K or more per year.

You have been provided with a variable, called `high_income`, containing the value `'$120K +'`. Use it to complete the steps below:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to unchurned customers
- Create the following two DataFrames:
  - `high_income_churned`: Filter `churned_customers` for only rows with an `IncomeCategory` value of `high_income'
  - `high_income_unchurned`: Filter `unchurned_customers` for only rows with an `IncomeCategory` value of `high_income`
- Compute the mean of each of these new DataFrames, then print the difference between these means.


In [27]:
# Provided Code -- Do NOT Edit!
high_income = '$120K +'

In [28]:
# TODO: Create `churned_customers` and `unchurned_customers` DataFrames
churned_customers = churn[churn.Churned == True]
unchurned_customers = churn[churn.Churned == False] 

In [29]:
# TODO: Filter for high-income, churned customers
high_income_churned = churned_customers['IncomeCategory'] == high_income

# TODO: Filter for high-income, unchurned customers
high_income_unchurned = unchurned_customers['IncomeCategory'] == high_income

# TODO: Compute difference in means between `high_income_churned` and `high_income_unchurned` 
high_mean_income = high_income_churned.mean() - high_income_unchurned.mean()
high_mean_income

0.006737264543186669

## Wrapping Up

Congratulations -- you've finished digging into the customer churn data set, and have generated significant insights for your organization! Write a paragraph that summarizes your findings and explains which factor(s) you analyzed are predictive of customer churn. 

In [None]:
After analyzing the data, no significant age difference was found between churned and unchurned customers. 
Also, dependent count or income status did not affect the churning of the credit cards.
Total revolving balance is predictive of credit card churn. Churned customers have a lower revolving balance 
an average of $672, whereas the unchurned customers have twice the revolving balance, which averages $1256.
This could be the case because customers get reward points after they open and spend a certain amount of money 
in a given period of time. After receiving the rewards, they leave.
The credit card should allure customers with different cashback offers that 
increases the spending by customers and increases the total revolving balance, which will prevent the churning of customers.