# Extra Examples - Pivoting

Let's use customer satisfaction for this example: https://www.kaggle.com/johndddddd/customer-satisfaction

We'll load in via `pd.read_excel`, so you might need to run `pip install xlrd` and restart your kernel. It will take a long time to load the data, be warned.

Lets investigate:

1. Assign a numeric ranking to "satisfaction_v2"
2. Pivot to show average satisfaction by gender and class.
3. What is most correlated with satisfaction
4. Are the online features correlated in count?

In [10]:
import pandas as pd
df = pd.read_excel("satisfaction.xlsx")
df.head()

Unnamed: 0,id,satisfaction_v2,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,11112,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,...,2,3,3,0,3,5,3,2,0,0.0
1,110278,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,...,2,3,4,4,4,2,3,2,310,305.0
2,103199,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,...,2,2,3,3,4,4,4,2,0,0.0
3,47462,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,...,3,1,1,0,1,4,1,3,0,0.0
4,120011,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,...,4,2,2,0,2,4,2,5,0,0.0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 24 columns):
id                                   129880 non-null int64
satisfaction_v2                      129880 non-null object
Gender                               129880 non-null object
Customer Type                        129880 non-null object
Age                                  129880 non-null int64
Type of Travel                       129880 non-null object
Class                                129880 non-null object
Flight Distance                      129880 non-null int64
Seat comfort                         129880 non-null int64
Departure/Arrival time convenient    129880 non-null int64
Food and drink                       129880 non-null int64
Gate location                        129880 non-null int64
Inflight wifi service                129880 non-null int64
Inflight entertainment               129880 non-null int64
Online support                       129880 non-null int64

## Make satisfaction_v2 numeric

In [14]:
df.satisfaction_v2.unique()

array(['satisfied', 'neutral or dissatisfied'], dtype=object)

In [16]:
mapping = {'satisfied': 1, 'neutral or dissatisfied': 0}
df["satisfaction"] = df.satisfaction_v2.map(mapping)
df

Unnamed: 0,id,satisfaction_v2,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,...,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,11112,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,...,3,3,0,3,5,3,2,0,0.0,1
1,110278,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,...,3,4,4,4,2,3,2,310,305.0,1
2,103199,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,...,2,3,3,4,4,4,2,0,0.0,1
3,47462,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,...,1,1,0,1,4,1,3,0,0.0,1
4,120011,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,...,2,2,0,2,4,2,5,0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,119211,satisfied,Female,disloyal Customer,29,Personal Travel,Eco,1731,5,5,...,2,3,3,4,4,4,2,0,0.0,1
129876,97768,neutral or dissatisfied,Male,disloyal Customer,63,Personal Travel,Business,2087,2,3,...,3,2,3,3,1,2,1,174,172.0,0
129877,125368,neutral or dissatisfied,Male,disloyal Customer,69,Personal Travel,Eco,2320,3,0,...,4,4,3,4,2,3,2,155,163.0,0
129878,251,neutral or dissatisfied,Male,disloyal Customer,66,Personal Travel,Eco,2450,3,2,...,3,3,2,3,2,1,2,193,205.0,0


## Satisfcation based on gender and class 

In [21]:
df.pivot_table(values="satisfaction", index="Class", columns="Gender", aggfunc="mean")
# Now that is a huge gender imbalance

Gender,Female,Male
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
Business,0.720628,0.697997
Eco,0.590799,0.19009
Eco Plus,0.57793,0.258493


In [20]:
df.pivot_table(values="satisfaction", index="Class", columns="Gender", aggfunc="count")

Gender,Female,Male
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
Business,31263,30897
Eco,29670,28639
Eco Plus,4966,4445


## What is most correlated with satisfaction

In [26]:
df.corr()["satisfaction"].sort_values()
# You'd think that delays would have a bigger impact!

Arrival Delay in Minutes            -0.080691
Departure Delay in Minutes          -0.073909
Flight Distance                     -0.039224
Departure/Arrival time convenient   -0.015507
Gate location                       -0.012071
id                                   0.013728
Age                                  0.117971
Food and drink                       0.120677
Inflight wifi service                0.227062
Seat comfort                         0.242384
Cleanliness                          0.259330
Baggage handling                     0.260347
Checkin service                      0.266179
Leg room service                     0.304928
Online boarding                      0.338147
On-board service                     0.352047
Online support                       0.390143
Ease of Online booking               0.431772
Inflight entertainment               0.523496
satisfaction                         1.000000
Name: satisfaction, dtype: float64

## Can we check if Online features have duplicate info?

This is a very open question, so a lot of ways to go about this. Correlation might be one, or
checking the frequency of all of them against each other might be another.

In [55]:
index = ["Ease of Online booking", "Online support"]
cols = ["Online boarding"]
p = df.pivot_table(index=index, columns=cols, values="satisfaction", aggfunc="count", fill_value=0)
p = 100 * p / p.to_numpy().sum()  # This turns it all into a percent of respondants
p.style.background_gradient(cmap="magma", low=0, high=1)

# See below that if you get a 1 in Online boarding, you are far more likely to get a 1 in 
# Online Support and a 1 in Ease of booking. Same for 2, 3, 4 and 5. So seems people 
# are encoding the same info in their answers

Unnamed: 0_level_0,Online boarding,0,1,2,3,4,5
Ease of Online booking,Online support,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2,0.00307977,0.000769941,0.0,0.0,0.0,0.0
0,3,0.00230982,0.0,0.000769941,0.000769941,0.000769941,0.0
0,4,0.00461965,0.0,0.0,0.000769941,0.0,0.0
1,1,0.0,6.89483,0.0215584,0.0169387,0.0161688,0.00692947
1,2,0.0,0.311056,0.169387,0.0608254,0.0515861,0.00538959
1,3,0.0,0.368802,0.187866,0.311056,0.192485,0.00384971
1,4,0.0,0.335694,0.148599,0.230982,0.375731,0.0908531
1,5,0.0,0.177087,0.00769941,0.0939329,0.0823837,0.183246
2,1,0.0,0.120881,0.445796,0.0292578,0.0292578,0.0277179
2,2,0.0,0.114721,8.64259,0.111642,0.0977826,0.0192485
