# 🚀 BONUS
### Dataset: Loyalty program (`loyalty_program.csv`) 
**Objective:** This analysis aims to evaluate significant differences in flight bookings based on customers' education levels..

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
pd.set_option('display.max_columns', None) 

df_loyalty= pd.read_csv('../data/processed/loyalty_program.csv', index_col=0)

df_loyalty.head()

Unnamed: 0,Loyalty Number,Year,Month,Flights Booked,Flights with Companions,Total Flights,Distance,Points Accumulated,Points Redeemed,Dollar Cost Points Redeemed,Province,City,Postal Code,Gender,Education,Salary,Marital Status,Loyalty Card,CLV,Enrollment Type,Enrollment Year,Enrollment Month,Cancellation Year,Cancellation Month
0,100018,2017,1,3,0,3,1521,152.0,0,0,Alberta,Edmonton,T9G 1W3,Female,Bachelor,92552.0,Married,Aurora,7919.2,Standard,2016,8,,
1,100102,2017,1,10,4,14,2030,203.0,0,0,Ontario,Toronto,M1R 4K3,Male,College,,Single,Nova,2887.74,Standard,2013,3,,
2,100140,2017,1,6,0,6,1200,120.0,0,0,British Columbia,Dawson Creek,U5I 4F1,Female,College,,Divorced,Nova,2838.07,Standard,2016,7,,
3,100214,2017,1,0,0,0,0,0.0,0,0,British Columbia,Vancouver,V5R 1W3,Male,Bachelor,63253.0,Married,Star,4170.57,Standard,2015,8,,
4,100272,2017,1,0,0,0,0,0.0,0,0,Ontario,Toronto,P1L 8X8,Female,Bachelor,91163.0,Divorced,Star,6622.05,Standard,2014,1,,


## 🫧 Data cleaning (again)

There are 'Loyalty Number' duplicated values. Therefore, the first step is to group by unique values of this column. 

In [3]:
df_loyalty['Loyalty Number'].duplicated().sum()

387023

In [4]:
# select columns from df_loyalty 
df_clean = df_loyalty[['Flights Booked', 'Education', 'Loyalty Number']] 

In [5]:
# Every unique value of Loyalty number has its corresponding Education 
# explode converts a column list to individual rows. Every unique value of 'Loyalty Number' will be an independent row.
df_education = df_loyalty.groupby('Education')['Loyalty Number'].unique().explode().reset_index()

df_education.tail() 

Unnamed: 0,Education,Loyalty Number
16732,Master,934529
16733,Master,942685
16734,Master,946061
16735,Master,985320
16736,Master,998072


In [6]:
# Every Loyalty Number with Flights Booked summation
df_flighs_booked = df_loyalty.groupby('Loyalty Number')['Flights Booked'].sum().reset_index()

df_flighs_booked.head()

Unnamed: 0,Loyalty Number,Flights Booked
0,100018,157
1,100102,173
2,100140,152
3,100214,79
4,100272,127


In [7]:
# Combine both DataFrames
df_clean = df_flighs_booked.merge(df_education, on='Loyalty Number', how='left')

df_clean.head()

Unnamed: 0,Loyalty Number,Flights Booked,Education
0,100018,157,Bachelor
1,100102,173,College
2,100140,152,College
3,100214,79,Bachelor
4,100272,127,Bachelor


In [8]:
df_clean.duplicated(subset='Loyalty Number').sum()

0

## 👩🏻‍🔬 Statistics and hypothesis testing

In [9]:
df_clean.groupby('Education')['Flights Booked'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bachelor,10475.0,99.104821,54.231557,0.0,59.0,113.0,139.0,354.0
College,4238.0,100.613025,54.20881,0.0,62.25,113.0,140.0,324.0
Doctor,734.0,100.866485,55.933391,0.0,58.25,115.0,143.0,292.0
High School or Below,782.0,101.014066,54.867491,0.0,62.0,115.0,139.0,265.0
Master,508.0,101.602362,54.023146,0.0,65.0,113.0,141.0,322.0


Customers with bachelor's degrees booked the highest number of flights, and they are also the most numerous. However, on average, all customers, regardless of their level of education, seem to have booked a similar number of flights.

To test if the mentioned differences are significant, a hypothesis test is performed. To do so, first, educational levels are re-categorized into 'Superior' and 'Basic'

In [10]:
df_clean['Education Numerical'] = df_clean['Education'].map({'High School or Below': 1, 'Bachelor': 2, 'College': 2, 'Master': 3, 'Doctor': 4})

In [11]:
df_clean['Education Category'] = pd.cut(df_clean['Education Numerical'], bins=2, labels=['Superior', 'Basic'])

In [12]:
education_cat = ['Superior', 'Basic']
flights_by_ed = []

for ed in education_cat:
    flights = df_clean['Flights Booked'][df_clean['Education Category'] == ed]
    flights_by_ed.append(flights)

In [13]:
import sys
sys.path.append('../')

In [14]:
from src import support_stats as stats

In [15]:
test_hypothesis = stats.hypothesis_test(*flights_by_ed)


 **Hypothesis Test Results** 
✅ Normality Test: No
  - Normality per group: [False, False]
✅ Variance Test: Equal (p = 0.4879)
✅ Applied Test: Mann-Whitney U (non-parametric test)
 Statistic: 9415797.0000, p-value: 0.2071
 Conclusion: Fail to reject H0 (No significant differences)



First, the Kolmogorov-Smirnov test is performed to assess data normality. Apparently, neither 'Superior' nor 'Basic' data show a normal distribution. Therefore, Levene's test is used to determine whether the variances are equal. 
Finally, a Mann-Whitney test is carried out, yielding a p-value of 0.207. Thus, there are no significant differences between the number of flights booked and by level of education of customers. 