# Starbucks sutomer loyalty prediciton

**Introduction**

The survey analysis results have a significant impact onto the business profitability in todays markets. Once running a coffee shop tt is vital to understand ASAP which service client prefer at most and which needed to be improved.

In this kernel I would like to estimate the factors that really influence the customers decision (loyalty) to continue visiting the StarBucks.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve, roc_auc_score, roc_curve 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/starbucks-customer-retention-malaysia-survey/Starbucks satisfactory survey encode cleaned.csv')
df.head()

Firstly, lets examine the feature that we will predict:

0 - customer is not loyal

1 - customer is loyal

In [None]:
df['loyal'].value_counts()

We can see that the feature loyal is imbalanced, thus we need to account for that once we will do the analysis

In [None]:
#The nest step is to assess the correlation betwenen the independent features and target variable

plt.figure(figsize=(12,12))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

Here we observe that some columns have values that do not disply any correlation,so they have a zero impact onto he target feature, thus we can drop them

In [None]:
df.columns

**Dropping constant features:**

In this step we will be removing the features which have constant features which are actually not important for solving the problem statement Variance Threshold Feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

In [None]:
def VarianceThreshold(df):
    ### It will zero variance features
    from sklearn.feature_selection import VarianceThreshold
    var_thres=VarianceThreshold(threshold=0)
    var_thres.fit(df)
    constant_columns = [column for column in df.columns
                    if column not in df.columns[var_thres.get_support()]]
    df.drop(constant_columns, axis = 1, inplace = True)
    return df

In [None]:
VarianceThreshold(df)

In [None]:
df.columns

In [None]:
df.isnull().sum()

There are no null values in our df

In [None]:
plt.figure(figsize=(20,10))
c= df.corr()
sns.heatmap(c)

Once I have plotted the heatmap we can see that some features have similar correlation coeffcients (orange square at right down corner) as such, we need to drop them or impute by the feature engineering approachea to reduce the amount of predictors otherwise this may lead to model overfitting 

Firstly I plan to take care of the target variable (loyalty), we need to balance its values as we have the imbalanced data.

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop(columns = ['loyal'])
y = df['loyal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:

#using SMOTE to balance the dataset
from imblearn.over_sampling import SMOTE
from collections import Counter

os = SMOTE(random_state=0)
X_train_os,y_train_os=os.fit_resample(X_train,y_train)
X_test_os,y_test_os=os.fit_resample(X_test,y_test)
print("The number of y_train classes before fit {}".format(Counter(y_train)))
print("The number of y_train classes after fit {}".format(Counter(y_train_os)))
print("The number of y_test classes before fit {}".format(Counter(y_test)))
print("The number of y_test classes after fit {}".format(Counter(y_test_os)))

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

1. Chi-Squared Statistic. 
2. Mutual Information Statistic.


Chi-Squared Statistic.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train_os)
X_test_fs = fs.transform(X_test_os)

In [None]:
from matplotlib import pyplot as plt
# what are scores for the features
for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))


In [None]:
scores = pd.Series(fs.scores_)

scores.index = X_train_os.columns
scores.sort_values(ascending=False)

In [None]:
#let's plot the ordered mutual_info values per feature
scores.sort_values(ascending=False).plot.bar(figsize=(20, 8))
plt.xticks(fontsize= 22)
plt.show()

Mutual Information Statistic.

In [None]:
from sklearn.feature_selection import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train_os, y_train_os)
mutual_info

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train_os.columns
mutual_info.sort_values(ascending=False)

In [None]:
#let's plot the ordered mutual_info values per feature
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20, 8))
plt.xticks(fontsize= 22)
plt.show()

In [None]:
X_train_os.columns

Unfortunately, it is not clear which features I should use for preictions because the two aformentioned a;gorithms did not give us the same result, thus we need to apply something else to actually reduce the number of features in our dataframe. Before doing that I will show you the data overfitting after applying the all feature for the prediction of the target variable.

In [None]:
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.ensemble import RandomForestClassifier

estimators = {
    'KNeighborsClassifier' :[KNeighborsClassifier()],
    'Random Forest' :[RandomForestClassifier()]
}


def mfit(estimators, X_train_om, y_train_os):
    for m in estimators:
        estimators[m][0].fit(X_train_os, y_train_os)
        print(m+' fitted')

mfit(estimators, X_train_os, y_train_os)

In [None]:
def mpredict(estimators, X_test_os, y_test_os):
    outcome = dict()
    r_a_score = dict()
    for m in estimators:
        y_pred = estimators[m][0].predict(X_test_os)
        #r_a_score[m] = roc_auc_score(y_test, y_pred)
        outcome[m] = [y_pred, confusion_matrix(y_pred,y_test_os), classification_report(y_pred,y_test_os)]
    return outcome, r_a_score

outcome, r_a_score = mpredict(estimators, X_test_os, y_test_os)
for m in outcome:
    print('------------------------'+m+'------------------------')
    print(outcome[m][1])
    print(outcome[m][2])

Amazing results!! We got 92% accuracy with the high precision and recall values. The problem is that this model is very vulnerable to the variations if the features values, meaning that on the unknowm dataset we will get the lower accuracy , recall and precision. 

However, we do not know which feature to choose, as Feature selection did not give us the reliable data. Yes, we can choose one feature that is the same in both algorithms (Price rate) and make classification based only on this feature. We, however, want to examine the other service needs that nust be improved, thus ideally, we would like to use 30-40% of the features. If the number of variables in the data is very high, the regression models in this situation tend to perform badly. Besides, identifying important variables becomes challenging. In this scenario, we try to reduce the number of variables. 

Lets say I do not want to decreease the numner of features as I am not sure whether the Mutual info or Chi2 give the me the reliable data, thus I will attemp to group the entire columns and to perform the feature reduction by the Factor analysis



# **Factor analysis**

Factor analysis is widely utilized in market research, advertising, psychology, finance, and operation research. Market researchers use factor analysis to identify price-sensitive customers, identify brand features that influence consumer choice, and helps in understanding channel selection criteria for the distribution channel.

**Assumptions:**

There are no outliers in data.
Sample size should be greater than the factor.
There should not be perfect multicollinearity.
There should not be homoscedasticity between the variables.

**Factor Analysis implementation:**

Factor Extraction: In this step, the number of factors and approach for extraction selected using variance partitioning methods such as principal components analysis and common factor analysis.

Factor Rotation: In this step, rotation tries to convert factors into uncorrelated factors — the main goal of this step to improve the overall interpretability. There are lots of rotation methods that are available such as: Varimax rotation method, Quartimax rotation method, and Promax rotation method.

Factor Analysis Vs. Principle Component Analysis
1. PCA components explain the maximum amount of variance while factor analysis explains the covariance in data. 
2. PCA components are fully orthogonal to each other whereas factor analysis does not require factors to be orthogonal. 
3. PCA component is a linear combination of the observed variable while in FA, the observed variables are linear combinations of the unobserved variable or factor. 
4. PCA components are uninterpretable. In FA, underlying factors are labelable and interpretable. PCA is a kind of dimensionality reduction method whereas factor analysis is the latent variable method. 
5. PCA is a type of factor analysis. PCA is observational whereas FA is a modeling technique.

The info is taken from this (#reference)[https://www.datacamp.com/community/tutorials/introduction-factor-analysis]


Before applying the Factor Analysis we need to make sure that our dataframe is suitable for that:
The are two tests to be performed:

1. Kaiser-Meyer-Olkin (KMO) test is used to check sampling adequacy for the overall data set. The statistic measures the proportion of variance among variables that could be common variance. This table shows two tests that indicate the suitability of your data for structure detection. The Kaiser-Meyer-Olkin Measure of Sampling Adequacy is a statistic that indicates the proportion of variance in your variables that might be caused by underlying factors. High values (close to 1.0) generally indicate that a factor analysis may be useful with your data. If the value is less than 0.50, the results of the factor analysis probably won't be very useful.(https://www.ibm.com/docs/en/spss-statistics/23.0.0?topic=detection-kmo-bartletts-test)

2. Bartlett's test of sphericity tests the hypothesis that your correlation matrix is an identity matrix, which would indicate that your variables are unrelated and therefore unsuitable for structure detection. Small values (less than 0.05) of the significance level indicate that a factor analysis may be useful with your data.(https://www.ibm.com/docs/en/spss-statistics/23.0.0?topic=detection-kmo-bartletts-test)

In [None]:
pip install factor_analyzer

In [None]:
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity,calculate_kmo

In [None]:
chi2,p = calculate_bartlett_sphericity(df)
print("Bartlett Sphericity Test")
print("Chi squared value : ",chi2)
print("p value : ",p)
if p < 0.05:
    print('The FA might be usefull to reduce the number of features')
else:
    print('The FA might be NOT usefull to reduce the number of features')

In [None]:
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity,calculate_kmo
kmo_model = calculate_kmo(df)
print(kmo_model)

We got this value: 0.6960887256821134. The values more than 0.5 can be questified by performing Factor analysis

In [None]:
df.columns
x =df.drop('loyal', axis = 1)
x

In [None]:
df.shape

In [None]:
#Subset of the data, the 14 columns containing the survey answers
from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer()
fa.fit(x, 10)
#Get Eigen values and plot them
plt.figure(figsize=(15,10))
ev, v = fa.get_eigenvalues()
ev
plt.scatter(range(1,x.shape[1]+1),ev, s = 800)


plt.title('Scree Plot', fontsize = 20)
plt.xlabel('Factors', fontsize = 20)
plt.ylabel('Eigen Value', fontsize = 20)
plt.xticks(np.arange(0, 20, 1.0))

plt.grid()

Looks like the closest number is 7, so we will reduce the number of columns (features) to 7 from 19

What are the factor loadings?

The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor. It shows the variance explained by the observed variables.

In [None]:
x.shape
x.head()

In [None]:
fa = FactorAnalyzer(7, rotation='varimax')
fa.fit(x)
print(pd.DataFrame(fa.loadings_,index=x.columns))
#loads = fa.loadings_
#print(loads)

Basically, we can reduce the dimensions to the 7 groups meaning we need to assign new 7 columns in a new dataframe

In [None]:
print(pd.DataFrame(fa.get_communalities(),index=x.columns,columns=['Communalities']))

In [None]:
sum = 0
sum_list = [i for i in fa.get_communalities()]

fa.get_communalities().sum()/len(sum_list)

We need to check the average communality of teh factors. As MacCallum (2000,2001) suggested the average communality should be no lesser than 0.5 for 120 samples. In our case we have the lesser number, the difference however is not so critical (0.5 vs 0.47), thus I will use it is out study

In [None]:
df.head()

In [None]:
pip install pingouin

In [None]:
import pingouin as pg
#Create the factors
factor1 = df[['productRate', 'priceRate', 'promoRate']] #service
factor2 = df[['gender', 'age', 'status']] #customer general info
factor3 = df[['location', 'visitNo']] #location
factor4 = df[['income', 'membershipCard']] #money
factor5 = df[['ambianceRate', 'serviceRate', 'wifiRate']] #service1
factor6 = df[['timeSpend', 'method']] #Inside




#Get cronbach alpha
factor1_alpha = pg.cronbach_alpha(factor1)
factor2_alpha = pg.cronbach_alpha(factor2)
factor3_alpha = pg.cronbach_alpha(factor3)
factor4_alpha = pg.cronbach_alpha(factor4)
factor5_alpha = pg.cronbach_alpha(factor5)
factor6_alpha = pg.cronbach_alpha(factor6)

print(factor1_alpha, factor2_alpha, factor3_alpha, factor4_alpha, factor5_alpha, factor6_alpha)

In [None]:
new_variables = fa.fit_transform(x)
new_variables

In [None]:
df['Service'] = new_variables[:, 0]
df['Customer_info'] = new_variables[:, 1]
df['Location'] = new_variables[:, 2]
df['Money'] = new_variables[:, 3]
df['Service_1'] = new_variables[:, 4]
df['Inside'] = new_variables[:, 5]

In [None]:
df.head()

In [None]:
df_factorized = df[['Service', 'Customer_info', 'Location', 'Money','Service_1','Inside', 'loyal']]

In [None]:
df_factorized.head()

In [None]:
df_factorized.shape

In [None]:
from sklearn.model_selection import train_test_split
X = df_factorized.drop(columns = ['loyal'])
y = df_factorized['loyal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:

#using SMOTE to balance the dataset
from imblearn.over_sampling import SMOTE
from collections import Counter

os = SMOTE(random_state=0)
X_train_os,y_train_os=os.fit_resample(X_train,y_train)
X_test_os,y_test_os=os.fit_resample(X_test,y_test)
print("The number of y_train classes before fit {}".format(Counter(y_train)))
print("The number of y_train classes after fit {}".format(Counter(y_train_os)))
print("The number of y_test classes before fit {}".format(Counter(y_test)))
print("The number of y_test classes after fit {}".format(Counter(y_test_os)))

In [None]:
from xgboost import  XGBClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier 


estimators = {
    'KNeighborsClassifier' :[KNeighborsClassifier()],
    'Random Forest' :[RandomForestClassifier()],
    'XGBoost' :[XGBClassifier()],
    'Logistic Regression': [LogisticRegression()],
    'GaussianNB' :[GaussianNB()],
    'Gradient Boost' :[GradientBoostingClassifier()],
    'Decision Tree' :[DecisionTreeClassifier()],
}


def mfit(estimators, X_train_om, y_train_os):
    for m in estimators:
        estimators[m][0].fit(X_train_os, y_train_os)
        print(m+' fitted')

mfit(estimators, X_train_os, y_train_os)

In [None]:
def mpredict(estimators, X_test_os, y_test_os):
    outcome = dict()
    r_a_score = dict()
    for m in estimators:
        y_pred = estimators[m][0].predict(X_test_os)
        #r_a_score[m] = roc_auc_score(y_test, y_pred)
        outcome[m] = [y_pred, confusion_matrix(y_pred,y_test_os), classification_report(y_pred,y_test_os)]
    return outcome, r_a_score

outcome, r_a_score = mpredict(estimators, X_test_os, y_test_os)
for m in outcome:
    print('------------------------'+m+'------------------------')
    print(outcome[m][1])
    print(outcome[m][2])