In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('../input/acmehappinesssurvey2020/ACME-HappinessSurvey2020.csv')

In [3]:
data.head()

Data Description:

Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers

X1 = my order was delivered on time

X2 = contents of my order was as I expected

X3 = I ordered everything I wanted to order

X4 = I paid a good price for my order

X5 = I am satisfied with my courier

X6 = the app makes ordering easy for me

In [4]:
# Check for information contained in dataset
data.info()

You can see that we have a very small dataset and it easier to see that we don't have any missing values. But we can confirm that by running the below:

In [5]:
data.isnull().sum()

We can see that all the features have no missing entries. 

Please also remember that all our features contain ordinal categorical data so doing the statistical check is not necesary, and we can change the data types to string.

In [6]:
# Ordered everything the customer wanted vs The app makes ordering easy for them
sns.countplot(x="X3", hue="X6", data=data);

For those individuals who ordered the least amount of items than what they wanted to order, only 2 found the app extremely difficult to use, and more found the app easy to use.
So maybe they didn't order much because there was no stock for the items they wanted to order, or it was above their budget, or the pictures weren't clear enough for them to buy etc. It could be a number of reasons.

Not many individuals who ordered everthing they wanted to order found the app easy to use.

In [7]:
sns.countplot(x="X1", hue="X2", data=data);

You can see from the graph above that the customer whose order took the longest to arrive also had contents not as the customer expected. 
And most of the orders that were delivered the fastest also had contents not as the customers expected.

In [8]:
sns.countplot(x="X5", data=data);

The most customers are indeed satisfied with their courier.

In [9]:
sns.countplot(x="X4", data=data);

More customers did indeed pay a good price for their order.

#### Dependent And Independant Variables

In [10]:
x = data.drop('Y', axis=1)
y = data.Y

#### Scaling Data To Make sure Our Features Are Within The Same Scale So Our Models Run Faster

In [11]:
from sklearn.preprocessing import MinMaxScaler

In [12]:
scaler = MinMaxScaler(copy=True, feature_range=(0,1))
X = scaler.fit_transform(x)

In [13]:
## Let's check the data
print('Independent variables: ', X[:5])
print('\nDependent variables: ', y[:5])

#### Train And Test Data

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=
                                                   30, random_state=33)

In [16]:
print('x_train shape is ', x_train.shape)

In [17]:
print('y_train shape is ', y_train.shape)

In [18]:
print('x_test shape is ', x_test.shape)

In [19]:
print('y_test shape is ', y_test.shape)

#### Train Models

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [21]:
# criterion='entropy'
randomForestClassifierModel = RandomForestClassifier(criterion='gini', n_estimators=100, max_depth=2, random_state=33)
randomForestClassifierModel.fit(x_train, y_train)

In [22]:
# Calculating Details
print('RandomForestClassifierModel Train score is : ', randomForestClassifierModel.score(x_train, y_train))

In [23]:
print('RandomForestClassifierModel Test score is : ', randomForestClassifierModel.score(x_test, y_test))

In [24]:
y_pred_random = randomForestClassifierModel.predict(x_test)

In [25]:
# DecisionTreeClassifier
decisionTreeModel = DecisionTreeClassifier(criterion='gini', max_depth=1, random_state=33)
decisionTreeModel.fit(x_train, y_train)

In [26]:
print('DecisionTreeClassifier Train score is : ', decisionTreeModel.score(x_train, y_train))

In [27]:
print('DecisionTreeClassifier Test score is : ', decisionTreeModel.score(x_test, y_test))

In [28]:
y_pred_decision = decisionTreeModel.predict(x_test)

In [29]:
# Logistic Regression
logisticRegression = LogisticRegression()
logisticRegression.fit(x_train, y_train)

In [30]:
y_pred_logistic = logisticRegression.predict(x_test)

#### Check Model Performance

Now, let us see how our 2 models performed by using **confusion matrix ***(which is a table that shows the number of records our model correctly predicted and incorrectly predicted)*, *** f1 score**  (tells us if our model was able to capture positive cases accurately), and **precision score** *(tells us the positively predicted labels that are actaully correct)*

In [31]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score

In [32]:
plot_confusion_matrix(randomForestClassifierModel, x_test, y_test)
plt.show();

In [33]:
plot_confusion_matrix(decisionTreeModel, x_test, y_test)
plt.show();

In [34]:
precisionScore_random = precision_score(y_test, y_pred_random, average='micro')
print('Precision score for RandomForestClassifier is ', precisionScore_random)

In [35]:
precisionScore_decision = precision_score(y_test, y_pred_decision, average='micro')
print('Precision score for DecisionTreeClassifier is ', precisionScore_decision)

The precision score for random forest classifier model tells us that our model only predicted 56 percent of positive labels correctly i.e. the model was only able to classify only 56 percent of customers as being happy, the rest was icorrectly predicted. 
From this, we can tell that this model performed poorly.

The one for decision tree classifier tells us that this model was able to classify 73 percent of happy customers as being happy. From this, we can see that our model performed ok.

In [36]:
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print('Logistic Regression Accuracy: ', accuracy_logistic)

You can see from the logistic accuracy measure that the logistic regression model also performed poorly because it was only able to predict 63 percent of the data correctly.

In [37]:
f1Score_random = f1_score(y_test, y_pred_random, average='weighted')
print('F1 Score for randomForestClassifier model is ', f1Score_random)

In [38]:
f1Score_decision = f1_score(y_test, y_pred_decision, average='weighted')
print('F1 Score for decisionTreeClassifier model is ', f1Score_decision)

### Apply Feature Selection

* Chi-Squared Statistic
* Mutual Information Statistic

In [39]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [40]:
# Select the most relevant features
selectKBest = SelectKBest(score_func=chi2, k='all')
selectKBest.fit(x_train, y_train)

In [41]:
x_train_kBest = selectKBest.transform(x_train)
x_test_kBest = selectKBest.transform(x_test)

In [42]:
for i in range(len(selectKBest.scores_)):
    print('Feature %d: %f '% (i, selectKBest.scores_[i]))

We can see that feature 1 may be considered insignificant out of all the features in our dataset. Let us train our models again on our new data with feature selection applied, to see if the performance of our models will change.

In [43]:
# Logistic Regression
logisticRegression = LogisticRegression()
logisticRegression.fit(x_train_kBest, y_train)

Let's now evaluate our logistic model built on data using feature selection method

In [44]:
y_predict_logistic = logisticRegression.predict(x_test_kBest)

In [45]:
accuracy_logistic = accuracy_score(y_test, y_predict_logistic)
print('Logistic Regssion Accuracy With Feature Selection: ', accuracy_logistic)

There has been no change on the performance of our logistic regression model. It could be because dropping feature 1 may have been of no significance.

In [46]:
# RandomForestClassifier model
randomForest = RandomForestClassifier(criterion='gini', n_estimators=100, max_depth=2, random_state=33)
randomForest.fit(x_train_kBest, y_train)

In [47]:
y_pred_random = randomForest.predict(x_test_kBest)

In [48]:
precisionScore_random = precision_score(y_test, y_pred_random, average='micro')
print('Precision score for RandomForestClassifier is ', precisionScore_random)

In [49]:
# DecisionTreeClassifier model
decisionTree = DecisionTreeClassifier(criterion='gini', max_depth=1, random_state=33)
decisionTree.fit(x_train_kBest, y_train)

In [50]:
y_pred_decision = decisionTree.predict(x_test_kBest)

In [51]:
precisionScore_decision = precision_score(y_test, y_pred_decision, average='micro')
print('Precision score for DecisionTreeClassifier is ', precisionScore_decision)

We can see that the performance of all 3 of our models did not change after applying feature selection i.e. The random forest classifier model still predicts only 56 percent of our positive classes correctly and the decision tree classifier still predicts 73 percent.
 This could be because the dropped feature was insignificant to our model, so dropping it made no difference.

Based on these results, we can see that the decision tree model performed better.