<h1> 5. Predicting Customer Satisfaction Using Logistic Regression</h1>
<h3><b> Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values (e.g., fill missing values with median).</li>
    <li>Encode categorical variables (e.g., one-hot encoding for region).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b> Task:</b> Implement logistic regression to predict customer satisfaction and evaluate the model using accuracy and confusion matrix. </h3>

In [32]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder   # For label encoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [15]:
# Loading the dataset
customer_dataset = pd.read_csv('Datasets\\CustomerSatisfaction.csv')
print(customer_dataset.shape, '\n')
customer_dataset.head()

(10616, 5) 



Unnamed: 0,Customer ID,Overall Delivery Experience (Rating),Food Quality (Rating),Speed of Delivery (Rating),Order Accuracy
0,1,5.0,3.0,4.0,Yes
1,2,3.0,4.0,3.0,Yes
2,3,4.0,5.0,2.0,Yes
3,4,5.0,3.0,4.0,Yes
4,5,2.0,5.0,1.0,Yes


In [16]:
# Printing the basic statistics of dataset
customer_dataset.describe()

Unnamed: 0,Customer ID,Overall Delivery Experience (Rating),Food Quality (Rating),Speed of Delivery (Rating)
count,10616.0,10198.0,10364.0,10377.0
mean,5308.5,3.32526,3.332015,3.322926
std,3064.719563,1.419754,1.414709,1.408918
min,1.0,1.0,1.0,1.0
25%,2654.75,2.0,2.0,2.0
50%,5308.5,3.0,3.0,3.0
75%,7962.25,5.0,5.0,5.0
max,10616.0,5.0,5.0,5.0


In [17]:
# Printing information of dataset
customer_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10616 entries, 0 to 10615
Data columns (total 5 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Customer ID                           10616 non-null  int64  
 1   Overall Delivery Experience (Rating)  10198 non-null  float64
 2   Food Quality (Rating)                 10364 non-null  float64
 3   Speed of Delivery (Rating)            10377 non-null  float64
 4   Order Accuracy                        9956 non-null   object 
dtypes: float64(3), int64(1), object(1)
memory usage: 414.8+ KB


<h2>Data Preprocessing</h2>

<h3><ol><li>Handling Missing Values</li></ol></h3>

In [18]:
# Checking for missing values in the dataset
customer_dataset.isnull().sum()

Customer ID                               0
Overall Delivery Experience (Rating)    418
Food Quality (Rating)                   252
Speed of Delivery (Rating)              239
Order Accuracy                          660
dtype: int64

-> 'Overall Delivery Experience' has 418 missing values, 'Food Quality' has 252, 'Speed of Delivery' has 239 and 'Order Accuracy' has 660 missing values. Since all these features are categorical in nature, we will use mode imputation on every feature.

In [19]:
# Since every variable is categorical variable in this dataset, mode imputation will be the best for imputing values.
customer_dataset.fillna(customer_dataset.mode().iloc[0], inplace=True)

In [20]:
# Checking missing values after imputation
print(customer_dataset.shape, '\n')
customer_dataset.isnull().sum()

(10616, 5) 



Customer ID                             0
Overall Delivery Experience (Rating)    0
Food Quality (Rating)                   0
Speed of Delivery (Rating)              0
Order Accuracy                          0
dtype: int64

-> So the missing values have been imputed using mode imputation technique.

<h3>2. Encoding Categorical Variables</h3>

In [21]:
# Identifying the non-numeric categorical variables in the dataset
categorical_features = customer_dataset.select_dtypes(include=['object']).columns
print('\nCategorical Variables after removing\n', categorical_features)

# Printing categories in each feature
for feature in categorical_features:
    print('\nFeature:', feature)
    print(customer_dataset[feature].value_counts())


Categorical Variables after removing
 Index(['Order Accuracy'], dtype='object')

Feature: Order Accuracy
Order Accuracy
Yes    7771
No     2845
Name: count, dtype: int64


In [26]:
# Applying one hot encoding
encoder = LabelEncoder()

# Encoding the 'Order Accuracy' feature
encoded_features = encoder.fit_transform(customer_dataset[categorical_features])
encoded_features_df = pd.DataFrame(encoded_features, columns=categorical_features)

# Drop the original categorical features and concatenate the encoded DataFrame
customer_dataset = customer_dataset.drop(categorical_features, axis=1)
customer_dataset = pd.concat([customer_dataset, encoded_features_df], axis=1)

customer_dataset.head()


  y = column_or_1d(y, warn=True)


Unnamed: 0,Customer ID,Overall Delivery Experience (Rating),Food Quality (Rating),Speed of Delivery (Rating),Order Accuracy
0,1,5.0,3.0,4.0,1
1,2,3.0,4.0,3.0,1
2,3,4.0,5.0,2.0,1
3,4,5.0,3.0,4.0,1
4,5,2.0,5.0,1.0,1


In [27]:
# Checking the datatypes of each feature
customer_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10616 entries, 0 to 10615
Data columns (total 5 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Customer ID                           10616 non-null  int64  
 1   Overall Delivery Experience (Rating)  10616 non-null  float64
 2   Food Quality (Rating)                 10616 non-null  float64
 3   Speed of Delivery (Rating)            10616 non-null  float64
 4   Order Accuracy                        10616 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 414.8 KB


-> So the object feature is now converted to int feature, encoded data. So, categorical variables have been successfully encoded.

<h3>3. Standardize numerical features</h3>

-> Since, all the features in the dataset are categorical features and there is no numeric feature that can be scaled. So we can't apply standardization here.

<h2>Model Training</h2>

In [29]:
# Separating features and target variable
X = customer_dataset.drop(['Customer ID', 'Order Accuracy'], axis=1)   # Since, we don't need customer id and order accuracy is the target variable.
Y = customer_dataset['Order Accuracy']

# Splitting the dataset into train and test data in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [30]:
# Initializing and fitting the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [38]:
# Predicting the target variable
Y_pred = lr_model.predict(X_test)

<h2>Model Evaluation</h2>

<h3><ol><li>Accuracy Score</li></ol></h3>

In [33]:
# Predicting the accuracy of the model
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy of the Model:", accuracy)

Accuracy of the Model: 0.7335216572504708


<h3>2. Confusion Matrix</h3>

In [34]:
# Calculating confusion matrix of the model
c_matrix = confusion_matrix(Y_test, Y_pred)
print("Confusion Matrix:\n", c_matrix)

Confusion Matrix:
 [[   0  566]
 [   0 1558]]


<p><b>True Positives (TP):</b> The number of correctly predicted positive cases (1558). <br>
<b>False Positives (FP):</b> The number of incorrectly predicted positive cases (566). In this case, these are cases where the model predicted 'yes', but the actual outcome was 'no'. <br>
<b>True Negatives (TN):</b> The number of correctly predicted negative cases (0). <br>
<b>False Negatives (FN):</b> The number of incorrectly predicted negative cases (0). In this case, these are cases where the model predicted 'no', but the actual outcome was 'yes'. <br></p>

-> The model achieves an accuracy of 73.35%, with perfect recall but poor performance in identifying negative cases (no true negatives). It correctly identifies all positive cases but makes a high number of false positive predictions, suggesting a need for better class balance or threshold adjustments.

<hr>