# Product Category Preference Prediction


### Reg No: IT21126888
### Name: Senadheera P.V.P.P

<hr/>

<ul>
    <li><b>Target Variable:</b> Product line (Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel)</li>
    <li><b>Predictors:</b> Branch, Customer type, Gender, Date, Time, Payment, COGS, and Gross income</li>
    <li><b>Objective:</b> Predict the likely product category preferences of customers based on their demographics and purchase patterns, assisting in inventory management and marketing strategies.</li>
</ul>




In [83]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Importing Dataset

In [84]:
dataset = pd.read_csv('../dataset/supermarket_sales.csv')

In [85]:
dataset

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1


Data Pre-processing

In [86]:
dataset = dataset[['Branch', 'Customer type', 'Gender', 'Payment', 'cogs', 'gross income', 'Product line']]

In [87]:
dataset

Unnamed: 0,Branch,Customer type,Gender,Payment,cogs,gross income,Product line
0,A,Member,Female,Ewallet,522.83,26.1415,Health and beauty
1,C,Normal,Female,Cash,76.40,3.8200,Electronic accessories
2,A,Normal,Male,Credit card,324.31,16.2155,Home and lifestyle
3,A,Member,Male,Ewallet,465.76,23.2880,Health and beauty
4,A,Normal,Male,Ewallet,604.17,30.2085,Sports and travel
...,...,...,...,...,...,...,...
995,C,Normal,Male,Ewallet,40.35,2.0175,Health and beauty
996,B,Normal,Female,Ewallet,973.80,48.6900,Home and lifestyle
997,A,Member,Male,Cash,31.84,1.5920,Food and beverages
998,A,Normal,Male,Cash,65.82,3.2910,Home and lifestyle


In [88]:
# Checking for missing values
hasNullValues = dataset.isnull().values.any()
if(hasNullValues == False):
    print("Dataset doesn't have any null value.")
else:
    print("Dataset has null values.")
    nullCount = dataset.isnull().sum().sum()
    print("Number of null values:", nullCount)

Dataset doesn't have any null value.


In [89]:
# Checking the columns in dataframe
dataset.columns

Index(['Branch', 'Customer type', 'Gender', 'Payment', 'cogs', 'gross income',
       'Product line'],
      dtype='object')

In [90]:
# Checking the unique values
print("Branch:", dataset.Branch.unique())
print("Customer type:", dataset['Customer type'].unique())
print("Gender:", dataset.Gender.unique())
print("Payment:", dataset.Payment.unique())
print("Product line", dataset['Product line'].unique())

Branch: ['A' 'C' 'B']
Customer type: ['Member' 'Normal']
Gender: ['Female' 'Male']
Payment: ['Ewallet' 'Cash' 'Credit card']
Product line ['Health and beauty' 'Electronic accessories' 'Home and lifestyle'
 'Sports and travel' 'Food and beverages' 'Fashion accessories']


In [91]:
# Label Encoding
labelEncoder = LabelEncoder()
dataset['Branch'] = labelEncoder.fit_transform(dataset['Branch'])
dataset['Customer type'] = labelEncoder.fit_transform(dataset['Customer type'])
dataset['Gender'] = labelEncoder.fit_transform(dataset['Gender'])
dataset['Payment'] = labelEncoder.fit_transform(dataset['Payment'])
dataset['Product line'] = labelEncoder.fit_transform(dataset['Product line'])
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Branch'] = labelEncoder.fit_transform(dataset['Branch'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Customer type'] = labelEncoder.fit_transform(dataset['Customer type'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Gender'] = labelEncoder.fit_transform(dataset['Gen

Unnamed: 0,Branch,Customer type,Gender,Payment,cogs,gross income,Product line
0,0,0,0,2,522.83,26.1415,3
1,2,1,0,0,76.40,3.8200,0
2,0,1,1,1,324.31,16.2155,4
3,0,0,1,2,465.76,23.2880,3
4,0,1,1,2,604.17,30.2085,5
...,...,...,...,...,...,...,...
995,2,1,1,2,40.35,2.0175,3
996,1,1,0,2,973.80,48.6900,4
997,0,0,1,0,31.84,1.5920,2
998,0,1,1,0,65.82,3.2910,4


In [92]:
# Normalization
def min_max_normalize(data):
    min_val = np.min(data)
    max_val = np.max(data)
    data_norm = (data - min_val) / (max_val - min_val)
    return data_norm

In [93]:
dataset['cogs'] = min_max_normalize(dataset['cogs'])
dataset['gross income'] = min_max_normalize(dataset['gross income'])
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['cogs'] = min_max_normalize(dataset['cogs'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['gross income'] = min_max_normalize(dataset['gross income'])


Unnamed: 0,Branch,Customer type,Gender,Payment,cogs,gross income,Product line
0,0,0,0,2,0.521616,0.521616,3
1,2,1,0,0,0.067387,0.067387,0
2,0,1,1,1,0.319628,0.319628,4
3,0,0,1,2,0.463549,0.463549,3
4,0,1,1,2,0.604377,0.604377,5
...,...,...,...,...,...,...,...
995,2,1,1,2,0.030707,0.030707,3
996,1,1,0,2,0.980465,0.980465,4
997,0,0,1,0,0.022049,0.022049,2
998,0,1,1,0,0.056622,0.056622,4


Unnamed: 0,Branch,Customer type,Gender,Payment,cogs,gross income,Product line
0,0,0,0,2,0.521616,0.521616,3
1,2,1,0,0,0.067387,0.067387,0
2,0,1,1,1,0.319628,0.319628,4
3,0,0,1,2,0.463549,0.463549,3
4,0,1,1,2,0.604377,0.604377,5
...,...,...,...,...,...,...,...
995,2,1,1,2,0.030707,0.030707,3
996,1,1,0,2,0.980465,0.980465,4
997,0,0,1,0,0.022049,0.022049,2
998,0,1,1,0,0.056622,0.056622,4


In [82]:
# Dividing X and Y sets
Y = dataset['Product line']
X = dataset.drop(['Product line'], axis=1)

In [10]:
# Visualizing X and Y sets
print(X)
print(Y)

     Branch  Customer type  Gender  Payment    cogs  gross income
0         0              0       0        2  522.83       26.1415
1         2              1       0        0   76.40        3.8200
2         0              1       1        1  324.31       16.2155
3         0              0       1        2  465.76       23.2880
4         0              1       1        2  604.17       30.2085
..      ...            ...     ...      ...     ...           ...
995       2              1       1        2   40.35        2.0175
996       1              1       0        2  973.80       48.6900
997       0              0       1        0   31.84        1.5920
998       0              1       1        0   65.82        3.2910
999       0              0       0        0  618.38       30.9190

[1000 rows x 6 columns]
0      3
1      0
2      4
3      3
4      5
      ..
995    3
996    4
997    2
998    4
999    1
Name: Product line, Length: 1000, dtype: int32


In [11]:
# Splitting train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=22)

Model Training and Evaluation

In [12]:
# Classifiers
logisticRegression = LogisticRegression(solver='lbfgs', max_iter=400)
decisionTreeClassifier = DecisionTreeClassifier()
randomForestClassifier = RandomForestClassifier()
svc = SVC()
kneighborsClassifier = KNeighborsClassifier()

In [14]:
model1 = logisticRegression.fit(x_train, y_train)
model2 = decisionTreeClassifier.fit(x_train, y_train)
model3 = randomForestClassifier.fit(x_train, y_train)
model4 = svc.fit(x_train, y_train)
model5 = kneighborsClassifier.fit(x_train, y_train)

In [15]:
model1_pred = model1.predict(x_test)
model2_pred = model2.predict(x_test)
model3_pred = model3.predict(x_test)
model4_pred = model4.predict(x_test)
model5_pred = model5.predict(x_test)

In [19]:
model1_accuracy = accuracy_score(y_test, model1_pred)
model2_accuracy = accuracy_score(y_test, model2_pred)
model3_accuracy = accuracy_score(y_test, model3_pred)
model4_accuracy = accuracy_score(y_test, model4_pred)
model5_accuracy = accuracy_score(y_test, model5_pred)

In [20]:
print(model1_accuracy)
print(model2_accuracy)
print(model3_accuracy)
print(model4_accuracy)
print(model5_accuracy)

0.135
0.145
0.16
0.14
0.16


In [73]:
y_pred_model1 = logisticRegression.predict(x_test)

In [79]:
np.min(y_pred_model1)

0

In [43]:
for clf in classifiers:
    # Train the classifier
    clf.fit(x_train, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(x_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)

    # Print the results
    print(f"{clf.__class__.__name__} Accuracy: {accuracy:.2f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression Accuracy: 0.14
DecisionTreeClassifier Accuracy: 0.14
RandomForestClassifier Accuracy: 0.16
SVC Accuracy: 0.14
KNeighborsClassifier Accuracy: 0.16


In [44]:
classifiers[0].fit(x_train, y_train)
y_pred = classifiers[0].predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [125]:
dataset['Branch'] = labelEncoder.inverse_transform(dataset['Branch'])
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Branch'] = labelEncoder.inverse_transform(dataset['Branch'])


Unnamed: 0,Branch,Customer type,Gender,Payment,cogs,gross income,Product line
0,A,Member,Female,Ewallet,522.83,26.1415,Health and beauty
1,C,Normal,Female,Cash,76.40,3.8200,Electronic accessories
2,A,Normal,Male,Credit card,324.31,16.2155,Home and lifestyle
3,A,Member,Male,Ewallet,465.76,23.2880,Health and beauty
4,A,Normal,Male,Ewallet,604.17,30.2085,Sports and travel
...,...,...,...,...,...,...,...
995,C,Normal,Male,Ewallet,40.35,2.0175,Health and beauty
996,B,Normal,Female,Ewallet,973.80,48.6900,Home and lifestyle
997,A,Member,Male,Cash,31.84,1.5920,Food and beverages
998,A,Normal,Male,Cash,65.82,3.2910,Home and lifestyle
