>**First and foremost**:  We extract the data from the .csv file and load it in a Panda dataframe in order to use it for classification

In [7]:
import pandas as pd

df = pd.read_csv("/content/AmazonData.csv")


>**Initialization:**
We set up the necessary components for the classification:
*  Classifiers: We initialize three different classifiers:
 1.   Logistic Regression
 2.   Support Vector Machine (SVM)
 3.   Multi-Layer Perceptron (MLP)
*   the KFold method (for distinct train-test splits),
*   the TF-IDF method to represent the text descriptions (this technique transforms the text data into a numerical format that captures the importance of each term in relation to the corpus.)

In [8]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import sklearn.feature_extraction.text as sk_text


logistic_regression = LogisticRegression(solver='lbfgs')
svm_classifier = SVC()
mlp_classifier = MLPClassifier(solver='lbfgs',max_iter = 1000)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

vectorizer = sk_text.TfidfVectorizer(min_df=1)
df['description'] = df['description'].fillna('')
X_tfidf = vectorizer.fit_transform(df['description'])

# Dictionaries for organizing the metrics of each model
metrics = {
    'LogisticRegression': {'confusion_matrices': [],'accuracies': [],'precisions': [],'recalls': [],'f1_scores': []},
    'SVC': {'confusion_matrices': [],'accuracies': [],'precisions': [],'recalls': [],'f1_scores': []},
    'MLPClassifier': {'confusion_matrices': [],'accuracies': [],'precisions': [],'recalls': [],'f1_scores': []}
}

>**Model Training and Testing**:
*   Each classifier was trained on the training data of the respective fold and evaluated on the test data.
*   In each fold and for all of the classifiers metrics where calculated and stored in a dictionary.



In [9]:
fold_num = 0
for train_index, test_index in kf.split(X_tfidf):
    fold_num+=1
    X_train, X_test =  df['description'][train_index], df['description'][test_index]
    y_train, y_test = df['category'][train_index], df['category'][test_index]

    vectorizer = sk_text.TfidfVectorizer(min_df=1)
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    predictors = [logistic_regression,svm_classifier,mlp_classifier]
    print("Fold:",fold_num)
    for model in predictors:
        model.fit(X_train_tfidf, y_train)
        predictions = model.predict(X_test_tfidf)

        model_name = type(model).__name__
        print(model_name,model.score(X_test_tfidf,y_test))

        # Store metrics for each model
        metrics[model_name]['confusion_matrices'].append(confusion_matrix(y_test, predictions))
        metrics[model_name]['accuracies'].append(accuracy_score(y_test, predictions))
        metrics[model_name]['precisions'].append(precision_score(y_test, predictions, average=None))
        metrics[model_name]['recalls'].append(recall_score(y_test, predictions, average=None))
        metrics[model_name]['f1_scores'].append(f1_score(y_test, predictions, average=None))


Fold: 1
LogisticRegression 0.8460992907801419
SVC 0.8524822695035461
MLPClassifier 0.8148936170212766
Fold: 2
LogisticRegression 0.8460992907801419
SVC 0.8560283687943262
MLPClassifier 0.8304964539007093
Fold: 3
LogisticRegression 0.849645390070922
SVC 0.8638297872340426
MLPClassifier 0.8361702127659575
Fold: 4
LogisticRegression 0.8489361702127659
SVC 0.8602836879432624
MLPClassifier 0.8262411347517731
Fold: 5
LogisticRegression 0.8445706174591909
SVC 0.8516678495386799
MLPClassifier 0.8076650106458482


>**Performance Metrics:**
The average confusion matrix across the five folds was computed, along with the mean values of accuracy, precision, recall, and F1-measure for each class.

In [10]:
for model_name, model_metrics in metrics.items():
    print(f"Results for {model_name}:")
    print("Confusion Matrices:", np.mean(model_metrics['confusion_matrices'], axis=0))
    print("Accuracies:", np.mean(model_metrics['accuracies']))
    print("Precisions:", np.mean(model_metrics['precisions'], axis=0))
    print("Recalls:", np.mean(model_metrics['recalls'], axis=0))
    print("F1 Scores:", np.mean(model_metrics['f1_scores'], axis=0))
    print("\n")


Results for LogisticRegression:
Confusion Matrices: [[753.6  98.6]
 [117.  440.6]]
Accuracies: 0.8470701518606326
Precisions: [0.86560504 0.81707219]
Recalls: [0.88436425 0.79023685]
F1 Scores: [0.87484743 0.80335139]


Results for SVC:
Confusion Matrices: [[746.6 105.6]
 [ 96.2 461.4]]
Accuracies: 0.8568583926027715
Precisions: [0.88583254 0.8136113 ]
Recalls: [0.87611652 0.82744518]
F1 Scores: [0.88093876 0.8204487 ]


Results for MLPClassifier:
Confusion Matrices: [[730.6 121.6]
 [127.8 429.8]]
Accuracies: 0.8230932858171128
Precisions: [0.851177   0.77910507]
Recalls: [0.85734601 0.77043361]
F1 Scores: [0.8542492  0.77474224]




>For the **Logistic Regression** classifier in the final fold, we identified the 20 words with the highest positive weights and the 20 words with the lowest negative weights.

In [11]:
coefficients = logistic_regression.coef_[0]
pos = coefficients.argsort()[-20:][::-1]
top_pos = vectorizer.get_feature_names_out()[pos]

neg = coefficients.argsort()[:20]
top_neg = vectorizer.get_feature_names_out()[neg]

print("20 words with biggest positive weight:\n", top_pos)
print("20 words with smallest negative weight:\n", top_neg)

20 words with biggest positive weight:
 ['wall' 'travel' 'battery' 'ac' 'home' '240v' 'desktop' 'qi' 'dock'
 'receiver' '60hz' 'original' 'of' '100' 'nokia' 'solar' 'station' 'case'
 'is' 'pin']
20 words with smallest negative weight:
 ['car' 'vehicle' 'lighter' '12v' 'cigarette' 'dc' '24v' '12' 'dual'
 'indicated' 'auto' 'led' 'retractable' 'road' 'in' 'system' 'powered'
 'title' 'cars' 'very']
