**This is the continuation of the document "Data_Analysis"**. Please see the other document for an extended explanation.

# Purpose of this document

Briefly, **this document aims to train a machine learning model that can classify a product in the Amazon search result in 1 of 3 categories**: Product owned by Amazon ("Amazon product"), product apparently not owned by Amazon ("non-Amazon product"), and product completely not owned by Amazon ("wholly non-Amazon product").

This will be done based on **3 features** that were given in the ["Amazon Brands and Exclusives" dataset](https://www.kaggle.com/datasets/thedevastator/amazon-s-dominance-in-e-commerce-why-you-should). These are: _the position the product is placed in the search results, the stars it has, and the reviews it has_. These features were previously confirmed to be statistically different in most cases, so it is feasible that a machine could distinguish these products based on these features.

Now, let's start talking about the machine learning models that will be used. 

# Machine learning algorithm
Since we are looking to classify an input into 3 different classes, I consider 2 possible algorithms: **Multiclass logistic regression or Neural Networks**. In this document, both algorithms will be applied to see which one has a better performance. 

**First, I will begin with the simpler algorithm, logistic regression**. From there, depending on the results, it will be decided what to do next. 

## Multiclass logistic regression 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math


df = pd.read_csv("Amazon Brands and Exclusives/quality_and_sales_comparisons.csv")

df = df.dropna()
df.rename(columns={"position_first_wholly_non_amazon": "position_first_wnon_amazon"}, inplace=True)
nump_df = df.to_numpy()


# Separating the rows into different examples (Amazon, non-Amazon, and wholly non-Amazon)
# Also creating the "Y" array, which indicates the type of product. 0 = Amazon, 1 = non-Amazon, 2 = wholly non-Amazon
amazon_examples = np.empty_like(nump_df[:, [1, 4, 5]])
amazon_examples = nump_df[:, [1, 4, 5]]
amazon_results = np.zeros( (amazon_examples.shape[0], 1) )

non_amazon_examples = np.empty_like(nump_df[:, [2, 6, 7]])
non_amazon_examples = nump_df[:, [2, 6, 7]]
non_amazon_results = np.ones( (amazon_examples.shape[0], 1) )

wnon_amazon_examples = np.empty_like(nump_df[:, [3, 8, 9]])
wnon_amazon_examples = nump_df[:, [3, 8, 9]]
wnon_amazon_results = np.ones( (amazon_examples.shape[0], 1) ) * 2

# Stacking all arrays
all_examples = np.vstack( (amazon_examples, non_amazon_examples, wnon_amazon_examples) )
all_results = np.vstack( (amazon_results, non_amazon_results, wnon_amazon_results) )



# Now I'm applying feature scaling, converting all values in the range 0-1 for a better performance. 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_examples = scaler.fit_transform(all_examples)



# Dividing the data into training data and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_examples, all_results, test_size=0.25, random_state=42, shuffle=True)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## Creating the algorithm to be used
## This will be done inside a for loop, since cross-validation will be used to find the best C value for regularization.
## C values will range from 0.01 to 100 (lower values means stronger regularization). 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

all_c = [0.01, 0.1, 0.2, 0.4, 0.8, 1.5, 3, 6, 12, 24, 48, 100]
curr_acc = 0
better_c = 0
for c_value in all_c:
    log_reg = LogisticRegression(solver='sag', C=c_value, random_state=32) # sag is used because it works with multi-class logistic regression, and it was fast while testing
    result_c = cross_val_score(log_reg, X_train, y_train, cv=3)
    print(f"For C={c_value}: Mean = %.2f, standard deviation = %0.3f" % (result_c.mean(), result_c.std()))
    
    # Storing the best c_value according to the cross validation
    if curr_acc < result_c.mean():
        better_c = c_value
        curr_acc = result_c.mean()
    
    

print(f"\nThe best value of C was {better_c}")

# Now it's time to train the algorithm with that c_value and see the result
log_reg = LogisticRegression(solver='sag', C=better_c, random_state=32)
log_reg.fit(X_train, y_train)
final_score = log_reg.score(X_test, y_test)
print(f"\nFor C={better_c}: Final score = %0.2f" % (final_score))


For C=0.01: Mean = 0.52, standard deviation = 0.012
For C=0.1: Mean = 0.56, standard deviation = 0.012
For C=0.2: Mean = 0.57, standard deviation = 0.008
For C=0.4: Mean = 0.59, standard deviation = 0.008
For C=0.8: Mean = 0.60, standard deviation = 0.009
For C=1.5: Mean = 0.62, standard deviation = 0.011
For C=3: Mean = 0.62, standard deviation = 0.009
For C=6: Mean = 0.63, standard deviation = 0.006
For C=12: Mean = 0.63, standard deviation = 0.007
For C=24: Mean = 0.64, standard deviation = 0.006
For C=48: Mean = 0.64, standard deviation = 0.006
For C=100: Mean = 0.64, standard deviation = 0.005

The best value of C was 100

For C=100: Final score = 0.62


As we can see by the final score, **the results are not the best**. Since the prediction accuracy was low even on the cross validation, it means there is a problem of **underfitting**. 

To solve this, I'm going to try to give it **more features**. Each new feature will be a relationship (division) between the other features, since the stars/review, position/review, or stars/position relationships could be useful to predict the type of product. 

In [2]:
# Ignoring errors when trying to divide by 0, solve with that later
np.seterr(divide='ignore', invalid='ignore')

# Creating the new features in an additional array (with original values), then I will concatenate and scale them.
new_features = np.empty_like(all_examples)
new_features[:, 0] = all_examples[:, 1] / all_examples[:, 0] # Stars / Position
new_features[:, 1] = all_examples[:, 0] / all_examples[:, 2] # Position / Reviews
new_features[:, 2] = all_examples[:, 1] / all_examples[:, 2] # Stars / Reviews

# Changing infinite values to 1 (since it is the maximum value after scaling)
new_features[new_features == np.inf] = 1;

# Concatenating
new_all_examples = np.concatenate((all_examples, new_features), axis=1)
new_all_examples

# Scaling
scaler = MinMaxScaler()
scaled_examples = scaler.fit_transform(new_all_examples);

In [3]:
# Ignoring some warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

# Now I will apply the same process as last time

# Dividing the data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(scaled_examples, all_results, test_size=0.25, random_state=42, shuffle=True)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## Creating the algorithm to be used
## This will be done inside a for loop, since cross-validation will be used to find the best C value for regularization.
## C values will range from 0.1 to 100 (lower values means stronger regularization). 

all_c = [0.01, 0.1, 0.2, 0.4, 0.8, 1.5, 3, 6, 12, 24, 48, 100]
curr_acc = 0
better_c = 0
for c_value in all_c:
    log_reg = LogisticRegression(solver='sag', C=c_value, random_state=32, max_iter=200) # sag is used because it works with multi-class logistic regression, and it was fast while testing
    result_c = cross_val_score(log_reg, X_train, y_train, cv=3)
    print(f"For C={c_value}: Mean = %.2f, standard deviation = %0.3f" % (result_c.mean(), result_c.std()))
    
    # Storing the best c_value according to the cross validation
    if curr_acc < result_c.mean():
        better_c = c_value
        curr_acc = result_c.mean()
    
    

print(f"\nThe best value of C was {better_c}")

# Now it's time to train the algorithm with that c_value and see the result
log_reg = LogisticRegression(solver='sag', C=better_c, random_state=32)
log_reg.fit(X_train, y_train)
final_score = log_reg.score(X_test, y_test)
print(f"\nFor C={better_c}: Final score = %0.2f" % (final_score))


For C=0.01: Mean = 0.59, standard deviation = 0.007
For C=0.1: Mean = 0.60, standard deviation = 0.011
For C=0.2: Mean = 0.60, standard deviation = 0.009
For C=0.4: Mean = 0.60, standard deviation = 0.005
For C=0.8: Mean = 0.60, standard deviation = 0.004
For C=1.5: Mean = 0.61, standard deviation = 0.004
For C=3: Mean = 0.62, standard deviation = 0.002
For C=6: Mean = 0.62, standard deviation = 0.001
For C=12: Mean = 0.63, standard deviation = 0.002
For C=24: Mean = 0.63, standard deviation = 0.003
For C=48: Mean = 0.63, standard deviation = 0.004
For C=100: Mean = 0.64, standard deviation = 0.003

The best value of C was 100

For C=100: Final score = 0.62


That wasn't the best result either. The results aren't horrible, but they aren't good either. 

To try to improve this, I'm going to consider the results of the previous data analysis. In those results, **it was clear that there was a difference between wholly non-Amazon and non-Amazon products**, in which non-amazon products had better position results. Therefore, to improve the model, I will **treat Amazon products and non-Amazon products the same**, distinguishing only between wholly non-Amazon products, and the rest. 

Let's see if this improves the classification. 

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

df = pd.read_csv("Amazon Brands and Exclusives/quality_and_sales_comparisons.csv")

df = df.dropna()
df.rename(columns={"position_first_wholly_non_amazon": "position_first_wnon_amazon"}, inplace=True)
nump_df = df.to_numpy()

# Separating the rows into different examples (Amazon, non-Amazon, and wholly non-Amazon)
# Also creating the "Y" array, which indicates the type of product. 1 = Amazon, 2 = non-Amazon, 3 = wholly non-Amazon
amazon_examples = np.empty_like(nump_df[:, [1, 4, 5]])
amazon_examples = nump_df[:, [1, 4, 5]]
amazon_results = np.zeros( (amazon_examples.shape[0], 1) )

non_amazon_examples = np.empty_like(nump_df[:, [2, 6, 7]])
non_amazon_examples = nump_df[:, [2, 6, 7]]
non_amazon_results = np.zeros( (amazon_examples.shape[0], 1) )

wnon_amazon_examples = np.empty_like(nump_df[:, [3, 8, 9]])
wnon_amazon_examples = nump_df[:, [3, 8, 9]]
wnon_amazon_results = np.ones( (amazon_examples.shape[0], 1) )

# Stacking all arrays
all_examples = np.vstack( (amazon_examples, non_amazon_examples, wnon_amazon_examples) )
all_results = np.vstack( (amazon_results, non_amazon_results, wnon_amazon_results) )

# Now I'm applying feature scaling, converting all values in the range 0-1 for a better performance. 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_examples = scaler.fit_transform(all_examples)


# Dividing the data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(scaled_examples, all_results, test_size=0.25, random_state=42, shuffle=True)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## Creating the algorithm to be used
## This will be done inside a for loop, since cross-validation will be used to find the best C value for regularization.
## C values will range from 0.1 to 100 (lower values means stronger regularization). 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

all_c = [0.01, 0.1, 0.2, 0.4, 0.8, 1.5, 3, 6, 12, 24, 48, 100]
curr_acc = 0
better_c = 0
for c_value in all_c:
    log_reg = LogisticRegression(solver='sag', C=c_value, random_state=32, max_iter=200) # sag is used because it works with multi-class logistic regression, and it was fast while testing
    result_c = cross_val_score(log_reg, X_train, y_train, cv=3)
    print(f"For C={c_value}: Mean = %.2f, standard deviation = %0.3f" % (result_c.mean(), result_c.std()))
    
    # Storing the best c_value according to the cross validation
    if curr_acc < result_c.mean():
        better_c = c_value
        curr_acc = result_c.mean()
    
    

print(f"\nThe best value of C was {better_c}")

# Now it's time to train the algorithm with that c_value and see the result
log_reg = LogisticRegression(solver='sag', C=better_c, random_state=32)
log_reg.fit(X_train, y_train)
final_score = log_reg.score(X_test, y_test)
print(f"\nFor C={better_c}: Final score = %0.2f" % (final_score))

For C=0.01: Mean = 0.66, standard deviation = 0.000
For C=0.1: Mean = 0.67, standard deviation = 0.005
For C=0.2: Mean = 0.67, standard deviation = 0.007
For C=0.4: Mean = 0.67, standard deviation = 0.006
For C=0.8: Mean = 0.67, standard deviation = 0.006
For C=1.5: Mean = 0.67, standard deviation = 0.008
For C=3: Mean = 0.67, standard deviation = 0.008
For C=6: Mean = 0.67, standard deviation = 0.008
For C=12: Mean = 0.68, standard deviation = 0.008
For C=24: Mean = 0.68, standard deviation = 0.007
For C=48: Mean = 0.68, standard deviation = 0.007
For C=100: Mean = 0.68, standard deviation = 0.006

The best value of C was 100

For C=100: Final score = 0.67


We can see an improvement already, **increasing the accuracy by 5%**, even without the new features.

Now, let's see if the 3 new features improve it even more. 

In [5]:
# Ignoring errors when trying to divide by 0, will solve that later
np.seterr(divide='ignore', invalid='ignore')

# Creating the new features in an additional array (with original values), then I will concatenate and scale them.
new_features = np.empty_like(all_examples)
new_features[:, 0] = all_examples[:, 1] / all_examples[:, 0] # Stars / Position
new_features[:, 1] = all_examples[:, 0] / all_examples[:, 2] # Position / Reviews
new_features[:, 2] = all_examples[:, 1] / all_examples[:, 2] # Stars / Reviews

# Changing zeros to 1 (since it is the maximum value after scaling)
new_features[new_features == np.inf] = 1;

# Concatenating
new_all_examples = np.concatenate((all_examples, new_features), axis=1)
new_all_examples

# Scaling
scaler = MinMaxScaler()
scaled_examples = scaler.fit_transform(new_all_examples);


# Now I will apply the same process as last time
# Dividing the data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(scaled_examples, all_results, test_size=0.25, random_state=42, shuffle=True)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## Creating the algorithm to be used
## This will be done inside a for loop, since cross-validation will be used to find the best C value for regularization.
## C values will range from 0.1 to 100 (lower values means stronger regularization). 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

all_c = [0.01, 0.1, 0.2, 0.4, 0.8, 1.5, 3, 6, 12, 24, 48, 100]
curr_acc = 0
better_c = 0
for c_value in all_c:
    log_reg = LogisticRegression(solver='sag', C=c_value, random_state=32, max_iter=200) # sag is used because it works with multi-class logistic regression, and it was fast while testing
    result_c = cross_val_score(log_reg, X_train, y_train, cv=3)
    print(f"For C={c_value}: Mean = %.2f, standard deviation = %0.3f" % (result_c.mean(), result_c.std()))
    
    # Storing the best c_value according to the cross validation
    if curr_acc < result_c.mean():
        better_c = c_value
        curr_acc = result_c.mean()
    
    

print(f"\nThe best value of C was {better_c}")

# Now it's time to train the algorithm with that c_value and see the result
log_reg = LogisticRegression(solver='sag', C=better_c, random_state=32)
log_reg.fit(X_train, y_train)
final_score = log_reg.score(X_test, y_test)
print(f"\nFor C={better_c}: Final score = %0.2f" % (final_score))


For C=0.01: Mean = 0.66, standard deviation = 0.001
For C=0.1: Mean = 0.68, standard deviation = 0.003
For C=0.2: Mean = 0.69, standard deviation = 0.007
For C=0.4: Mean = 0.69, standard deviation = 0.005
For C=0.8: Mean = 0.70, standard deviation = 0.008
For C=1.5: Mean = 0.70, standard deviation = 0.007
For C=3: Mean = 0.70, standard deviation = 0.013
For C=6: Mean = 0.71, standard deviation = 0.012
For C=12: Mean = 0.71, standard deviation = 0.013
For C=24: Mean = 0.72, standard deviation = 0.013
For C=48: Mean = 0.72, standard deviation = 0.012
For C=100: Mean = 0.73, standard deviation = 0.010

The best value of C was 100

For C=100: Final score = 0.71


That was even better, **a score over 70% is not the best, but still decent**. 

It is clear that the algorithm can still improve, as **underfitting** can still be seen, so let's try a more complicated algorithm: **Neural Networks**
- It makes sense that we use this one since we probably have a problem with lack of features, considering the somewhat big amount of data we have

## Neural Networks

It is important to consider that this will be just a small application of neural networks, so I continue to use this same interface to code it. Therefore, I think that because the size of the dataset is not incredibly big, nor we have a lot of features, the *scikit-learn neural networks* will be good enough for this small project. 

Since it performed better before, the new features created will be maintained, and all the preprocessing of the data will also be kept the same. 

### Specification of the neural networks
To begin, let's try with a somewhat small neural network and see how well does it perform. This is, let's try with 3 hidden layers, with 10 neurons in each layer, making a total of 30 additional neurons.

Cross-validation will be done with the value of alpha, the regularization term.

In [6]:
# Ignoring some warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

from sklearn.neural_network import MLPClassifier
# The alpha values to be tested
all_alpha = [0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1] # 10,10,10 - 5,5,5
curr_acc = 0
better_alpha = 0

# Testing for the best alpha with this NN combination. Tested this way to prevent overfitting the test set. 
for my_alpha in all_alpha:
    NNC = MLPClassifier(hidden_layer_sizes=[10,10,10], alpha=my_alpha, shuffle=True, random_state=32, learning_rate="adaptive", max_iter=300)
    result_alpha = cross_val_score(NNC, X_train, y_train, cv=3)
    print(f"For alpha={my_alpha}: Mean = %.3f, standard deviation = %0.3f" % (result_alpha.mean(), result_alpha.std()))
    
    # Storing the best alpha according to the cross validation
    if curr_acc < result_alpha.mean():
        better_alpha = my_alpha
        curr_acc = result_alpha.mean()

print(f"\nThe best value of alpha was {better_alpha}")

# Now it's time to train the algorithm with that c_value and see the result
NNC = MLPClassifier(hidden_layer_sizes=[10,10,10], alpha=better_alpha, shuffle=True, random_state=32, learning_rate="adaptive", max_iter=300)
NNC.fit(X_train, y_train)
final_score = NNC.score(X_test, y_test)
print(f"\nFor alpha={better_alpha}: Final score = %0.3f \n" % (final_score))

For alpha=1e-07: Mean = 0.746, standard deviation = 0.012
For alpha=1e-06: Mean = 0.746, standard deviation = 0.013
For alpha=1e-05: Mean = 0.747, standard deviation = 0.015
For alpha=0.0001: Mean = 0.749, standard deviation = 0.012
For alpha=0.001: Mean = 0.746, standard deviation = 0.010
For alpha=0.01: Mean = 0.747, standard deviation = 0.011
For alpha=0.1: Mean = 0.745, standard deviation = 0.011
For alpha=1: Mean = 0.711, standard deviation = 0.007

The best value of alpha was 0.0001

For alpha=0.0001: Final score = 0.751 



Great! That was an **improvement of another 5%** compared to the previous test. 

Now, **let's try different arrangements of Neural Networks** to check if other configuration has better results. 
For each configuration, the same cross-validation tests will be used to check for the best alphas. 

In [9]:
# Ignoring some warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

from sklearn.neural_network import MLPClassifier
# The alpha values to be tested
all_alpha = [0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1] # 10,10,10 - 5,5,5
best_acc = 0
best_alpha = 0
best_NN_comb = []
NN_combinations = [ [6,6,6], [10,10,10], [15,15,15], [5,5,5,5,5], [100] ]


# Testing for the best NN combination with cross-validation. Tested this way to prevent overfitting the test set. 
# These combinations were chosen because they were varied and could possibly have different results
print("Starting tests. This could take around 4-5 minutes")
for NN_comb in NN_combinations:
    print(f"\nTesting {NN_comb}...", end="")
    for my_alpha in all_alpha:
        NNC = MLPClassifier(hidden_layer_sizes=NN_comb, alpha=my_alpha, shuffle=True, random_state=32, learning_rate="adaptive", max_iter=300)
        result_alpha = cross_val_score(NNC, X_train, y_train, cv=3)
        
        # Storing the best alpha according to the cross validation
        if best_acc < result_alpha.mean():
            best_alpha = my_alpha
            best_NN_comb = NN_comb
            best_acc = result_alpha.mean()
        print(".", end="") # To show progress
    print(f"\nBest result until now is NN={best_NN_comb} with alpha={best_alpha} and accuracy of %.3f" % best_acc)

print(f"\nThe best NN performance was {best_NN_comb} with an alpha of {best_alpha}, with accuracy in cross-validation of %.3f" % (best_acc))

# Now it's time to train the algorithm with that c_value and see the result
NNC = MLPClassifier(hidden_layer_sizes=best_NN_comb, alpha=best_alpha, shuffle=True, random_state=32, learning_rate="adaptive", max_iter=300)
NNC.fit(X_train, y_train)
final_score = NNC.score(X_test, y_test)
print(f"\nFor alpha={best_alpha} and NN combination of {best_NN_comb}: Final score = %0.3f \n" % (final_score))

Starting tests. This could take around 4-5 minutes

Testing [6, 6, 6]...........
Best result until now is NN=[6, 6, 6] with alpha=1e-06 and accuracy of 0.755

Testing [10, 10, 10]...........
Best result until now is NN=[6, 6, 6] with alpha=1e-06 and accuracy of 0.755

Testing [15, 15, 15]...........
Best result until now is NN=[6, 6, 6] with alpha=1e-06 and accuracy of 0.755

Testing [5, 5, 5, 5, 5]...........
Best result until now is NN=[6, 6, 6] with alpha=1e-06 and accuracy of 0.755

Testing [100]...........
Best result until now is NN=[6, 6, 6] with alpha=1e-06 and accuracy of 0.755

The best NN performance was [6, 6, 6] with an alpha of 1e-06, with accuracy in cross-validation of 0.755

For alpha=1e-06 and NN combination of [6, 6, 6]: Final score = 0.750 



The result of this experiment is that **the combination for the hidden layers of \[6,6,6\] was even more effective than \[10,10,10\]** (even if the final test ended up being slightly lower, but by a thin margin). 

With this, the testing with the machine learning models can be concluded, finishing with **an accuracy of 75%** in the predictions.

# Conclusions

I can conclude from this experiment the following:
- Somewhat accurate predictions of the type of products in the Amazon search result can be made, with the use of machine learning models
- Probably more features would be needed for a better accuracy rate. The position, stars rating, and reviews are not enough. 
- Still, even with this little number of features, decent models **with an accuracy of up to 75% could be made**. Thus, I consider this project a small success. 

_This document was created by Sergio González. The other file of this project can be found in my GitHub_ https://github.com/SergioGzzBrz/My-projects