## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
#1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
data = pd.read_csv("portfolio_3.csv")
data.head()

In [None]:
data.describe()

In [None]:
data.info()

We see that the data is clean already and we can move on to next steps.

In [None]:
#2) Convert object features into digit features by using an encoder

encoded_data = data
# Encode non-numeric columns
columns_to_encode = ['review', 'item', 'gender', 'category']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each column
for col in columns_to_encode:
    encoded_data[col] = label_encoder.fit_transform(encoded_data[col])

encoded_data

We have successfuly encoded all object feature columns into digit features. We can move on to the next task.

In [None]:
#3) Study the correlation between these features. 

correlation_matrix = encoded_data.corr()
correlation_matrix

It is difficult to infer data from all these numbers. It'll be much easier to find strong correlations using a heatmap.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.xticks(rotation=45)
plt.title("Correlation Matrix")
plt.show()

The heatmap shows that overall, the correlation between variables isn't very high. There is a 1 to 1 correlation between item and item_id, however this is to be expected. The correlations of note are:
- item and review: 0.16 (and by extension item_id and review)
- helpfulness and userId: -0.17
- rating and category: -0.14
- category and item_price: -0.12

Looking at rating specifically, the most significant correlated feature is category with -0.14, followed by userId with 0.07, and item/item_id with 0.06.

Nonetheless, overall these are weak correlations. The rest of the correlations are negligible.

In [None]:
#4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
#Split the dataset on rating
X = encoded_data.drop('rating', axis=1)
y = encoded_data['rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Train logistic model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)

#Evaluate model
y_pred = logistic_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

We can see that upon training the logistic model, it's accuracy is about 64%. While this is better than a coin toss, ultimately it is too low to be considered really useful. Overall, with a logistic model, predicting rating based on all the other columns in the dataset isn't very effective. Let's try a different combination of features to see if they yield better results.

In [None]:
#4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
#Split the dataset on rating
X_1 = encoded_data[['category']]
X_3 = encoded_data[['category', 'userId', 'item']]
y = encoded_data['rating']
X_1_train, X_1_test, y_1_train, y_1_test = train_test_split(X_1, y, test_size=0.2, random_state=42)
X_3_train, X_3_test, y_3_train, y_3_test = train_test_split(X_3, y, test_size=0.2, random_state=42)

#Train logistic models
logistic_model_1 = LogisticRegression(max_iter=1000)
logistic_model_3 = LogisticRegression(max_iter=1000)
logistic_model_1.fit(X_1_train, y_1_train)
logistic_model_3.fit(X_3_train, y_3_train)


#Evaluate models
y_1_pred = logistic_model_1.predict(X_1_test)
y_3_pred = logistic_model_3.predict(X_3_test)

accuracy_1 = accuracy_score(y_1_test, y_1_pred)
accuracy_3 = accuracy_score(y_3_test, y_3_pred)
print(f'Accuracy of model with 1 feature: {accuracy_1 * 100:.2f}%')
print(f'Accuracy of model with 3 feature: {accuracy_3 * 100:.2f}%')

When using all available features, the model achieved an accuracy of 63.69%. This suggests that the combination of all features collectively contributes to the model's ability to predict the 'rating'.

When using only the single most correlated feature, the accuracy increased very slightly to 63.87%. This indicates that this particular feature has a relatively stronger individual influence on the model's predictions compared to the rest. This could imply that this specific feature is highly informative and contributes significantly to the predictive performance.

Surprisingly, when incorporating the three most correlated features, the accuracy decreased to 62.57%. This might suggest that the additional features introduced some noise or redundancy into the model, potentially diluting the impact of the most informative feature. It's a common phenomenon in modeling, where adding more features doesn't always lead to better performance, and a subset of highly informative features might be more effective.

In [None]:
#5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
#Although we can resuse the data from the last question, I think it's better to maintain modularity.
#Split the dataset on rating
X = encoded_data.drop('rating', axis=1)
y = encoded_data['rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Train the model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

#Test the model
y_pred = knn_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

An accuracy score of 60% means that the K-Nearest Neighbors (KNN) model correctly predicted the 'rating' for approximately 60% of the samples in the testing set. In other words, out of all the instances in the testing set, the model made accurate predictions for 60% of them. While this indicates some level of predictive capability, it's important to put this accuracy score into context. Depending on the nature of the problem, a 60% accuracy score may meet intended requirements, however in most cases this would not be sufficient. 

Similary to the logistic model, let's try using different feature sets.

In [None]:
X_1 = encoded_data[['category']]
X_3 = encoded_data[['category', 'userId', 'item']]
y = encoded_data['rating']
X_1_train, X_1_test, y_1_train, y_1_test = train_test_split(X_1, y, test_size=0.2, random_state=42)
X_3_train, X_3_test, y_3_train, y_3_test = train_test_split(X_3, y, test_size=0.2, random_state=42)

#Train KNN models
knn_model_1 = KNeighborsClassifier(n_neighbors=5)
knn_model_3 = KNeighborsClassifier(n_neighbors=5)
knn_model_1.fit(X_1_train, y_1_train)
knn_model_3.fit(X_3_train, y_3_train)

#Evaluate models
y_1_pred = knn_model_1.predict(X_1_test)
y_3_pred = knn_model_3.predict(X_3_test)

accuracy_1 = accuracy_score(y_1_test, y_1_pred)
accuracy_3 = accuracy_score(y_3_test, y_3_pred)
print(f'Accuracy of model with 1 feature: {accuracy_1 * 100:.2f}%')
print(f'Accuracy of model with 3 feature: {accuracy_3 * 100:.2f}%')

The first KNN model correctly predicted the 'rating' for roughly 60% of the samples in the testing set. It suggests a moderate level of predictive capability, but there may be room for improvement.

Surprisingly, when using only the single most correlated feature, the accuracy drops significantly to 46.18%. This indicates that this particular feature, while being highly correlated, might not have enough information on its own to accurately predict the 'rating'. This suggests that the model might be over-relying on this one feature and not effectively utilizing the information from the other features.

Interestingly, when incorporating the three most correlated features, the accuracy increases to 62.38%. This suggests that these three features, when combined, provide a more effective set of information for the model to make accurate predictions. This demonstrates the importance of considering multiple potentially informative features in the modeling process.

# Comparing the Two Models
The logistic regression model and the K-Nearest Neighbors (KNN) model both exhibit distinct strengths and weaknesses in predicting the 'rating' based on different feature sets. The logistic regression model, when considering all features, achieved an accuracy of 63.69%, outperforming the KNN model with all features which achieved an accuracy of 59.78%. This suggests that, for this specific dataset and problem, logistic regression may be better suited to capture the underlying relationships between features and the target variable. However, the KNN model demonstrated a notable improvement when using only the three most correlated features, achieving an accuracy of 62.38%. This indicates that KNN might be more sensitive to a select set of highly informative features. Ultimately, the choice between the models should be driven by the specific characteristics of the dataset and the nature of the problem at hand. Experimentation with different algorithms and feature sets, along with consideration of computational efficiency and interpretability, is crucial in selecting the most effective predictive model.

In [None]:
#6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance
#I will tune knn_model_3 as it had the highest accuracy.

# Range of K values to test
k_values = list(range(1, 101))

# Create a dictionary of hyperparameters and their corresponding values
param_grid = {'n_neighbors': k_values}

# Initialize GridSearchCV to perform hyperparameter tuning
grid_search = GridSearchCV(knn_model_3, param_grid, cv=5, scoring='accuracy')

# Fit the model using GridSearchCV
grid_search.fit(X_3_train, y_3_train)

# Get the best hyperparameter
best_k = grid_search.best_params_['n_neighbors']
best_accuracy = grid_search.best_score_

# Train the model using the best K
knn_model_3 = KNeighborsClassifier(n_neighbors=best_k)
knn_model_3.fit(X_3_train, y_3_train)

# Evaluate the model with the best K
y_pred_3 = knn_model_3.predict(X_3_test)
accuracy_3 = accuracy_score(y_3_test, y_pred_3)

print(f'The best K found is: {best_k}')
print(f'Accuracy with the best K: {accuracy_3 * 100:.2f}%')

In [None]:
k_values = [params['n_neighbors'] for params in grid_search.cv_results_['params']]
mean_test_scores = grid_search.cv_results_['mean_test_score']

# Create a bar chart
plt.figure(figsize=(10, 6))
plt.bar(k_values, mean_test_scores, color='skyblue')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Mean Test Score')
plt.title('Grid Search Results')
show_ticks = [1] + list(range(5, len(k_values), 5)) + [100]
plt.xticks(show_ticks)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Highlight the best K
plt.axvline(x=best_k, color='red', linestyle='--')
plt.legend()

# Show the plot
plt.show()

The model predicted that the optimal number of neighbors is 67, resulting in 63.31% accuracy, an increase of 1.87pp. Although it is an improvement, it is almost negligible.

Picking a larger K means the model is less exposed to noise. However, looking at the bar chart of the results, we see that changing K didn't result in much of a change. While there were noticable improvements in the beginning, they eventually smooth out at K = 20.