### Version 1

### Data

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource

output_notebook()

In [3]:
#File route
ruta="Amazon dataset.xlsx"
data= pd.read_excel(ruta)
print (data.dtypes)

product_id              object
product_name            object
Full_category           object
category                object
subcategory             object
subcategory_1           object
discounted_price       float64
actual_price           float64
discount_percentage    float64
rating                 float64
rating_count           float64
about_product           object
user_id                 object
user_name               object
review_id               object
review_title            object
review_content          object
img_link                object
product_link            object
dtype: object


### Encoding

In [6]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['Full_category_encoded'] = encoder.fit_transform(data['Full_category'])

In [7]:
print("Original Categories:")
print(data['Full_category'].unique())

print("\nEncoded Categories:")
print(data['Full_category_encoded'].unique())

Original Categories:
['Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables'
 'Computers&Accessories|NetworkingDevices|NetworkAdapters|WirelessUSBAdapters'
 'Electronics|HomeTheater,TV&Video|Accessories|Cables|HDMICables'
 'Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions'
 'Electronics|HomeTheater,TV&Video|Accessories|RemoteControls'
 'Electronics|HomeTheater,TV&Video|Televisions|StandardTelevisions'
 'Electronics|HomeTheater,TV&Video|Accessories|TVMounts,Stands&Turntables|TVWall&CeilingMounts'
 'Electronics|HomeTheater,TV&Video|Accessories|Cables|RCACables'
 'Electronics|HomeAudio|Accessories|SpeakerAccessories|Mounts'
 'Electronics|HomeTheater,TV&Video|Accessories|Cables|OpticalCables'
 'Electronics|HomeTheater,TV&Video|Projectors'
 'Electronics|HomeAudio|Accessories|Adapters'
 'Electronics|HomeTheater,TV&Video|SatelliteEquipment|SatelliteReceivers'
 'Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|DVICables'
 'Electr

In the Encoded Categories section, we will see the corresponding encoded values for each unique category. The encoded values are numerical representations of the categories obtained through the encoding process.

By comparing the original categories with the encoded categories, we can verify if the encoding process has been successful. The encoded values should accurately represent the original categories, and each unique category should have a distinct encoded value.

If the encoded categories match the expectations and reflect the categorical information correctly, it indicates that the encoding process has worked properly.

### Attribute Engineering

In [8]:
data['average_rating'] = data.groupby('product_id')['rating'].transform('mean')

In [10]:
# Check the updated dataframe
print(data.head())  # Display the first few rows of the dataframe

product_id = 'B07JW9H4J1'
actual_average_rating = data.loc[data['product_id'] == product_id, 'rating'].mean()
calculated_average_rating = data.loc[data['product_id'] == product_id, 'average_rating'].iloc[0]
print("Actual Average Rating:", actual_average_rating)
print("Calculated Average Rating:", calculated_average_rating)

   product_id                                       product_name  \
0  B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1  B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2  B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3  B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4  B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   

                                       Full_category               category  \
0  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
1  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
2  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
3  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
4  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   

               subcategory       subcategory_1  discounted_price  \
0  Accessories&Peripherals  Cables&Accessories  

1. The output displays the first few rows of the updated dataframe. It shows the columns "product_id," "product_name," "Full_category," "category," "subcategory," "subcategory_1," "discounted_price," "actual_price," "discount_percentage," "rating," "rating_count," "about_product," "user_id," "user_name," "review_id," "review_title," "review_content," "img_link," "product_link," "Full_category_encoded," and "average_rating."

2. The specific example used for verification is for the product with the ID "B07JW9H4J1."

3. The "actual_average_rating" variable stores the mean rating for the specified product using the original "rating" column.

4. The "calculated_average_rating" variable stores the mean rating for the specified product using the newly created "average_rating" column.

5. The output displays the actual average rating and the calculated average rating for the specified product. In this case, both values are "4.2," indicating that the attribute engineering code successfully calculated the average rating for the product.

The verification step helps ensure that the attribute engineering code is functioning correctly by comparing the calculated average rating with the actual average rating for a specific product. If the calculated value matches the actual value, it provides confidence in the accuracy of the attribute engineering process.

### Training a Supervised Model (Classification)

In [22]:
print(data['category'].dtype)  # Check data type
print(data['category'].isnull().sum())  # Check the count of null values

print(data['rating'].dtype)  # Check data type
print(data['rating'].isnull().sum())  # Check the count of null values

print(data['rating_count'].dtype)  # Check data type
print(data['rating_count'].isnull().sum())  # Check the count of null values


object
0
float64
1
float64
2


In [24]:
from sklearn.impute import SimpleImputer

# Select the features (X) and the target variable (y)
X = data[['rating', 'rating_count']]
y = data['category']

# Check for missing values in X
print(X.isnull().sum())

# Handle missing values using an imputer
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the classification model (Random Forest Classifier)
classification_model = RandomForestClassifier()
classification_model.fit(X_train, y_train)

rating          1
rating_count    2
dtype: int64


In [25]:
# Check for missing values after imputation
print(pd.DataFrame(X, columns=['rating', 'rating_count']).isnull().sum())

rating          0
rating_count    0
dtype: int64


In [26]:
# Make predictions on the testing set
y_pred = classification_model.predict(X_test)

# Evaluate the model's performance
accuracy = classification_model.score(X_test, y_test)

# Print the accuracy
print("Model Accuracy:", accuracy)

Model Accuracy: 0.5085324232081911


The features (rating and rating count) and the target variable (category) are selected from the dataset. The data is then split into training and testing sets using the train_test_split function, with a test size of 0.2 (20% of the data).

A RandomForestClassifier model is created and trained using the training data. Next, predictions are made on the testing set using the trained model. The accuracy of the model is calculated by comparing the predicted categories (y_pred) with the actual categories from the testing set (y_test). The accuracy score represents the percentage of correct predictions made by the model on the testing set.

Finally, the accuracy score is printed to evaluate the performance of the trained model.

This code demonstrates the training and evaluation of the supervised classification model using the RandomForestClassifier algorithm. The accuracy score can be used to assess how well the model predicts or classifies the categories based on the given features (rating and rating count).

### Training a Supervised Model (Regression)

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Define features and target variable
X = data[['rating', 'rating_count']]
y = data['category']

# Check for missing values in the feature matrix X
print(X.isnull().sum())

# Handle missing values in X using an imputer transformer
imputer_X = SimpleImputer(strategy='mean')
X = imputer_X.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the random forest classifier
classification_model = RandomForestClassifier()
classification_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classification_model.predict(X_test)


rating          1
rating_count    2
dtype: int64



The code snippet is building a random forest classifier model using the RandomForestClassifier class from scikit-learn library. Here's a breakdown of the steps and the results:

1. Data Preparation:

- The feature matrix X is created by selecting the columns 'rating' and 'rating_count' from the data DataFrame.
- The target variable y is created by selecting the column 'category' from the data DataFrame.

2. Train-Test Split:
- The train_test_split function is used to split the data into training and testing sets. The parameter test_size=0.2 specifies that 20% of the data will be used for testing, while the remaining 80% will be used for training.
- The training set is split into X_train (features) and y_train (target) variables, while the testing set is split into X_test and y_test.

3. Model Training:

- An instance of the RandomForestClassifier is created.
- The fit method is called to train the random forest classifier using the training data (X_train and y_train).

4. Model Evaluation:

- The trained model is used to make predictions on the test set (X_test).
- The accuracy of the model is evaluated using the test set predictions (y_pred) and the actual target values (y_test).
- The accuracy is not explicitly calculated in the provided code, but it can be computed using appropriate evaluation metrics such as accuracy_score, confusion matrix, etc.


The output you provided is an error message that indicates an issue with missing values in the feature matrix X. The RandomForestClassifier does not accept missing values, and the error message suggests using an imputer transformer or dropping samples with missing values to handle the issue.

To fix the error, you can preprocess the data to handle missing values before training the model. One option is to use the SimpleImputer class from scikit-learn to fill in missing values with a strategy like the mean, median, or most frequent value. Alternatively, you can consider removing the rows with missing values using the dropna function if the missing values are limited.

### -------------------------------------

### Version 2

### Data

In [27]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource

output_notebook()

In [29]:
#File route
ruta="Amazon dataset.xlsx"
data= pd.read_excel(ruta)
print (data.dtypes)

product_id              object
product_name            object
Full_category           object
category                object
subcategory             object
subcategory_1           object
discounted_price       float64
actual_price           float64
discount_percentage    float64
rating                 float64
rating_count           float64
about_product           object
user_id                 object
user_name               object
review_id               object
review_title            object
review_content          object
img_link                object
product_link            object
dtype: object


### Encoding

In [31]:
# Select categorical columns for one-hot encoding
cat_cols = ['Full_category', 'category', 'subcategory', 'subcategory_1']
# Perform one-hot encoding
encoded_data = pd.get_dummies(data, columns=cat_cols)
# Check the encoded_data DataFrame
print(encoded_data.head())

   product_id                                       product_name  \
0  B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1  B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2  B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3  B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4  B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   

   discounted_price  actual_price  discount_percentage  rating  rating_count  \
0             399.0        1099.0                 0.64     4.2       24269.0   
1             199.0         349.0                 0.43     4.0       43994.0   
2             199.0        1899.0                 0.90     3.9        7928.0   
3             329.0         699.0                 0.53     4.2       94363.0   
4             154.0         399.0                 0.61     4.2       16905.0   

                                       about_product  \
0  High Compatibility : Compatible With iPhone 12...  

By examining the output, we can verify if the one-hot encoding process has successfully created new columns for each unique category in the selected categorical columns.

The output is shoowing the encoded_data DataFrame with additional columns representing the one-hot encoded categorical variables. Each original categorical column will be replaced by multiple binary columns, where a value of 1 indicates the presence of that category and 0 indicates its absence.

### Attribute Engineering

In [32]:
# Calculate total rating counts per product
product_rating_counts = data.groupby('product_id')['rating_count'].sum().reset_index()
product_rating_counts.columns = ['product_id', 'total_rating_count']

# Merge the new feature with the original dataset
data = data.merge(product_rating_counts, on='product_id', how='left')

# Check the updated dataframe
print(data.head())

   product_id                                       product_name  \
0  B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1  B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2  B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3  B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4  B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   

                                       Full_category               category  \
0  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
1  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
2  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
3  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   
4  Computers&Accessories|Accessories&Peripherals|...  Computers&Accessories   

               subcategory       subcategory_1  discounted_price  \
0  Accessories&Peripherals  Cables&Accessories  

The output displays the first few rows of the updated dataframe, showing the additional column 'total_rating_count' that represents the total rating counts per product.

In your provided output, we can see the following columns: 'product_id', 'product_name', 'Full_category', 'category', 'subcategory', 'subcategory_1', 'discounted_price', 'actual_price', 'discount_percentage', 'rating', 'rating_count', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'img_link', 'product_link', 'total_rating_count_x', 'total_rating_count_y', and 'total_rating_count'.

### Training a Supervised Model (Classification)

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

# Define features and target variable
X = encoded_data.drop(['product_id', 'product_name', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'img_link', 'product_link'], axis=1)
y = encoded_data['rating']

# Check for missing values in the feature matrix X
print(X.isnull().sum())

# Check for missing values in the target variable y
print(y.isnull().sum())

# Handle missing values in X using an imputer transformer
imputer_X = SimpleImputer(strategy='mean')
X = imputer_X.fit_transform(X)

# Handle missing values in y using an imputer transformer
imputer_y = SimpleImputer(strategy='mean')
y = imputer_y.fit_transform(y.values.reshape(-1, 1)).ravel()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the random forest regressor
rf_regressor = RandomForestRegressor()
rf_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_regressor.predict(X_test)

# Calculate mean squared error of the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

discounted_price                            0
actual_price                                0
discount_percentage                         0
rating                                      1
rating_count                                2
                                           ..
subcategory_1_UninterruptedPowerSupplies    0
subcategory_1_Vacuum,Cleaning&Ironing       0
subcategory_1_VideoCameras                  0
subcategory_1_WaterHeaters&Geysers          0
subcategory_1_WaterPurifiers&Accessories    0
Length: 327, dtype: int64
1
Mean Squared Error: 0.0005386996587030734


In [34]:
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer

# Define features
X = encoded_data.drop(['product_id', 'product_name', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'img_link', 'product_link'], axis=1)

# Check for missing values in the feature matrix X
print(X.isnull().sum())

# Handle missing values in X using an imputer transformer
imputer_X = SimpleImputer(strategy='mean')
X = imputer_X.fit_transform(X)

# Initialize and fit the K-Means clustering model
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

# Assign cluster labels to each data point
cluster_labels = kmeans.labels_

import warnings
warnings.filterwarnings("ignore")

discounted_price                            0
actual_price                                0
discount_percentage                         0
rating                                      1
rating_count                                2
                                           ..
subcategory_1_UninterruptedPowerSupplies    0
subcategory_1_Vacuum,Cleaning&Ironing       0
subcategory_1_VideoCameras                  0
subcategory_1_WaterHeaters&Geysers          0
subcategory_1_WaterPurifiers&Accessories    0
Length: 327, dtype: int64




1. Random Forest Regression:
   - The features (input variables) are defined as `X`, which is obtained by dropping certain columns from the `encoded_data` DataFrame.
   - The target variable (output variable) is defined as `y`, which corresponds to the 'rating' column of the `encoded_data` DataFrame.
   - The code checks for missing values in the feature matrix `X` and prints the count of missing values for each column.
   - Similarly, the code checks for missing values in the target variable `y` and prints the count of missing values.
   - Missing values in `X` are handled using the `SimpleImputer` class from scikit-learn. The strategy chosen is 'mean', which replaces missing values with the mean value of the respective column.
   - Missing values in `y` are handled in the same way as `X`.
   - The data is split into training and testing sets using the `train_test_split` function from scikit-learn.
   - A `RandomForestRegressor` model is initialized and trained using the training data.
   - Predictions are made on the test set using the trained model.
   - The mean squared error (MSE) is calculated to evaluate the performance of the regression model, and it is printed as the output.

2. K-Means Clustering:
   - The features (input variables) are defined as `X`, similar to the previous task.
   - The code checks for missing values in the feature matrix `X` and prints the count of missing values for each column.
   - Missing values in `X` are handled using the `SimpleImputer` class from scikit-learn, with the same 'mean' strategy as before.
   - A `KMeans` clustering model is initialized with `n_clusters=3`, which means it will cluster the data into 3 groups.
   - The K-Means model is fitted to the data, assigning cluster labels to each data point.
   - The cluster labels are stored in the `cluster_labels` variable.

Additionally, the warning messages related to the default value of `n_init` and the memory leak issue with K-Means on Windows with MKL are displayed. These warnings can be safely ignored.

The output you provided shows the counts of missing values for each column of `X` before and after handling missing values. It also displays the cluster labels generated by K-Means clustering.

Note: The `warnings.filterwarnings("ignore")` statement is used to suppress all warning messages in the code execution. It is placed after the import statements to ensure warnings are ignored throughout the entire code.