Recipe Reviews and User Feedbacks

Name: Joel K Abraham

Organisation: Entri Elevate

Date:: 18/03/25

Introduction

The Recipe Reviews dataset comprises over 18,000 user-generated reviews that capture detailed information about recipes and user interactions. It includes identifiers such as recipe_number, recipe_code, and recipe_name, along with review-specific data like unique comment IDs, timestamps, reply counts, and engagement metrics (thumbs up and thumbs down). Additionally, it provides insights into user behavior through user IDs, names, and reputation scores, and features numerical ratings like stars and best_score, as well as the actual review text. This rich dataset is ideal for analyzing sentiment, user engagement, and trends in culinary feedback.


Objective

The objective of the Recipe Reviews dataset is to provide a rich, multifaceted view of user interactions with culinary recipes, enabling the analysis of sentiment, engagement, and review quality. It is designed to support research in understanding how factors like user reputation, review ratings (such as stars and best_score), and textual feedback correlate with overall user satisfaction and recipe performance. This dataset facilitates various analytical tasks—from sentiment analysis and trend detection to the development of recommendation systems—thus offering valuable insights for both data scientists and culinary professionals looking to enhance user experience and improve recipe curation.

Data Description

Unnamed: 0: An index column generated during data export or import.
recipe_number: A numerical identifier for the recipe.
recipe_code: A code representing the recipe, likely used for internal identification.
recipe_name: The name or title of the recipe.
comment_id: A unique identifier for each review or comment.
user_id: A unique identifier for the user who submitted the review.
user_name: The display name of the user.
user_reputation: A numerical score indicating the credibility or influence of the user.
created_at: A timestamp showing when the review was posted.
reply_count: The number of replies or follow-up comments the review received.
thumbs_up: The count of positive feedback received for the review.
thumbs_down: The count of negative feedback received for the review.
stars: The rating given to the recipe, typically on a scale (e.g., 1-5 stars).
best_score: A computed score that might reflect the quality or relevance of the review.
text: The actual textual content of the review.

In [8]:
import pandas as pd

# Load dataset
df = pd.read_csv("Recipe_Reviews.csv", encoding="ISO-8859-1")  # Change file format if needed

# Display basic info
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18182 entries, 0 to 18181
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       18182 non-null  int64 
 1   recipe_number    18182 non-null  int64 
 2   recipe_code      18182 non-null  int64 
 3   recipe_name      18182 non-null  object
 4   comment_id       18182 non-null  object
 5   user_id          18182 non-null  object
 6   user_name        18182 non-null  object
 7   user_reputation  18182 non-null  int64 
 8   created_at       18182 non-null  int64 
 9   reply_count      18182 non-null  int64 
 10  thumbs_up        18182 non-null  int64 
 11  thumbs_down      18182 non-null  int64 
 12  stars            18182 non-null  int64 
 13  best_score       18182 non-null  int64 
 14  text             18180 non-null  object
dtypes: int64(10), object(5)
memory usage: 2.1+ MB


Unnamed: 0.1,Unnamed: 0,recipe_number,recipe_code,recipe_name,comment_id,user_id,user_name,user_reputation,created_at,reply_count,thumbs_up,thumbs_down,stars,best_score,text
0,0,1,14299,Creamy White Chili,sp_aUSaElGf_14299_c_2G3aneMRgRMZwXqIHmSdXSG1hEM,u_9iFLIhMa8QaG,Jeri326,1,1665619889,0,0,0,5,527,"I tweaked it a little, removed onions because ..."
1,1,1,14299,Creamy White Chili,sp_aUSaElGf_14299_c_2FsPC83HtzCsQAtOxlbL6RcaPbY,u_Lu6p25tmE77j,Mark467,50,1665277687,0,7,0,5,724,Bush used to have a white chili bean and it ma...
2,2,1,14299,Creamy White Chili,sp_aUSaElGf_14299_c_2FPrSGyTv7PQkZq37j92r9mYGkP,u_s0LwgpZ8Jsqq,Barbara566,10,1664404557,0,3,0,5,710,I have a very complicated white chicken chili ...
3,3,1,14299,Creamy White Chili,sp_aUSaElGf_14299_c_2DzdSIgV9qNiuBaLoZ7JQaartoC,u_fqrybAdYjgjG,jeansch123,1,1661787808,2,2,0,0,581,"In your introduction, you mentioned cream chee..."
4,4,1,14299,Creamy White Chili,sp_aUSaElGf_14299_c_2DtZJuRQYeTFwXBoZRfRhBPEXjI,u_XXWKwVhKZD69,camper77,10,1664913823,1,7,0,0,820,Wonderful! I made this for a &#34;Chili/Stew&#...


In [9]:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

Unnamed: 0         0
recipe_number      0
recipe_code        0
recipe_name        0
comment_id         0
user_id            0
user_name          0
user_reputation    0
created_at         0
reply_count        0
thumbs_up          0
thumbs_down        0
stars              0
best_score         0
text               2
dtype: int64


In [10]:
# Drop rows/columns with excessive missing data
df.dropna(subset=["user_id", "text"], inplace=True)

In [11]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
18177    False
18178    False
18179    False
18180    False
18181    False
Length: 18180, dtype: bool

In [12]:
df.duplicated().sum()

0

In [13]:
df["created_at"] = pd.to_datetime(df["created_at"])
df["created_at"]

0       1970-01-01 00:00:01.665619889
1       1970-01-01 00:00:01.665277687
2       1970-01-01 00:00:01.664404557
3       1970-01-01 00:00:01.661787808
4       1970-01-01 00:00:01.664913823
                     ...             
18177   1970-01-01 00:00:01.622717977
18178   1970-01-01 00:00:01.613036720
18179   1970-01-01 00:00:01.622717844
18180   1970-01-01 00:00:01.622717233
18181   1970-01-01 00:00:01.622717625
Name: created_at, Length: 18180, dtype: datetime64[ns]

In [14]:

# Function to count outliers using the IQR method
def count_outliers(df, columns):
    outlier_counts = {}
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_counts[col] = len(outliers)
    return outlier_counts

# Define numerical columns to check for outliers
numerical_columns = ["user_reputation", "reply_count", "thumbs_up", "thumbs_down", "stars", "best_score"]

# Get the counts of outliers
outlier_counts = count_outliers(df, numerical_columns)

# Print out the results
for col, count in outlier_counts.items():
    print(f"{col}: {count} outliers")


user_reputation: 1246 outliers
reply_count: 230 outliers
thumbs_up: 4080 outliers
thumbs_down: 2396 outliers
stars: 4353 outliers
best_score: 4180 outliers


In [15]:
df.select_dtypes(include = ["number"]).skew()

Unnamed: 0          2.112885
recipe_number       0.458756
recipe_code         3.443614
user_reputation    33.716657
reply_count        11.282445
thumbs_up           8.413665
thumbs_down        17.889305
stars              -2.128418
best_score          3.402349
dtype: float64

In [16]:
def remove_outliers(df, columns):
    df_clean = df.copy()
    for col in columns:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        # Keep rows within the bounds for the current column
        df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
    return df_clean

# Define numerical columns from which to remove outliers
# Excluding 'stars' and 'recipe_number'
columns_to_clean = ["user_reputation", "reply_count", "thumbs_up", "thumbs_down", "best_score"]

# Remove outliers
df_clean = remove_outliers(df, columns_to_clean)

# Print the shapes of the original and cleaned datasets
print("Original dataset shape:", df.shape)
print("Cleaned dataset shape:", df_clean.shape)

Original dataset shape: (18180, 15)
Cleaned dataset shape: (12657, 15)


In [17]:
import pandas as pd
import numpy as np
from scipy.stats import boxcox

# Assume df_clean is your cleaned DataFrame
# Select numeric columns
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns

# Compute skewness for each numeric column
skewness = df_clean[numeric_cols].skew()
print("Skewness of numeric columns:")
print(skewness)

# Define thresholds for high skewness
right_threshold = 1   # right-skewed if skewness > 1
left_threshold = -1   # left-skewed if skewness < -1

# Dictionaries to store transformation parameters
boxcox_lambdas = {}

# Loop over each numeric column
for col in numeric_cols:
    col_skew = skewness[col]
    # For right-skewed columns: use Box-Cox transformation
    if col_skew > right_threshold:
        # Box-Cox requires strictly positive values. If not, shift the data.
        if (df_clean[col] <= 0).any():
            shift = abs(df_clean[col].min()) + 1
            data = df_clean[col] + shift
            transformed, lmbda = boxcox(data)
            print(f"Applied Box-Cox to '{col}' with shift {shift} and lambda {lmbda:.4f}")
        else:
            transformed, lmbda = boxcox(df_clean[col])
            print(f"Applied Box-Cox to '{col}' with lambda {lmbda:.4f}")
        # Save transformed data in a new column (you can choose to replace the original)
        df_clean[col + '_boxcox'] = transformed
        boxcox_lambdas[col] = lmbda
         # For left-skewed columns: use square transformation
    if col_skew < left_threshold:
        df_clean[col + '_square'] = df_clean[col] ** 2
        print(f"Applied square transformation to '{col}'")
    else:
        print(f"No transformation needed for '{col}'")

# Optionally, inspect the new columns
print("\nTransformed DataFrame columns:")
print(df_clean.columns)

Skewness of numeric columns:
Unnamed: 0         1.945846
recipe_number      0.587924
recipe_code        3.105770
user_reputation    0.000000
reply_count        0.000000
thumbs_up          0.000000
thumbs_down        0.000000
stars             -2.315872
best_score         0.000000
dtype: float64
Applied Box-Cox to 'Unnamed: 0' with shift 1 and lambda 0.1059
No transformation needed for 'Unnamed: 0'
No transformation needed for 'recipe_number'
Applied Box-Cox to 'recipe_code' with lambda 0.2571
No transformation needed for 'recipe_code'
No transformation needed for 'user_reputation'
No transformation needed for 'reply_count'
No transformation needed for 'thumbs_up'
No transformation needed for 'thumbs_down'
Applied square transformation to 'stars'
No transformation needed for 'best_score'

Transformed DataFrame columns:
Index(['Unnamed: 0', 'recipe_number', 'recipe_code', 'recipe_name',
       'comment_id', 'user_id', 'user_name', 'user_reputation', 'created_at',
       'reply_count', 't

In [18]:
# Feature Selection
x = df.drop(columns=['stars'])  
y = df['stars']

In [19]:
y.value_counts()

stars
5    13827
0     1696
4     1655
3      490
1      280
2      232
Name: count, dtype: int64

In [20]:
# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x.select_dtypes(include='number'), y)

In [21]:
print("After SMOTE:")
print(pd.Series(y_resampled).value_counts())

After SMOTE:
stars
5    13827
0    13827
4    13827
3    13827
1    13827
2    13827
Name: count, dtype: int64


In [22]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split

# Drop the target and select only numeric columns for feature selection
X = df.drop("stars", axis=1)
X_numeric = X.select_dtypes(include='number')

k = 9  # Number of features to select
selector = SelectKBest(score_func=f_classif, k=k)
X_new = selector.fit_transform(X_numeric, y)

# Get the names of the selected features
selected_feature_names = X_numeric.columns[selector.get_support()]
print("Selected Features:", list(selected_feature_names))


Selected Features: ['Unnamed: 0', 'recipe_number', 'recipe_code', 'user_reputation', 'reply_count', 'thumbs_up', 'thumbs_down', 'best_score']




In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_new=scaler.fit_transform(X_new)

In [24]:
x_train, x_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)

In [25]:
print(f"""
  x_train shape : {x_train.shape}
  x_test shape : {x_test.shape}
  y_train shape : {y_train.shape}
  y_test shape : {x_test.shape}
  """)


  x_train shape : (14544, 8)
  x_test shape : (3636, 8)
  y_train shape : (14544,)
  y_test shape : (3636, 8)
  


In [26]:
# Define five classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Define models with regularization parameters applied where applicable
models = {
    'Logistic Regression': LogisticRegression(
                                penalty='l2',      # L2 regularization
                                C=1.0,             # Regularization strength (adjust as needed)
                                solver='lbfgs', 
                                max_iter=1000, 
                                random_state=42),
    'Decision Tree': DecisionTreeClassifier(
                                random_state=42, 
                                max_depth=10,      # Limit tree depth to reduce overfitting
                                min_samples_split=5),  # Increase min samples to split a node
    'Random Forest': RandomForestClassifier(
                                n_estimators=100, 
                                random_state=42, 
                                max_depth=10,      # Limit depth of trees for regularization
                                min_samples_split=5),
    'Support Vector Machine': SVC(
                                C=1.0,             # Regularization parameter (adjust as needed)
                                kernel='rbf', 
                                gamma='scale', 
                                random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

In [27]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

results = {}
for model_name, model in models.items():
    model.fit(x_train, y_train)
    
    # Predict on training and test sets
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)
    
    # Calculate accuracies
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    
    # Calculate precision, recall, and F1 score with zero_division set to 0
    train_precision = precision_score(y_train, y_train_pred, average='weighted', zero_division=0)
    train_recall = recall_score(y_train, y_train_pred, average='weighted', zero_division=0)
    train_f1 = f1_score(y_train, y_train_pred, average='weighted', zero_division=0)
    
    test_precision = precision_score(y_test, y_test_pred, average='weighted', zero_division=0)
    test_recall = recall_score(y_test, y_test_pred, average='weighted', zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, average='weighted', zero_division=0)
    
    results[model_name] = {
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'train_precision': train_precision,
        'train_recall': train_recall,
        'train_f1': train_f1,
        'test_precision': test_precision,
        'test_recall': test_recall,
        'test_f1': test_f1,
    }
    
    print(f"{model_name} -")
    print(f"    Train Accuracy: {train_acc:.4f}, Precision: {train_precision:.4f}, Recall: {train_recall:.4f}, F1: {train_f1:.4f}")
    print(f"    Test Accuracy:  {test_acc:.4f}, Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}\n")


Logistic Regression -
    Train Accuracy: 0.7633, Precision: 0.6263, Recall: 0.7633, F1: 0.6676
    Test Accuracy:  0.7561, Precision: 0.6013, Recall: 0.7561, F1: 0.6572

Decision Tree -
    Train Accuracy: 0.7948, Precision: 0.7825, Recall: 0.7948, F1: 0.7374
    Test Accuracy:  0.7588, Precision: 0.6709, Recall: 0.7588, F1: 0.6881

Random Forest -
    Train Accuracy: 0.7855, Precision: 0.8156, Recall: 0.7855, F1: 0.7120
    Test Accuracy:  0.7627, Precision: 0.6667, Recall: 0.7627, F1: 0.6744

Support Vector Machine -
    Train Accuracy: 0.7673, Precision: 0.7752, Recall: 0.7673, F1: 0.6744
    Test Accuracy:  0.7577, Precision: 0.6104, Recall: 0.7577, F1: 0.6590

K-Nearest Neighbors -
    Train Accuracy: 0.7972, Precision: 0.7584, Recall: 0.7972, F1: 0.7619
    Test Accuracy:  0.7398, Precision: 0.6661, Recall: 0.7398, F1: 0.6927



In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define the parameter grid for Random Forest tuning
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Initialize Random Forest model
rf = RandomForestClassifier(random_state=42)

# Setup GridSearchCV with 5-fold cross-validation
grid_search_rf = GridSearchCV(estimator=rf,
                              param_grid=param_grid_rf,
                              cv=3,
                              n_jobs=-1,
                              scoring='accuracy',
                              verbose=1)

# Fit GridSearchCV on the training data
grid_search_rf.fit(x_train, y_train)

# Output the best parameters and cross-validation score
print("Best Parameters for Random Forest:", grid_search_rf.best_params_)
print("Best Cross-validation Accuracy:", grid_search_rf.best_score_)

# Evaluate the best estimator on the test set
best_rf = grid_search_rf.best_estimator_
y_pred_test = best_rf.predict(x_test)

test_accuracy = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test, average='weighted', zero_division=0)
test_recall = recall_score(y_test, y_pred_test, average='weighted', zero_division=0)
test_f1 = f1_score(y_test, y_pred_test, average='weighted', zero_division=0)

print("\nTest Metrics for the Tuned Random Forest:")
print(f"Test Accuracy:  {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall:    {test_recall:.4f}")
print(f"Test F1 Score:  {test_f1:.4f}")


Fitting 3 folds for each of 324 candidates, totalling 972 fits
Best Parameters for Random Forest: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
Best Cross-validation Accuracy: 0.7707645764576458

Test Metrics for the Tuned Random Forest:
Test Accuracy:  0.7660
Test Precision: 0.6868
Test Recall:    0.7660
Test F1 Score:  0.6858
