
# Welcome to the Recipe Ratings Prediction Project!

## Problem Overview
The project involves developing predictive models to forecast the ratings for various recipes. The data comprises recipe names, user reviews, and other key features. The objective is to analyze this information and create accurate rating predictions for each recipe.

## Load basic libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Load the data

In [2]:
import pandas as pd
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


/kaggle/input/recipe-for-rating-predict-food-ratings-using-ml/sample.csv
/kaggle/input/recipe-for-rating-predict-food-ratings-using-ml/train.csv
/kaggle/input/recipe-for-rating-predict-food-ratings-using-ml/test.csv


In [3]:
# Read the CSV files into pandas DataFrames
train = pd.read_csv("/kaggle/input/recipe-for-rating-predict-food-ratings-using-ml/train.csv")
test = pd.read_csv("/kaggle/input/recipe-for-rating-predict-food-ratings-using-ml/test.csv")

## Exploring Datasets

In [None]:
train.head()

In [None]:
train.shape

### Features

In [4]:
features = train.columns[train.columns != 'Rating']
label = train['Rating']
print(features,label)

Index(['ID', 'RecipeNumber', 'RecipeCode', 'RecipeName', 'CommentID', 'UserID',
       'UserName', 'UserReputation', 'CreationTimestamp', 'ReplyCount',
       'ThumbsUpCount', 'ThumbsDownCount', 'BestScore', 'Recipe_Review'],
      dtype='object') 0        5
1        5
2        3
3        5
4        4
        ..
13631    5
13632    5
13633    5
13634    5
13635    5
Name: Rating, Length: 13636, dtype: int64


### Data statistics

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13636 entries, 0 to 13635
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   ID                 13636 non-null  int64 
 1   RecipeNumber       13636 non-null  int64 
 2   RecipeCode         13636 non-null  int64 
 3   RecipeName         13636 non-null  object
 4   CommentID          13636 non-null  object
 5   UserID             13636 non-null  object
 6   UserName           13636 non-null  object
 7   UserReputation     13636 non-null  int64 
 8   CreationTimestamp  13636 non-null  int64 
 9   ReplyCount         13636 non-null  int64 
 10  ThumbsUpCount      13636 non-null  int64 
 11  ThumbsDownCount    13636 non-null  int64 
 12  Rating             13636 non-null  int64 
 13  BestScore          13636 non-null  int64 
 14  Recipe_Review      13634 non-null  object
dtypes: int64(10), object(5)
memory usage: 1.6+ MB


* Total entries : 13634
* Total features : 15(14features  + 1label )
    * Label column: `Rating`
    * Features: `['ID', 'RecipeNumber', 'RecipeCode', 'RecipeName', 'CommentID', 'UserID','UserName', 'UserReputation', 'CreationTimestamp', 'ReplyCount','ThumbsUpCount', 'ThumbsDownCount', 'BestScore', 'Recipe_Review']`

In [None]:
train.describe()

We can make few conclusions :
* **User Reputation:** The average user reputation seems to be around 2.16, with a minimum of 1 and a maximum of 510. 

* **Reply Count:** Most recipes have a low reply count, with 75% having zero replies. 

* **ThumbsUp Count:** Similarly, most recipes have a low number of thumbs-up, with 75% having zero thumbs-up. This indicates that while some recipes are well-received, many do not receive much positive feedback.

* **ThumbsDown Count:** The distribution of thumbs-down counts is similar to that of thumbs-up counts, with most recipes receiving zero thumbs-down.

* **Rating:** The average rating is around 4.29 out of 5, with most recipes having a rating of 5. This suggests that the majority of recipes are highly rated by users.

* **BestScore:** This attribute has a wide range of values, with a mean of 153 and a maximum of 946. 


In [None]:
train['Rating'].value_counts()

Based on the distribution of ratings from above cell 

- The majority of ratings fall into the 5-star category, with 10,371 occurrences.
- The 0-star and 4-star ratings also have a considerable number of occurrences, with 1,272 and 1,241 respectively.
- The 3-star rating has fewer occurrences compared to the higher ratings, with 368 instances.
- The 1-star and 2-star ratings are the least frequent, with 210 and 174 occurrences respectively.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
df = train.copy()
df = df.dropna()
ratings_to_visualize = [0, 1, 2, 3, 4, 5]

num_cols = 3

num_rows = (len(ratings_to_visualize) + num_cols - 1) // num_cols

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 8))

for i, rating in enumerate(ratings_to_visualize):
    filtered_comments = df[df['Rating'] == rating]['Recipe_Review']
    text = ' '.join(filtered_comments)
    wordcloud = WordCloud(width=400, height=200, background_color='white').generate(text)
    row = i // num_cols
    col = i % num_cols
    axes[row, col].imshow(wordcloud, interpolation='bilinear')
    axes[row, col].set_title(f'Rating {rating}')
    axes[row, col].axis('off')
plt.tight_layout()
plt.suptitle("word clouds for each rating",fontsize=20)
plt.show()


In [None]:
train.Rating.hist()
plt.suptitle("Rating distribution")
plt.xlabel("Rating")
plt.ylabel('Count');

### We can note that there is imbalance in the dataset
Since the dataset is imbalanced, with a significant number of 5-star ratings compared to lower ratings, consider techniques such as oversampling, undersampling, or using algorithms that are robust to class imbalance.

In [None]:
train.hist(bins=50,figsize=(15,15))
plt.suptitle("Rating vs Features",fontsize=20)
plt.show()

A few observation based on these plots:
1. Some features are at different scales
2. Features have different distributions

# Data visualization

In [None]:
exploration_set = train.copy()

In [None]:

numerical_columns = exploration_set.select_dtypes(include=['int', 'float']).columns
exploration_set_numeric = exploration_set[numerical_columns]

# Create correlation matrix
corr_matrix = exploration_set_numeric.corr()


In [None]:
corr_matrix['Rating']

- The correlation analysis reveals **subtle relationships** between the 'Rating' column and other numerical features.
- A **weak positive correlation** with the 'ID' column indicates a slight tendency for higher ratings with higher ID values.
- **Weak negative correlations** with 'ReplyCount' and 'ThumbsDownCount' suggest that recipes with more replies or thumbs-down counts tend to have slightly lower ratings.
- 'UserReputation' exhibits a **negligible positive correlation** with the 'Rating' column, implying a minimal influence of user reputation on recipe ratings.
- The remaining features show correlations **close to zero**, indicating minimal linear relationships with recipe ratings.

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(corr_matrix, annot=True);

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(corr_matrix[['Rating']], annot=True)
plt.title('Correlation Heatmap: Rating vs Others')
plt.show()


The correlation analysis between the "Rating" variable and other features reveals several notable relationships. Notably, the number of replies (`ReplyCount`) and thumbs-down reactions (`ThumbsDownCount`) exhibit moderate negative correlations with ratings, indicating that recipes sparking more discussion or receiving negative feedback tend to have lower ratings. Additionally, a slight negative correlation is observed between thumbs-up reactions (`ThumbsUpCount`) and ratings, suggesting that while positive feedback contributes to higher ratings, its impact may be somewhat tempered. The creation timestamp (`CreationTimestamp`) also shows a modest negative correlation with ratings, implying that older recipes may not always maintain high ratings over time. 


# Data Preparation

### Separate features and labels from the training set.

In [None]:
X = train.drop("Rating",axis =1)
y = train['Rating']

### Data cleaning

In [None]:
X.isna().sum()

- **Dropping Rows with Null Values and Updating X and y**: To ensure data integrity and effectiveness in modeling, we'll handle null values by dropping rows containing them. We'll then update the feature matrix `X` and target variable `y` accordingly.

- **Dropping Unnecessary Columns Based on Correlations with Rating**: Effective feature selection is crucial for improving model performance and interpretability. Here, we'll identify and drop columns that exhibit low correlations with our target variable, "Rating." This streamlines our dataset, focusing on the most relevant features for our predictive model.


In [6]:

column_names = [
    "ID",
    "RecipeNumber",
    "RecipeCode",
    "CommentID",
    "UserID",
    "UserName",
    "CreationTimestamp",
    "Rating"
]

X = train.dropna().drop(column_names,axis =1)
# X = train.dropna().drop(columns='Rating')
y = train.dropna()['Rating']

In [None]:
X.shape,y.shape

In [None]:
X

In [None]:
y

### Importing necessary libraries for Preprocessing

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, chi2
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=UserWarning)

## Text Preprocessing
We will compute most frequent words and we will store it in `word_freq`

In [None]:
word_freq = X['Recipe_Review'].str.split(expand=True).stack().value_counts()

In [None]:
word_freq.head()

In [None]:
stop_words = word_freq.index[:50].tolist()
print(stop_words)

Some stop words might be important we will only **remove** those which are **common** for each vocabulary of rating

In [None]:
ratings_to_create_vocabulary = [0, 1, 2, 3, 4, 5]
df = train.copy()
df=df.dropna()
vocabulary = {}

for rating in ratings_to_create_vocabulary:
    # Filter DataFrame for the specific rating
    filtered_reviews = df[df['Rating'] == rating]['Recipe_Review'].astype(str)
    all_reviews = ' '.join(filtered_reviews)
    words = all_reviews.split()
    vocabulary_per_rating = set(words)
    vocabulary[rating] = vocabulary_per_rating

We will be taking **intersection** for each vocabulry of rating

In [None]:
common_words = vocabulary[5].intersection(vocabulary[4]).intersection(vocabulary[3]).intersection(vocabulary[2]).intersection(vocabulary[1]).intersection(vocabulary[0])
print(len(set(stop_words) - common_words ))

As we can see the words in `stop_words` is present in `common_words` we will use this as stop words

In [None]:
# Columns for training
numeric_features = ['UserReputation', 'ReplyCount', 'ThumbsUpCount', 'ThumbsDownCount', 'BestScore']
categorical_features = ['RecipeName','Recipe_Review']

In [None]:
# concatinating the two features into one to reduce dimensionality
X['cat'] = X['RecipeName'] + ' ' + X['Recipe_Review'] 
test['cat'] = test['RecipeName'] + ' ' + test['Recipe_Review']
# creating a new feature column
X['score'] = X['UserReputation']*X['BestScore']
test['score'] = test['UserReputation']*test['BestScore']

### Split the train data into validation

In [None]:
# y = y.to_numpy()
# y = y.reshape(-1,1)

In [None]:
# Split the test data into validation and separate test sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# y

### Using TFIDF for semantic meaning

In [8]:
NGRAM_RANGE = (1, 1)
TOKEN_MODE = 'word'
MIN_DOCUMENT_FREQUENCY = 5
kwargs = {
        'ngram_range': NGRAM_RANGE,
        'dtype': 'float64',
        'strip_accents': 'unicode',
        'decode_error': 'replace',
        'analyzer': TOKEN_MODE,
        'min_df': MIN_DOCUMENT_FREQUENCY,
        'stop_words':['I', "i", "you", "the"]       #['I', "i", "you", "the", "and", "it", "a"] 
    }
vectorizer = TfidfVectorizer(**kwargs)

# MODELS 🔗

## Logistic Regression

In [None]:
# numeric_features = ['BestScore', 'ID', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount', 'UserReputation']
# categorical_features = ['RecipeName','Recipe_Review']

In [None]:
# Logistic Regression
pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', StandardScaler(), numeric_features),

        ('cat', vectorizer, 'cat')
    ])),
    
#     ('smote', SMOTE(random_state=42)), #Accuracy: 0.5988265493215988 

    ('classifier', LogisticRegression(max_iter=1000,random_state=42,))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_valid)

# Evaluate the model
logistic_accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy:", logistic_accuracy)

In [None]:
# def logistic():
#     # Logistic Regression
#     pipeline = Pipeline(steps=[
#         ('preprocessor', ColumnTransformer(transformers=[
#             ('num', StandardScaler(), numeric_features),

#             ('cat', vectorizer, 'cat')
#         ])),

#     #     ('smote', SMOTE(random_state=42)), #Accuracy: 0.5988265493215988 

#         ('classifier', LogisticRegression(max_iter=1000,random_state=42,))
#     ])

#     pipeline.fit(X_train, y_train)

#     y_pred = pipeline.predict(X_valid)

#     # Evaluate the model
#     logistic_accuracy = accuracy_score(y_valid, y_pred)
#     print("Accuracy:", logistic_accuracy)
#     return logistic_accuracy

In [None]:
# import itertools

# def get_numeric_combinations(numeric_features):
#     combinations_list = []
#     for r in range(1, len(numeric_features) + 1):
#         combinations = list(itertools.combinations(numeric_features, r))
#         combinations_list.extend(combinations)
#     return [list(combination) for combination in combinations_list]

# numeric_features = ['UserReputation', 'ReplyCount', 'ThumbsUpCount', 'ThumbsDownCount', 'BestScore','RecipeNumber','ID','RecipeCode']
# combinations = get_numeric_combinations(numeric_features)



In [None]:
# import itertools

# def get_numeric_combinations(numeric_features):
#     combinations_set = set()
#     for r in range(1, len(numeric_features) + 1):
#         combinations = itertools.combinations(numeric_features, r)
#         for combo in combinations:
#             sorted_combo = sorted(combo)
#             combinations_set.add(tuple(sorted_combo))  # Convert to tuple and sort to ensure uniqueness
#     return [list(combination) for combination in combinations_set]

# numeric_features = ['UserReputation', 'ReplyCount', 'ThumbsUpCount', 'ThumbsDownCount', 'BestScore','RecipeNumber','ID','RecipeCode']
# combinations = get_numeric_combinations(numeric_features)


In [None]:
# t = []
# for combination in combinations:
#     numeric_features = combination
#     t.append((logistic(),combination))
#     print(combination)
    

In [None]:
# sorted_data = sorted(t, key=lambda x: x[0], reverse=True)
# for score, features in sorted_data:
#     print(f"Score: {score:.6f}, Features: {features}")
# Score: 0.782178, Features: ['BestScore', 'ID', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount', 'UserReputation']
# Score: 0.781812, Features: ['BestScore', 'ID', 'RecipeCode', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount', 'UserReputation']
# Score: 0.781812, Features: ['BestScore', 'ID', 'RecipeCode', 'RecipeNumber', 'ThumbsDownCount', 'ThumbsUpCount', 'UserReputation']
# Score: 0.781445, Features: ['BestScore', 'ID', 'RecipeCode', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount']
# Score: 0.781445, Features: ['BestScore', 'RecipeCode', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount']
# Score: 0.781445, Features: ['BestScore', 'ID', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount']
# Score: 0.781445, Features: ['RecipeCode', 'RecipeNumber', 'ReplyCount', 'ThumbsDownCount', 'ThumbsUpCount']

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_valid, y_pred,zero_division=1));

From these metrics, we can infer the performance of a classification model across different classes:

1. **Precision**: Class 5 has a high precision of 0.80, indicating that when the model predicts class 5, it's correct 80% of the time. However, for other classes like class 2, the precision is 0.00, meaning the model rarely predicts this class correctly.

2. **Recall**:  class 5 has a high recall of 0.99, meaning the model effectively captures almost all instances of class 5. However, for other classes like class 1, the recall is very low at 0.02, indicating that the model misses most instances of this class.


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_valid, y_pred,zero_division=1));

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import numpy as np

# True labels
true_labels = y_valid

# Predicted labels
predicted_labels = y_pred

# Get confusion matrix
cm = confusion_matrix(true_labels, predicted_labels)

# Define class labels
classes = np.unique(true_labels)

# Plot confusion matrix as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap="Blues", fmt='g', xticklabels=classes, yticklabels=classes)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


### cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-validation scores:", cv_scores)
print("Mean accuracy:", cv_scores.mean())

Based on these observations, it seems that the model is not significantly overfitting or underfitting. The validation accuracy and cross-validation scores are relatively consistent, indicating that the model is generalizing well to unseen data.

### Hyperparameter-tuning

In [None]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'preprocessor__cat__ngram_range': [(1, 1), (1, 2),(1,3)], 
#     'preprocessor__cat__min_df': [1, 3,5, 10,15],
#     'classifier__C': [0.1, 1.0, 10.0],  
#     'classifier__penalty': ['l1', 'l2'],
#     'classifier__solver': ['liblinear', 'saga']
# }

# grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1,verbose=3)
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best Parameters:", best_params)

# best_pipeline = grid_search.best_estimator_

# y_pred = best_pipeline.predict(X_valid)

# best_logistic_accuracy = accuracy_score(y_valid, y_pred)
# print("Accuracy:", best_logistic_accuracy)

**Result of hyperparameter tuning** : `{'ngram_range':(1,2),'min_df': 10,'C': 1, 'penalty': 'l1', 'solver': 'saga'}`

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', StandardScaler(), numeric_features),
        ('cat', vectorizer, 'cat')
    ])),
    ('classifier', knn_classifier)  
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_valid)

# Evaluate the model
knn_accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy:", knn_accuracy)


### cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')

print("Cross-validation scores:", cv_scores)
print("Mean accuracy of knn:", cv_scores.mean())

Based on these observations, it seems that the model is not significantly overfitting or underfitting. The validation accuracy and cross-validation scores are relatively consistent, indicating that the model is generalizing well to unseen data.

### Hyperparameter-tuning

In [None]:
# param_grid = {
#     'classifier__n_neighbors': [3, 5, 7, 9, 11],
#     'classifier__weights': ['uniform', 'distance'],
#     'classifier__metric': ['euclidean', 'manhattan']
# }

# grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5)
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best parameters:", best_params)

# best_knn = grid_search.best_estimator_
# best_knn_accuracy = best_knn.score(X_valid, y_valid)
# print("Accuracy on test set:", best_knn_accuracy)

**Best parameters**: `{'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'distance'}`

In [9]:
X['Recipe_Review'] = X['Recipe_Review'].str.replace('&#39;', "'")
X['Recipe_Review'] = X['Recipe_Review'].str.replace('&#34;', "'")
X['Recipe_Review'] = X['Recipe_Review'].str.replace('â€™', "'")
X['Recipe_Review'] = X['Recipe_Review'].str.replace('!ðŸ˜', "'")

test['Recipe_Review'] = test['Recipe_Review'].str.replace('&#39;', "'")
test['Recipe_Review'] = test['Recipe_Review'].str.replace('&#34;', "'")
test['Recipe_Review'] = test['Recipe_Review'].str.replace('â€™', "'")
test['Recipe_Review'] = test['Recipe_Review'].str.replace('!ðŸ˜', "'")

In [10]:
from bs4 import BeautifulSoup
import re
import string

def preprocess_text(text):
    # Same preprocessing steps as before
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'http\S+', '', text)
    return text

# X['Recipe_Review'] = X['Recipe_Review'].apply(preprocess_text)
test['Recipe_Review'] = test['Recipe_Review'].apply(preprocess_text)


## SVM

In [11]:
from sklearn.svm import SVC

# svm = SVC(kernel='linear', C=1.0)
svm = SVC(C= 10, gamma= 'scale', kernel = 'rbf')

pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', StandardScaler(), numeric_features),
        ('cat', vectorizer, 'Recipe_Review')
    ])),
#     ('smote', SMOTE(random_state=42)),
    ('classifier', svm)  
])

pipeline.fit(X, y)

# y_pred = pipeline.predict(X_valid)

# # Evaluate the model
# svm_accuracy = accuracy_score(y_valid, y_pred)
# print("Accuracy:", svm_accuracy)


In [12]:
import numpy as np

# Assuming you already have decision function values for a subset of the dataset
decision_values_valid = pipeline.decision_function(test)

# Number of classes
num_classes = decision_values_valid.shape[1]

# Predict labels for another subset (test subset) of the same dataset
predictions_test = np.argmax(decision_values_valid, axis=1) 

In [14]:
import numpy as np

# Assuming you already have decision function values for a subset of the dataset
decision_values_valid = pipeline.decision_function(X)

# Number of classes
num_classes = decision_values_valid.shape[1]

# Predict labels for another subset (test subset) of the same dataset
predictions_test = np.argmax(decision_values_valid, axis=1) 

In [19]:
from sklearn.metrics import classification_report
print(classification_report(y,predictions_test))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1272
           1       1.00      1.00      1.00       210
           2       1.00      0.99      1.00       174
           3       1.00      0.97      0.98       368
           4       1.00      0.98      0.99      1241
           5       1.00      1.00      1.00     10369

    accuracy                           1.00     13634
   macro avg       1.00      0.99      0.99     13634
weighted avg       1.00      1.00      1.00     13634



In [13]:
y_test_pred = pipeline.predict(test)
# Create a submission dataframe
submission = pd.DataFrame({"ID": range(1, len(y_test_pred) + 1), "Rating": predictions_test})

# Save the submission dataframe to a CSV file
submission.to_csv('submission.csv', index=False)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_valid, y_pred))

### Hyperparameter-tuning

In [None]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
#     'classifier__C': [0.1, 1, 10, 100],
#     'classifier__gamma': ['scale', 'auto']
# }



# grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5)
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best parameters:", best_params)

# best_svm = grid_search.best_estimator_

# # Calculate accuracy
# best_svm_accuracy = best_svm.score(X_valid, y_valid)
# print("Accuracy on test set:", best_svm_accuracy)


**Best parameters**: `{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}`

### Cross Validation

In [None]:
# from sklearn.model_selection import cross_val_score

# cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
# print("Cross-validation scores:", cv_scores)
# print("Mean accuracy:", cv_scores.mean())

From cv we can conclude that model is slightly overfitting

## DecisionTree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', MinMaxScaler(), ['score']),
        ('cat', vectorizer, 'cat')
    ])),
    ('classifier', dtc)  
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the testing data
y_pred = pipeline.predict(X_valid)

# Evaluate the model
dtc_accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy:", dtc_accuracy)


### Hyperparameter-tuning

In [None]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'classifier__max_depth': [3, 5, 7, None],
#     'classifier__min_samples_split': [2, 5, 10],
#     'classifier__min_samples_leaf': [1, 2, 4]
# }

# grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# best_model = grid_search.best_estimator_

# y_pred = best_model.predict(X_valid)

# # Evaluate the model
# best_dtc_accuracy = accuracy_score(y_valid, y_pred)
# print("Validation Accuracy:",best_dtc_accuracy)
# print("Best Parameters:", best_params)


**Best Parameters**: `{'classifier__max_depth': 5, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}`

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42)

pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', StandardScaler(), numeric_features),
        ('cat', vectorizer, 'Recipe_Review')
    ])),
    ('classifier', rfc)  
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the testing data
y_pred = pipeline.predict(X_valid)

# Evaluate the model
rfc_accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy:", rfc_accuracy)


### Hyperparameter-tuning

In [None]:

# # Define the hyperparameters to tune
# param_grid = {
#     'classifier__n_estimators': [50, 100, 150],
#     'classifier__max_depth': [None, 5, 10, 20],
#     'classifier__min_samples_split': [2, 5, 10],
#     'classifier__min_samples_leaf': [1, 2, 4]
# }

# grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best Hyperparameters:", best_params)

# best_model = grid_search.best_estimator_
# # Make predictions on the test set using the best model
# y_pred = best_model.predict(X_valid)

# best_rfc_accuracy = accuracy_score(y_valid, y_pred)
# print("Validation Accuracy:", best_rfc_accuracy)


 **result** : `{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}`

## MLP

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=42)

pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
#         ('num', StandardScaler(), numeric_features),
        ('cat', vectorizer, 'cat')
    ])),
    ('classifier',mlp)  
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the testing data
y_pred = pipeline.predict(X_valid)

# Evaluate the model
mlp_accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy:", mlp_accuracy)


### Hyperparameter-tuning

In [None]:


# param_grid = {
#     'classifier__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
#     'classifier__activation': ['logistic', 'tanh', 'relu'],
#     'classifier__solver': ['lbfgs', 'sgd', 'adam'],
#     'classifier__alpha': [0.0001, 0.001, 0.01],
#     'classifier__learning_rate': ['constant', 'invscaling', 'adaptive']
# }

# grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best Hyperparameters:", best_params)

# best_model = grid_search.best_estimator_
# y_pred = best_model.predict(X_valid)

# validation_accuracy = accuracy_score(y_valid, y_pred)
# print("Validation Accuracy:", validation_accuracy)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

models = ['Logistic Regression', 'KNN', 'SVM', 'Decision Tree', 'Random Forest', 'MLP']

scores = [logistic_accuracy, knn_accuracy, svm_accuracy, dtc_accuracy, rfc_accuracy, mlp_accuracy]

sns.set_style("whitegrid")

plt.figure(figsize=(10, 6))
sns.barplot(x=models, y=scores, palette="viridis")
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Comparison of Model Accuracy')
plt.ylim(0, 1)  # Setting y-axis limits for better visualization
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()


In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# models = ['Logistic Regression', 'KNN', 'SVM', 'Decision Tree', 'Random Forest']

# scores = [best_logistic_accuracy, best_knn_accuracy, best_svm_accuracy, best_dtc_accuracy, best_rfc_accuracy]

# sns.set_style("whitegrid")

# # Creating the bar plot
# plt.figure(figsize=(10, 6))
# sns.barplot(x=models, y=scores, palette="viridis")
# plt.xlabel('Models')
# plt.ylabel('Accuracy')
# plt.title('Comparison of Model Accuracy')
# plt.ylim(0, 1)  # Setting y-axis limits for better visualization
# plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
# plt.grid(axis='y', linestyle='--', alpha=0.7)
# plt.tight_layout()  # Adjust layout to prevent clipping of labels
# plt.show()


In [None]:
# y_test_pred = best_svm.predict(test)
# # Create a submission dataframe
# submission = pd.DataFrame({"ID": range(1, len(y_test_pred) + 1), "Rating": y_test_pred})

# # Save the submission dataframe to a CSV file
# submission.to_csv('submission.csv', index=False)

In [None]:
import numpy as np 
y = np.array([0,1,2,3,4,5])