<a href="https://colab.research.google.com/github/Brainbellworld/DATA-SCIENCE_ML_AI/blob/main/WhatIsCooking_LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
print('Done')

Done


In [None]:
train_data = pd.read_json('train.json')

In [None]:
train_data.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


>To preprocess the data for your machine learning model, you'll need to perform the following steps:

>Label Encoding for Cuisine Categories: Convert the cuisine names (e.g., "indian," "italian," etc.) into numerical labels. You can use scikit-learn's LabelEncoder for this.

>Text Vectorization for Ingredients: Convert the lists of ingredients into numerical representations. One common method is to use the Bag of Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. This will transform the ingredient lists into numerical feature vectors.

>Here's a Python code snippet that demonstrates these preprocessing steps:

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


# Create a DataFrame from the JSON data
train_df = train_data.copy()

In [None]:
# Step 1: Label Encoding for Cuisine Categories
label_encoder = LabelEncoder()
train_df['cuisine_label'] = label_encoder.fit_transform(train_df['cuisine'])

In [None]:
# Step 2: Text Vectorization for Ingredients
# Using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
ingredient_vectors = tfidf_vectorizer.fit_transform(train_df['ingredients'].apply(lambda x: ' '.join(x)))

# Now, you have the 'cuisine_label' column for cuisine labels and 'ingredient_vectors' for ingredient features.

# You can use these preprocessed data to train a machine learning model for cuisine prediction.

# # This code will encode cuisine labels as numerical values and transform the ingredients into TF-IDF
# feature vectors, which can be used as input for your machine learning model. Make sure to apply these steps
# to your entire dataset.

In [None]:
ingredient_vectors

<39774x3010 sparse matrix of type '<class 'numpy.float64'>'
	with 761951 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load your dataset (assuming you have the preprocessed DataFrame from the previous step)
# Example: df = pd.read_csv('your_dataset.csv')

# Split the data into features (X) and the target (y)
x = ingredient_vectors
y = train_df['cuisine_label']

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Initialize and train the model (Logistic Regression in this example)
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)

# Make predictions on the test data
y_pred = model.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

print(f"Accuracy: {accuracy:.2f}")
print(report)


Accuracy: 0.78
              precision    recall  f1-score   support

   brazilian       0.79      0.52      0.63        84
     british       0.69      0.36      0.47       157
cajun_creole       0.78      0.65      0.71       328
     chinese       0.76      0.86      0.81       510
    filipino       0.73      0.54      0.62       136
      french       0.60      0.66      0.63       550
       greek       0.79      0.67      0.72       249
      indian       0.88      0.89      0.88       602
       irish       0.62      0.42      0.50       151
     italian       0.79      0.91      0.85      1567
    jamaican       0.90      0.60      0.72        91
    japanese       0.84      0.71      0.77       284
      korean       0.84      0.75      0.79       166
     mexican       0.90      0.93      0.91      1336
    moroccan       0.86      0.73      0.79       166
     russian       0.60      0.39      0.48        89
 southern_us       0.67      0.79      0.72       848
     spanish

In [None]:
test_df = pd.read_json('test.json')

In [None]:
# Preprocess the test data using the same TF-IDF vectorizer
test_ingredient_vectors = tfidf_vectorizer.transform(test_df['ingredients'].apply(lambda x: ' '.join(x)))

# Make predictions on the test data
test_predictions = model.predict(test_ingredient_vectors)

# Inverse transform the predicted labels to cuisine names
predicted_cuisines = label_encoder.inverse_transform(test_predictions)

In [None]:
# Add the predicted cuisines to the test DataFrame
test_df['cuisine'] = predicted_cuisines

# Now, 'test_df' will contain a new column 'predicted_cuisine' with the predicted cuisine labels.
# You can save or further analyze the results as needed.

In [None]:
test_df.head()

Unnamed: 0,id,ingredients,cuisine
0,18009,"[baking powder, eggs, all-purpose flour, raisi...",british
1,28583,"[sugar, egg yolks, corn starch, cream of tarta...",southern_us
2,41580,"[sausage links, fennel bulb, fronds, olive oil...",italian
3,29752,"[meat cuts, file powder, smoked sausage, okra,...",cajun_creole
4,35687,"[ground black pepper, salt, sausage casings, l...",italian


In [None]:
cuisine_submission = test_df.drop('ingredients', axis = 1)
cuisine_submission.head()

Unnamed: 0,id,cuisine
0,18009,british
1,28583,southern_us
2,41580,italian
3,29752,cajun_creole
4,35687,italian


In [None]:
cuisine_submission.to_csv('cuisine_submission.csv', index=False)