# Regression

### Exercise - Predict a National Cuisine

* Import the dependencies and the cleaned dataframe created in the previous assignment

In [13]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

cuisines_df = pd.read_csv("C:/Users/Soya/Documents/Dev/Game_Dev/GODOT_Games/ML-For-Beginners/3-Classification/data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Divide the X and Y coords into 2 different dataframes *(labels / features)* to use for training

In [14]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [15]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


#### Selection of Classifier / Split the Data
We will be using Multiclass Logistic Regression through Scikit-Learn library

First we need to split the data into a training split and a test split, this could be something like 80% / 20% or anything else, depending on the size of the dataset.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3, random_state=42)

### Applying the Multiclass Logistic Regression

Choose a *scheme* and a *solver*. We are using **LogisticRegression** with a *multiclass* setting and the **liblinear** solver to train.

In [27]:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

Accuracy is 0.786488740617181
ingredients: Index(['coconut', 'coriander', 'cumin', 'fenugreek', 'fish', 'lime_juice',
       'pepper', 'turmeric', 'vegetable_oil'],
      dtype='object')
cuisine: thai




In [25]:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()



Unnamed: 0,0
thai,0.929633
indian,0.04059
japanese,0.029481
chinese,0.000252
korean,4.4e-05


In [28]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.72      0.70      0.71       236
      indian       0.91      0.90      0.90       245
    japanese       0.71      0.75      0.73       231
      korean       0.81      0.74      0.77       242
        thai       0.79      0.84      0.81       245

    accuracy                           0.79      1199
   macro avg       0.79      0.79      0.79      1199
weighted avg       0.79      0.79      0.79      1199



## Study The Solvers


#### **liblinear**
* Type: Coordinate Descent Algorithm (uses a library called LIBLINEAR)
* Best for: Small to medium-sized datasets; binary and multiclass (one-vs-rest) classification
* How it works: Optimizes the loss function by updating one coefficient at a time, making it efficient for problems with a small number of features.
* Data structures: Handles sparse data well, but not ideal for very large datasets or datasets with many classes.
* Why choose it: It’s robust, reliable, and works well for smaller datasets or when you want to use the one-vs-rest (OvR) multiclass strategy.


#### **lbfgs**
* Type: Quasi-Newton method (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)
* Best for: Medium to large datasets; supports multinomial loss for multiclass classification
* How it works: Approximates the second derivative (Hessian) to find the minimum of the loss function efficiently, especially for large numbers of features.
* Data structures: Works well with dense data and can handle larger datasets and more classes than liblinear.
* Why choose it: It’s the default solver in scikit-learn, supports both OvR and multinomial strategies, and is generally faster and more scalable for larger problems.


#### **Comparison**
* **liblinear** is simple, reliable, and good for small datasets or when using OvR multiclass classification. It’s not as efficient for large datasets or multinomial problems.
* **lbfgs** is more flexible, supports multinomial loss (true multiclass), and is better for larger datasets and more complex problems.

#### **In summary:**
Choose **liblinear** for small, simple problems or when you need OvR. Choose **lbfgs** for larger, more complex datasets or when you want to use multinomial logistic regression.