# Exploratory Data Analysis

In [3]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### Cuisine Classification Challenge

**Link:** https://www.kaggle.com/competitions/cuisine-classification-challenge/data

**Dataset Description from Kaggle:**

The training dataset includes two files.

The train_label.csv file contains columns Recipe_ID and Cuisine. Recipe_ID is the ID of the recipe and Cuisine is the name of the regional cuisine to which the recipe belongs. Recipe_ID and Cuisine have one-to-one mapping.

The train_features.csv file contains columns Recipe_ID, Ingredient_ID, state, quantity, unit. Here Recipe_ID is the ID of the recipe and rest of the columns contain detailed information of the ingredients used in this recipe. Recipe_ID and other columns have one-to-many mappings.

Test dataset includes test_features.csv file which contains the following columns: Recipe_ID, Ingredient_ID, state, quantity, unit. Here Recipe_ID is the ID of the recipe and rest of the columns contain detailed information of the ingredients used in this recipe. Recipe_ID and other columns have one-to-many mappings.

You are expected to generate a submission file with two columns: Recipe_ID and Cuisine. Here Recipe_ID should have all the unique values of recipe IDs in test dataset and Cuisine should have the predicted regional cuisine of the corresponding recipe.

In [12]:
base_dir_ccc = "/Users/timseeberger/PycharmProjects/6.C511-Project/data/cuisine_classification_challenge"
df_ccc_labels = pd.read_csv(base_dir_ccc + "/train_labels.csv")
df_ccc_features = pd.read_csv(base_dir_ccc + "/train_features.csv")

In [6]:
df_ccc_features.head(5)

Unnamed: 0,Recipe_ID,Ingredient_ID,state,quantity,unit
0,2610,3,,3,cups
1,2610,452,,1,cup
2,2610,180,quartered,1,
3,2610,21,quartered,1,
4,2610,1,quartered,1,


In [7]:
df_ccc_labels.head(5)

Unnamed: 0,Recipe_ID,Cuisine
0,2610,Middle Eastern
1,2611,Middle Eastern
2,2612,Middle Eastern
3,2613,Middle Eastern
4,2614,Middle Eastern


In [8]:
df_ccc = pd.merge(df_ccc_features, df_ccc_labels, on="Recipe_ID", how="inner")
df_ccc

Unnamed: 0,Recipe_ID,Ingredient_ID,state,quantity,unit,Cuisine
0,2610,3,,3,cups,Middle Eastern
1,2610,452,,1,cup,Middle Eastern
2,2610,180,quartered,1,,Middle Eastern
3,2610,21,quartered,1,,Middle Eastern
4,2610,1,quartered,1,,Middle Eastern
...,...,...,...,...,...,...
807195,147181,34,,3,cups,Canadian
807196,147181,213,cooked chopped,3,cups,Canadian
807197,147181,29,peeled cooked mashed,2 1/2,cups,Canadian
807198,147181,204,chopped,1,tablespoon,Canadian


In [9]:
df_ccc.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 807200 entries, 0 to 807199
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Recipe_ID      807200 non-null  int64 
 1   Ingredient_ID  807200 non-null  int64 
 2   state          336642 non-null  object
 3   quantity       756956 non-null  object
 4   unit           584056 non-null  object
 5   Cuisine        807200 non-null  object
dtypes: int64(2), object(4)
memory usage: 37.0+ MB


In [10]:
df_ccc.describe(include="all")

Unnamed: 0,Recipe_ID,Ingredient_ID,state,quantity,unit,Cuisine
count,807200.0,807200.0,336642,756956.0,584056,807200
unique,,,11180,806.0,930,26
top,,,chopped,1.0,cup,Italian
freq,,,58480,253218.0,141673,116487
mean,80251.121567,450.319419,,,,
std,41829.250792,1741.80623,,,,
min,2610.0,0.0,,,,
25%,54416.0,12.0,,,,
50%,83157.0,56.0,,,,
75%,114305.25,223.0,,,,


In [11]:
df_ccc.isnull().sum()

Recipe_ID             0
Ingredient_ID         0
state            470558
quantity          50244
unit             223144
Cuisine               0
dtype: int64

### Indonesian Food Recipes

**Link:** 

**Dataset Description from Kaggle:** https://www.kaggle.com/datasets/canggih/indonesian-food-recipes?select=dataset-udang.csv

Indonesian foods are well-known for their rich taste. There are many spices used even for daily foods. This dataset may give insight on how to prepare Indonesian food, in many ways.

This dataset contains 14000 recipes divided in 7 categories:
- dataset-ayam.csv (chicken recipes)
- dataset-kambing.csv (lamb recipes)
- dataset-sapi.csv (beef recipes)
- dataset-telur.csv (egg recipes)
- dataset-tahu.csv (tofu recipes)
- dataset-ikan.csv (fish recipes)
- dataset-tempe.csv (tempe recipes)

In [13]:
base_dir_ifr = "/Users/timseeberger/PycharmProjects/6.C511-Project/data/indonesian_food_recipes"
df_ifr_chicken = pd.read_csv(base_dir_ifr + "/dataset-ayam.csv")
df_ifr_lamb = pd.read_csv(base_dir_ifr + "/dataset-kambing.csv")
df_ifr_beef = pd.read_csv(base_dir_ifr + "/dataset-sapi.csv")
df_ifr_egg = pd.read_csv(base_dir_ifr + "/dataset-telur.csv")
df_ifr_tofu = pd.read_csv(base_dir_ifr + "/dataset-tahu.csv")
df_ifr_fish = pd.read_csv(base_dir_ifr + "/dataset-ikan.csv")
df_ifr_tempe = pd.read_csv(base_dir_ifr + "/dataset-tempe.csv")

In [14]:
df_ifr_chicken.head(5)

Unnamed: 0,Title,Ingredients,Steps,Loves,URL
0,Ayam Woku Manado,1 Ekor Ayam Kampung (potong 12)--2 Buah Jeruk ...,Cuci bersih ayam dan tiriskan. Lalu peras jeru...,1,/id/resep/4473027-ayam-woku-manado
1,Ayam goreng tulang lunak,1 kg ayam (dipotong sesuai selera jangan kecil...,"Haluskan bumbu2nya (BaPut, ketumbar, kemiri, k...",1,/id/resep/4471956-ayam-goreng-tulang-lunak
2,Ayam cabai kawin,1/4 kg ayam--3 buah cabai hijau besar--7 buah ...,Panaskan minyak di dalam wajan. Setelah minyak...,2,/id/resep/4473057-ayam-cabai-kawin
3,Ayam Geprek,250 gr daging ayam (saya pakai fillet)--Secuku...,Goreng ayam seperti ayam krispi--Ulek semua ba...,10,/id/resep/4473023-ayam-geprek
4,Minyak Ayam,400 gr kulit ayam & lemaknya--8 siung bawang p...,Cuci bersih kulit ayam. Sisihkan--Ambil 50 ml ...,4,/id/resep/4427438-minyak-ayam


In [16]:
recipe_sample = df_ifr_chicken.iloc[0, :]
recipe_sample

Title                                           Ayam Woku Manado
Ingredients    1 Ekor Ayam Kampung (potong 12)--2 Buah Jeruk ...
Steps          Cuci bersih ayam dan tiriskan. Lalu peras jeru...
Loves                                                          1
URL                           /id/resep/4473027-ayam-woku-manado
Name: 0, dtype: object

In [17]:
recipe_sample["Ingredients"]

'1 Ekor Ayam Kampung (potong 12)--2 Buah Jeruk Nipis--2 Sdm Garam--3 Ruas Kunyit--7 Bawang Merah--7 Bawang Putih--10 Cabe Merah--10 Cabe Rawit Merah (sesuai selera)--3 Butir Kemiri--2 Batang Sereh--2 Lembar Daun Salam--2 Ikat Daun Kemangi--Penyedap Rasa--1 1/2 Gelas Air--'

In [18]:
recipe_sample["Ingredients"].split("--")

['1 Ekor Ayam Kampung (potong 12)',
 '2 Buah Jeruk Nipis',
 '2 Sdm Garam',
 '3 Ruas Kunyit',
 '7 Bawang Merah',
 '7 Bawang Putih',
 '10 Cabe Merah',
 '10 Cabe Rawit Merah (sesuai selera)',
 '3 Butir Kemiri',
 '2 Batang Sereh',
 '2 Lembar Daun Salam',
 '2 Ikat Daun Kemangi',
 'Penyedap Rasa',
 '1 1/2 Gelas Air',
 '']

In [19]:
recipe_sample["Steps"]

'Cuci bersih ayam dan tiriskan. Lalu peras jeruk nipis (kalo gak ada jeruk nipis bisa pake cuka) dan beri garam. Aduk hingga merata dan diamkan selama 5 menit, biar ayam gak bau amis.--Goreng ayam tersebut setengah matang, lalu tiriskan--Haluskan bumbu menggunakan blender. Bawang merah, bawang putih, cabe merah, cabe rawit, kemiri dan kunyit. Oh iya kasih minyak sedikit yaa biar bisa di blender. Untuk sereh nya di geprek aja terus di buat simpul.--Setelah bumbu di haluskan barulah di tumis. Jangan lupa sereh dan daun salamnya juga ikut di tumis. Di tumis sampai berubah warna ya 👌--Masukan ayam yang sudah di goreng setengah matang ke dalam bumbu yang sudah di tumis, dan diamkan 5 menit dulu. Biar bumbu meresap. Lalu tuangkan 1 1/2 Gelas air. Lalu tambahkan penyedap rasa (saya 3 Sdt, tapi sesuai selera ya) koreksi rasa dan Biar kan sampai mendidih--Setelah masakan mendidih, lalu masukan daun kemangi yang sudah di potong potong. Masak lagi sekitar 10 menit. And taraaaaaaaaaaaaaa..... jadi

In [20]:
recipe_sample["Steps"].split("--")

['Cuci bersih ayam dan tiriskan. Lalu peras jeruk nipis (kalo gak ada jeruk nipis bisa pake cuka) dan beri garam. Aduk hingga merata dan diamkan selama 5 menit, biar ayam gak bau amis.',
 'Goreng ayam tersebut setengah matang, lalu tiriskan',
 'Haluskan bumbu menggunakan blender. Bawang merah, bawang putih, cabe merah, cabe rawit, kemiri dan kunyit. Oh iya kasih minyak sedikit yaa biar bisa di blender. Untuk sereh nya di geprek aja terus di buat simpul.',
 'Setelah bumbu di haluskan barulah di tumis. Jangan lupa sereh dan daun salamnya juga ikut di tumis. Di tumis sampai berubah warna ya 👌',
 'Masukan ayam yang sudah di goreng setengah matang ke dalam bumbu yang sudah di tumis, dan diamkan 5 menit dulu. Biar bumbu meresap. Lalu tuangkan 1 1/2 Gelas air. Lalu tambahkan penyedap rasa (saya 3 Sdt, tapi sesuai selera ya) koreksi rasa dan Biar kan sampai mendidih',
 'Setelah masakan mendidih, lalu masukan daun kemangi yang sudah di potong potong. Masak lagi sekitar 10 menit. And taraaaaaaaa

In [15]:
df_ifr_lamb.head(5)

Unnamed: 0,Title,Ingredients,Steps,Loves,URL
0,.Sate Kambing,Bahan-bahan :--500 gr Daging kambing--Daun pep...,"1. Cuci bersih daging kambing, potong"" kotak, ...",6,/id/resep/4470066-sate-kambing
1,Rabeg Kambing,1 kg daging kambing bagian paha beserta tulang...,Tumis bumbu halus hingga harum. Masukan bumbu ...,6,/id/resep/4469170-rabeg-kambing
2,Gulai kambing,500 gram daging kambing--Bumbu :--sesuai seler...,Rebus dgng kambing dgn jahe krng lbh 20 menit ...,0,/id/resep/4467084-gulai-kambing
3,Sayur tulang kambing,1.5 kg tulang sumsum kambing(direbus dahulu)ag...,Haluskan bawang merah 5 siung dan 5 bawang put...,1,/id/resep/4458778-sayur-tulang-kambing
4,Sate Kambing,300 garam daging kambing--1/2 buah jeruk nipis...,"Potong daging kambing sesuai selera, beri pera...",7,/id/resep/4462884-sate-kambing


### What's Cooking?

**Link:** https://www.kaggle.com/competitions/whats-cooking/data

**Dataset Description from Kaggle:**

In the dataset, we include the recipe id, the type of cuisine, and the list of ingredients of each recipe (of variable length). The data is stored in JSON format. 

File descriptions:
- train.json: the training set containing recipes id, type of cuisine, and list of ingredients
- test.json: the test set containing recipes id, and list of ingredients
- sample_submission.csv: a sample submission file in the correct format

In [23]:
base_dir_wc = "/Users/timseeberger/PycharmProjects/6.C511-Project/data/whats_cooking"
df_wc_train = pd.read_json(base_dir_wc + "/train.json")
df_wc_sample_submission = pd.read_csv(base_dir_wc + "/sample_submission.csv")

In [22]:
df_wc_train.head(5)

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


### Recipes and Reviews (Food.com)

**Link:** https://www.kaggle.com/datasets/irkaal/foodcom-recipes-and-reviews?select=reviews.csv

**Dataset Description from Kaggle:**

The recipes dataset contains 522,517 recipes from 312 different categories. This dataset provides information about each recipe like cooking times, servings, ingredients, nutrition, instructions, and more.
The reviews dataset contains 1,401,982 reviews from 271,907 different users. This dataset provides information about the author, rating, review text, and more.

In [25]:
base_dir_rr = "/Users/timseeberger/PycharmProjects/6.C511-Project/data/recipes_and_reviews"
df_rr_recipes = pd.read_csv(base_dir_rr + "/recipes.csv")
df_rr_reviews = pd.read_csv(base_dir_rr + "/reviews.csv")

In [26]:
df_rr_recipes.head(5)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09T21:46:00Z,Make and share this Low-Fat Berry Blue Frozen ...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"c(""Toss 2 cups berries with sugar."", ""Let stan..."
1,39,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29T13:12:00Z,Make and share this Biryani recipe from Food.com.,"c(""https://img.sndimg.com/food/image/upload/w_...",...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,"c(""Soak saffron in warm milk for 5 minutes and..."
2,40,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05T19:52:00Z,This is from one of my first Good House Keepi...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"c(""Into a 1 quart Jar with tight fitting lid, ..."
3,41,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,This dish is best prepared a day in advance to...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"c(""Drain the tofu, carefully squeezing out exc..."
4,42,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19T06:19:00Z,Make and share this Cabbage Soup recipe from F...,"""https://img.sndimg.com/food/image/upload/w_55...",...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"c(""Mix everything together and bring to a boil..."


In [27]:
df_rr_recipes.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    522517 non-null  int64  
 1   Name                        522517 non-null  object 
 2   AuthorId                    522517 non-null  int64  
 3   AuthorName                  522517 non-null  object 
 4   CookTime                    439972 non-null  object 
 5   PrepTime                    522517 non-null  object 
 6   TotalTime                   522517 non-null  object 
 7   DatePublished               522517 non-null  object 
 8   Description                 522512 non-null  object 
 9   Images                      522516 non-null  object 
 10  RecipeCategory              521766 non-null  object 
 11  Keywords                    505280 non-null  object 
 12  RecipeIngredientQuantities  522514 non-null  object 
 13  RecipeIngredie

In [28]:
df_rr_recipes.describe(include="all")

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
count,522517.0,522517,522517.0,522517,439972,522517,522517,522517,522512,522516,...,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0,174446,522517
unique,,438188,,56793,490,318,1240,245540,492838,165889,...,,,,,,,,,34043,519993
top,,Banana Bread,,ratherbeswimmin,PT30M,PT10M,PT30M,1999-12-01T20:03:00Z,Make and share this Banana Bread recipe from F...,character(0),...,,,,,,,,,2 cups,"""Blend all ingredients until smooth."""
freq,,186,,7742,50715,120265,41590,221,96,356620,...,,,,,,,,,4421,32
mean,271821.43697,,45725850.0,,,,,,,,...,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191,,
std,155495.878422,,292971400.0,,,,,,,,...,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809,,
min,38.0,,27.0,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,
25%,137206.0,,69474.0,,,,,,,,...,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0,,
50%,271758.0,,238937.0,,,,,,,,...,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0,,
75%,406145.0,,565828.0,,,,,,,,...,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0,,


In [29]:
df_rr_recipes.isnull().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82545
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   751
Keywords                       17237
RecipeIngredientQuantities         3
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
R

In [30]:
df_rr_reviews.head(5)

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25T21:44:00Z,2000-01-25T21:44:00Z
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17T16:49:59Z,2001-10-17T16:49:59Z
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13T21:15:00Z,2000-03-13T21:15:00Z
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28T12:51:00Z,2000-03-28T12:51:00Z


In [31]:
df_rr_reviews.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1401982 entries, 0 to 1401981
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   ReviewId       1401982 non-null  int64 
 1   RecipeId       1401982 non-null  int64 
 2   AuthorId       1401982 non-null  int64 
 3   AuthorName     1401982 non-null  object
 4   Rating         1401982 non-null  int64 
 5   Review         1401768 non-null  object
 6   DateSubmitted  1401982 non-null  object
 7   DateModified   1401982 non-null  object
dtypes: int64(4), object(4)
memory usage: 85.6+ MB


In [32]:
df_rr_reviews.describe(include="all")

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
count,1401982.0,1401982.0,1401982.0,1401982,1401982.0,1401768,1401982,1401982
unique,,,,241365,,1392745,1384268,1384268
top,,,,Sydney Mike,,Delicious!,2002-03-18T10:44:27Z,2002-03-18T10:44:27Z
freq,,,,8842,,383,49,49
mean,817973.9,152641.2,155863800.0,,4.407951,,,
std,528082.1,130111.2,530511100.0,,1.272012,,,
min,2.0,38.0,1533.0,,0.0,,,
25%,374386.2,47038.75,133680.0,,4.0,,,
50%,771780.5,109327.0,330545.0,,5.0,,,
75%,1204126.0,231876.8,818359.0,,5.0,,,


In [33]:
df_rr_reviews.isnull().sum()

ReviewId           0
RecipeId           0
AuthorId           0
AuthorName         0
Rating             0
Review           214
DateSubmitted      0
DateModified       0
dtype: int64