# **Exploratory Data Analysis**

## **Download Dataset**

In [None]:
!pip install -q kaggle

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"hanifalirsyad","key":"f959e0614d99b5d92b567a14ef453a80"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d irkaal/foodcom-recipes-and-reviews

Downloading foodcom-recipes-and-reviews.zip to /content
 98% 712M/723M [00:08<00:00, 106MB/s]
100% 723M/723M [00:08<00:00, 87.3MB/s]


In [None]:
!unzip /content/foodcom-recipes-and-reviews.zip

Archive:  /content/foodcom-recipes-and-reviews.zip
  inflating: recipes.csv             
  inflating: recipes.parquet         
  inflating: reviews.csv             
  inflating: reviews.parquet         


## **Import Library**

In [None]:
# Data loading and data analysis
import pandas as pd

# Data preprocessing
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pylab 
import scipy.stats as stats

# import tensorflow as tf

## **Dataset Information**

In [None]:
# Load dataset recipes.parquet
data = pd.read_parquet('/content/recipes.parquet')

In [None]:
# Show dataset
data.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38.0,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09 21:46:00+00:00,Make and share this Low-Fat Berry Blue Frozen ...,[https://img.sndimg.com/food/image/upload/w_55...,...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"[Toss 2 cups berries with sugar., Let stand fo..."
1,39.0,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29 13:12:00+00:00,Make and share this Biryani recipe from Food.com.,[https://img.sndimg.com/food/image/upload/w_55...,...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,[Soak saffron in warm milk for 5 minutes and p...
2,40.0,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05 19:52:00+00:00,This is from one of my first Good House Keepi...,[https://img.sndimg.com/food/image/upload/w_55...,...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"[Into a 1 quart Jar with tight fitting lid, pu..."
3,41.0,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03 14:54:00+00:00,This dish is best prepared a day in advance to...,[https://img.sndimg.com/food/image/upload/w_55...,...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"[Drain the tofu, carefully squeezing out exces..."
4,42.0,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19 06:19:00+00:00,Make and share this Cabbage Soup recipe from F...,[https://img.sndimg.com/food/image/upload/w_55...,...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"[Mix everything together and bring to a boil.,..."


In [None]:
# Show information dataset
print(f'Data consist of {data.shape[1]} columns')
print(f'Each column consists of {data.shape[0]} records')

Data consist of 28 columns
Each column consists of 522517 records


In [None]:
# Show column name from dataset
column_headers = list(data.columns.values)
print("The Column Header :", column_headers)

The Column Header : ['RecipeId', 'Name', 'AuthorId', 'AuthorName', 'CookTime', 'PrepTime', 'TotalTime', 'DatePublished', 'Description', 'Images', 'RecipeCategory', 'Keywords', 'RecipeIngredientQuantities', 'RecipeIngredientParts', 'AggregatedRating', 'ReviewCount', 'Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent', 'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent', 'RecipeServings', 'RecipeYield', 'RecipeInstructions']


In [None]:
# Show variables info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[ns, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

**A description of the variables in the dataset :**


* RecipeId : Id from Recipe.
* Name AuthorId : Recipe creator's Id.
* AuthorName : Recipe creator's name.
* CookTime : Cooking time in minutes.
* PrepTime : Preparation time in minutes.
* TotalTime : Total prep and cook time in minutes.
* DatePublished : Date the recipe was published.
* Description : Description of the recipe made.
* Images : Picture of a food recipe.
* RecipeCategory : Categories of food recipes.
* Keywords : Keyword from food recipe.
* RecipeIngredientQuantities : Number of ingredients in the recipe.
* RecipeIngredientParts : Types of cooking ingredients in food recipes.
* AggregatedRating : Rate food recipes from reviews.
* ReviewCount : Number of reviews of food recipes made.
* Calories : Amount of calories in food recipes.
* FatContent : Amount of fat in food recipes.
* SaturatedFatContent : Amount of SaturatedFat in food recipes.
* CholesterolContent : Amount of Cholesterol in food recipes.
* SodiumContent : Amount of Sodium in food recipes.
* CarbohydrateContent : Amount of Carbohydrate in food recipes.
* FiberContent : Amount of Fiber in food recipes.
* SugarContent : Amount of Sugar in food recipes.
* ProteinContent : Amount of Protein in food recipes.
* RecipeServings : Food recipe portions.
* RecipeYield : Product results obtained from food recipes.
* RecipeInstructions : How to make from food recipes.




**The variables that we will use in this project:**

['RecipeId','Name','CookTime','PrepTime','TotalTime','RecipeIngredientParts','Calories','FatContent','SaturatedFatContent','CholesterolContent','SodiumContent','CarbohydrateContent','FiberContent','SugarContent','ProteinContent','RecipeInstructions']

In [None]:
# Select the required column
dataset=data.copy()
columns=['RecipeId','Name','CookTime','PrepTime','TotalTime','RecipeIngredientParts','Calories','FatContent','SaturatedFatContent','CholesterolContent','SodiumContent','CarbohydrateContent','FiberContent','SugarContent','ProteinContent','RecipeInstructions']
dataset=dataset[columns]

## **Data Visualization**

In [None]:
dataset.head()

Unnamed: 0,RecipeId,Name,CookTime,PrepTime,TotalTime,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeInstructions
0,38.0,Low-Fat Berry Blue Frozen Dessert,PT24H,PT45M,PT24H45M,"[blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,29.8,37.1,3.6,30.2,3.2,"[Toss 2 cups berries with sugar., Let stand fo..."
1,39.0,Biryani,PT25M,PT4H,PT4H25M,"[saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,368.4,84.4,9.0,20.4,63.4,[Soak saffron in warm milk for 5 minutes and p...
2,40.0,Best Lemonade,PT5M,PT30M,PT35M,"[sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,1.8,81.5,0.4,77.2,0.3,"[Into a 1 quart Jar with tight fitting lid, pu..."
3,41.0,Carina's Tofu-Vegetable Kebabs,PT20M,PT24H,PT24H20M,"[extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,"[Drain the tofu, carefully squeezing out exces..."
4,42.0,Cabbage Soup,PT30M,PT20M,PT50M,"[plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,959.3,25.1,4.8,17.7,4.3,"[Mix everything together and bring to a boil.,..."


In [None]:
# plotPerColumnDistribution(dataset, 10, 5)