### Data Cleaning

- Getting the scraped files and cleaning them for any inconsistencies 
- We'll be following through all the stages of data cleaning process, as follows 

#### Pre-requisites

In [1]:
import warnings 
warnings.filterwarnings(action="ignore")

#### Import the dependencies

In [2]:
import pandas as pd
import numpy as np

#### Load the data

In [3]:
df_bbc = pd.read_csv("recipies 1.csv")
df_tarladal_1 = pd.read_csv("indian_recipies_1.csv")
df_tarladal_2 = pd.read_csv("indian_recipies_2.csv")

##### Concat tarladal's data

In [46]:
df_tarladal  = pd.concat([df_tarladal_1,df_tarladal_2], ignore_index=True)

#### Exploratory Data Analysis

In [48]:
df_tarladal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2114 entries, 0 to 2113
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    2114 non-null   int64 
 1   name          2114 non-null   object
 2   tags          2114 non-null   object
 3   ingredients   2114 non-null   object
 4   serving_size  2114 non-null   object
 5   cook_time     2114 non-null   object
 6   nutrition     1997 non-null   object
 7   instructions  2033 non-null   object
 8   link          2114 non-null   object
dtypes: int64(1), object(8)
memory usage: 148.8+ KB


- Without instructions the recipies are of no use, since there are only 81 such recipes, we drop them.
- Let's keep the nutrition column for now, we'll try to calculate the nutrions based on ingredients later 

In [67]:
df_dropped = df_tarladal.drop(["Unnamed: 0","link"], axis=1).dropna()

In [68]:
df_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1917 entries, 0 to 2113
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          1917 non-null   object
 1   tags          1917 non-null   object
 2   ingredients   1917 non-null   object
 3   serving_size  1917 non-null   object
 4   cook_time     1917 non-null   object
 5   nutrition     1917 non-null   object
 6   instructions  1917 non-null   object
dtypes: object(7)
memory usage: 119.8+ KB


In [69]:
df_dropped

Unnamed: 0,name,tags,ingredients,serving_size,cook_time,nutrition,instructions
0,vaal ki usal recipe,"['Non-stick Pan', 'Boiled Indian recipes', 'Sa...",['2 cups sprouted vaal (field beans/ butter be...,4 servings,34 Mins,"{'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...","For vaal ki usal To make vaal ki usal recipe, ..."
1,capsicum paneer sabzi recipe,"['Non Stick Kadai Veg', 'Antioxidant Rich Indi...","['2 1/2 cups capsicum cubes', '1/2 cup low-fat...",4 servings,26 Mins,"{'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...",For capsicum paneer sabziTo make capsicum pane...
3,restaurant style palak paneer recipe,"['Non-stick Pan', 'Indian Dinner', 'Indian Lun...","['1 cup sliced onions', '3 tbsp roughly choppe...",4 servings,46 Mins,"{'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...",For the onion-cashew paste Combine all the ing...
4,beetroot and dill salad recipe,"['Refrigerator', 'Indian Salads', 'Indian Sala...","['1 tbsp olive oil', '1 tsp vinegar', 'a pinch...",4 servings,15 Mins,"{'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...",For beetroot and dill saladTo make beetroot an...
5,how to eat flaxseeds,"['Indian Dinner', 'Indian Veg Recipes', 'High ...","['1 cup flax seeds', '1 tbsp lemon juice', '1/...",4 servings,5 Mins,"{'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...",For flaxseeds mukhwas Combine all the ingredie...
...,...,...,...,...,...,...,...
2109,Penne and Fruit Salad,"['No Cooking Veg Indian', 'Indian Salads', 'In...","['1 cup cooked penne , cut diagonally', '1/4 c...",4 servings,20 Mins,"{'Energy': '103 cal', 'Protein': '4.6 g', 'Car...",Combine all the ingredients for the salad in a...
2110,eggless vegan chocolate cake,"['Oven', 'different kinds of Indian Eggless Ca...","['1/2 cup oil', '1 1/2 tsp vinegar', '1 tsp va...",4 servings,5 Mins,"{'Energy': '186 cal', 'Protein': '2 g', 'Carbo...",For vegan chocolate cake To make vegan chocola...
2111,ginger garlic soup recipe,"['Indian Soups', 'Chunky Indian Soups', 'India...","['1 tsp oil', '1/2 tbsp finely chopped garlic ...",4 servings,25 Mins,"{'Energy': '54 cal', 'Protein': '1.4 g', 'Carb...",For ginger garlic soup To make ginger garlic s...
2112,broccoli broth,"['Deep Pan', 'Boiled Indian recipes', 'Indian ...","['1 cup broccoli floret', '1 tsp olive oil', '...",4 servings,16 Mins,"{'Energy': '27 cal', 'Protein': '0.7 g', 'Carb...","For broccoli broth To make broccoli broth, hea..."


#### Claning BBC

In [70]:
df_bbc.head(3)

Unnamed: 0.1,Unnamed: 0,name,tags,ingredients,serving_size,cook_time,nutrition,instructions
0,0,Chicken madras,"['Dairy-free', 'Egg-free', 'Gluten-free', 'Hea...","['1 onion peeled and quartered', '2 garlic clo...",serves 3 - 4,35 mins,"{'kcal': '373 g', 'fat': '17 g', 'saturates': ...","['step 1', 'Blitz 1 quartered onion, 2 garlic ..."
1,1,Pani puris,"['Healthy', 'Vegan', 'Vegetarian']","['150g chakki atta (chapatti flour)', '30g fin...",serves 4 - 6,40 mins,"{'kcal': '385 g', 'fat': '20 g', 'saturates': ...","['step 1', 'Make the pani water. Place the cor..."
2,2,Easy veggie biryani,"['Healthy', 'Vegetarian']","['250g basmati rice', '400g special mixed froz...",serves 4,,"{'kcal': '305 g', 'fat': '6 g', 'saturates': '...","['step 1', 'Boil the kettle. Get out a large m..."


In [71]:
df_bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    52 non-null     int64 
 1   name          52 non-null     object
 2   tags          52 non-null     object
 3   ingredients   52 non-null     object
 4   serving_size  50 non-null     object
 5   cook_time     45 non-null     object
 6   nutrition     52 non-null     object
 7   instructions  52 non-null     object
dtypes: int64(1), object(7)
memory usage: 3.4+ KB


- We will not be using cook time and serving size as our features so lets keep these columns for now

In [72]:
df_cleaned = df_bbc.drop(["Unnamed: 0"], axis=1)

#### Merging both

In [74]:
df = pd.concat([df_dropped,df_cleaned], ignore_index= True)

In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1969 entries, 0 to 1968
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          1969 non-null   object
 1   tags          1969 non-null   object
 2   ingredients   1969 non-null   object
 3   serving_size  1967 non-null   object
 4   cook_time     1962 non-null   object
 5   nutrition     1969 non-null   object
 6   instructions  1969 non-null   object
dtypes: object(7)
memory usage: 107.8+ KB


- Before saving, to maintain a consistent format, reordering and correcting attribute names
- Final data should be in this order : ["name", "ingredients","instructions","nutrition","time","serving_size","tags"]

In [94]:
df = df.rename(columns={"cook_time":"time"})

In [95]:
df.head(3)

Unnamed: 0,name,ingredients,instructions,nutrition,time,serving_size,tags
0,vaal ki usal recipe,['2 cups sprouted vaal (field beans/ butter be...,"For vaal ki usal To make vaal ki usal recipe, ...","{'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...",34 Mins,4 servings,"['Non-stick Pan', 'Boiled Indian recipes', 'Sa..."
1,capsicum paneer sabzi recipe,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...",For capsicum paneer sabziTo make capsicum pane...,"{'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...",26 Mins,4 servings,"['Non Stick Kadai Veg', 'Antioxidant Rich Indi..."
2,restaurant style palak paneer recipe,"['1 cup sliced onions', '3 tbsp roughly choppe...",For the onion-cashew paste Combine all the ing...,"{'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...",46 Mins,4 servings,"['Non-stick Pan', 'Indian Dinner', 'Indian Lun..."


In [96]:
df = df[["name", "ingredients","instructions","nutrition","time","serving_size","tags"]]

In [98]:
df.to_csv("cleaned_data1.csv")