## Vector Representation with fastText

fastText other than a technique for computing embeddings is also a python module.

fastText has provided us with some pre-trained models that we can find here (note that those models are huge): https://fasttext.cc/docs/en/crawl-vectors.html.

Those pre-trained model are trained in Wikipedia and Common Crowl, so they hav been seeing a lot of data. But when we are trying to use the model to our custom domain, let's say Indian Food Recipes the results can be improved.

In [2]:
!pip install fasttext -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone


In [None]:
import fasttext

## Using the Pre-Trained Model


### Loading the Model (.bin file)

In [None]:
model_path = ""

model_en = fasttext.load_model(model_path)

### Getting Most Similar Words of a given Word

In [None]:
model_en.get_nearest_neighbors("great")

### Getting the Vector Representation of a Word

In [None]:
print(model_en.get_word_vector("great").shape)
model_en.get_word_vector("great")[:30]

### Finding Analogies between Words

In [None]:
model_en.get_analogies("Berlin", "Germany", "India")

In [None]:
model_en.get_analogies("car", "driving", "phone")

## Training the Model over the Indian Food Recipes Dataset

In [3]:
import fasttext

import pandas as pd
import re

### Loading the Dataset

In [4]:
from pathlib import Path
import zipfile


zip_path = Path("/content/indian_food.zip")
dest_dir = Path("/content")

if not dest_dir.is_file():
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        print(f"[INFO] Unzipping dataset `{zip_path}` to `{dest_dir}`...")
        zip_ref.extractall(dest_dir)

print(f"[INFO] Dataset succesfully downloaded to `{dest_dir}`..")

[INFO] Unzipping dataset `/content/indian_food.zip` to `/content`...
[INFO] Dataset succesfully downloaded to `/content`..


### Preprocessing the Dataset

In [5]:
df = pd.read_csv(dest_dir / "IndianFoodDatasetCSV.csv")

print(df.shape)
df.head(3)

(6871, 15)


Unnamed: 0,Srno,RecipeName,TranslatedRecipeName,Ingredients,TranslatedIngredients,PrepTimeInMins,CookTimeInMins,TotalTimeInMins,Servings,Cuisine,Course,Diet,Instructions,TranslatedInstructions,URL
0,1,Masala Karela Recipe,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...","6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,30,45,6,Indian,Side Dish,Diabetic Friendly,"To begin making the Masala Karela Recipe,de-se...","To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...
1,2,टमाटर पुलियोगरे रेसिपी - Spicy Tomato Rice (Re...,Spicy Tomato Rice (Recipe),"2-1/2 कप चावल - पका ले,3 टमाटर,3 छोटा चमच्च बी...","2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,10,15,3,South Indian Recipes,Main Course,Vegetarian,टमाटर पुलियोगरे बनाने के लिए सबसे पहले टमाटर क...,"To make tomato puliogere, first cut the tomato...",http://www.archanaskitchen.com/spicy-tomato-ri...
2,3,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...","1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,30,50,4,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,"To begin making the Ragi Vermicelli Recipe, fi...","To begin making the Ragi Vermicelli Recipe, fi...",http://www.archanaskitchen.com/ragi-vermicelli...


In [6]:
# Let's see the first value of the column we are interested `TranslatedInstructions`
print(df["TranslatedInstructions"][0])

To begin making the Masala Karela Recipe,de-seed the karela and slice. Do not remove the skin as the skin has all the nutrients. Add the karela to the pressure cooker with 3 tablespoon of water, salt and turmeric powder and pressure cook for three whistles. Release the pressure immediately and open the lids. Keep aside.Heat oil in a heavy bottomed pan or a kadhai. Add cumin seeds and let it sizzle.Once the cumin seeds have sizzled, add onions and saute them till it turns golden brown in color.Add the karela, red chilli powder, amchur powder, coriander powder and besan. Stir to combine the masalas into the karela.Drizzle a little extra oil on the top and mix again. Cover the pan and simmer Masala Karela stirring occasionally until everything comes together well. Turn off the heat.Transfer Masala Karela into a serving bowl and serve.Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family. 


In [15]:
# Removing all the non-alpharithmetic characters
text = df["TranslatedInstructions"][0]
alpharithmetic_text = re.sub(r"[^\w\s]", " ", text)

print(alpharithmetic_text)

# Removing extra spcaes
processed_text = re.sub(" +", " ", alpharithmetic_text)

print(processed_text)

To begin making the Masala Karela Recipe de seed the karela and slice  Do not remove the skin as the skin has all the nutrients  Add the karela to the pressure cooker with 3 tablespoon of water  salt and turmeric powder and pressure cook for three whistles  Release the pressure immediately and open the lids  Keep aside Heat oil in a heavy bottomed pan or a kadhai  Add cumin seeds and let it sizzle Once the cumin seeds have sizzled  add onions and saute them till it turns golden brown in color Add the karela  red chilli powder  amchur powder  coriander powder and besan  Stir to combine the masalas into the karela Drizzle a little extra oil on the top and mix again  Cover the pan and simmer Masala Karela stirring occasionally until everything comes together well  Turn off the heat Transfer Masala Karela into a serving bowl and serve Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family  
To begin making the Masala Karela Recipe de seed the karela and 

In [16]:
# Creating a preprocessed function
def preprocessed(text):
    alpharithmetic_text = re.sub(r"[^\w\s]", " ", text)
    processed_text = re.sub(" +", " ", alpharithmetic_text)

    return processed_text.strip().lower()

In [17]:
print(preprocessed(text))

to begin making the masala karela recipe de seed the karela and slice do not remove the skin as the skin has all the nutrients  add the karela to the pressure cooker with 3 tablespoon of water salt and turmeric powder and pressure cook for three whistles release the pressure immediately and open the lids keep aside heat oil in a heavy bottomed pan or a kadhai add cumin seeds and let it sizzle once the cumin seeds have sizzled add onions and saute them till it turns golden brown in color add the karela  red chilli powder amchur powder coriander powder and besan stir to combine the masalas into the karela drizzle a little extra oil on the top and mix again cover the pan and simmer masala karela stirring occasionally until everything comes together well turn off the heat transfer masala karela into a serving bowl and serve serve masala karela along with panchmel dal and phulka for a weekday meal with your family


In [18]:
# Applying this function to the entire column of the DataFrame
df["TranslatedInstructions"] = df["TranslatedInstructions"].apply(preprocessed)

In [20]:
df.head(3)

Unnamed: 0,Srno,RecipeName,TranslatedRecipeName,Ingredients,TranslatedIngredients,PrepTimeInMins,CookTimeInMins,TotalTimeInMins,Servings,Cuisine,Course,Diet,Instructions,TranslatedInstructions,URL
0,1,Masala Karela Recipe,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...","6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,30,45,6,Indian,Side Dish,Diabetic Friendly,"To begin making the Masala Karela Recipe,de-se...",to begin making the masala karela recipe de se...,https://www.archanaskitchen.com/masala-karela-...
1,2,टमाटर पुलियोगरे रेसिपी - Spicy Tomato Rice (Re...,Spicy Tomato Rice (Recipe),"2-1/2 कप चावल - पका ले,3 टमाटर,3 छोटा चमच्च बी...","2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,10,15,3,South Indian Recipes,Main Course,Vegetarian,टमाटर पुलियोगरे बनाने के लिए सबसे पहले टमाटर क...,to make tomato puliogere first cut the tomatoe...,http://www.archanaskitchen.com/spicy-tomato-ri...
2,3,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...","1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,30,50,4,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,"To begin making the Ragi Vermicelli Recipe, fi...",to begin making the ragi vermicelli recipe fir...,http://www.archanaskitchen.com/ragi-vermicelli...


In [29]:
# The dataset is not perferct traslated:
i = 0
for text in df["TranslatedInstructions"]:
    for word in text.split():
        j = j + 1 if not word.isascii() else j
j

121273

As we see there are `121273` non-english characters.

In [33]:
# Fixing this problem by not inlcuding non-translated words into the DataFrame
correct_df = pd.DataFrame(columns=["TranslatedInstructions"])

correct_df

Unnamed: 0,TranslatedInstructions


In [55]:
# Remove each recipe that contains a non-english character
correct_data = []
for text in df["TranslatedInstructions"]:
    correct_text = ""
    for word in text.split():
        if not word.isascii():
            correct_text = ""
            break
        else:
            correct_text = correct_text + ' ' + word
    
    if len(correct_text) != 0:
        correct_data.append(correct_text)

correct_df = pd.DataFrame(correct_data, columns=["TranslatedInstructions"])
correct_df

Unnamed: 0,TranslatedInstructions
0,to begin making the masala karela recipe de s...
1,to make tomato puliogere first cut the tomato...
2,to begin making the ragi vermicelli recipe fi...
3,to begin making gongura chicken curry recipe ...
4,to make andhra style alam pachadi first heat ...
...,...
5726,to begin making the saffron paneer peda recip...
5727,to begin making the italian arancini rice bal...
5728,to begin making quinoa phirnee recipe place a...
5729,to begin making ullikadala pulusu recipe spri...


In [57]:
# Optimize the same code
correct_data = df["TranslatedInstructions"].filter(regex=r"[a-z0-9]+")
correct_df

Unnamed: 0,TranslatedInstructions
0,to begin making the masala karela recipe de s...
1,to make tomato puliogere first cut the tomato...
2,to begin making the ragi vermicelli recipe fi...
3,to begin making gongura chicken curry recipe ...
4,to make andhra style alam pachadi first heat ...
...,...
5726,to begin making the saffron paneer peda recip...
5727,to begin making the italian arancini rice bal...
5728,to begin making quinoa phirnee recipe place a...
5729,to begin making ullikadala pulusu recipe spri...


### Exporting the column `TranslatedInstructions` into a seperate file

In [58]:
correct_df.to_csv("food_recipes.txt", columns=["TranslatedInstructions"], header=None, index=False)

### Initializing and Training the Model

In [59]:
model = fasttext.train_unsupervised("food_recipes.txt")

### Getting Similar Words for Indian Recipes

In [60]:
model.get_nearest_neighbors("chutney")

[(0.9372386932373047, 'chutneys'),
 (0.7289499640464783, 'dhaniya'),
 (0.7172319293022156, 'khajur'),
 (0.7093727588653564, 'imli'),
 (0.6905165314674377, 'pudina'),
 (0.6788460612297058, 'mavinakayi'),
 (0.660719633102417, 'pudi'),
 (0.65772944688797, 'mullu'),
 (0.652636706829071, 'dhani'),
 (0.6501569747924805, 'south')]

As we see this model has far better undersating about the custom dataset, for the original model.

For more about word representation in fastText see here: https://fasttext.cc/docs/en/unsupervised-tutorial.html