# Additional Pre-Processing

This does some additional pre-processing on the `menu_section` and `dish_name` columns. It, then, creates a new column called `full_description` that is a string concatenation of `menu_section`, `dish_name`, and `cleaned_descriptions`. 

Because the original pre-processing ran on the original file and took a long time to finish, this file is simply an extension of it that runs on the data that resulted from the initial pre-processing, which is much smaller.

In [None]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
df = pd.read_csv('data/processed_dishes_v2.csv')
df.head(5)

In [None]:
stop_words = set(nltk.corpus.stopwords.words('english'))

def tokenize_and_process(desc):
    tokens = nltk.word_tokenize(desc)
    
    # Removing stop words and punctuation from menu descriptions, and stem what's left.
    tokens = [t for t in tokens if t not in stop_words and t.isalpha()]
    
    return " ".join(tokens)

In [None]:
df_copy = df.copy()
df_copy['cleaned_menu_section'] = df_copy['menu_section'].apply(tokenize_and_process).str.lower()
df_copy['cleaned_dish_name'] = df_copy['dish_name'].apply(tokenize_and_process).str.lower()
df_copy.head(5)

In [None]:
full_description = df_copy['cleaned_menu_section'] + ' ' + df_copy['cleaned_dish_name'] + ' ' + df_copy['cleaned_descriptions']
df.insert(5, "full_description", full_description)
df.head(5)

In [None]:
df.to_csv('data/processed_dishes_v3.csv', index=False)