## Data Cleaning
In this notebook, we describe the data cleaning process we will follow and apply it to clean our data.

First we have to load the data in order to explore them.

In [49]:
import pandas as pd
data = pd.read_csv('datasets/incidents_train.csv', index_col=0)

Lets take a look to see what kind of text we have in our data set.

In [50]:
data.sample(20)

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
1958,2016,11,9,us,"ISB Food Group, LLC Recalls Nancy’s Fancy Butt...","ISB Food Group, LLC of Los Angeles, California...",biological,ices and desserts,listeria monocytogenes,ice cream
850,2013,7,10,us,2009 - foodscience corporation recalls kid's m...,"FOR IMMEDIATE RELEASE - June 11, 2009 - FoodSc...",fraud,"dietetic foods, food supplements, fortified foods",mislabelled,dietary supplement
5195,2021,8,5,us,Blount Fine Foods Corp. Recalls Chicken Soup P...,027-2021\n\n \n Low - Class II\n\n Produc...,foreign bodies,"soups, broths, sauces and condiments",foreign bodies,soup
3029,2018,9,10,us,Market of Choice Issues Allergy Alert for Unde...,"Market of Choice, based in Eugene, Ore., is re...",allergens,fruits and vegetables,eggs and products thereof,salads
2441,2017,11,1,ca,Maple Leaf brand Chicken Breast Strips recalle...,Food Recall Warning - Maple Leaf brand Chicken...,biological,"meat, egg and dairy products",staphylococcus,chicken based products
2455,2017,11,7,hk,Food Alert - Stop consuming three kinds of Iri...,Food Alert - Stop consuming three kinds of Iri...,biological,"meat, egg and dairy products",listeria monocytogenes,cheddar cheese
4124,2020,2,28,au,Aussie Shwe — Shan Ma Lay Salt Fruits & Jams 400g,PRA No. 2020/18219 Date published 28 Feb 2020 ...,chemical,fruits and vegetables,heavy metals,dried fruit in jars
1842,2016,8,10,us,Al Shabrawy Incorporated Recalls Meat and Poul...,EDITORS NOTE: This release is being reissued a...,allergens,"meat, egg and dairy products",soybeans and products thereof,other not classified meat products
87,2000,11,28,au,Bluebird—Incredibites Choc/Hazelnut Snack,PRA No. 2000/4567 Date published 28 Nov 2000 P...,biological,cereals and bakery products,other not classified biological hazards,cookies
3377,2019,3,13,us,Hometown Food Company Recalls Two Production L...,Please be advised the Hometown Food Company in...,biological,cereals and bakery products,salmonella,flour


First we will start by casting the title and the text columns into lowercase.

In [51]:
#Casting to lowercase
data.title = data.title.apply(lambda x: x.lower()) 
data.text = data.text.apply(lambda x: x.lower())

We have multiple cases where the "title" column contains only the recall notification, which does not provide meaningful information for the classification task.

On the other hand, the "text" column contains excessive information but is in a standard format.

Therefore, it is possible to extract and retain only the meaningful information from the "text" column and store it in both the "title" and "text" columns.



In [52]:
data.iloc[:10]

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
0,1994,1,7,us,recall notification: fsis-024-94,case number: 024-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,1994,3,10,us,recall notification: fsis-033-94,case number: 033-94 \n date opene...,biological,"meat, egg and dairy products",listeria spp,sausage
2,1994,3,28,us,recall notification: fsis-014-94,case number: 014-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices
3,1994,4,3,us,recall notification: fsis-009-94,case number: 009-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,thermal processed pork meat
4,1994,7,1,us,recall notification: fsis-001-94,case number: 001-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,chicken breast
5,1994,8,11,us,recall notification: fsis-044-94,case number: 044-94 \n date opene...,biological,"meat, egg and dairy products",escherichia coli,ground beef
6,1994,9,2,us,recall notification: fsis-005-94,case number: 005-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,thermal processed pork meat
7,1994,10,11,us,recall notification: fsis-045-94,case number: 045-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,beef
8,1994,12,5,us,recall notification: fsis-018-94,case number: 018-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,beef stewed
9,1994,12,7,us,recall notification: fsis-026-94,case number: 026-94 \n date opene...,chemical,"meat, egg and dairy products","antibiotics, vet drugs",chicken breast


In the above output there are some cases like those mentioned.

Lets see how the format of the "text" column looks like.

In [53]:
print(data.text.iloc[0])

case number: 024-94   
            date opened: 07/01/1994   
            date closed: 09/22/1994 
    
            recall class:  1   
            press release (y/n):  y  
    
            domestic est. number:  05893  p   
              name:  gerhard's napa valley sausage
    
            imported product (y/n):  n       
            foreign estab. number:  n/a
    
            city:  napa    
            state:  ca   
            country:  usa
    
            product:  smoked chicken sausage
    
            problem:  bacteria   
            description: listeria
    
            total pounds recalled:  2,894   
            pounds recovered:  2,894


Below, we define a function that checks if the text begins with "case number". If it does, the function extracts and retains only the meaningful labels from the output.

In [54]:
import re

In [55]:
def case_study(text):
    if re.match("^case number", text):
        information_labels = ['name', 'product' , 'problem', 'description'] #Defining the meaningful labels
        text = text.split('\n') #Spliting the text on \n
        text = [x.strip() for x in text] #Striping excess spacing 
        text = ", ".join([x for x in text for label in information_labels if re.match(f"^{label}", x)]) #Keeping only the labels that are inside the meaningful labels
        return text
    else:
        return text


Now, we will locate the rows that start with "case number" (though this check is also performed within the case_study function) in order to identify the indexes of these cases.\
After identifying these rows, we will apply the case_study function to them.\
Finally in the cleaning process, the cleaned text will be stored in both the "text" and "title" columns.

In [56]:
indexes = data.loc[data['text'].str.contains(r"^case number", na=False),'text'].index
new_col = data.loc[indexes,'text'].apply(case_study)
initial_col = data.loc[indexes,'text'] 

In [57]:
print("The initial columns were: \n",initial_col[:5],"\n")
print("The new columns will be: \n",new_col[:5])

The initial columns were: 
 0    case number: 024-94   \n            date opene...
1    case number: 033-94   \n            date opene...
2    case number: 014-94   \n            date opene...
3    case number: 009-94   \n            date opene...
4    case number: 001-94   \n            date opene...
Name: text, dtype: object 

The new columns will be: 
 0    name:  gerhard's napa valley sausage, product:...
1    name:  wimmer's meat products, product:  wiene...
2    name:  willow foods inc, product:  ham, sliced...
3    name:  oscar mayer foods, product:  beef frank...
4    name:  tyson foods, product:  chicken breast c...
Name: text, dtype: object


In the code below we can see that in some cases there may be excess text from the crawling process so we can remove anything included in "< >"

In [58]:
print(data.text.iloc[1763])

orlando, florida – fresh express incorporated issued a precautionary recall of a small quantity of 7.6 ounce net weight fresh express caesar salad kits when it was learned that two bags of salad kits were mistakenly packed with incorrect condiments including walnuts, a tree nut allergen. according to the company, a total of 2,449 cases could possibly contain the wrong condiment packet and are subject to the precautionary recall. the recalled caesar salad kits were distributed to approximately 19 states primarily in the southeast and are identified by product code g163b13a with a use-by date of june 26.
fresh express representatives are already working with retailers to ensure any incorrectly packed salad kits are rapidly removed from store shelves and inventories. no other fresh express products are included in this recall. no illnesses are reported.
consumers in possession of the recalled product should discard it. a refund is available where purchased or by contacting the fresh expre

Next, we observe that most of the cases contain a lot of numbers, such as dates, barcodes, quantities, and other numerical values, which do not provide meaningful insights for the classification task. Therefore, we can eliminate numbers from the text to focus only on the relevant information.


In [59]:
print(data.text.iloc[4563])

pra number 2021/19189 published date 3 sep 2021 product description original juice company black label cloudy apple juice 1.5l plastic bottle batch code 15468 use by date 07/10/2021 all other use by dates available for sale are not affected   identifying features barcode number 9350142001900 what are the defects? the recall is due to microbial (mycotoxin patulin) contamination. what are the hazards? food products containing mycotoxin (patulin) may cause illness if consumed. what should consumers do? consumers should not drink this product and should return the product to the place of purchase for a full refund. any consumers concerned about their health should seek medical advice. for further information, please contact thirsty brothers pty ltd by phone on 03 9982 1451 or email info@originaljuice.com.au supplier thirsty brothers pty ltd traders who sold this product coles stores in nsw and vic only. where the product was sold new south wales victoria dates available for sale 1 mar 2021

In [60]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, we are defining the cleaning function that implements everything mentioned above.\
Additionally, we are removing stopwords to reduce the size of the text, and using a lemmatizer, we are reducing words to their root form.

In [61]:
stop_words = set(stopwords.words('english')) #The set of the English stopwords
lem = WordNetLemmatizer() #Reducer of a word to its root word.

def cleaning(text):
    cleaned_text = re.sub(r'<[^<>]*>', '', text.strip()) #Removing anything inside '<>'
    cleaned_text = re.sub(r'\d+', '', cleaned_text).strip() #Removing numbers
    cleaned_text = re.sub(r'\n', ' ', cleaned_text) #Replacing new line char with " "
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text) #Removing punctuation
    words = cleaned_text.split() #Spliting the text into words
    filtered_words = [word for word in words if word not in stop_words] #Filtering for stopwords
    reduced_words = [lem.lemmatize(word) for word in filtered_words] #Reducing words to their root
    reduced_text = ' '.join(reduced_words)


    return reduced_text

## Cleaning the given DataSets

Now, we will load our datasets, clean them, and then store the cleaned datasets into CSV files.

In [62]:
# Loading the data
train = pd.read_csv('datasets/incidents_train.csv', index_col=0)
test = pd.read_csv('datasets/incidents_test.csv', index_col=0)
val = pd.read_csv('datasets/incidents_val.csv', index_col=0)
datasets = [train,test,val]

In [63]:
for dataset in datasets:
    # Casting columns to lowercase
    dataset.title = dataset.title.apply(lambda x: x.lower()) 
    dataset.text = dataset.text.apply(lambda x: x.lower())  
    
    # Finding the indexes of the rows that text column starts with case number and applying the case_study function
    indexes = dataset.loc[dataset['text'].str.contains(r"^case number", na=False),'text'].index
    column = dataset.loc[indexes,'text'].apply(case_study)
    dataset.loc[indexes,'text'] = column
    dataset.loc[indexes,'title'] = column

    # Applying the cleaning function to the text and title columns
    data['text'] = data['text'].apply(cleaning)
    data['title'] = data['title'].apply(cleaning)
    

    

In [64]:
val.sample(10)

Unnamed: 0,year,month,day,country,title,text
343,2018,9,22,au,schweppes lemonade,page content ​ schweppes lemonade 1.1l best be...
314,2018,4,12,uk,sweetland ltd recalls sweet and cake products,sweetland ltd is recalling various sweet and c...
369,2019,3,14,us,butterball llc recalls turkey products due to ...,"washington, march 13, 2019 – butterball, llc, ..."
61,2008,9,22,au,coles group ltd—you’ll love coles—sliced chick...,pra no. 2008/10323 date published 22 sep 2008 ...
373,2019,3,29,us,thomas hammer coffee roasters inc. issues alle...,"thomas hammer coffee roasters inc. of spokane,..."
188,2016,2,5,uk,asco foods ltd recalls clover chips barbecue f...,asco foods ltd is recalling three batches of l...
230,2016,9,14,sg,recall of “cocoluscious” coconut milk ice crea...,food alert \n \n \n \n \n \nrecall of “cocolus...
261,2017,4,23,ca,certain longo's brand ground meat products rec...,updated food recall warning - certain longo's ...
469,2020,11,11,uk,diageo great britain recalls guinness draught ...,diageo great britain has taken the precautiona...
422,2020,1,13,ca,scarpone's italian store brand frozen ground ...,food recall warning - scarpone's italian store...


In [65]:
name = ['train','test', 'val']
for name , dataset in zip(name,datasets):
    dataset.to_csv(f'datasets/cleaned_{name}.csv',index=False)