## Data Cleaning
In this notebook, we describe the data cleaning process we will follow and apply it to clean our data.

First we have to load the data in order to explore them.

In [None]:
import pandas as pd
data = pd.read_csv('Datasets/incidents_train.csv', index_col=0)

Lets take a look to see what kind of text we have in our data set.

In [15]:
data.sample(20)

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
183,2004,1,8,au,Golden Circle—Meal Variety 4 and 8 Pack Baby Food,PRA No. 2004/6724 Date published 8 Jan 2004 Pr...,allergens,"dietetic foods, food supplements, fortified foods",milk and products thereof,baby food
5356,2021,10,14,ca,Canada Uncle Bill Seafood brand Dried Octopus ...,Service interruption Due to system maintenance...,allergens,seafood,sulphur dioxide and sulphites,octopus
4310,2020,7,17,us,"InHe Manufacturing, LLC and MHR Brands Issues ...",This recall press release was issued by the fi...,chemical,"dietetic foods, food supplements, fortified foods",heavy metals,food supplement
4733,2021,1,8,uk,Premier Selection Sweets recalls The Premier S...,Premier Selection Sweets is recalling The Prem...,allergens,"cocoa and cocoa preparations, coffee and tea",hazelnut,chocolate products
4391,2020,8,22,us,Prima® Wawona Recalls Bulk/Loose and Bagged Pe...,"Prima® Wawona of Fresno, California is volunta...",biological,fruits and vegetables,salmonella,peaches
5911,2022,6,23,us,Daily Harvest Issues Voluntary Recall of Frenc...,"June 23, 2022, Daily Harvest, Inc., New York, ...",biological,fruits and vegetables,other,frozen leek
1564,2016,3,5,us,Voluntary Recall of Cheese and Fruit Bistro Bo...,"Gretchen’s Shoebox Express, a food packing est...",allergens,prepared dishes and snacks,cashew,ready to eat - cook meals
565,2011,9,11,us,arkansas firm recalls ground turkey products d...,"WASHINGTON, September 11, 2011 - Cargill Meat ...",biological,"meat, egg and dairy products",salmonella,turkey and turkey preparations
2087,2017,2,4,uk,Flynn's Fine Foods recalls Limerick Cooked Ham...,Flynn's Fine Foods is recalling Limerick Cooke...,allergens,"meat, egg and dairy products",milk and products thereof,cooked ham
283,2007,2,15,au,Harvey Fresh Ltd—All White Milk,PRA No. 2007/9048 Date published 15 Feb 2007 P...,organoleptic aspects,"meat, egg and dairy products",taste disturbance,milk


First we will start by casting the title and the text columns into lowercase.

In [16]:
#Casting to lowercase
data.title = data.title.apply(lambda x: x.lower()) 
data.text = data.text.apply(lambda x: x.lower())

We have multiple cases where the "title" column contains only the recall notification, which does not provide meaningful information for the classification task.

On the other hand, the "text" column contains excessive information but is in a standard format.

Therefore, it is possible to extract and retain only the meaningful information from the "text" column and store it in both the "title" and "text" columns.



In [17]:
data.iloc[:10]

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
0,1994,1,7,us,recall notification: fsis-024-94,case number: 024-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,1994,3,10,us,recall notification: fsis-033-94,case number: 033-94 \n date opene...,biological,"meat, egg and dairy products",listeria spp,sausage
2,1994,3,28,us,recall notification: fsis-014-94,case number: 014-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices
3,1994,4,3,us,recall notification: fsis-009-94,case number: 009-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,thermal processed pork meat
4,1994,7,1,us,recall notification: fsis-001-94,case number: 001-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,chicken breast
5,1994,8,11,us,recall notification: fsis-044-94,case number: 044-94 \n date opene...,biological,"meat, egg and dairy products",escherichia coli,ground beef
6,1994,9,2,us,recall notification: fsis-005-94,case number: 005-94 \n date opene...,biological,"meat, egg and dairy products",listeria monocytogenes,thermal processed pork meat
7,1994,10,11,us,recall notification: fsis-045-94,case number: 045-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,beef
8,1994,12,5,us,recall notification: fsis-018-94,case number: 018-94 \n date opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,beef stewed
9,1994,12,7,us,recall notification: fsis-026-94,case number: 026-94 \n date opene...,chemical,"meat, egg and dairy products","antibiotics, vet drugs",chicken breast


In the above output there are some cases like those mentioned.

Lets see how the format of the "text" column looks like.

In [18]:
print(data.text.iloc[0])

case number: 024-94   
            date opened: 07/01/1994   
            date closed: 09/22/1994 
    
            recall class:  1   
            press release (y/n):  y  
    
            domestic est. number:  05893  p   
              name:  gerhard's napa valley sausage
    
            imported product (y/n):  n       
            foreign estab. number:  n/a
    
            city:  napa    
            state:  ca   
            country:  usa
    
            product:  smoked chicken sausage
    
            problem:  bacteria   
            description: listeria
    
            total pounds recalled:  2,894   
            pounds recovered:  2,894


Below, we define a function that checks if the text begins with "case number". If it does, the function extracts and retains only the meaningful labels from the output.

In [19]:
import re

In [20]:
def case_study(text):
    if re.match("^case number", text):
        information_labels = ['name', 'product' , 'problem', 'description'] #Defining the meaningful labels
        text = text.split('\n') #Spliting the text on \n
        text = [x.strip() for x in text] #Striping excess spacing 
        text = ", ".join([x for x in text for label in information_labels if re.match(f"^{label}", x)]) #Keeping only the labels that are inside the meaningful labels
        return text
    else:
        return text


Now, we will locate the rows that start with "case number" (though this check is also performed within the case_study function) in order to identify the indexes of these cases.\
After identifying these rows, we will apply the case_study function to them.\
Finally in the cleaning process, the cleaned text will be stored in both the "text" and "title" columns.

In [21]:
indexes = data.loc[data['text'].str.contains(r"^case number", na=False),'text'].index
new_col = data.loc[indexes,'text'].apply(case_study)
initial_col = data.loc[indexes,'text'] 

In [22]:
print("The initial columns were: \n",initial_col[:5],"\n")
print("The new columns will be: \n",new_col[:5])

The initial columns were: 
 0    case number: 024-94   \n            date opene...
1    case number: 033-94   \n            date opene...
2    case number: 014-94   \n            date opene...
3    case number: 009-94   \n            date opene...
4    case number: 001-94   \n            date opene...
Name: text, dtype: object 

The new columns will be: 
 0    name:  gerhard's napa valley sausage, product:...
1    name:  wimmer's meat products, product:  wiene...
2    name:  willow foods inc, product:  ham, sliced...
3    name:  oscar mayer foods, product:  beef frank...
4    name:  tyson foods, product:  chicken breast c...
Name: text, dtype: object


In the code below we can see that in some cases there may be excess text from the crawling process so we can remove anything included in "< >"

In [23]:
print(data.text.iloc[1763])

orlando, florida – fresh express incorporated issued a precautionary recall of a small quantity of 7.6 ounce net weight fresh express caesar salad kits when it was learned that two bags of salad kits were mistakenly packed with incorrect condiments including walnuts, a tree nut allergen. according to the company, a total of 2,449 cases could possibly contain the wrong condiment packet and are subject to the precautionary recall. the recalled caesar salad kits were distributed to approximately 19 states primarily in the southeast and are identified by product code g163b13a with a use-by date of june 26.
fresh express representatives are already working with retailers to ensure any incorrectly packed salad kits are rapidly removed from store shelves and inventories. no other fresh express products are included in this recall. no illnesses are reported.
consumers in possession of the recalled product should discard it. a refund is available where purchased or by contacting the fresh expre

Next, we observe that most of the cases contain a lot of numbers, such as dates, barcodes, quantities, and other numerical values, which do not provide meaningful insights for the classification task. Therefore, we can eliminate numbers from the text to focus only on the relevant information.


In [24]:
print(data.text.iloc[4563])

pra number 2021/19189 published date 3 sep 2021 product description original juice company black label cloudy apple juice 1.5l plastic bottle batch code 15468 use by date 07/10/2021 all other use by dates available for sale are not affected   identifying features barcode number 9350142001900 what are the defects? the recall is due to microbial (mycotoxin patulin) contamination. what are the hazards? food products containing mycotoxin (patulin) may cause illness if consumed. what should consumers do? consumers should not drink this product and should return the product to the place of purchase for a full refund. any consumers concerned about their health should seek medical advice. for further information, please contact thirsty brothers pty ltd by phone on 03 9982 1451 or email info@originaljuice.com.au supplier thirsty brothers pty ltd traders who sold this product coles stores in nsw and vic only. where the product was sold new south wales victoria dates available for sale 1 mar 2021

In [25]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, we are defining the cleaning function that implements everything mentioned above.\
Additionally, we are removing stopwords to reduce the size of the text, and using a lemmatizer, we are reducing words to their root form.

In [26]:
stop_words = set(stopwords.words('english')) #The set of the English stopwords
lem = WordNetLemmatizer() #Reducer of a word to its root word.

def cleaning(text):
    cleaned_text = re.sub(r'<[^<>]*>', '', text.strip()) #Removing anything inside '<>'
    cleaned_text = re.sub(r'\d+', '', cleaned_text).strip() #Removing numbers
    cleaned_text = re.sub(r'\n', ' ', cleaned_text) #Replacing new line char with " "
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text) #Removing punctuation
    words = cleaned_text.split() #Spliting the text into words
    filtered_words = [word for word in words if word not in stop_words] #Filtering for stopwords
    reduced_words = [lem.lemmatize(word) for word in filtered_words] #Reducing words to their root
    reduced_text = ' '.join(reduced_words)


    return reduced_text

## Cleaning the given DataSets

Now, we will load our datasets, clean them, and then store the cleaned datasets into CSV files.

In [None]:
# Loading the data
train = pd.read_csv('Datasets/incidents_train.csv', index_col=0)
test = pd.read_csv('Datasets/incidents_test.csv', index_col=0)
val = pd.read_csv('Datasets/incidents_val.csv', index_col=0)
datasets = [train,test,val]

In [28]:
for dataset in datasets:
    # Casting columns to lowercase
    dataset.title = dataset.title.apply(lambda x: x.lower()) 
    dataset.text = dataset.text.apply(lambda x: x.lower())  
    
    # Finding the indexes of the rows that text column starts with case number and applying the case_study function
    indexes = dataset.loc[dataset['text'].str.contains(r"^case number", na=False),'text'].index
    column = dataset.loc[indexes,'text'].apply(case_study)
    dataset.loc[indexes,'text'] = column
    dataset.loc[indexes,'title'] = column

    # Applying the cleaning function to the text and title columns
    data['text'] = data['text'].apply(cleaning)
    data['title'] = data['title'].apply(cleaning)
    

    

In [29]:
val.sample(10)

Unnamed: 0,year,month,day,country,title,text
368,2019,3,13,uk,premier foods recalls hovis granary bread flou...,premier foods is recalling hovis granary bread...
551,2022,4,14,hk,cfs continues to follow up on imported chocola...,cfs continues to follow up on imported chocola...
159,2015,2,13,uk,anjoman is recalling its dried plums,anjoman is recalling its dried plums because t...
11,2000,12,15,au,mcwilliams wines—holsten imported beer,pra no. 2000/4579 date published 15 dec 2000 p...
439,2020,3,24,us,tiffany food corp. issues alert on undeclared ...,"tiffany food corp. of brooklyn, ny is recallin..."
374,2019,4,2,us,wakefern food corp. voluntarily recalls wholes...,wakefern food corp. has initiated a voluntary ...
145,2014,9,19,ca,pc organics brand original stoned wheat cracke...,notice this archive of previously issued food ...
223,2016,8,2,us,"michael angelo's gourmet foods, inc. recalls s...",editors note: this release is being reissued a...
12,2001,3,8,au,safcol australia pty ltd—canned tuna with sata...,pra no. 2001/4677 date published 8 mar 2001 pr...
34,2004,6,25,us,archives,recall notification report 020-2004\n ...


In [None]:
#Saving the new data sets
name = ['train','test', 'val']
for name , dataset in zip(name,datasets):
    dataset.to_csv(f'Datasets/cleaned_{name}.csv',index=False)

## Data Cleaning Evaluation
We are going to compare the achieved results of our basic models with the raw and the clean data.

<img src="Images\F1 with the raw data.JPG" alt="Raw Data" width="900" style="display:inline-block;">
<img src="Images\F1 with the clean data.JPG" alt="Clean Data" width="900" style="display:inline-block; margin-right:15px;">


The F1 scores of the models show little to no improvement with the clean data.\
This suggests that the data either needed more thorough cleaning or that the models were already able to handle the noise during training.\
Despite the fact that the cleaning didn’t help the models to achieve significantlly better results, from now on we are going to use the clean data for the training of our models.