# Preprocessing Data in EMADE
In this notebook I am going to cover the processing of cleaning and formatting a specific dataset for EMADE.

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import shutil
import gzip

import os
import re
import math

We will start by importing all the libraries we will need. The most important libraries are numpy and pandas because we will use their data types to store and manipulate our dataset.

Note: Make sure you have nltk's corpus downloaded. You can check by running nltk.download() in a Python shell.

Next, we will import our dataset into a pandas dataframe object.

In [2]:
data = pd.read_csv("datasets/winemag-data_first150k.csv")

# Print first 5 examples
print(data.head())
print(data.shape)

# remove all rows except for the first 25,000
data = data.drop(data.index[25000:])
print(data.shape)

  country                                        description  \
0      US  This tremendous 100% varietal wine hails from ...   
1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2      US  Mac Watson honors the memory of a wine once ma...   
3      US  This spent 20 months in 30% new French oak, an...   
4  France  This is the top wine from La Bégude, named aft...   

                            designation  points  price        province  \
0                     Martha's Vineyard      96  235.0      California   
1  Carodorum Selección Especial Reserva      96  110.0  Northern Spain   
2         Special Selected Late Harvest      96   90.0      California   
3                               Reserve      96   65.0          Oregon   
4                            La Brûlade      95   66.0        Provence   

            region_1           region_2             variety  \
0        Napa Valley               Napa  Cabernet Sauvignon   
1               Toro                NaN     

The original link to the dataset can be found here: https://www.kaggle.com/zynicide/wine-reviews

However, I removed the index column from the csv in Excel.

The dataset consists of 10 features. Most of these are string values, which machine learning algorithms and data transformations will not accept. The dataset also has columns with missing values. We will either have to remove those columns or replace the missing values.

We also cut the dataset size down to 25,000 examples to reduce the size of our final dataset and RAM usage.

Note: if your computer does not have at least 8 GB of RAM, you might need to reduce the size of the data even more

Next, we will separate a column for labels and modify its values.

In [3]:
labels = data[["price"]]
print(labels.shape)

# Fill in all missing values with mean of column
labels = labels.fillna(labels.mean())

# Set all values greater or equal to 50 to 1
labels.loc[labels.price >= 50, 'price'] = 1
labels.loc[labels.price != 1, 'price'] = 0

(25000, 1)


In this case, we will choose to classify whether a wine has a price greater than or equal to 50. It is important to learn how to recognize potential labels and classification problems when you find a dataset. Most datasets will not have pre-defined labels and even those with labels you can tweak into a different problem.

We also fill in the missing price values with the mean of all price values. There are many different methods of dealing with missing values. You could remove all missing value rows or replace missing values with the median instead.

Next, we will remove irrelevant and/or unuseful features.

In [4]:
data = data.drop("price", axis=1)
data = data.drop("region_2", axis=1)
data = data.drop("winery", axis=1)
data = data.drop("designation", axis=1)

I removed these features for a specific reason. Price needs to be removed because we will add it back as labels later. Region 2 is similar enough to region_1 to not add useful information. Winery and designation have too many unique values.

When we one-hot encode some of our features in the next section, too many unique values will matter. If most of the strings in a feature are unique, we get very little useful or relevant information and only make the dimensions of our data unnecessarily large. This also applies to region 2. We do not want to double our dimensions with overlapping information.

Next, we will one-hot encoding on our remaining string features to extract relevant information from them. 

In [None]:
data = pd.get_dummies(data, columns=["country", "province", "variety", "region_1"])

An explanation of one-hot encoding can be found here: https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science

Next, we will separate the descriptions column and clean up the text data. One-hot encoding does not work well on columns with unique sentences or paragraphs because all of the items in the column will end up being unique.

In [None]:
text = data[["description"]].values
data = data.drop("description", axis=1)

text_list = []
for i in text:
    text_list.append(i[0])
    
# Alternative
# text = text.tolist()
# text = [i[0] for i in text]
    
# For debugging
print(text_list[0])
count = 0

stemmer = PorterStemmer()

for text in text_list:
    count += 1
    text = text.lower()
    text = re.sub('[!@#$.,?]', '', text)
    words = text.split(" ")
    words = [word for word in words if word not in stopwords.words('english')]
    words = [stemmer.stem(word) for word in words]
    text = " ".join(words)
    print("line " + str(count) + " processed")

This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
line 1 processed
line 2 processed
line 3 processed
line 4 processed
line 5 processed
line 6 processed
line 7 processed
line 8 processed
line 9 processed
line 10 processed
line 11 processed
line 12 processed
line 13 processed
line 14 processed
line 15 processed
line 16 processed
line 17 processed
line 18 processed
line 19 processed
line 20 processed
line 21 processed
line 22 processed
line 23 processed
line 24 processed
line 25 processed
line 26 processed
line 27 processed
line 28 processed
line 29 processed
line 30 processed
line 31 processed
line 32 processed
line 33 processed
line 34 processed
line 35 processed
line 36 processed
line 

First, we separate the text column into its own numpy array. Then, we convert the numpy array into a list. This puts our data into a universal list format for when we clean the data.

Next, we initialize variables we need and start looping over the individual text descriptions.

We start by using regular expressions to get rid of punctuation and symbols, and we also make all of the text lower case. This makes sure repeated words are recognized in our text.

Next, we split our description string into words by splitting on the spaces between words. This allows us to modify individual words in our text.

Then, we remove stop words from our text and stem all our words. Stop words are words such as "and", "the", and "of" which do not have a lot of sentimental value to our machine learning. You can think of it as removing noise. However, you have to be careful. Removing stop words on a dataset with short phrases such as "On the blue river" and "Over the mountain" can cause a loss of important information.

When we stem all the words we are removing irrelevant part of words such as prefixes and suffixes which are similar. We use the Porter Stemmer, which is well known to be a good solution to stemming words. Stemming also tends to reduce the amount of words in our bag of words model later on, which can help lower the dimensionality of our input data. However, we need to be careful to not stem useful information like with removing stop words.

Finally, we join the modified words back together with spaces and keep track of our progress.

Next, we will vectorize our text descriptions and add our text features back into our input data.

In [None]:
text_array = np.array(text_list)
data_array = data.values

tfidf = TfidfVectorizer()
text_vect = tfidf.fit_transform(text_array)

print(data_array.shape)
print(text_vect.shape)

full_data = np.concatenate((text_vect.toarray(), data_array), axis=1)

print(full_data.shape)

First, we convert the main dataframe into a numpy array and the modified text list we made into a numpy array. Then, we use sklearn's term frequency vectorizer to vectorize our text data. Rather than simply counting the words in each description, we weight words based on term frequency. This vectorizer keeps less useful repeated words from overshadowing more useful unique words.

Then, we append our vectorized text data onto our input data and observe the new shape of the data. The feature count may look very large compared to the number of examples. However, this is due to us having to cut down the number of examples from 150,000 to 25,000. The final feature count with 150,000 examples is only around 31,000, which suggests a good amount of repeated words.

You can find more information about tf-idf term weighting here: http://scikit-learn.org/stable/modules/feature_extraction.html

Next, we will split our data into train and test data concat the input data and labels together to fit EMADE's format.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(full_data, labels.values, test_size=0.33)

data_train = np.concatenate((X_train, y_train), axis=1)
data_test = np.concatenate((X_test, y_test), axis=1)

print(data_train.shape)
print(data_test.shape)

We split the data into training and test data with 67% as training and 33% as testing data. Then, we append the labels columns to the end of the input data columns to put the data into the format EMADE expects.

Next, we will test out the performance of our dataset on a logistic regression classifier to get a rough benchmark. Logistic Regression works well on text datasets and any datasets with a lot of discrete and/or Boolean (1,0) values.

In [None]:
classifier = LogisticRegression()

classifier.fit(X_train, np.ravel(y_train))

predicted = classifier.predict(X_test)

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(y_test, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, predicted))

Next, we will convert our train and test data back into pandas dataframes and export it in the proper format for EMADE.

In [None]:
train = pd.DataFrame(data_train)
test = pd.DataFrame(data_test)

divisor_train = math.ceil(len(train) / 5)
divisor_test = math.ceil(len(test) / 5)

count = 0
for g, df in train.groupby(np.arange(len(train)) // divisor_train):
    print(df.shape)

    np.savetxt("datasets/wine_data_set_train_%i.txt" % count, df.values, fmt='%.5f', delimiter=",")

    with open("datasets/wine_data_set_train_%i.txt" % count, "rb") as f_in, gzip.open("datasets/wine_data_set_train_%i.dat.gz" % count, "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)
        
    os.remove("datasets/wine_data_set_train_%i.txt" % count)

    count += 1
    
count = 0
for g, df in test.groupby(np.arange(len(test)) // divisor_test):
    print(df.shape)

    np.savetxt("datasets/wine_data_set_test_%i.txt" % count, df.values, fmt='%.5f', delimiter=",")

    with open("datasets/wine_data_set_test_%i.txt" % count, "rb") as f_in, gzip.open("datasets/wine_data_set_test_%i.dat.gz" % count, "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)
        
    os.remove("datasets/wine_data_set_test_%i.txt" % count)

    count += 1
    
small_train = train.drop(train.index[1675:])
small_test = test.drop(test.index[825:])

np.savetxt("datasets/wine_data_set_train_small.txt", small_train.values, fmt='%.5f', delimiter=",")

with open("datasets/wine_data_set_train_small.txt", "rb") as f_in, gzip.open("datasets/wine_data_set_train_small.dat.gz", "wb") as f_out:
    shutil.copyfileobj(f_in, f_out)
    
os.remove("datasets/wine_data_set_train_small.txt")
              
np.savetxt("datasets/wine_data_set_test_small.txt", small_train.values, fmt='%.5f', delimiter=",")

with open("datasets/wine_data_set_test_small.txt", "rb") as f_in, gzip.open("datasets/wine_data_set_test_small.dat.gz", "wb") as f_out:
    shutil.copyfileobj(f_in, f_out)
    
os.remove("datasets/wine_data_set_test_small.txt")

We split the dataset into 5 chunks each with 20% of the data. Then, we set aside one chunk of 10% to be our small dataset in EMADE.

The final step for preprocessing is to compress all the .dat files into .dat.gz files and place them in a folder under the datasets directory of EMADE.

There are other datasets already implemented into EMADE in this format.

## XML Formatting

Below is an example xml template for the dataset we preprocessed. This template is used to setup parameters for EMADE. You can change the objectives, crossover probability, and mutation probability from here.

Make sure to make an xml file and store it in the templates directory when you implement a new dataset into EMADE.

In [None]:
<?xml version="1.0"?>

<input>

    <datasets>
        <dataset>
            <name>SmallDataSet</name>
            <type>featuredata</type>
            <MonteCarlo>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_small.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_small.dat.gz</testFilename>
                </trial>
            </MonteCarlo>
        </dataset>
        <dataset>
            <name>FullDataSet</name>
            <type>featuredata</type>
            <MonteCarlo>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_0.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_0.dat.gz</testFilename>
                </trial>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_1.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_1.dat.gz</testFilename>
                </trial>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_2.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_2.dat.gz</testFilename>
                </trial>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_3.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_3.dat.gz</testFilename>
                </trial>
                <trial>
                    <trainFilename>datasets/wine/wine_data_set_train_4.dat.gz</trainFilename>
                    <testFilename>datasets/wine/wine_data_set_test_4.dat.gz</testFilename>
                </trial>
            </MonteCarlo>
        </dataset>
    </datasets>

    <objectives>
        <objective>
            <name>False Positives</name>
            <weight>-1.0</weight>
            <achievable>4971.8</achievable>
            <goal>0</goal>
            <evaluationFunction>false_positive</evaluationFunction>
            <lower>0</lower>
            <upper>1</upper>
        </objective>
        <objective>
            <name>False Negatives</name>
            <weight>-1.0</weight>
            <achievable>1541.2</achievable>
            <goal>0</goal>
            <evaluationFunction>false_negative</evaluationFunction>
            <lower>0</lower>
            <upper>1</upper>
        </objective>
        <objective>
            <name>F1-Score</name>
            <weight>-1.0</weight>
            <achievable>0.2</achievable>
            <goal>0</goal>
            <evaluationFunction>f1_score_min</evaluationFunction>
            <lower>0</lower>
            <upper>1</upper>
        </objective>
        <objective>
            <name>Num Elements</name>
            <weight>-1.0</weight>
            <achievable>100.0</achievable>
            <goal>0</goal>
            <evaluationFunction>num_elements_eval_function</evaluationFunction>
            <lower>0</lower>
            <upper>1</upper>
        </objective>

    </objectives>

    <evaluation>
        <module>evalFunctions</module>
        <memoryLimit>30</memoryLimit> <!-- In Percent -->
    </evaluation>

    <scoopParameters>
        <host>
            <name>localhost</name>
            <workers>24</workers>
        </host>
        <!--<host>
            <name>localhost</name>
            <workers>3</workers>
        </host>
        <host>
            <name>localhost</name>
            <workers>3</workers>
        </host>
        <host>
            <name>localhost</name>
            <workers>3</workers>
        </host>
        <host>
            <name>localhost</name>
            <workers>3</workers>
        </host>
        <host>
            <name>localhost</name>
            <workers>3</workers>
        </host>-->
    </scoopParameters>

    <evolutionParameters>
        <initialPopulationSize>512</initialPopulationSize>
        <elitePoolSize>512</elitePoolSize>
        <launchSize>300</launchSize>
        <minQueueSize>200</minQueueSize>

        <matings>
            <mating>
                <name>crossover</name>
                <probability>0.50</probability>
            </mating>
            <mating>
                <name>crossoverEphemeral</name>
                <probability>0.50</probability>
            </mating>
            <mating>
                <name>headlessChicken</name>
                <probability>0.10</probability>
            </mating>
            <mating>
                <name>headlessChickenEphemeral</name>
                <probability>0.10</probability>
            </mating>
        </matings>

        <mutations>
            <mutation>
                <name>insert</name>
                <probability>0.05</probability>
            </mutation>
            <mutation>
                <name>insert modify</name>
                <probability>0.10</probability>
            </mutation>
            <mutation>
                <name>ephemeral</name>
                <probability>0.25</probability>
            </mutation>
            <mutation>
                <name>node replace</name>
                <probability>0.05</probability>
            </mutation>
            <mutation>
                <name>uniform</name>
                <probability>0.05</probability>
            </mutation>
            <mutation>
                <name>shrink</name>
                <probability>0.05</probability>
            </mutation>
        </mutations>

        <selections>
            <selection>
                <name>NSGA2</name>
            </selection>
        </selections>

    </evolutionParameters>

    <seedFile>

    </seedFile>

    <genePoolFitness>
        <prefix>genePoolFitnessWine</prefix>
    </genePoolFitness>
    <paretoFitness>
        <prefix>paretoFitnessWine</prefix>
    </paretoFitness>
    <parentsOutput>
        <prefix>parentsWine</prefix>
    </parentsOutput>



    <paretoOutput>
        <prefix>paretoFrontWine</prefix>
    </paretoOutput>
</input>