# Team Ecologists
### AutoML Data Converter
This notebook converts cvs data to AutoML Format for Codalab challenges.


The input csv file should have the following structure:
<ul>
  <li> n+1 columns where n are the features of the dataset
  <li> the last column should the label/class of the example
  <li> name of the last column should be <b>label</b>
</ul>

<br>

There are some variables which have to be initialized with required value to be used in the conversion
<ul>
<li> <b>pathRes</b> .........: path of the directory where the csv file is located
<li> <b>fileName</b> ........: csv filename including the extention .csv ___ Example: data.csv
<li> <b>path</b> ................: [Do not change] complete path of the csv file
<li> <b>pathAuto</b> ........: [Do not change] path of directory where the converted AutoML files will be saved. [Default: same path as of the csv file]
<li> <b>pathAuto</b> ........: path of directory where the converted AutoML files will be saved. [Default: same path as of the csv file]
<li> <b>pathPublic</b> ......: [Do not change] path of public_data 
<li> <b>pathSample</b> ....: [Do not change] path of sample_date
<li> <b>dataName</b> .......: name of the dataset to be created
<li> <b>ChalName</b> .......: name of the challenge for which the data is being converted
<li> <b>taskName</b> ........: name of the task of the challenge ___ Example: Regression
<li> <b>targetType</b> .......: type of the label ___ Example: Numerical
<li> <b>featType</b> ...........: type of the features ___ Example: Numerical
<li> <b>metric</b> ...............: name of the performance metric ___ Example: accuracy
<li> <b>percTest</b> ...........: percentage of the test set
<li> <b>percValid</b> ..........: percentage of the validation set
<li> <b>sampleSize</b> .......: number of examples in sample_data
<li> <b>hasMissing</b> ........: dataset has missing values
<li> <b>hasCategorical</b> .: dataset has categorical data
<li> <b>isSparse</b> ............: dataset has alot of zeros

</ul>
<hr>

**Imports**

In [1]:
import sys
import os
import shutil

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

**Initialization of variables**

In [2]:
pathRes = 'Data/'
fileName = 'bees.csv'

path = pathRes + fileName
pathAuto = pathRes+'AutoML/'
pathPublic = pathAuto+'public_data/'
pathSample = pathAuto+'sample_data/'
dataName = 'bee'
chalName = 'beeVSwasp'
taskName = 'multiclass.classification'
targetType = 'Numerical'
featType = 'Numerical'
metric = 'balanced_accuracy_score'
percTest = 0.1
percValid = 0.1
sampleSize = 100
hasMissing = '0' # 0 if false, 1 if true
hasCategorical = '0' # 0 if false, 1 if true
isSparse = '0' # 0 if false, 1 if true

**Read CSV file**

In [3]:
print('--/!\-- Reading the CSV file --/!\--')
data = pd.read_csv(path)


--/!\-- Reading the CSV file --/!\--


**Separate features and labels**

In [4]:
print('--/!\-- Separating features and labels --/!\--')
X = data.loc[:, data.columns != 'label']
y = data['label']

--/!\-- Separating features and labels --/!\--


**Creating Directories**

In [5]:
try:
    shutil.rmtree(pathAuto)
except OSError:
    print ("Deletion of the directory %s failed" % pathPublic)
else:
    print ("Successfully deleted the directory %s" % pathPublic)

try:
    os.mkdir(pathAuto)
except OSError:
    print ("Creation of the directory %s failed" % pathAuto)
else:
    print ("Successfully created the directory %s" % pathAuto)
    
try:
    os.mkdir(pathPublic)
except OSError:
    print ("Creation of the directory %s failed" % pathPublic)
else:
    print ("Successfully created the directory %s" % pathPublic)
    
try:
    os.mkdir(pathSample)
except OSError:
    print ("Creation of the directory %s failed" % pathSample)
else:
    print ("Successfully created the directory %s" % pathSample)

Deletion of the directory Data/AutoML/public_data/ failed
Successfully created the directory Data/AutoML/
Successfully created the directory Data/AutoML/public_data/
Successfully created the directory Data/AutoML/sample_data/


**Saving Features**

In [6]:
features = X.columns # Save all features of the dataset (-1 is to don't keep the labels)
f = open(pathPublic+dataName+"_feat.name", "w") # Create the file which contains feature names
for i in range(0,len(features)):
    if(i!=len(features)-1):
        f.write(features[i]+'\n') # Normal case
    else:
        f.write(features[i]) # Last line
f.close() # Close the file
print(dataName+"_feat.name is created")

bee_feat.name is created


**Encoding label values**

In [7]:
le = preprocessing.LabelEncoder()
le.fit(y.unique())
enc_y = le.transform(y.values)

**Saving Label Names**

In [8]:
labels = le.classes_
f = open(pathPublic+dataName+"_label.name", "w") # Create the file which contains label names
for i in range(0,len(labels)):
    if(i!=len(labels)-1):
        f.write(labels[i]+'\n') # Normal case
    else:
        f.write(labels[i]) # Last line
f.close()
print(dataName+"_label.name is created")

bee_label.name is created


**Creating Training, Validation and Test sets**

In [9]:
x_temp, x_test, y_temp, y_test = train_test_split(
    X, enc_y, test_size=percTest)


testSize = int(percTest*X.shape[0])/x_temp.shape[0] 

if(testSize == 0):
    testSize = 1
    
x_train, x_valid, y_train, y_valid = train_test_split(
    x_temp, y_temp, test_size= testSize)

**Saving train, valid and test data and solution**

In [12]:
x_train.to_csv(pathPublic+dataName+"_train.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathPublic+dataName+"_train.solution", y_train, fmt='%d')
print(dataName+"_train.data and "+dataName+"_train.solution are created")

x_valid.to_csv(pathPublic+dataName+"_valid.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathPublic+dataName+"_valid.solution", y_valid, fmt='%d')
print(dataName+"_valid.data and "+dataName+"_valid.solution are created")

x_test.to_csv(pathPublic+dataName+"_test.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathPublic+dataName+"_test.solution", y_test, fmt='%d')
print(dataName+"_test.data and "+dataName+"_test.solution are created")

bee_train.data and bee_train.solution are created
bee_valid.data and bee_valid.solution are created
bee_test.data and bee_test.solution are created


**Saving Feature types**

In [13]:
typee = x_train.dtypes
typee.to_csv(pathPublic+dataName+"_feat.type", header=None, index=None, sep=' ', mode='a')
print(dataName+"_feat.type is created")

bee_feat.type is created


**Saving Public Info**

In [14]:
f = open(pathPublic+dataName+"_public.info", "w")
f.write('usage = '+chalName+'\n')
f.write('name = '+dataName+'\n')
f.write('task = '+taskName+'\n')
f.write('target_type = '+targetType+'\n')
f.write('feat_type = '+featType+'\n')
f.write('metric = '+metric+'\n')
f.write('feat_num = '+str(len(features))+'\n')
f.write('target_num = '+str(len(labels))+'\n')
f.write('label_num = '+str(len(labels))+'\n')
f.write('train_num = '+str(len(x_train))+'\n')
f.write('valid_num = '+str(len(x_valid))+'\n')
f.write('test_num = '+str(len(x_test))+'\n')
f.write('has_categorical = '+hasCategorical+'\n')
f.write('has_missing = '+hasMissing+'\n')
f.write('is_sparse = '+isSparse+'\n')
f.write('time_budget = 500')
f.close()
print(dataName+"_public.info is created")

bee_public.info is created


**Saving Private Info**

In [16]:
f = open(pathPublic+dataName+"_private.info", "w")
f.write('title = '+dataName+'\n')
f.write('keywords = image.classification\n')
f.write('authors = Grégoire Loïs, Colin FONTAINE, Jean-Francois Julien\n')
f.write('resource_url = https://www.mnhn.fr/fr\n')
f.write('contact_name = Grégoire Loïs, Colin FONTAINE, Jean-Francois Julien\n')
f.write('contact_url = gregoire.lois@mnhn.fr\n')
f.write('license = \n')
f.write('date_created = 30 Dec 2020\n')
f.write('past_usage = \n')
f.write('description = The data is property of MUSÉUM NATIONAL D’HISTOIRE NATURELLE. The data consists of 290,000 images including images of bees, wasps, butterflies and other insects.\n')
f.write('preparation = The data was processed to get uniform images of resolution 128x128. After preprocessing the data was divided into Train, Validation and Test sets. \n')
f.write('representation = OpenCv features extracted by the opencv algorithm from the image which represents different image properties\n')
f.write('real_feat_num = 2048\n')
f.write('probe_num = 0\n')
f.write('frac_probes = 0\n')
f.write("feat_type = { 'Numerical' 'Categorical' 'Binary' }\n")
f.write('feat_type_freq = [1 0 0]\n')
f.write("label_names = { 'bee' 'butterfly' 'insect' 'other' 'wasp' } \n")
f.write('train_label_freq = [0.30540744175263707 0.10869115236218517 0.3951133484575005 0.1501349583533425 0.04065309907433473 ]\n')
f.write('train_label_entropy = 1.3852419181750144\n')
f.write('train_sparsity = \n')
f.write('train_frac_missing = 0\n')
f.write('valid_label_freq = [0.2, 0.2, 0.2, 0.2, 0.2]\n')
f.write('valid_label_entropy = 1.6094379124341005\n')
f.write('valid_sparsity = \n')
f.write('valid_frac_missing = 0\n')
f.write('test_label_freq = [0.2, 0.2, 0.2, 0.2, 0.2]\n')
f.write('test_label_entropy = 1.6094379124341005\n')
f.write('test_sparsity = \n')
f.write('test_frac_missing = 0\n')
f.write('train_data_aspect_ratio = 0.008478365265197305')
f.close()
print(dataName+"_private.info is created")



bee_private.info is created


**Creating Sample Data from Training set of public data**

In [17]:
x_train = x_train[:sampleSize]
y_train = y_train[:sampleSize]

x_temp, x_test, y_temp, y_test = train_test_split(
    x_train, y_train, test_size=percTest)


testSize = int(percTest*x_train.shape[0])/x_temp.shape[0] 
if(testSize == 0):
    testSize = 1
    
x_train, x_valid, y_train, y_valid = train_test_split(
    x_temp, y_temp, test_size= testSize)

**Saving Sample Data**

In [18]:
x_train.to_csv(pathSample+dataName+"_train.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathSample+dataName+"_train.solution", y_train, fmt='%d')
print(dataName+"_train.data and "+dataName+"_train.solution are created")

x_valid.to_csv(pathSample+dataName+"_valid.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathSample+dataName+"_valid.solution", y_valid, fmt='%d')
print(dataName+"_valid.data and "+dataName+"_valid.solution are created")

x_test.to_csv(pathSample+dataName+"_test.data", header=None, index=None, sep=' ', mode='a')
np.savetxt(pathSample+dataName+"_test.solution", y_test, fmt='%d')
print(dataName+"_test.data and "+dataName+"_test.solution are created")

bee_train.data and bee_train.solution are created
bee_valid.data and bee_valid.solution are created
bee_test.data and bee_test.solution are created


In [19]:
f = open(pathSample+dataName+"_public.info", "w")
f.write('usage = '+chalName+'\n')
f.write('name = '+dataName+'\n')
f.write('task = '+taskName+'\n')
f.write('target_type = '+targetType+'\n')
f.write('feat_type = '+featType+'\n')
f.write('metric = '+metric+'\n')
f.write('feat_num = '+str(len(features))+'\n')
f.write('target_num = '+str(len(labels))+'\n')
f.write('label_num = '+str(len(labels))+'\n')
f.write('train_num = '+str(len(x_train))+'\n')
f.write('valid_num = '+str(len(x_valid))+'\n')
f.write('test_num = '+str(len(x_test))+'\n')
f.write('has_categorical = '+hasCategorical+'\n')
f.write('has_missing = '+hasMissing+'\n')
f.write('is_sparse = '+isSparse+'\n')
f.write('time_budget = 500')
f.close()
print(dataName+"_public.info is created")

bee_public.info is created


In [20]:
shutil.copyfile(pathPublic+dataName+"_private.info", pathSample+dataName+"_private.info")
shutil.copyfile(pathPublic+dataName+"_label.name", pathSample+dataName+"_label.name")
shutil.copyfile(pathPublic+dataName+"_feat.name", pathSample+dataName+"_feat.name")
shutil.copyfile(pathPublic+dataName+"_feat.type", pathSample+dataName+"_feat.type")

'Data/AutoML/sample_data/bee_feat.type'