<a href="https://colab.research.google.com/github/ELehmann91/Thesis_Multilingual_Transferlearning/blob/master/Text_Annotation_Tool_BdF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a **student**, a **data scientist** or an **AI researcher**, Colab can make your work easier. Watch [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more, or just get started below!

# ECOICOP Annotation
  
Use this tool to annotate your product texts with the correspondant eCoiCop category. 


## Get code from GitHub
  
At first we need to download the code from Github to here.

In [1]:
%%capture
!pip install eli5
!git clone 'https://github.com/ELehmann91/Thesis_Multilingual_Transferlearning'
 
%cd Thesis_Multilingual_Transferlearning
import labeler_cc5
import coicop_model
import pandas as pd
import numpy as np
from tqdm import tqdm
import io

## Upload CSV / Excel from Local Drive

To upload from your local drive, start with the following code:

In [None]:
from google.colab import files
uploaded = files.upload()

Saving eantoclassifyECB.csv to eantoclassifyECB (1).csv


It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
Finally, type in the following code to import it into a dataframe (make sure the filename matches the name of the uploaded file).

In [None]:
#stata
df = pd.read_stata('table_BCE_finale.dta')

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['eantoclassifyECB.csv']),sep=',')
#df = pd.read_excel(io.BytesIO(uploaded['carrfour_trans_pred.xlsx']),encoding='ANSI')
#df = pd.read_excel(uploaded['carrfour_trans_pred.xlsx'],encoding='unicode')

In [None]:
# Dataset is now stored in a Pandas Dataframe
#df['productCategory'] = df['productCategory'].apply(lambda x: str(x).replace('suesses','süßes').replace('ue','ü').replace('ae','ä').replace('oe','ö'))
print(df.shape)
df[:2]

(132639, 3)


Unnamed: 0,ean,title_accent,id
0,30080751,"coolwave bonbons sans sucres avec édulcorants,...",1811805
1,30080768,"cool wave fraise, sans sucres, la boite de 28g",1811806


## Upload from Google Drive

Loading CSV files stored in your google drive.

In [2]:
from google.colab import drive, files
 
drive.mount('/content/gdrive')
path ='/content/gdrive/My Drive/Thesis_ecb_ecoicop'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


The commands will bring you to a Google Authentication step. You should see a screen with Google Drive File Stream wants to access your Google Account. After you allow permission, copy the given verification code and paste it in the box in Colab.
In the notebook, click on the charcoal > on the top left of the notebook and click on Files. Locate the data folder you created earlier and find your data. Right-click on your data and select Copy Path. Store this copied path into a variable and you are ready to go.

Specify the exact path to your file and the seperator of your file to read it in.

In [14]:
#@title Which data do you want to use?
file_path = "auchan.csv" #@param ["auchan.csv", "ecb_data.csv", "carrefour.csv", "banque_de_france.csv"] {allow-input: true}

In [15]:
import pandas as pd
data_path = '/data/bdf/'#
df = pd.read_csv(path+data_path+file_path,sep='|',index_col=False)
print(file_path,'loaded',len(df),'observation')

auchan.csv loaded 25511 observation


Print out the dataframe to see if the import is correct.

# Predict CoiCop Level 4
  
How does it work? - Word Embeddings (left)
* Words are translated in vectors which are learned representations
* Embeddings are a multidimensional space (often 300+ dimesnsion) and words which relate to each other are closer in this space
* Those spaces can be aligned for different languages
* The example shows a three dimensional space with embeddings for three different languages, the translation are idealy close neighbours and the distance between milk and cheese is closer than the distance betwenn milk and potato  
more information: https://towardsdatascience.com/introduction-to-word-embeddings-4cf857b12edc
  
How does it work? - Recurrent Neural Network (right)
* Now every word is represented by a vector and the input to the model is a sequence of vectors
* LSTM (Long Short Term Memory) Networks are good in solving sequential tasks, because they are able to remember information from previous states (words) and output a representation for the sentence 
* This representation is used to classify in to the coicop categories  
more information: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e 




![embed](https://drive.google.com/uc?id=1AoleK5q47ZTkPCtvxDD6icEpcnLc9viE)

For the prediction you have to specify the names of certain columns in your dataframe which the tool will use.  
* name_col is the column with the product name (mandatory)
* category_col is the column with the category given by the supermarked (optional)
* url_col is the column with  the url to the product (optional)
* lang ist the language of the texts, supports 'de' and 'fr' (mandatory)
* label_cat5 is the column with labels,if labeld (optional)



In [16]:
df[:1]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,lang,name,categ,prod_desc,text_other,url,words_from_url,unit,cc3,cc4,cc5,cc3_pred,cc4_pred,cc5_pred,shop,brand,price,id,labeld_by,url_text,text,labeled_by,sort_columns,1111_Rice,1112_Flours and other cereals,1113_Bread,1114_Other bakery products,1115_Pizza and quiche,1116_Pasta products and couscous,1117_Breakfast cereals,1118_Other cereal products,1121_Beef and veal,1122_Pork,1123_Lamb and goat,1124_Poultry,1125_Other meats,...,1163_Dried fruit and nuts,1164_Preserved fruit and fruit-based products,1171_Fresh or chilled vegetables other than potatoes and other tubers,1172_Frozen vegetables other than potatoes and other tubers,"1173_Dried vegetables, other preserved or processed vegetables",1174_Potatoes,1175_Crisps,1176_Other tubers and products of tuber vegetables,1181_Sugar,"1182_Jams, marmalades and honey",1183_Chocolate,1184_Confectionery products,1185_Edible ices and ice cream,1186_Artificial sugar substitutes,"1191_Sauces, condiments","1192_Salt, spices and culinary herbs",1193_Baby food,1194_Ready-made meals,1199_Other food products n.e.c.,1211_Coffee,1212_Tea,1213_Cocoa and powdered chocolate,1221_Mineral or spring waters,1222_Soft drinks,1223_Fruit and vegetable juices,2111_Spirits and liqueurs,2112_Alcoholic soft drinks,2121_Wine from grapes,2122_Wine from other fruits,2123_Fortified wines,2124_Wine-based drinks,2131_Lager beer,2132_Other alcoholic beer,2133_Low and non-alcoholic beer,2134_Beer-based drinks,2201_Cigarettes,2202_Cigars,2203_Other tobacco products,9999_Non-Food,max_score
0,0,22406,2555,27915,fr,Rôti filet de porc 700g,"Accueil Boucherie, volaille, poissonnerie Bouc...",Rôti de viande de porc ficelé. Cette viande es...,,https://www.auchan.fr/roti-filet-de-porc-700g/...,roti filet porc,,11_Food,112_Meat,1122_Pork,11_Food,112_Meat,1122_Pork,Auchan,,unknown,[24188],,roti filet porc,"Rôti filet de porc 700gAccueil Boucherie, vola...",Lukas,44,1.356779e-10,6.192233e-15,4.7179e-10,1.079738e-10,8.015436e-12,2.101298e-12,2.295424e-15,1.498002e-17,0.000692,0.997314,8e-06,0.001153,2e-05,...,3.73855e-25,2.804867e-24,8.589847e-16,2.13656e-16,6.284986e-11,6.745654e-13,1.000561e-15,6.201404e-15,3.120115e-17,7.741413e-17,7.948463e-16,5.204252e-21,3.5978199999999997e-20,1.067667e-22,2.292552e-09,4.433889e-15,8.144385e-14,2.501014e-08,4.582998e-09,3.660202e-18,1.552436e-26,2.5015730000000003e-23,1.4535399999999998e-20,1.5937990000000001e-25,1.188169e-14,4.136259e-16,8.050178e-21,3.981738e-14,1.1905040000000001e-17,2.182963e-20,9.761882e-21,3.286936e-13,6.027602e-14,1.807706e-18,2.622885e-16,2.9579289999999997e-20,2.205907e-21,4.400288e-16,1.940416e-09,0.997314


In [17]:
# init predictor
  
name_col = 'name'
category_col = 'categ'
url_col = 'url'
lang = 'fr'
label_cat5 = 'cc5' #'coicop4_str'

 #coicop_model #
CoiCop_Predictor = coicop_model.predictor(df
                                          , name_col
                                          , category_col
                                          , url_col
                                          , label_cat5
                                          , lang)


using name, category and words in url as input
using french embeddings
95% quantile no. of words per row is 35 (trained on 39)


In [18]:
df_probs = CoiCop_Predictor.predict_proba()

100%|██████████| 255/255 [00:32<00:00,  7.77it/s]


if products are already labeled you can test the consistency with the prediction.
label_cat5 is specified two lines above

In [20]:
if label_cat5 is not None:
    CoiCop_Predictor.test_performance()
    #CoiCop_Predictor.confusion_matrix()

number of observation (labeled / all): 189 / 25511 consistency  98.94 %


## Explonation  
  
It shows the three best candidates for a product and highlights the words which contribute to the estimate (positiv: green, negative: red)
If you leave the the brackets empty it will show a random product. If you want the explonation for a particular product, you can enter the text.  
If you enter text it always needs to be in marks ''.

In [21]:
text = None
categ= '1134_Frozen seafood'
#text = ''

CoiCop_Predictor.tell_me_why(text,categ)

prediction 1134_Frozen seafood
label nan


Contribution?,Feature
3.779,Highlighted in text (sum)
-4.264,<BIAS>

Contribution?,Feature
5.109,Highlighted in text (sum)
-6.273,<BIAS>

Contribution?,Feature
2.697,Highlighted in text (sum)
-3.903,<BIAS>


# Annotation

For the annotation tool you have to specify the naes of certain columns in your dataframe which the tool will use.  
* Labeled_by takes your name and stores it after every item you labeld.  
* Text column 1 is the name of the column for the first line of text which will be displayed to help you labeling the product, usually this is the product name.  
* Text column 2 is the name of the column for the second line of text which will be displayed to help you labeling the product, this could be the category or the translation.  
* The URL column is the name of the column where the url of the scraped product is located (if it is in the dataframe). The url might be helpfull if you are unsure about the product nd want to look up the whole page on the website. If you do not have the url in your data you can leave it empty.  
* CoiCop 5 prediction is the column where the prediction is stored. The prediction will appear preselected in the dropdown menu.   
  
The data are sorted in a way that catogies with few labels come first, if predictions of those label are available.


In [22]:
# init labeler

labeled_by = 'Erik'
text_column_1 = 'name'
text_column_2 = 'categ'
url_column = 'url'
CoiCop_5_pred_col = 'cc5_pred'
Use_probabilities = True

#labeler_cc5
CoiCop_Labeler = labeler_cc5.labeler( labeled_by
                        , df_probs
                        , text_column_1
                        , text_column_2
                        , url_column
                        , CoiCop_5_pred_col
                        , Use_probabilities)

Now you are ready to label, after executing the next line you will see:
* Select category you want to label
* Text 1
* Text 2
* URL
* Dropdown with one category which is the prediction for the product. If you are dissatisfied with it, you can click on the dropdown and select the right one.
* Next-Button will skip this product
* Save-Button will annotate the selected category to your dataframe
  
If you finished labeling just jump to the next line.

In [23]:
CoiCop_Labeler.start_to_label()

VBox(children=(Box(children=(Dropdown(description='Select category to label:', layout=Layout(height='60px', wi…

The next line will print out your current progress of annotating the data.

In [24]:
CoiCop_Labeler.get_stats()

new labels: 2
in total 191 of 25511 labeled ( 1.0 %)


Output the data including annotation.

In [25]:
df= CoiCop_Labeler.output_labels()
df[:1]

Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,lang,name,categ,prod_desc,text_other,url,words_from_url,unit,cc3,cc4,cc5,cc3_pred,cc4_pred,cc5_pred,shop,brand,price,id,labeld_by,url_text,text,labeled_by,sort_columns,1111_Rice,1112_Flours and other cereals,1113_Bread,1114_Other bakery products,1115_Pizza and quiche,1116_Pasta products and couscous,1117_Breakfast cereals,1118_Other cereal products,1121_Beef and veal,1122_Pork,1123_Lamb and goat,1124_Poultry,1125_Other meats,1126_Edible offal,...,1163_Dried fruit and nuts,1164_Preserved fruit and fruit-based products,1171_Fresh or chilled vegetables other than potatoes and other tubers,1172_Frozen vegetables other than potatoes and other tubers,"1173_Dried vegetables, other preserved or processed vegetables",1174_Potatoes,1175_Crisps,1176_Other tubers and products of tuber vegetables,1181_Sugar,"1182_Jams, marmalades and honey",1183_Chocolate,1184_Confectionery products,1185_Edible ices and ice cream,1186_Artificial sugar substitutes,"1191_Sauces, condiments","1192_Salt, spices and culinary herbs",1193_Baby food,1194_Ready-made meals,1199_Other food products n.e.c.,1211_Coffee,1212_Tea,1213_Cocoa and powdered chocolate,1221_Mineral or spring waters,1222_Soft drinks,1223_Fruit and vegetable juices,2111_Spirits and liqueurs,2112_Alcoholic soft drinks,2121_Wine from grapes,2122_Wine from other fruits,2123_Fortified wines,2124_Wine-based drinks,2131_Lager beer,2132_Other alcoholic beer,2133_Low and non-alcoholic beer,2134_Beer-based drinks,2201_Cigarettes,2202_Cigars,2203_Other tobacco products,9999_Non-Food,max_score
4543,21023,964,10791,fr,Pavés de saumon d'Ecosse filière responsable x...,"Accueil Boucherie, volaille, poissonnerie Pois...",4 pavés de saumon avec peau et sans arêtes. Ma...,,https://www.auchan.fr/paves-de-saumon-d-ecosse...,paves saumon ecosse filiere responsable,,,,,11_Food,113_Fish and seafood,1131_Fresh or chilled fish,Auchan,,13.99,862017,,paves saumon ecosse filiere responsable,Pavés de saumon d'Ecosse filière responsable x...,,56,6.31551e-12,1.096988e-11,2.024923e-10,2.799689e-10,6.841676e-13,1.827027e-09,1.007096e-13,6.922135e-14,2.118212e-07,3.766295e-09,7.13018e-09,5.511939e-10,3.413446e-09,6.249634e-10,...,8.848409e-15,1.19781e-11,1.150949e-09,8.268149e-07,3.087571e-08,8.744443e-10,1.851336e-13,4.086762e-07,1.055068e-14,6.467982e-15,2.691449e-13,1.378857e-16,4.122532e-13,3.571318e-15,2.904277e-10,6.157712e-09,1.890226e-08,9.670295e-08,1.511618e-10,2.201905e-15,2.944557e-12,2.369656e-14,2.118141e-07,5.688365e-12,7.780102e-12,1.159642e-09,1.067876e-13,1.433147e-09,1.557186e-10,8.798209e-11,1.36849e-12,4.395232e-09,1.329559e-09,1.662149e-14,5.573373e-13,3.31035e-13,1.734929e-12,6.127578e-09,2.252896e-08,0.771309


## Save data to Local Drive

In [None]:
from google.colab import files

df.to_csv('df_out.csv',sep='|')
files.download('df_out.csv')

## Save data to Google Drive

In [26]:
df.to_csv(path+data_path+file_path,sep='|')
print(file_path,'saved',len(df),'observation',sum(df.cc5.isna()==False),'labeled')

auchan.csv saved 25511 observation 191 labeled
