<a href="https://colab.research.google.com/github/ELehmann91/Thesis_Multilingual_Transferlearning/blob/master/Text_Annotation_Tool_AT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a **student**, a **data scientist** or an **AI researcher**, Colab can make your work easier. Watch [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more, or just get started below!

# ECOICOP Annotation
  
Use this tool to annotate your product texts with the correspondant eCoiCop category. 


## Get code from GitHub
  
At first we need to download the code from Github to here.

In [0]:
%%capture
!pip install eli5
!git clone 'https://github.com/ELehmann91/Thesis_Multilingual_Transferlearning'

%cd Thesis_Multilingual_Transferlearning
import labeler_cc5
import coicop_model
import pandas as pd
import numpy as np
from tqdm import tqdm
import io

## Upload from Google Drive

Loading CSV files stored in your google drive.

In [2]:
from google.colab import drive, files

drive.mount('/content/gdrive')
path ='/content/gdrive/My Drive/Thesis_ecb_ecoicop'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


The commands will bring you to a Google Authentication step. You should see a screen with Google Drive File Stream wants to access your Google Account. After you allow permission, copy the given verification code and paste it in the box in Colab.
In the notebook, click on the charcoal > on the top left of the notebook and click on Files. Locate the data folder you created earlier and find your data. Right-click on your data and select Copy Path. Store this copied path into a variable and you are ready to go.

Specify the exact path to your file and the seperator of your file to read it in.

In [3]:
import pandas as pd

data_path = '/data/'
file_path = 'at/norm_at.csv'
df = pd.read_csv(path+data_path+file_path,sep='|',index_col=False)
df[:2]

Unnamed: 0.1,Unnamed: 0,lang,name,categ,prod_desc,text_other,url,words_from_url,unit,cc3,cc4,cc5,cc3_pred,cc4_pred,cc5_pred,shop,brand,price,id,url_text,text,labeled_by,sort_columns
0,0,de,Kelly's Linsenchips mit Meersalz,suesses und salziges chips und co knabbereien,Die neue Generation Chips!Kelly's LinsenCHIPS ...,,https://www.billa.at/produkte/kellys-linsenchi...,produkte kellys linsenchips mit meersalz,,11_Food,117_Vegetables,1176_Other tubers and products of tuber vegeta...,11_Food,117_Vegetables,1176_Other tubers and products of tuber vegeta...,Billa,,1.99,00-245459,produkte kellys linsenchips mit meersalz,Kelly's Linsenchips mit Meersalzsuesses und sa...,Erik,68
1,1,de,SPAR Veggie Veganes Erdnuss-Eis,Tiefkühlung Mehlspeisen & Eis Familienpack...,- Eis auf Hafer-BasisVeganes Hafer-Eis ist fre...,,https://www.interspar.at/shop/lebensmittel/spa...,shop lebensmittel spar veggie veganes erdnuss eis,,11_Food,"118_Sugar, jam, honey, chocolate and confectio...",1185_Edible ices and ice cream,11_Food,117_Vegetables,1176_Other tubers and products of tuber vegeta...,Spar,,3.59,2020001019731,shop lebensmittel spar veggie veganes erdnuss eis,SPAR Veggie Veganes Erdnuss-EisTiefkühlung M...,Erik,68


Print out the dataframe to see if the import is correct.

## Predict CoiCop Level 4
  
How does it work? - Word Embeddings (left)
* Words are translated in vectors which are learned represent the words
* Embeddings are a multidimensional space (often 300+ dimesnsion) and words which relate to each other are closer in this space
* Those spaces can be aligned for different languages
* The example shows a three dimensional space with embeddings for three different languages, the translation are idealy close neighbours and the distance between milk and cheese is closer than the distance betwenn milk and potato  
more information: https://towardsdatascience.com/introduction-to-word-embeddings-4cf857b12edc
  
How does it work? - Recurrent Neural Network (right)
* Now every word is represented by a vector and the input to the model is a sequence of vectors
* LSTM (Long Short Term Memory) Networks are good in solving sequential tasks, because they are able to remember information from previous states (words) and output a representation for the sentence 
* This representation is used to classify in to the coicop categories  
more information: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e 




![embed](https://drive.google.com/uc?id=1AoleK5q47ZTkPCtvxDD6icEpcnLc9viE)

For the prediction you have to specify the names of certain columns in your dataframe which the tool will use.  
* name_col is the column with the product name (mandatory)
* category_col is the column with the category given by the supermarked (optional)
* url_col is the column with  the url to the product (optional)
* lang ist the language of the texts, supports 'de' and 'fr' (mandatory)
* label_cat5 is the column with labels,if labeld (optional)



In [11]:
# init predictor

new_prediction = True

name_col = 'name'
category_col = 'categ'
url_col = 'url'
lang = 'de'
label_col = 'cc5'

#coicop_model
CoiCop_Predictor = coicop_model.predictor(df
                                          , name_col
                                          , category_col
                                          , url_col
                                          , label_col
                                          , lang)
    
if new_prediction:
    df = CoiCop_Predictor.predict_proba()
    #CoiCop_Predictor.predict()
    #df = CoiCop_Predictor.get_df()

using name, category and words in url as input
using german embeddings


  0%|          | 0/303 [00:00<?, ?it/s]

95% quantile no. of words per row is 21 (trained on 39)


100%|██████████| 303/303 [00:27<00:00, 11.00it/s]


ValueError: ignored

## Explonation  
  
It shows the three best candidates for a product and highlights the words which contribute to the estimate (positiv: green, negative: red)
If you leave the the brackets empty it will show a random product. If you want the explonation for a particular product, you can enter the text.  
If you enter text it always needs to be in marks ''.

In [5]:
text = None
#text = 'wurst spezialitäten produkte big power pork chip bites suess pikant'

CoiCop_Predictor.tell_me_why(text)

prediction 1194_Ready-made meals
label nan


Contribution?,Feature
3.585,Highlighted in text (sum)
-3.317,<BIAS>

Contribution?,Feature
4.446,Highlighted in text (sum)
-6.244,<BIAS>

Contribution?,Feature
3.787,Highlighted in text (sum)
-6.162,<BIAS>


# Annotation

For the annotation tool you have to specify the naes of certain columns in your dataframe which the tool will use.  
* Labeled_by takes your name and stores it after every item you labeld.  
* Text column 1 is the name of the column for the first line of text which will be displayed to help you labeling the product, usually this is the product name.  
* Text column 2 is the name of the column for the second line of text which will be displayed to help you labeling the product, this could be the category or the translation.  
* The URL column is the name of the column where the url of the scraped product is located (if it is in the dataframe). The url might be helpfull if you are unsure about the product nd want to look up the whole page on the website. If you do not have the url in your data you can leave it empty.  
* CoiCop 5 prediction is the column where the prediction is stored. The prediction will appear preselected in the dropdown menu.   
  
The data are sorted in a way that catogies with few labels come first, if predictions of those label are available.


In [0]:
# init labeler

labeled_by = 'Erik'
text_column_1 = 'name'
text_column_2 = 'categ'
url_column = 'url'
CoiCop_5_pred_col = 'cc5_pred'
use_proba = False

CoiCop_Labeler = labeler_cc5.labeler( labeled_by
                        , df
                        , text_column_1
                        , text_column_2
                        , url_column
                        , CoiCop_5_pred_col
                        , use_proba)

Now you are ready to label, after executing the next line you will see:
* Text 1
* Text 2
* URL
* Dropdown with one category which is the prediction for the product. If you are dissatisfied with it, you can click on the dropdown and select the right one.
* Next-Button will skip this product
* Save-Button will annotate the selected category to your dataframe
  
If you finished labeling just jump to the next line.

In [7]:
CoiCop_Labeler.start_to_label()

VBox(children=(Box(children=(Dropdown(description='Select category to label:', layout=Layout(height='60px', wi…

The next line will print out your current progress of annotating the data.

In [8]:
CoiCop_Labeler.get_stats()

new labels: -28362
in total 1100 of 30360 labeled ( 4.0 %)


Output the data including annotation.

In [9]:
df= CoiCop_Labeler.output_labels()
df[:1]

Unnamed: 0,lang,name,categ,prod_desc,text_other,url,words_from_url,unit,cc3,cc4,cc5,cc3_pred,cc4_pred,cc5_pred,shop,brand,price,id,url_text,text,labeled_by,sort_columns,1111_Rice,1112_Flours and other cereals,1113_Bread,1114_Other bakery products,1115_Pizza and quiche,1116_Pasta products and couscous,1117_Breakfast cereals,1118_Other cereal products,1121_Beef and veal,1122_Pork,1123_Lamb and goat,1124_Poultry,1125_Other meats,1126_Edible offal,"1127_Dried, salted or smoked meat",1128_Other meat preparations,1131_Fresh or chilled fish,1132_Frozen fish,...,1163_Dried fruit and nuts,1164_Preserved fruit and fruit-based products,1171_Fresh or chilled vegetables other than potatoes and other tubers,1172_Frozen vegetables other than potatoes and other tubers,"1173_Dried vegetables, other preserved or processed vegetables",1174_Potatoes,1175_Crisps,1176_Other tubers and products of tuber vegetables,1181_Sugar,"1182_Jams, marmalades and honey",1183_Chocolate,1184_Confectionery products,1185_Edible ices and ice cream,1186_Artificial sugar substitutes,"1191_Sauces, condiments","1192_Salt, spices and culinary herbs",1193_Baby food,1194_Ready-made meals,1199_Other food products n.e.c.,1211_Coffee,1212_Tea,1213_Cocoa and powdered chocolate,1221_Mineral or spring waters,1222_Soft drinks,1223_Fruit and vegetable juices,2111_Spirits and liqueurs,2112_Alcoholic soft drinks,2121_Wine from grapes,2122_Wine from other fruits,2123_Fortified wines,2124_Wine-based drinks,2131_Lager beer,2132_Other alcoholic beer,2133_Low and non-alcoholic beer,2134_Beer-based drinks,2201_Cigarettes,2202_Cigars,2203_Other tobacco products,9999_Non-Food,max_score
0,de,Kelly's Linsenchips mit Meersalz,suesses und salziges chips und co knabbereien,Die neue Generation Chips!Kelly's LinsenCHIPS ...,,https://www.billa.at/produkte/kellys-linsenchi...,produkte kellys linsenchips mit meersalz,,11_Food,117_Vegetables,1176_Other tubers and products of tuber vegeta...,11_Food,117_Vegetables,1176_Other tubers and products of tuber vegeta...,Billa,,1.99,00-245459,produkte kellys linsenchips mit meersalz,Kelly's Linsenchips mit Meersalzsuesses und sa...,Erik,68,0.000191,8.204557e-08,3e-06,0.004922,2.088711e-07,6.6e-05,7.2e-05,5.109275e-07,1.377003e-08,1.329912e-10,5.356492e-09,1.525451e-08,9.386105e-08,2.718626e-09,4.578719e-07,1e-06,4.391935e-08,9.916019e-08,...,0.000917,9e-06,4e-06,1.4e-05,0.001196,0.005708,0.127077,0.844922,5.114393e-07,2e-06,0.000682,0.002596,7e-06,7.116184e-07,0.004055,0.000897,1.6e-05,0.003996,6e-06,2e-06,2.227783e-07,6.926529e-07,3.348202e-07,1.3e-05,1e-06,8.6e-05,0.000224,1.3e-05,3e-06,9e-06,4.964586e-07,2.210389e-07,7.811616e-08,3.348473e-09,6.184307e-08,1.6e-05,2.630245e-07,1e-06,0.000901,0.844922


## Save data to Google Drive

In [0]:
if True:
    print(path+data_path+file_path,'saved')
    df.to_csv(path+data_path+file_path,sep='|')

/content/gdrive/My Drive/Thesis_ecb_ecoicop/data/at/norm_at.csv saved
