<a href="https://colab.research.google.com/github/ELehmann91/Thesis_Multilingual_Transferlearning/blob/master/Text_Annotation_Tool_BuBa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a **student**, a **data scientist** or an **AI researcher**, Colab can make your work easier. Watch [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more, or just get started below!

# ECOICOP Annotation
  
Use this tool to annotate your product texts with the correspondant eCoiCop category. 


## Get code from GitHub
  
At first we need to download the code from Github to here.

In [15]:
%%capture
!pip install eli5
!git clone 'https://github.com/ELehmann91/Thesis_Multilingual_Transferlearning'
 
%cd Thesis_Multilingual_Transferlearning
import labeler_cc5
import coicop_model
import pandas as pd
import numpy as np
from tqdm import tqdm
import io

Now you see the data from Github in your temp folder at colab

In [34]:
!ls

coicop_5_3.txt	  Match_DE_Galeria_small.xlsx
coicop_5_4.txt	  model_helper.py
coicop_model.py   Multilingual_Embeddings.ipynb
data		  README.md
de_fr_mod_cc5.h5  Results_Thesis.ipynb
de_mod_cc5.h5	  Text_Annotation_Tool_AT.ipynb
img		  Text_Annotation_Tool_BdF.ipynb
__init.py__	  Text_Annotation_Tool_BuBa.ipynb
labeler_cc5.py	  Text_Annotation_Tool.ipynb
labeler.py	  Text_Annotation_Tool_Italy.ipynb


## Upload CSV / Excel from Local Drive

To upload from your local drive, start with the following code:

In [16]:
from google.colab import files
uploaded = files.upload()

Saving Match_DE_Galeria_small.xlsx to Match_DE_Galeria_small.xlsx


It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
Finally, type in the following code to import it into a dataframe (make sure the filename matches the name of the uploaded file).

In [17]:
#df = pd.read_csv(io.BytesIO(uploaded['eantoclassifyECB.csv']),sep=',')
#df = pd.read_excel(io.BytesIO(uploaded['carrfour_trans_pred.xlsx']),encoding='ANSI')
df = pd.read_excel(uploaded['Match_DE_Galeria_small.xlsx'],encoding='unicode')
print(df.shape)
df[:2]

(1999, 7)


Unnamed: 0,PRODUKTART_BEZ,ABTEILUNG_BEZ,GESCHLECHT,WARENGRUPPE_BESCHREIBUNG1,category,url,label_cat5
0,DAMENWAESCHE,DA.-TAGW.,Damen,D-Slips,Tageswäsche-Damen,https://www.galeria.de/Schoeller-Damen-Hueftsl...,9999_Non-Food
1,OUTDOOR,OUTDOORBKL.,-,kurzarm He/Uni,Hemden - Bergsport / Wandern,,


## Upload from Google Drive

Loading CSV files stored in your google drive.

In [29]:
from google.colab import drive, files
 
drive.mount('/content/gdrive')
path ='/content/gdrive/My Drive/Thesis_ecb_ecoicop'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


You can now also see your drive folders

In [None]:
!ls '/content/gdrive/My Drive/'

and the data available in your  (my) buba folder

In [36]:
!ls '/content/gdrive/My Drive/Thesis_ecb_ecoicop/data/buba'

Match_DE_Galeria_small.xlsx  training.csv


In [33]:
#load same from drive
df = pd.read_excel(path+'/data/buba/Match_DE_Galeria_small.xlsx',encoding='unicode')
print(df.shape)
df[:2]

(1999, 7)


Unnamed: 0,PRODUKTART_BEZ,ABTEILUNG_BEZ,GESCHLECHT,WARENGRUPPE_BESCHREIBUNG1,category,url,label_cat5
0,DAMENWAESCHE,DA.-TAGW.,Damen,D-Slips,Tageswäsche-Damen,https://www.galeria.de/Schoeller-Damen-Hueftsl...,9999_Non-Food
1,OUTDOOR,OUTDOORBKL.,-,kurzarm He/Uni,Hemden - Bergsport / Wandern,,


The commands will bring you to a Google Authentication step. You should see a screen with Google Drive File Stream wants to access your Google Account. After you allow permission, copy the given verification code and paste it in the box in Colab.
In the notebook, click on the charcoal > on the top left of the notebook and click on Files. Locate the data folder you created earlier and find your data. Right-click on your data and select Copy Path. Store this copied path into a variable and you are ready to go.

Specify the exact path to your file and the seperator of your file to read it in.

Print out the dataframe to see if the import is correct.

# Predict CoiCop Level 4
  
How does it work? - Word Embeddings (left)
* Words are translated in vectors which are learned representations
* Embeddings are a multidimensional space (often 300+ dimesnsion) and words which relate to each other are closer in this space
* Those spaces can be aligned for different languages
* The example shows a three dimensional space with embeddings for three different languages, the translation are idealy close neighbours and the distance between milk and cheese is closer than the distance betwenn milk and potato  
more information: https://towardsdatascience.com/introduction-to-word-embeddings-4cf857b12edc
  
How does it work? - Recurrent Neural Network (right)
* Now every word is represented by a vector and the input to the model is a sequence of vectors
* LSTM (Long Short Term Memory) Networks are good in solving sequential tasks, because they are able to remember information from previous states (words) and output a representation for the sentence 
* This representation is used to classify in to the coicop categories  
more information: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e 




![embed](https://drive.google.com/uc?id=1AoleK5q47ZTkPCtvxDD6icEpcnLc9viE)

For the prediction you have to specify the names of certain columns in your dataframe which the tool will use.  
* name_col is the column with the product name (mandatory)
* category_col is the column with the category given by the supermarked (optional)
* url_col is the column with  the url to the product (optional)
* lang ist the language of the texts, supports 'de' and 'fr' (mandatory)
* label_cat5 is the column with labels,if labeld (optional)



In [18]:
# init predictor
  
name_col = 'WARENGRUPPE_BESCHREIBUNG1'
category_col = 'category'
url_col = 'url'
lang = 'de'
label_cat5 = None #'cc5' #'coicop4_str'

 #coicop_model #
CoiCop_Predictor = coicop_model.predictor(df
                                          , name_col
                                          , category_col
                                          , url_col
                                          , label_cat5
                                          , lang)


using name, category and words in url as input
using german embeddings
95% quantile no. of words per row is 11 (trained on 39)


In [19]:
df = CoiCop_Predictor.predict_proba()

100%|██████████| 19/19 [00:01<00:00, 18.67it/s]


In [None]:
df.iloc[111][:12]

PRODUKTART_BEZ                      STRUEMPFE                               
ABTEILUNG_BEZ                       STRUEMPFE                               
GESCHLECHT                                                                 -
WARENGRUPPE_BESCHREIBUNG1                                   Feinstrumpfhosen
category                                                   Neue WG verwenden
url                                                                      NaN
label_cat5                                                               NaN
url_text                                                                    
text                         Feinstrumpfhosen <sep> Neue WG verwenden <sep> 
cc3_pred                                                         99_Non-Food
cc4_pred                                                        999_Non-Food
cc5_pred                                                       9999_Non-Food
Name: 111, dtype: object

In [20]:
# Predicted ECOICOP Categories:
df['cc5_pred'].value_counts()

9999_Non-Food                                                            760
1211_Coffee                                                              144
1117_Breakfast cereals                                                   138
1192_Salt, spices and culinary herbs                                     128
1116_Pasta products and couscous                                         108
1154_Other edible oils                                                    96
1128_Other meat preparations                                              80
1125_Other meats                                                          62
2111_Spirits and liqueurs                                                 46
1193_Baby food                                                            46
1134_Frozen seafood                                                       41
1118_Other cereal products                                                31
1194_Ready-made meals                                                     31

In [21]:
if label_cat5 is not None:
    CoiCop_Predictor.test_performance()
    #CoiCop_Predictor.confusion_matrix()

## Explonation  
  
It shows the three best candidates for a product and highlights the words which contribute to the estimate (positiv: green, negative: red)
If you leave the the brackets empty it will show a random product. If you want the explonation for a particular product, you can enter the text.  
If you enter text it always needs to be in marks ''.

In [None]:
text = 'Feinstrumpfhosen <sep> Neue WG verwenden <sep>  '
categ= None #'1134_Frozen seafood'

CoiCop_Predictor.tell_me_why(text,categ)

Contribution?,Feature
3.095,Highlighted in text (sum)
-4.923,<BIAS>

Contribution?,Feature
-0.521,Highlighted in text (sum)
-2.506,<BIAS>

Contribution?,Feature
1.012,Highlighted in text (sum)
-4.265,<BIAS>


# Annotation

For the annotation tool you have to specify the naes of certain columns in your dataframe which the tool will use.  
* Labeled_by takes your name and stores it after every item you labeld.  
* Text column 1 is the name of the column for the first line of text which will be displayed to help you labeling the product, usually this is the product name.  
* Text column 2 is the name of the column for the second line of text which will be displayed to help you labeling the product, this could be the category or the translation.  
* The URL column is the name of the column where the url of the scraped product is located (if it is in the dataframe). The url might be helpfull if you are unsure about the product nd want to look up the whole page on the website. If you do not have the url in your data you can leave it empty.  
* CoiCop 5 prediction is the column where the prediction is stored. The prediction will appear preselected in the dropdown menu.   
  
The data are sorted in a way that catogies with few labels come first, if predictions of those label are available.


In [22]:
 add_categories = ['12111 Hairdressing for men and children',
                   '12112 Hairdressing for women',
                   '12113 Personal grooming treatments',
                   '12121 Electric appliances for personal care',
                   '12131 Non-electrical appliances',
                   '12132 Articles for personal hygiene (wellness, esoteric beauty) ',
                   '06110 Pharmaceutical products',
                   '06131 Corrective eye-glasses and contact lenses',
                   '06110 Pharmaceutical products',
                   '06139 Other therapeutic appliances and equipment' ,
                   '5111_Household furniture',
                    '5112_Garden furniture',
                    '5113_Lighting equipment',
                    '5119_Other furniture and furnishings',
                    '5121_Carpets and rugs',
                    '5122_Other floor coverings',
                    '5123_Services of laying of fitted carpets and floor coverings',
                    '5130_Repair of furniture, furnishings and floor coverings',
                    '5201_Furnishing fabrics and curtains',
                    '5202_Bed linen',
                    '5203_Table linen and bathroom linen',
                    '5204_Repair of household textiles',
                    '5209_Other household textiles',
                    '5311_Refrigerators, freezers and fridge-freezers',
                    '5312_Clothes washing machines, clothes drying machines and dish washing machines',
                    '5313_Cookers',
                    '5314_Heaters, air conditioners',
                    '5315_Cleaning equipment',
                    '5319_Other major household appliances',
                    '5321_Food processing appliances',
                    '5322_Coffee machines, tea makers and similar appliances',
                    '5323_Irons',
                    '5324_Toasters and grills',
                    '5329_Other small electric household appliances',
                    '5330_Repair of household appliances',
                    '5401_Glassware, crystal-ware, ceramic ware and chinaware',
                    '5402_Cutlery, flatware and silverware',
                    '5403_Non-electric kitchen utensils and articles',
                    '5404_Repair of glassware, tableware and household utensils',
                    '5511_Motorised major tools and equipment',
                    '5512_Repair, leasing and rental of major tools and equipment',
                    '5521_Non-motorised small tools',
                    '5522_Miscellaneous small tool accessories',
                    '5523_Repair of non-motorised small tools and miscellaneous accessories',
                    '5611_Cleaning and maintenance products',
                    '5612_Other non-durable small household articles',
                    '5621_Domestic services by paid staff',
                    '5622_Cleaning services',
                    '5623_Hire of furniture and furnishings',
                    '5629_Other domestic services and household services'
                   ]

In [25]:
# init labeler

labeled_by = 'Erik'
text_column_1 = 'WARENGRUPPE_BESCHREIBUNG1'
text_column_2 = 'category'
url_column = 'url'
CoiCop_5_pred_col = 'cc5_pred'
Use_probabilities = False

#labeler_cc5
CoiCop_Labeler = labeler_cc5.labeler( labeled_by
                        , df
                        , text_column_1
                        , text_column_2
                        , url_column
                        , CoiCop_5_pred_col
                        , Use_probabilities
                        , add_categories)

Now you are ready to label, after executing the next line you will see:
* Select category you want to label
* Text 1
* Text 2
* URL
* Dropdown with one category which is the prediction for the product. If you are dissatisfied with it, you can click on the dropdown and select the right one.
* Next-Button will skip this product
* Save-Button will annotate the selected category to your dataframe
  
If you finished labeling just jump to the next line.

In [26]:
CoiCop_Labeler.start_to_label()

VBox(children=(Box(children=(Dropdown(description='Select category to label:', layout=Layout(height='60px', wi…

The next line will print out your current progress of annotating the data.

In [27]:
CoiCop_Labeler.get_stats()

new labels: -1239
in total 0 of 1999 labeled ( 0.0 %)


Output the data including annotation.

In [28]:
df= CoiCop_Labeler.output_labels()
df[:1]

Unnamed: 0,PRODUKTART_BEZ,ABTEILUNG_BEZ,GESCHLECHT,WARENGRUPPE_BESCHREIBUNG1,category,url,label_cat5,url_text,text,cc3_pred,cc4_pred,cc5_pred,1111_Rice,1112_Flours and other cereals,1113_Bread,1114_Other bakery products,1115_Pizza and quiche,1116_Pasta products and couscous,1117_Breakfast cereals,1118_Other cereal products,1121_Beef and veal,1122_Pork,1123_Lamb and goat,1124_Poultry,1125_Other meats,1126_Edible offal,"1127_Dried, salted or smoked meat",1128_Other meat preparations,1131_Fresh or chilled fish,1132_Frozen fish,1133_Fresh or chilled seafood,1134_Frozen seafood,"1135_Dried, smoked or salted fish and seafood",1136_Other preserved or processed fish and seafood-based preparations,1141_Fresh whole milk,1142_Fresh low fat milk,1143_Preserved milk,1144_Yoghurt,1145_Cheese and curd,1146_Other milk products,...,"1173_Dried vegetables, other preserved or processed vegetables",1174_Potatoes,1175_Crisps,1176_Other tubers and products of tuber vegetables,1181_Sugar,"1182_Jams, marmalades and honey",1183_Chocolate,1184_Confectionery products,1185_Edible ices and ice cream,1186_Artificial sugar substitutes,"1191_Sauces, condiments","1192_Salt, spices and culinary herbs",1193_Baby food,1194_Ready-made meals,1199_Other food products n.e.c.,1211_Coffee,1212_Tea,1213_Cocoa and powdered chocolate,1221_Mineral or spring waters,1222_Soft drinks,1223_Fruit and vegetable juices,2111_Spirits and liqueurs,2112_Alcoholic soft drinks,2121_Wine from grapes,2122_Wine from other fruits,2123_Fortified wines,2124_Wine-based drinks,2131_Lager beer,2132_Other alcoholic beer,2133_Low and non-alcoholic beer,2134_Beer-based drinks,2201_Cigarettes,2202_Cigars,2203_Other tobacco products,9999_Non-Food,max_score,labeled_by,cc3,cc4,cc5
0,DAMENWAESCHE,DA.-TAGW.,Damen,D-Slips,Tageswäsche-Damen,https://www.galeria.de/Schoeller-Damen-Hueftsl...,9999_Non-Food,schoeller damen hueftslip pack html src,D-Slips <sep> Tageswäsche-Damen <sep> schoelle...,11_Food,111_Bread and cereals,1116_Pasta products and couscous,0.001608,0.001574,0.022974,0.017038,0.000169,0.268262,0.002265,0.000798,7.3e-05,0.000375,2e-06,0.000767,7e-06,5.1e-05,0.000606,0.008287,0.000105,0.000817,0.002088,0.000758,0.000366,0.002999,0.00037,0.001108,0.000475,0.000645,0.031194,0.00221,...,0.031773,0.002507,0.000253,0.001993,4e-06,0.00821,0.001788,0.00237,3.2e-05,0.000509,0.007687,0.216767,0.00374,0.020162,0.002495,0.012391,0.012769,0.000417,0.000659,0.000386,0.000442,0.002778,5.4e-05,0.001995,2.7e-05,0.000408,0.000481,9.6e-05,0.001152,1.1e-05,0.001057,0.000176,0.000775,7.9e-05,0.232724,0.268262,,,,


## Save data to Local Drive

In [None]:
from google.colab import files

df.to_csv('df_out.csv',sep='|')
files.download('df_out.csv')

## Save data to Google Drive

In [None]:
df.to_csv(path+data_path+file_path,sep='|')
print(file_path,'saved',len(df),'observation',sum(df.cc5.isna()==False),'labeled')

training.csv saved 9584 observation 8473 labeled
