<a href="https://colab.research.google.com/github/ELehmann91/Thesis_Multilingual_Transferlearning/blob/master/Text_Annotation_Tool_BdF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a **student**, a **data scientist** or an **AI researcher**, Colab can make your work easier. Watch [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more, or just get started below!

# ECOICOP Annotation
  
Use this tool to annotate your product texts with the correspondant eCoiCop category. 


## Get code from GitHub
  
At first we need to download the code from Github to here.

In [0]:
%%capture
!pip install eli5
!git clone 'https://github.com/ELehmann91/Thesis_Multilingual_Transferlearning'

%cd Thesis_Multilingual_Transferlearning
import labeler_cc5
import coicop_model
import pandas as pd
import numpy as np
from tqdm import tqdm
import io

## Upload CSV / Excel from Local Drive

To upload from your local drive, start with the following code:

In [0]:
from google.colab import files
uploaded = files.upload()

Saving eantoclassifyECB.csv to eantoclassifyECB (1).csv


It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
Finally, type in the following code to import it into a dataframe (make sure the filename matches the name of the uploaded file).

In [0]:
#stata
df = pd.read_stata('table_BCE_finale.dta')

In [0]:
df = pd.read_csv(io.BytesIO(uploaded['eantoclassifyECB.csv']),sep=',')
#df = pd.read_excel(io.BytesIO(uploaded['carrfour_trans_pred.xlsx']),encoding='ANSI')
#df = pd.read_excel(uploaded['carrfour_trans_pred.xlsx'],encoding='unicode')

In [0]:
# Dataset is now stored in a Pandas Dataframe
#df['productCategory'] = df['productCategory'].apply(lambda x: str(x).replace('suesses','süßes').replace('ue','ü').replace('ae','ä').replace('oe','ö'))
print(df.shape)
df[:2]

(132639, 3)


Unnamed: 0,ean,title_accent,id
0,30080751,"coolwave bonbons sans sucres avec édulcorants,...",1811805
1,30080768,"cool wave fraise, sans sucres, la boite de 28g",1811806


## Upload from Google Drive

Loading CSV files stored in your google drive.

In [2]:
from google.colab import drive, files

drive.mount('/content/gdrive')
path ='/content/gdrive/My Drive/Thesis_ecb_ecoicop'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


The commands will bring you to a Google Authentication step. You should see a screen with Google Drive File Stream wants to access your Google Account. After you allow permission, copy the given verification code and paste it in the box in Colab.
In the notebook, click on the charcoal > on the top left of the notebook and click on Files. Locate the data folder you created earlier and find your data. Right-click on your data and select Copy Path. Store this copied path into a variable and you are ready to go.

Specify the exact path to your file and the seperator of your file to read it in.

In [0]:
#@title Which data do you want to use?
file_path = "carrefour.csv" #@param ["auchan.csv","ecb_data.csv", "carrefour.csv","banque_de_france"] {allow-input: true}

In [4]:
import pandas as pd
data_path = '/data/bdf/'#
df = pd.read_csv(path+data_path+file_path,sep='|',index_col=False)
print(file_path,'loaded',len(df),'observation')

carrefour.csv loaded 8610 observation


Print out the dataframe to see if the import is correct.

# Predict CoiCop Level 4
  
How does it work? - Word Embeddings (left)
* Words are translated in vectors which are learned representations
* Embeddings are a multidimensional space (often 300+ dimesnsion) and words which relate to each other are closer in this space
* Those spaces can be aligned for different languages
* The example shows a three dimensional space with embeddings for three different languages, the translation are idealy close neighbours and the distance between milk and cheese is closer than the distance betwenn milk and potato  
more information: https://towardsdatascience.com/introduction-to-word-embeddings-4cf857b12edc
  
How does it work? - Recurrent Neural Network (right)
* Now every word is represented by a vector and the input to the model is a sequence of vectors
* LSTM (Long Short Term Memory) Networks are good in solving sequential tasks, because they are able to remember information from previous states (words) and output a representation for the sentence 
* This representation is used to classify in to the coicop categories  
more information: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e 




![embed](https://drive.google.com/uc?id=1AoleK5q47ZTkPCtvxDD6icEpcnLc9viE)

For the prediction you have to specify the names of certain columns in your dataframe which the tool will use.  
* name_col is the column with the product name (mandatory)
* category_col is the column with the category given by the supermarked (optional)
* url_col is the column with  the url to the product (optional)
* lang ist the language of the texts, supports 'de' and 'fr' (mandatory)
* label_cat5 is the column with labels,if labeld (optional)



In [33]:
# init predictor
  
name_col = 'name'
category_col = 'categ'
url_col = 'url'
lang = 'fr'
label_cat5 = 'cc5' #'coicop4_str'

 #coicop_model #
CoiCop_Predictor = coicop_model.predictor(df
                                          , name_col
                                          , category_col
                                          , url_col
                                          , label_cat5
                                          , lang)


using name, category and words in url as input
using french embeddings
95% quantile no. of words per row is 34 (trained on 39)


In [34]:
df_probs = CoiCop_Predictor.predict_proba()

100%|██████████| 86/86 [00:10<00:00,  8.56it/s]


if products are already labeled you can test the consistency with the prediction.
label_cat5 is specified two lines above

In [0]:
if label_cat5 is not None:
    CoiCop_Predictor.test_performance()
    #CoiCop_Predictor.confusion_matrix()

## Explonation  
  
It shows the three best candidates for a product and highlights the words which contribute to the estimate (positiv: green, negative: red)
If you leave the the brackets empty it will show a random product. If you want the explonation for a particular product, you can enter the text.  
If you enter text it always needs to be in marks ''.

In [36]:
text = None
categ= '1134_Frozen seafood'
#text = ''

CoiCop_Predictor.tell_me_why(text,categ)

prediction 1134_Frozen seafood
label 1134_Frozen seafood


Contribution?,Feature
15.115,Highlighted in text (sum)
-9.002,<BIAS>

Contribution?,Feature
3.054,Highlighted in text (sum)
-5.86,<BIAS>

Contribution?,Feature
-3.256,<BIAS>
-3.472,Highlighted in text (sum)


# Annotation

For the annotation tool you have to specify the naes of certain columns in your dataframe which the tool will use.  
* Labeled_by takes your name and stores it after every item you labeld.  
* Text column 1 is the name of the column for the first line of text which will be displayed to help you labeling the product, usually this is the product name.  
* Text column 2 is the name of the column for the second line of text which will be displayed to help you labeling the product, this could be the category or the translation.  
* The URL column is the name of the column where the url of the scraped product is located (if it is in the dataframe). The url might be helpfull if you are unsure about the product nd want to look up the whole page on the website. If you do not have the url in your data you can leave it empty.  
* CoiCop 5 prediction is the column where the prediction is stored. The prediction will appear preselected in the dropdown menu.   
  
The data are sorted in a way that catogies with few labels come first, if predictions of those label are available.


In [37]:
# init labeler

labeled_by = 'Erik'
text_column_1 = 'name'
text_column_2 = 'categ'
url_column = 'url'
CoiCop_5_pred_col = 'cc5_pred'
Use_probabilities = True

#labeler_cc5
CoiCop_Labeler = labeler( labeled_by
                        , df_probs
                        , text_column_1
                        , text_column_2
                        , url_column
                        , CoiCop_5_pred_col
                        , Use_probabilities)

4798




Now you are ready to label, after executing the next line you will see:
* Select category you want to label
* Text 1
* Text 2
* URL
* Dropdown with one category which is the prediction for the product. If you are dissatisfied with it, you can click on the dropdown and select the right one.
* Next-Button will skip this product
* Save-Button will annotate the selected category to your dataframe
  
If you finished labeling just jump to the next line.

In [38]:
CoiCop_Labeler.start_to_label()

VBox(children=(Box(children=(Dropdown(description='Select category to label:', layout=Layout(height='60px', wi…

4798




4798




The next line will print out your current progress of annotating the data.

In [0]:
CoiCop_Labeler.get_stats()

new labels: 1
in total 19242 of 303638 labeled ( 6.0 %)


Output the data including annotation.

In [0]:
df= CoiCop_Labeler.output_labels()
df[:1]

Unnamed: 0,lang,name,categ,prod_desc,text_other,url,words_from_url,unit,cc3,cc4,cc5,cc3_pred,cc4_pred,cc5_pred,shop,brand,price,id,labeld_by,url_text,text,labeled_by,sort_columns
0,fr,pomme chp bq 6 frt golden,unknown,,,unknown,,,,,,11_Food,115_Oils and fats,1155_Other edible animal fats,ecb,,,1807326,,,pomme chp bq 6 frt goldenunknown,,78


## Save data to Local Drive

In [0]:
from google.colab import files

df.to_csv('df_out.csv',sep='|')
files.download('df_out.csv')

## Save data to Google Drive

In [45]:
df.to_csv(path+data_path+file_path,sep='|')
print(file_path,'saved',len(df),'observation',sum(df.cc5.isna()==False),'labeled')

banque_de_france.csv saved 14910 observation 14910 labeled
