# Extraction of Named Entities: Prerequisites
Author: Pierre Nugues

__You must execute this notebook before your start the assignment.__

The goal of the assignment is to create a system to extract syntactic groups from a text. You will apply it to the CoNLL 2003 dataset. 

In this part, you will collect the datasets and the files you need to train your models. You will also collect the script you need to evaluate them.

## Collecting a Training and a Test sets

As annotated data and annotation scheme, you will use the data created for [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/).
1. Read the description of the CoNLL 2003 task
2. Download both the training, validation, and test sets from https://data.deepai.org/conll2003.zip and decompress them. See the instructions below
3. Note that the tagging scheme has been changed to IOB2 


In [2]:
!wget https://data.deepai.org/conll2003.zip
!unzip -u conll2003.zip
!mkdir conll2003
!mv train.txt valid.txt test.txt conll2003
!rm conll2003.zip

--2024-04-27 11:14:26--  https://data.deepai.org/conll2003.zip
Resolving data.deepai.org (data.deepai.org)... 138.199.37.229
Connecting to data.deepai.org (data.deepai.org)|138.199.37.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982975 (960K) [application/zip]
Saving to: 'conll2003.zip'


2024-04-27 11:14:26 (3.49 MB/s) - 'conll2003.zip' saved [982975/982975]

Archive:  conll2003.zip
  inflating: metadata                
  inflating: test.txt                
  inflating: train.txt               
  inflating: valid.txt               
mkdir: conll2003: File exists


## The evaluation script

You will train the models with the training set and the test set to evaluate them. For this, you will apply the `conlleval` script that will compute the harmonic mean of the precision and recall: F1. 

`conlleval` was written in Perl. Some people rewrote it in Python and you will use such such a translation in this lab. The line below installs it. The source code is available from this address: https://github.com/kaniblu/conlleval

In [3]:
!pip install conlleval

Collecting conlleval
  Downloading conlleval-0.2-py3-none-any.whl.metadata (171 bytes)
Downloading conlleval-0.2-py3-none-any.whl (5.4 kB)
Installing collected packages: conlleval
Successfully installed conlleval-0.2


## Collecting the Embeddings

You will represent the words with dense vectors, instead of a one-hot encoding. GloVe embeddings is one such representation. The Glove files contain a list of words, where each word is represented by a vector of a fixed dimension. In this notebook, we will use the file of 400,000 lowercase words with the 50 and 100-dimensional vectors.
Download either:
*  The GloVe embeddings 6B from <a href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</a> and keep the 50d and 100d vectors; or
* A local copy of this dataset with the cell below (faster)

In [4]:
!wget https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz

--2024-04-27 11:16:42--  https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz
Resolving fileadmin.cs.lth.se (fileadmin.cs.lth.se)... 130.235.16.7
Connecting to fileadmin.cs.lth.se (fileadmin.cs.lth.se)|130.235.16.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 134409071 (128M) [application/x-gzip]
Saving to: 'glove.6B.100d.txt.gz'


2024-04-27 11:16:45 (43.1 MB/s) - 'glove.6B.100d.txt.gz' saved [134409071/134409071]



In [5]:
!wget https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.50d.txt.zip

--2024-04-27 11:17:24--  https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.50d.txt.zip
Resolving fileadmin.cs.lth.se (fileadmin.cs.lth.se)... 130.235.16.7
Connecting to fileadmin.cs.lth.se (fileadmin.cs.lth.se)|130.235.16.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69240158 (66M) [application/zip]
Saving to: 'glove.6B.50d.txt.zip'


2024-04-27 11:17:26 (41.5 MB/s) - 'glove.6B.50d.txt.zip' saved [69240158/69240158]



In [6]:
!gunzip -k glove.6B.100d.txt.gz
!unzip -u glove.6B.50d.txt.zip
!mkdir glove
!mv glove.6B.100d.txt glove
!mv glove.6B.50d.txt glove
!rm glove.6B.100d.txt.gz glove.6B.50d.txt.zip

Archive:  glove.6B.50d.txt.zip
  inflating: glove.6B.50d.txt        
  inflating: __MACOSX/._glove.6B.50d.txt  
