# Assignment #4: Extracting syntactic groups using machine-learning techniques: Prerequisites
Author: Pierre Nugues

__You must execute this notebook before your start the assignment.__

The goal of the assignment is to create a system to extract syntactic groups from a text. You will apply it to the CoNLL 2000 dataset. 

In this part, you will collect the datasets and the files you need to train your models. You will also collect the script you need to evaluate them.

## Collecting a Training and a Test sets

As annotated data and annotation scheme, you will use the data available from [CoNLL 2000](https://www.clips.uantwerpen.be/conll2000/chunking/).
1. Read the description of the CoNLL 2000 task
2. Download both the training and test sets and decompress them. See below

CoNLL 2000 is an early dataset and contrary to many current ones, it has no development set.

You can also download them from this site: https://huggingface.co/datasets/conll2000

In [4]:
%pip install requests

import requests

def download_file(url, local_filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)

download_file('http://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz', 'train.txt.gz')
download_file('http://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz', 'test.txt.gz')


Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Arvid\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [5]:
import gzip
import shutil

def decompress_gz_file(input_file_path, output_file_path):
    with gzip.open(input_file_path, 'rb') as f_in:
        with open(output_file_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

decompress_gz_file('train.txt.gz', 'train.txt')
decompress_gz_file('test.txt.gz', 'test.txt')


In [6]:
import os
import shutil

# Create directory
os.makedirs('corpus', exist_ok=True)

# Move files
shutil.move('train.txt', 'corpus/train.txt')
shutil.move('test.txt', 'corpus/test.txt')


'corpus/test.txt'

## The evaluation script

You will train the models with the training set and the test set to evaluate them. For this, you will apply the `conlleval` script that will compute the harmonic mean of the precision and recall: F1. 

`conlleval` was written in Perl. Some people rewrote it in Python and you will use such such a translation in this lab. The line below installs it. The source code is available from this address: https://github.com/kaniblu/conlleval

In [7]:
%pip install conlleval

Collecting conlleval
  Downloading conlleval-0.2-py3-none-any.whl (5.4 kB)
Installing collected packages: conlleval
Successfully installed conlleval-0.2
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Arvid\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


## Collecting the Embeddings

You will represent the words with dense vectors, instead of a one-hot encoding. GloVe embeddings is one such representation. The Glove files contain a list of words, where each word is represented by a vector of a fixed dimension. In this notebook, we will use the file of 400,000 lowercase words with the 100-dimensional vectors.
Download either:
*  The GloVe embeddings 6B from <a href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</a> and keep the 100d vectors; or
* A local copy of this dataset with the cell below (faster)

In [8]:
download_file("https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz", "glove.6B.100d.txt.gz")

In [9]:
decompress_gz_file('glove.6B.100d.txt.gz', 'glove.6B.100d.txt')

shutil.move('glove.6B.100d.txt', 'corpus/glove.6B.100d.txt')

'corpus/glove.6B.100d.txt'