# Problem Set 1 - Regex and Tokenization
One of the difficulties of the German langiage is the phenomena of merging together many words into one. To a newcomer to the language, this is particularly obvious and painful with numbers. For example 
* fünfhundretfünfzehn
* achthundretzehn'
* dreitausendvierhundretachtzehn
* dreitausendneunhundretneunundsiebzig
* eintausendneunhundretzweiundneunzig

If we wanted to write an algorithm that took a number such as *eintausendneunhundretzweiundneunzig* and converted into 8992 we might want to start by breaking that word up into it's constituents. 
This task is an extreme example of one of the fundamental tasks in NLP, Tokenization - the act of breaking up text into it's "atomic" elements 

## What we'll do 

The excercises in this notebook will walk you through some building block techniques in NLP, namely string searching and Regular Expressions. We'll use both techniques to try and break up the words into the most relevent tokens
## Prerequites
Start by working through the amazing regex tutorial at [RegexOne](https://regexone.com/)

## Getting Data
We've conveneiently provided a utility that will generate data for you. Run the following code to get a dictionary that maps numerical values to their verbal representation:
**code**
```python
from pprint import pprint
from utils.numtoWord import createNum2WordDict
d =createNum2WordDict(100,10000)
print(d)
```
**output**
```
{2360: 'zweitausenddreihundretundsechzig',
 2518: 'zweitausendfünfhundretachtzehn',
 4080: 'viertausendundachtzig',
 4808: 'viertausendachthundretacht',
 5785: 'fünftausendsiebenhundretfünfundachtzig',
 6002: 'sechstausendzwei',
 6289: 'sechstausendzweihundretneunundachtzig',
 7157: 'siebentausendeinhundretsiebenundfünfzig',
 8930: 'achttausendneunhundretunddreiβig',
 9455: 'neuntausendvierhundretfünfundfünfzig'}
```

You want to reach soemthing that maps zweitausenddreihundretundsechzig to
*zweitausend dreihundret und sechzig*



In [18]:
#Load some data
from pprint import pprint
from utils.numtoWord import createNum2WordDict
d =createNum2WordDict(size=10,high=100000000) # Get 10 random numbers between 0 and 10,000
pprint(d)

{2256795: 'zwei Million '
          'zweihundretsechsundfünfzigtausendsiebenhundretfünfundneunzig',
 8676914: 'acht Million sechshundretsechsundsiebzigtausendneunhundretvierzehn',
 14521415: 'vierzehn Million '
           'fünfhundreteinundzwanzigtausendvierhundretfünfzehn',
 29692583: 'neunundzwanzig Million '
           'sechshundretzweiundneunzigtausendfünfhundretdreiundachtzig',
 31690331: 'einunddreiβig Million '
           'sechshundretundneunzigtausenddreihundreteinunddreiβig',
 50723793: 'undfünfzig Million '
           'siebenhundretdreiundzwanzigtausendsiebenhundretdreiundneunzig',
 55623689: 'fünfundfünfzig Million '
           'sechshundretdreiundzwanzigtausendsechshundretneunundachtzig',
 74613093: 'vierundsiebzig Million sechshundretdreizehntausenddreiundneunzig',
 79116293: 'neunundsiebzig Million '
           'einhundretsechszehntausendzweihundretdreiundneunzig',
 89692350: 'neunundachtzig Million '
           'sechshundretzweiundneunzigtausenddreihundretundfünfzig'}


## Excercise 1 - Naive Pattern Matching
Sample 1000 numbers betwenn 1,100. Write a python function that will list all of the "number-names" (1-9) that appear

## Excercise 2 - Smarter Pattern Matching
Install the [FlashText](https://github.com/vi3k6i5/flashtext) library or [Py-Aho-Corasaik](https://github.com/JanFan/py-aho-corasick) and repeat excercise 1 with it

## Excercise 3 - Regular Expressions
Sample number between 1 and 1,000,000,000
Use the python re module and the re.split function to split the numbers into millions, thousands hundrends and tens
For example
```
vierzehnmillionfünfhundreteinundzwanzigtausendvierhundretfünfzehn
```
Should map to
```
vierzehnmillion
fünfhundreteinundzwanzigtausend
vierhundret
fünfzehn
```

## Excercise 4 - Regular Expression classifier
We can do clever things with regular expressions. In this case we'll use regular expressions to decide if a given number is even or odd. Write a function, powered by a single regular expression, that receives a number (in word form) and returns 1 if it is even and 0 if it is odd. 
For example
```
f("vierzehnmillion") =1
f("sechshundretdreiundzwanzigtausendsechshundretneunundachtzig") =0
```

# Excercise 5 - Named Capture Groups
Use pythons named capture groups ([Explanation](https://www.regular-expressions.info/named.html)) ([Reference](https://docs.python.org/3/library/re.html#regular-expression-syntax)) To split a number into it's constituent parts as a dictionary. 
Use named capture groups, and the **groupdict** function to get a dictionary. For example

```python
    match = yourRegex.search("zweihundretsechsundfünfzigtausendsiebenhundretfünfundneunzig")
    match.groupdict()
    >
        {
            "thousands" : "zweihundretsechsundfünfzig",
            "hundreds" : "sieben",
            "tens" :"neunzig",
            "ones" : "fünf"
        }
    
```
