<b>This notebook is just a tutorial for you to get familiar with skip-gram and MapReduce.  
<font color="red">You don't need to hand in this notebook</font>, so feel free to jump to [Requirement section](#Assignment-Requirement) and directly work on your `mapper.py` and `reducer.py` if you already have the idea of how to do so.</b>  

# Week 05: Skip-gram and MapReduce

In previous assignments, you have known the concept of ngrams and how to generate them.  
This week, we are introducing another gram type, called *skip-gram*, to you.  
Also, we are going to calculate it on a large dataset, so you'll have to process it with the MapReduce technique.  

So, first thing first: what is skip-gram?  

## Skip-gram

<i>\[S\]kip-grams are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.  - from [Wikipedia](https://en.wikipedia.org/wiki/N-gram#Skip-gram)</i>  

That is, skip-gram is actually the same as ngram, but allowed to skip some words in between.  
In the sentence <i>"Strong winds blew roofs away"</i>, two of its bigrams are <i>"winds blew"</i> and <i>"blew roofs"</i>, while <i>"blew away"</i> is one of the skipgrams with distance 2, since it skipped one word <i>"roofs"</i> .  
As you can see, skipgram is able to capture the phrase seperated by other words.  

Now consider another sentence

> "Skip-gram is used to predict the context word for a given target word".

With a pivot word *predict*, all of its skip-grams within distance 5 are as below:
```
.------------------------------------------------------------------------------.
| distance || -5 |     -4    | -3 |  -2  | -1 |  1  |    2    |  3   |  4  | 5 |
|----------||----|-----------|----|------|----|-----|---------|------|-----|---|
| predict  || -  | Skip-gram | is | used | to | the | context | word | for | a |
'------------------------------------------------------------------------------'
```

<a name="Practice"></a>
### Practice: Distance table of skip-gram

Now, let's practice!  
Given a sentence <i>"Skip-gram is used to predict the context word for a given target word"</i>, <u>**output all of its skip-gram with distance between -3 to 3</u> and show the result in a table**.  

**Example**
```
distance      -3            -2            -1            1             2             3             
--------------------------------------------------------------------------------------------
Skip-gram     -             -             -             is            used          to            
is            -             -             Skip-gram     used          to            predict       
used          -             Skip-gram     is            to            predict       the           
to            Skip-gram     is            used          predict       the           context       
predict       is            used          to            the           context       word          
the           used          to            predict       context       word          for           
context       to            predict       the           word          for           a             
word          predict       the           context       for           a             given         
for           the           context       word          a             given         target        
a             context       word          for           given         target        word          
given         word          for           a             target        word          -             
target        for           a             given         word          -             -             
word          a             given         target        -             -             -             
```

\*Hint: Try to get the skip-grams for a single word first if you have trouble generating them all at once. 
```
(predict, is, -3)
(predict, used, -2)
(predict, to, -1)
(predict, the, 1)
(predict, context, 2)
(predict, word, 3)
```

In [116]:
#tokens = "Skip-gram is used to predict the context word for a given target word".split()
def skip_gram(tokens):
    token_length = len(tokens)
    f_max_dis = 5; p_max_dis = -5
    gram = []

    for idx in range(token_length):
        # pervious
        for i in range(-1, p_max_dis-1, -1):
            if idx+i >= 0:
                gram.append((tokens[idx], tokens[idx+i], i, 1))
        for i in range(1, f_max_dis+1):
            if idx+i < token_length:
                gram.append((tokens[idx], tokens[idx+i], i, 1))

    return gram
#skip_gram

## MapReduce

<i>MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. - from [Wikipedia](https://en.wikipedia.org/wiki/MapReduce)</i> 

### Why MapReduce?

Imagine that you are working on a pretty large dataset, say all pages on Wikipedia (whose size has already reached 94GB in 2013).  
Most likely you are not able to process the whole corpus in the memory or on a single computer. Even a simple frequency counter would be challenging under such a huge data size.  
To deal with this, Google proposed a big-data processing model called MapReduce, and it has been implemented and supported by many distributed computing systems, such as Apache Hadoop.  
The core concept of MapReduce is to **split, apply and then combine**, so that each data segment can be handled separately.  

### Mapper-Shuffler-Reducer

![](https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png)
<small><i> - image source: [Today Software Magazine](https://www.todaysoftmag.com/article/1358/hadoop-mapreduce-deep-diving-and-tuning)</i></small> 

As you can see in the picture:  
First, the whole data is split into some smaller partitions, each partition able to be processed by an independant machine.  
In this step, **mappers** will generate one or more key-value pair(s) that can easily be clustered.  
 - example: in a word counter, it would generate the word and the word's current count.  

Then, we will **shuffle** and group all outputs from mappers.  
 - example: sort the output from mappers.  

Lastly, we can combine the grouped values and **reduce** them into final results.  
 - example: calculate total frequency in each group.  

## MapReduce for skip-gram

Now, having the concepts of skip-gram and MapReduce in mind, it's time to put these all together: let's generate skip-gram table with MapReduce technique!  

It may sound scary to some of you, so let's break it down first.  
There are 3 steps to do, and each step is described as below:
1. **Mapper**: Print all skip-gram with its distance infomation, and the current count of it.  
   ```
   a b -3 1
   a c 3  1
   c e -2 1
   a c 1  1
   a b -3 1
   b d 2  1
   ```
2. **Shuffler**: Group all skipgrams by its text. This can be easily achieved with sorting.  
   ```
   a b -3 1
   a b -3 1
   a c 1  1
   a c 3  1
   b d 2  1
   c e -2 1
   ```
3. **Reducer** :  
   Since the results have been sorted in the previous step, we can easily calculate the frequency of each skip-gram with different distance.  
   So we can know that the frequency of skipgram `a b` with distance $-3$ should be $1+1=2$, while other skip-grams' are all $1$.

### Step 1: Mapper

First, in the mapper we want to generate all skip-grams within distance $-5$ to $5$.  
Remember that you've already done something similar in [previous Practice](#Practice)? Just modify it to MapReduce format!  

Output: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`  

Example: 
```
predict is  -3  1
predict used    -2  1
predict the -1  1
predict the 1   1
...
```



In [145]:
import os
import pandas as pd
import csv
import re

In [146]:
with open(os.path.join('data', 'wiki1G.txt')) as f:
    for line in f:
        #toekns = re.findall(r'\S+', line)
        tokens = re.sub(r'[^\w\s]',' ',line).lower().split()
        grams = skip_gram(tokens) 
        break # for the sake of this practice, just test the first page now


In [147]:
#df = pd.DataFrame(gram)
#df.to_csv('skip_gram/mapper.csv', index=False)
with open('skip_gram/mapper.tsv', 'wt') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    for gram in grams:
        tsv_writer.writerow(gram)

In [148]:
os.system('sort -k1,3 < skip_gram/mapper.tsv > skip_gram/shuffler.tsv')

0

### Step 2: Shuffler

All we need to do in the shuffler is sorting, so let's use the built-in command to do this for us!  

Try this on your terminal/command prompt ;)  
(You can get the sample input from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing))

**Unix**  
```bash
sort -k1,3 < mapper.sample.tsv
```
**Windows**
```powershell
type mapper.sample.tsv | sort
```

### Step 3: Reducer

Since all the input should have been sorted in previous shuffler, the task of reducer is pretty simple: just count how many times the same gram appears, and then print the count out!

Input: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`
 - You can get a sample input file `shuffler.sample.tsv` from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing)

Output: 
 - `"{pivot}\t{word}\t{total_freq}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
 - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.

Example:
 - `arouse  open    4       0       0       3       0       0       0       0       0       0       1`

Hints: 
1. Parse the input from shuffler
2. Check if this is the same skipgram as the previous one
3. If so, add the frequency according to its distance
4. If not, output the previous skipgram data

Note that you may NOT want to store all your counting results in a dict or any data structure.  
Recall that one purpose of MapReduce is to prevent memory exhaustion. It loses its value if you end up storing it again.  
Instead, <u>directly print it out or write it into a file</u> .  
(Don't get me wrong: of course you can store some temporary data, but let's not store the whole result and then print it out at once, okay?)


In [189]:
'''def generate_reducer(reducer_dict, skip_list):

    if (skip_list[0], skip_list[1]) not in reducer_dict.keys():
        reducer_dict[(skip_list[0], skip_list[1])] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    if int(skip_list[2]) == 5:
        reducer_dict[(skip_list[0], skip_list[1])][10] += int(skip_list[3])
    if int(skip_list[2]) == 4:
        reducer_dict[(skip_list[0], skip_list[1])][9] += int(skip_list[3])
    if int(skip_list[2]) == 3:
        reducer_dict[(skip_list[0], skip_list[1])][8] += int(skip_list[3])
    if int(skip_list[2]) == 2:
        reducer_dict[(skip_list[0], skip_list[1])][7] += int(skip_list[3])
    if int(skip_list[2]) == 1:
        reducer_dict[(skip_list[0], skip_list[1])][6] += int(skip_list[3])
    if int(skip_list[2]) == -1:
        reducer_dict[(skip_list[0], skip_list[1])][5] += int(skip_list[3])
    if int(skip_list[2]) == -2:
        reducer_dict[(skip_list[0], skip_list[1])][4] += int(skip_list[3])
    if int(skip_list[2]) == -3:
        reducer_dict[(skip_list[0], skip_list[1])][3] += int(skip_list[3])
    if int(skip_list[2]) == -4:
        reducer_dict[(skip_list[0], skip_list[1])][2] += int(skip_list[3])
    if int(skip_list[2]) == -5:
        reducer_dict[(skip_list[0], skip_list[1])][1] += int(skip_list[3])
        
    reducer_dict[(skip_list[0], skip_list[1])][0] += int(skip_list[3])'''
    
def generate_reducer2(reducer_list, skip_list, index=0, overwrite=False):
    if overwrite == False:
        # generate new
        init_list = [skip_list[0], skip_list[1], int(skip_list[3]), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # 3..12
        
        if int(skip_list[2])>0:
            init_list[7+int(skip_list[2])] += int(skip_list[3])
        else:
            init_list[8+int(skip_list[2])] += int(skip_list[3])
        #init_list[7+int(skip_list[2])]+=int(skip_list[3]) if int(skip_list[2])>0 else init_list[8+int(skip_list[2])]+=int(skip_list[3])
        reducer_list.append(init_list)
        
    elif overwrite == True:
        reducer_list[index][2] += int(skip_list[3])
        if int(skip_list[2])>0:
            reducer_list[index][7+int(skip_list[2])]+=int(skip_list[3]) 
        else:
            reducer_list[index][8+int(skip_list[2])]+=int(skip_list[3]) 
        #reducer_list[index][7+int(skip_list[2])]+=int(skip_list[3]) if int(skip_list[2])>0 else reducer_list[index][8+int(skip_list[2])]+=int(skip_list[3])
    return reducer_list
    
pviot = ''; word ='';reducer_dict = {}; reducer_list = []; index = 0; count=0
with open(os.path.join('skip_gram', 'shuffler.tsv')) as f:
    # 1) Parse the input from shuffler
    # 2) Check if this is the same skipgram
    # 3) If so, add the frequency according to its distance
    # 4) If not, output the previous skipgram data
    for line in f:
        skip_list = line.split()
        '''
        generate_reducer(reducer_dict, skip_list)
        '''
        if pviot != skip_list[0]: #pivot not same
            pviot = skip_list[0]; word = skip_list[1]
            reducer_list = generate_reducer2(reducer_list, skip_list)
            index += 1

        elif (pviot == skip_list[0]) and (word != skip_list[1]):
            word = skip_list[1]
            reducer_list = generate_reducer2(reducer_list, skip_list)
            index += 1          
        elif (pviot == skip_list[0]) and (word == skip_list[1]):
            reducer_list = generate_reducer2(reducer_list, skip_list, index-1, True)
        


In [190]:
with open('skip_gram/reducer.tsv', 'wt') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    for reducer in reducer_list:
        tsv_writer.writerow(reducer)


### Step 4: Combine them together!  

Now you can move your code above into mapper.py and reducer.py (with some tiny modifications, of course), and this is your assignment this week!   
See below for detailed requirement description.  

**Hints: What should I modify in my mapper and reducer?**  

1. Receive/pass data from standard I/O, rather than the file (We've already done this for you)
2. Process with the whole dataset, rather than only the first line

That's it!  

The processing takes some times (~1hr w/o parallel computing), so go enjoy some coffee or movies (or sleep) during the waiting time ;)

<a name="Assignment-Requirement"></a>
## Assignment Requirement 

1. You need to implement the `mapper.py` and `reducer.py` to calculate the skip-gram table.

2. In `mapper.py`, you need to generate skipgrams with distance within -5 to 5 (inclusive).  
   - Input: Pure text file (`wiki1G.txt`) with each line as a wikipage.
   - Output: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Example: 
     ```
     predict is  -3  1
     predict used    -2  1
     predict the -1  1
     predict the 1   1
     ...
     ```
   - Sample output: `mapper.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

3. In `reducer.py`, you have to collect the output from the shuffler (`sort`) and generate the skip-gram table.
   - Input: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Output: 
     - `"{pivot}\t{word}\t{total}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
     - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.
   - Example:
     ```
     arouse  of      1       0       0       0       0       0       0       0       1       0       0
     arouse  open    4       0       0       3       0       0       0       0       0       0       1
     arouse  so      2       0       1       0       0       0       0       0       0       1       0
     arouse  sufficiently    1       0       1       0       0       0       0       0       0       0       0
     ...
     ```
   - Sample output: `reducer.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

4. Concate your MapReduce procedure and generate the skip-gram on wiki1G dataset
   - Unix: 
     - Use the [local map-reduce tool](https://github.com/dspp779/local-mapreduce) (faster),
     - or run it directly: `python mapper.py < wiki1G.txt | sort -k1,2 -k3n | python reducer.py > skipgram.tsv` (slower)
   - Windows: 
     - CMD: `python mapper.py < wiki1G.txt | sort | python reducer.py > skipgram.tsv`
     - PS: `type wiki1G.txt | python mapper.py | sort | python reducer.py > skipgram.tsv`
     - or the bash environment you installed last week.  
   - See [Appendix](#built-in-command) if you want to know what these commands mean

During the demo, you need to 

1. show us your skip-gram result on the given dataset, and
2. explain your implementation in `mapper.py` and `reducer.py`.  

Note that the final result would be a large file (~6 GB), so **you may want to show it with `more` or `less` command**.  

## TA's note

Congratulations! You've learned how to calculate skipgram frequency and to deal with a huge dataset with MapReduce technique.  

Remember to <b><a href="https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit?usp=sharing">make an appoiment with TA</a> to demo/explain your implementation <u>before <font color="red">10/21 15:30</font></u></b> .  
You should also submit your `mapper.py` and `reducer.py` to <a href="https://eeclass.nthu.edu.tw/course/homework/3285">eeclass</a> .

<a name="built-in-command"></a>
## Appendix: useful built-in commands

Several built-in commands are very useful in the MapReducer procedure.  
Here we introduce `cat` and `type`, `<` and `>`, `sort`, and pipe `|`.  

### cat (on Unix)
`cat` command, which is definitly not indicating some cute creatures (*meow~*), is the abbreviation of `concatenate`. ([doc](https://man7.org/linux/man-pages/man1/cat.1.html))   

When you `cat` a file, it means you want to print the content from a file (or some files) to standard output.  
Now open your bash and test the command below!  
```bash
cat file.txt
```

You should see something like this: 

![picture](https://i.imgur.com/Z9shOYQ.png)

### type (on Windows)
`type` command works exactly the same as `cat` on Unix, but without its cute nickname (Shame on you, Windows). ([doc](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/type))  

Similarly, if you `type` a file, it means you print the content from a file (or some files) to standard output.  

```powershell
type file.txt
```

You should see something like this:  
![](https://i.imgur.com/5WFhxkq.png)

### `>`? `<`? `>///<`? 

`<` and `>` are the I/O redirections.  
`program < filename` means that you want to redirect the input from a file to a program, while `program > filename` means that you want to redirect the output of a program to that file.  

For example,
```bash
echo "hello world" > greet.txt
```
writes the string "hello world" into a file `greet.txt`.  

On the other hand, 
```bash
head < greet.txt
```
makes `head` receive the content from `greet.txt`, so it will print out the string in `greet.txt`.  
![](https://i.imgur.com/swxv8LG.png)
<small>p.s. `>///<` is just a joke. Don't take it seriously.</small>

### sort

As its name suggests, `sort` sorts the data that it receives. (doc on [Linux](https://man7.org/linux/man-pages/man1/sort.1.html) and on [Windows](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/sort))  
Try this:
```
sort sample.txt
```
You can see that the content has been sorted before printed onto your screen.  
![](https://i.imgur.com/QFEq3Tc.png)

### Pipe `|`

Pipe passes the output from previous program to the next program.  
For example, 
```bash
python program.py | sort
```
will pass the output of `program.py` to `sort` command.  