# Challenge - Hadoop : MapReduce in Python, locally

# Python Scripts in MapReduce

Big Data tools are great, but if you have to learn a new language every time you use a new tool, this slows down the development of such tools. For this reason, it is possible to submit Python scripts to Hadoop using a Map-Reduce framework. Let's consider the WordCount example.

Any job in Hadoop must have two phases: 
- Mapper 
- and Reducer. 

In this exercise, we will have to create a WordCount in MapReduce locally. 

## Download Hadoop locally

*Step 0*: Install the necessary packages. (MacOS) If you have another OS, stop before step 8.

```bash
brew tap caskroom/versions
brew cask install java
brew install hadoop
```

On MacOS, you might be needed to :
- install GCC
- install XCode

Follow the error messages you get and run the necessary installs, this is to allow Hadoop (written in Java) to be compiled.

If you get a denied access to a java open sdk file, go to your preferences and in Security, enable the access to Java SDK.

*Step 1*: Download & unzip the following Hadoop file : https://drive.google.com/open?id=1rOmwnWK3NotPeI0bSxXKTSaThgrqVBmg

*Step 2*: Place the folder in your home (e.g Users/your_name/hadoop_here)

*Step 3*: Create a folder on your Desktop called "HadoopTP" in which you will have all the necessary files.

## Set up for the Word Count

*Step 4*: In this folder, create a file called `file.txt` which contains the following content:

```bash
Some content
Some content
Some contents 
Hello World Some content
```

*Step 5*: We now have to create a Mapper in Python. Create in this folder a file called `mapper.py`.

*Step 6*: Fill in the mapper according to the following template. 

In [None]:
#!/usr/bin/python3
import sys
def main(argv):
    for line in sys.stdin:
        # Line is the input line that comes in (e.g `Some content`)
        # You must remove cases with too many spaces, start space and end space (Strip) and split the sequence
        # The for each word in the list, print it with a 1 next to 1 (this will be read by the reducer)
        # Separate the word and the 1 by a tab

if __name__ == "__main__":
    main(sys.argv)

*Step 7*: Create the reducer.py and fill it in according to the following template. 

In [None]:
#!/usr/bin/python3
import sys
def main(argv):

    # Initialize the current word and the count
    current_word = None
    count = 0

    # Read each line
    for line in sys.stdin:

        # Split the line the other way round (strip and split)
        # Get the word and the count


        # Don't forget that Hadoop sorted the sequences for us
        # We need to check the new word we received is still the same than the "current_word"
        # If it's the case, update the count


        # Otherwise, we must:
        # print the current word an its count
        # update the count to the one of the new word
        # update the current word


    # Don't forget to display the last word once we exit the loop
    if current_word == word:
        print(current_word+"\t"+str(count))

if __name__ == "__main__":
    main(sys.argv)

*Additional Step*: Test your Mapper and Reducer.
   
To test the mapper, run:
```
echo "Some content Some content Some contents  Hello World Some content" | python mapper.py
````

To test the whole pipeline:
```python
echo "Some content Some content Some contents  Hello World Some content" | python mapper.py | sort -k1,1 | python reducer.py
```

NB: If your install of Hadoop does not work properly, stop the process here and move on to the next exercise.

*Step 8*: Put the file.txt in HDFS. 
- First, from your terminal, go the the folder : /Users/yourname/hadoop-3.2.1/bin
- Then, to put a file in HDFS, use: 

````
hdfs dfs -put -f path_to_your_local_txt_file/file.txt
````

*Step 9*: Make sure that the file is in HDFS by running :

````
hdfs dfs -ls
````

## Run the Word Count

*Step 10*: Your are now ready to run your job ! From the terminal, still in the bin:

````
hadoop jar /Users/yourname/hadoop/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
	-input file.txt \
	-output output \
	-mapper path_to_folder/mapper.py \
	-reducer path_to_folder/reducer.py 

````

If you get an error 13 of "access denied", your Python files are not runnable. Simply fix it this way:


````
hadoop jar /Users/yourname/hadoop/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
	-input file.txt \
	-output output \
	-mapper "python path_to_folder/mapper.py" \
	-reducer "python path_to_folder/reducer.py" 

````

*Step 11*: The output should now be in the bin, in a folder called output. You shoul have :

```
Hello	1
Some	4
World	1
content	3
contents	1
```