# PySpark and Hadoop

In this notebook we look at real data while using our small hadoop cluster. For programming we will use parallel Spark dataframes with real data.

Copy your notebooks, connect to the cluster with a ssh tunnel and launch the jupyter notebook:
```bash
scp big-data YY@192.168.2.8X:
ssh -L 99XX:localhost:99XX YY@192.168.2.8X
jupyter notebook --port 99XX --no-browser
```
Copy paste the http address in your browser

## Create some dummy data

We want to implement the **WordCount application** already done in notebook 01.MapReduce. 
This time, files are 1000 times bigger and stored in hadoop file system.

### Prepare data and put it on hdfs

In [1]:
from lorem import text
with open('sample.txt','w') as f:
    for i in range(500):
        f.write(text())

In [10]:
!du -sh sample.txt

680K	sample.txt


We will increase the size of this file by 1000. Let's make 1000 of copies of sample.txt in data/latin directory with this shell script.

In [11]:
%%file cp_file.sh
#!/bin/bash
mkdir -p data/latin
INPUT=sample.txt
for num in $(seq 1 100)
do
    bn=$(basename $INPUT .txt)
    cp $INPUT data/latin/$bn$num.txt
done

Overwriting cp_file.sh


In [12]:
!chmod +x cp_file.sh; ./cp_file.sh

In [13]:
!du -sh data/latin/

67M	data/latin/


In [14]:
# put these files on HDFS
!hdfs dfs -put data/latin /user/navaro

17/12/08 14:51:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## hdfs3

[hdfs3](http://hdfs3.readthedocs.io/en/latest/) is an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.

This library, hdfs3, is a lightweight Python wrapper around the C/C++ libhdfs3 library. It provides both direct access to libhdfs3 from Python as well as a typical Pythonic interface.

In [109]:
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host='localhost', port=54310)

In [16]:
hdfs.ls('/user/pnavaro')

['/user/pnavaro/1990.csv',
 '/user/pnavaro/1991.csv',
 '/user/pnavaro/1992.csv',
 '/user/pnavaro/1993.csv',
 '/user/pnavaro/1994.csv',
 '/user/pnavaro/1995.csv',
 '/user/pnavaro/1996.csv',
 '/user/pnavaro/1997.csv',
 '/user/pnavaro/1998.csv',
 '/user/pnavaro/1999.csv',
 '/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv',
 '/user/pnavaro/copied-file.txt',
 '/user/pnavaro/latin',
 '/user/pnavaro/nyc-taxi',
 '/user/pnavaro/nycflights.parquet',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

In [17]:
hdfs.put('sample.txt', '/user/pnavaro/remote-file.txt')

In [18]:
hdfs.ls('/user/pnavaro/')

['/user/pnavaro/1990.csv',
 '/user/pnavaro/1991.csv',
 '/user/pnavaro/1992.csv',
 '/user/pnavaro/1993.csv',
 '/user/pnavaro/1994.csv',
 '/user/pnavaro/1995.csv',
 '/user/pnavaro/1996.csv',
 '/user/pnavaro/1997.csv',
 '/user/pnavaro/1998.csv',
 '/user/pnavaro/1999.csv',
 '/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv',
 '/user/pnavaro/copied-file.txt',
 '/user/pnavaro/latin',
 '/user/pnavaro/nyc-taxi',
 '/user/pnavaro/nycflights.parquet',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

In [19]:
hdfs.mv('/user/pnavaro/remote-file.txt', '/user/pnavaro/copied-file.txt')

False

In [20]:
hdfs.ls('/user/pnavaro')

['/user/pnavaro/1990.csv',
 '/user/pnavaro/1991.csv',
 '/user/pnavaro/1992.csv',
 '/user/pnavaro/1993.csv',
 '/user/pnavaro/1994.csv',
 '/user/pnavaro/1995.csv',
 '/user/pnavaro/1996.csv',
 '/user/pnavaro/1997.csv',
 '/user/pnavaro/1998.csv',
 '/user/pnavaro/1999.csv',
 '/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv',
 '/user/pnavaro/copied-file.txt',
 '/user/pnavaro/latin',
 '/user/pnavaro/nyc-taxi',
 '/user/pnavaro/nycflights.parquet',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

### Get filenames 

Here a python function that returns the complete list of filenames in
a given local directory.

In [21]:
import os
def get_filenames(root):
    """
    Returns complete list of filenames in root directory
    """

    files = []
    for f in os.listdir(root):
       if f.endswith(".txt"):
          files.append(f)
    
    return files

root = os.path.join(os.getcwd(),'data','latin')
get_filenames(root)

['sample81.txt',
 'sample29.txt',
 'sample64.txt',
 'sample88.txt',
 'sample51.txt',
 'sample84.txt',
 'sample32.txt',
 'sample30.txt',
 'sample27.txt',
 'sample31.txt',
 'sample61.txt',
 'sample71.txt',
 'sample41.txt',
 'sample44.txt',
 'sample12.txt',
 'sample78.txt',
 'sample55.txt',
 'sample54.txt',
 'sample98.txt',
 'sample53.txt',
 'sample42.txt',
 'sample11.txt',
 'sample73.txt',
 'sample18.txt',
 'sample59.txt',
 'sample37.txt',
 'sample57.txt',
 'sample94.txt',
 'sample2.txt',
 'sample4.txt',
 'sample19.txt',
 'sample50.txt',
 'sample35.txt',
 'sample47.txt',
 'sample89.txt',
 'sample23.txt',
 'sample16.txt',
 'sample8.txt',
 'sample5.txt',
 'sample3.txt',
 'sample87.txt',
 'sample77.txt',
 'sample83.txt',
 'sample60.txt',
 'sample26.txt',
 'sample85.txt',
 'sample76.txt',
 'sample13.txt',
 'sample69.txt',
 'sample45.txt',
 'sample17.txt',
 'sample28.txt',
 'sample80.txt',
 'sample74.txt',
 'sample91.txt',
 'sample56.txt',
 'sample34.txt',
 'sample97.txt',
 'sample49.txt',
 '

### Exercise
- Change the function above by using hdfs3 to get filenames and size on HDFS.
- Take a look at the [API](http://hdfs3.readthedocs.io/en/latest/api.html)

In [24]:
import os

def get_hdfs_filenames(root):
    files = []
    for f in hdfs.ls(root):
       if f.endswith(".txt"):
          files.append(f)
 
    return files
    
root = os.path.join('/user','pnavaro')    
get_hdfs_filenames(root)

['/user/pnavaro/copied-file.txt',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

### Get files size

This python function below returns the total size of filenames in
a given local directory.

In [33]:
def files_total_size(root):
   """ 
      Prints sum of filesize of txt files in root
   """
   filesize = 0
   for f in os.listdir(root):
       if f.endswith(".txt"):
           filesize += os.path.getsize(os.path.join(root,f))

   print("Size of files:", filesize / 1073741824, "GB")

root = os.path.join(os.getcwd(),'data','latin')
files_total_size(root)

Size of files: 0.06463425233960152 GB


### Exercise
- Change the function above by using hdfs3 to get size on HDFS.
- Take a look at the [API](http://hdfs3.readthedocs.io/en/latest/api.html)

In [None]:
def hdfs_files_total_size(root):
   """
   Prints sum of filesize of txt files in root
   """
   filesize = hdfs.du(root, total=True)

   return filesize[root] / 1073741824

root = os.path.join('/user','pnavaro')
hdfs_files_total_size(root)

### Exercise 

- use your wordcount functions to count occurences of
words in this dummy "data lake" of text files on hdfs.

In [82]:
import string
def words(file):
    """ Read a text file and return a sorted list of (word, 1) values."""
    translator = str.maketrans('', '', string.punctuation)
    output = []
    for line in file:
        line = line.strip()
        line = line.translate(translator)
        for word in line.split():            
            word = word.lower()
            output.append((word, 1))
    output.sort()
    return output

In [83]:
hdfs.head('samples.txt')

b'Etincidunt consectetur voluptatem aliquam modi eius adipisci aliquam. Etincidunt dolorem etincidunt neque quaerat ut. Sed consectetur dolorem non non dolorem magnam ipsum. Numquam adipisci sed est porro dolore. Quaerat est neque dolore sed. Magnam sit amet dolor dolorem velit labore. Aliquam ipsum consectetur ut quaerat sed velit. Non non velit dolor eius consectetur neque.'

In [2]:
import string
def hdfs_words(file):
    """ Read a text file and return a sorted list of (word, 1) values."""
    output = []
    with hdfs.open(file) as f:
       for line in f: # hdfs.open output is binary      
           line = line.strip()
           for word in line.split(b" "):            
               word = word.lower().replace(b'.',b'')
               output.append((word, 1))
    output.sort()
    return output

In [106]:
counts = hdfs_words('samples.txt')
counts

  


[(b'adipisci', 1),
 (b'adipisci', 1),
 (b'aliquam', 1),
 (b'aliquam', 1),
 (b'aliquam', 1),
 (b'amet', 1),
 (b'consectetur', 1),
 (b'consectetur', 1),
 (b'consectetur', 1),
 (b'consectetur', 1),
 (b'dolor', 1),
 (b'dolor', 1),
 (b'dolore', 1),
 (b'dolore', 1),
 (b'dolorem', 1),
 (b'dolorem', 1),
 (b'dolorem', 1),
 (b'dolorem', 1),
 (b'eius', 1),
 (b'eius', 1),
 (b'est', 1),
 (b'est', 1),
 (b'etincidunt', 1),
 (b'etincidunt', 1),
 (b'etincidunt', 1),
 (b'ipsum', 1),
 (b'ipsum', 1),
 (b'labore', 1),
 (b'magnam', 1),
 (b'magnam', 1),
 (b'modi', 1),
 (b'neque', 1),
 (b'neque', 1),
 (b'neque', 1),
 (b'non', 1),
 (b'non', 1),
 (b'non', 1),
 (b'non', 1),
 (b'numquam', 1),
 (b'porro', 1),
 (b'quaerat', 1),
 (b'quaerat', 1),
 (b'quaerat', 1),
 (b'sed', 1),
 (b'sed', 1),
 (b'sed', 1),
 (b'sed', 1),
 (b'sit', 1),
 (b'ut', 1),
 (b'ut', 1),
 (b'velit', 1),
 (b'velit', 1),
 (b'velit', 1),
 (b'voluptatem', 1)]

In [107]:
import operator
def reduce(words):
    """ Read the sorted list from map and print out every word with 
    its number of occurences"""
    d = {}
    for w in words:
        try:
            d[w[0]] +=1
        except KeyError:
            d[w[0]] = 1 
    
    return sorted(d.items(), key=operator.itemgetter(1), reverse=True)

In [108]:
file = 'samples.txt'
result = reduce(hdfs_words(file))
result

  


[(b'consectetur', 4),
 (b'dolorem', 4),
 (b'non', 4),
 (b'sed', 4),
 (b'aliquam', 3),
 (b'etincidunt', 3),
 (b'neque', 3),
 (b'quaerat', 3),
 (b'velit', 3),
 (b'adipisci', 2),
 (b'dolor', 2),
 (b'dolore', 2),
 (b'eius', 2),
 (b'est', 2),
 (b'ipsum', 2),
 (b'magnam', 2),
 (b'ut', 2),
 (b'amet', 1),
 (b'labore', 1),
 (b'modi', 1),
 (b'numquam', 1),
 (b'porro', 1),
 (b'sit', 1),
 (b'voluptatem', 1)]

In [None]:
# Run the modified functions on hdfs latin directory
# Use functions to get filenames and size to display total ammount
# of gigabytes processed.

import itertools

files = hdfs.glob("/user/pnavaro/latin/*.txt")
print(hdfs.du("/user/pnavaro/latin/"))
mapped_values = map(hdfs_words, files)
results = reduce(itertools.chain(*mapped_values))
results

{'/user/pnavaro/latin/sample1.txt': 3184213, '/user/pnavaro/latin/sample10.txt': 3184213, '/user/pnavaro/latin/sample100.txt': 3184213, '/user/pnavaro/latin/sample1000.txt': 3184213, '/user/pnavaro/latin/sample101.txt': 3184213, '/user/pnavaro/latin/sample102.txt': 3184213, '/user/pnavaro/latin/sample103.txt': 3184213, '/user/pnavaro/latin/sample104.txt': 3184213, '/user/pnavaro/latin/sample105.txt': 3184213, '/user/pnavaro/latin/sample106.txt': 3184213, '/user/pnavaro/latin/sample107.txt': 3184213, '/user/pnavaro/latin/sample108.txt': 3184213, '/user/pnavaro/latin/sample109.txt': 3184213, '/user/pnavaro/latin/sample11.txt': 3184213, '/user/pnavaro/latin/sample110.txt': 3184213, '/user/pnavaro/latin/sample111.txt': 3184213, '/user/pnavaro/latin/sample112.txt': 3184213, '/user/pnavaro/latin/sample113.txt': 3184213, '/user/pnavaro/latin/sample114.txt': 3184213, '/user/pnavaro/latin/sample115.txt': 3184213, '/user/pnavaro/latin/sample116.txt': 3184213, '/user/pnavaro/latin/sample117.txt':

### Exercise

Do the work with a Spark context.
By using the [Cheat Sheet](http://datacamp-community.s3.amazonaws.com/4d91fcbc-820d-4ae2-891b-f7a436ebefd4), you can create a parallel collection with `textFile`.
You can do your wordcount with functions `map`, `flatMap`, `reduceByKey` and `sortByKey`. Do some tries on a small local text file before to analyze the "latin data lake" on hdfs.

In [45]:
from pyspark import SparkContext
sc = SparkContext("spark://hadoop1:7077", "Your Name")
sc

In [47]:
files = "sample.txt"

# Your code here...
# wordcounts = ...

wordcounts.collect()


[(13, 'quaerat'),
 (12, 'est'),
 (12, 'neque'),
 (10, 'dolor'),
 (9, 'amet'),
 (9, 'ut'),
 (9, 'numquam'),
 (8, 'eius'),
 (8, 'quiquia'),
 (8, 'tempora'),
 (8, 'sit'),
 (8, 'dolorem'),
 (7, 'labore'),
 (6, 'dolore'),
 (6, 'aliquam'),
 (6, 'etincidunt'),
 (6, 'voluptatem'),
 (6, 'velit'),
 (5, 'ipsum'),
 (5, 'consectetur'),
 (5, 'modi'),
 (5, 'sed'),
 (5, 'quisquam'),
 (5, 'adipisci'),
 (4, 'magnam'),
 (4, 'non'),
 (3, 'porro')]

In [None]:
sc.stop()