# PySpark and Hadoop

In this notebook we look at real data while using our small hadoop cluster. For programming we will use parallel Spark dataframes with real data.

Connection to the cluster:
```bash
scp big-data YY@192.168.2.8X:
ssh -L 99XX:localhost:99XX YY@192.168.2.8X
jupyter notebook --port 99XX --no-browser
```
Copy paste the http address in your browser

## Create some dummy data

We want to implement the **WordCount application** already done in notebook 01.MapReduce. 
This time, files are 1000 times bigger and stored in hadoop file system.

### Prepare data and put it on hdfs

In [1]:
from lorem import text
with open('sample.txt','w') as f:
    for i in range(500):
        f.write(text())

In [2]:
!du -sh sample.txt

3,0M	sample.txt


We will increase the size of this file by 1000. Let's make 1000 of copies of sample.txt in data/latin directory with this shell script.

In [3]:
%%file cp_file.sh
#!/bin/bash
mkdir -p data/latin
INPUT=sample.txt
for num in $(seq 1 100)
do
    bn=$(basename $INPUT .txt)
    cp $INPUT data/latin/$bn$num.txt
done

Writing cp_file.sh


In [4]:
!chmod +x cp_file.sh; ./cp_file.sh

In [9]:
!du -sh data/latin/

3,0G	data/latin/


In [7]:
# put these files on HDFS
!hdfs dfs -put data/latin /user/navaro

## hdfs3

[hdfs3](http://hdfs3.readthedocs.io/en/latest/) is an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.

This library, hdfs3, is a lightweight Python wrapper around the C/C++ libhdfs3 library. It provides both direct access to libhdfs3 from Python as well as a typical Pythonic interface.

In [11]:
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host='localhost', port=9000)

ImportError: Can not find the shared library: libhdfs3.so
See installation instructions at http://hdfs3.readthedocs.io/en/latest/install.html

In [None]:
hdfs.ls('/user/navaro')

In [19]:
hdfs.put('sample.txt', '/user/pnavaro/remote-file.txt')

In [20]:
hdfs.ls('/user/pnavaro/')

['/user/pnavaro/1990.csv',
 '/user/pnavaro/1991.csv',
 '/user/pnavaro/1992.csv',
 '/user/pnavaro/1993.csv',
 '/user/pnavaro/1994.csv',
 '/user/pnavaro/1995.csv',
 '/user/pnavaro/1996.csv',
 '/user/pnavaro/1997.csv',
 '/user/pnavaro/1998.csv',
 '/user/pnavaro/1999.csv',
 '/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv',
 '/user/pnavaro/copied-file.txt',
 '/user/pnavaro/latin',
 '/user/pnavaro/nyc-taxi',
 '/user/pnavaro/nycflights.parquet',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

In [21]:
hdfs.mv('/user/pnavaro/remote-file.txt', '/user/pnavaro/copied-file.txt')

False

In [22]:
hdfs.ls('/user/pnavaro')

['/user/pnavaro/1990.csv',
 '/user/pnavaro/1991.csv',
 '/user/pnavaro/1992.csv',
 '/user/pnavaro/1993.csv',
 '/user/pnavaro/1994.csv',
 '/user/pnavaro/1995.csv',
 '/user/pnavaro/1996.csv',
 '/user/pnavaro/1997.csv',
 '/user/pnavaro/1998.csv',
 '/user/pnavaro/1999.csv',
 '/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv',
 '/user/pnavaro/copied-file.txt',
 '/user/pnavaro/latin',
 '/user/pnavaro/nyc-taxi',
 '/user/pnavaro/nycflights.parquet',
 '/user/pnavaro/remote-file.txt',
 '/user/pnavaro/samples.txt']

### Get filenames 

Here a python function that returns the complete list of filenames in
a given local directory.

In [26]:
import os
def get_filenames(root):
	"""
	Returns complete list of filenames in root directory
    """

	files = []
	for f in os.listdir(root):
		if f.endswith(".txt"):
			files.append(f)
	
	return files

root = os.path.join(os.getcwd(),'data','latin')
get_filenames(root)

['sample575.txt',
 'sample922.txt',
 'sample911.txt',
 'sample649.txt',
 'sample964.txt',
 'sample754.txt',
 'sample253.txt',
 'sample454.txt',
 'sample133.txt',
 'sample670.txt',
 'sample248.txt',
 'sample397.txt',
 'sample336.txt',
 'sample506.txt',
 'sample688.txt',
 'sample237.txt',
 'sample243.txt',
 'sample638.txt',
 'sample81.txt',
 'sample939.txt',
 'sample395.txt',
 'sample399.txt',
 'sample909.txt',
 'sample789.txt',
 'sample270.txt',
 'sample841.txt',
 'sample925.txt',
 'sample860.txt',
 'sample721.txt',
 'sample103.txt',
 'sample328.txt',
 'sample890.txt',
 'sample517.txt',
 'sample127.txt',
 'sample29.txt',
 'sample129.txt',
 'sample905.txt',
 'sample967.txt',
 'sample976.txt',
 'sample740.txt',
 'sample324.txt',
 'sample391.txt',
 'sample596.txt',
 'sample262.txt',
 'sample875.txt',
 'sample296.txt',
 'sample177.txt',
 'sample958.txt',
 'sample617.txt',
 'sample681.txt',
 'sample782.txt',
 'sample532.txt',
 'sample230.txt',
 'sample312.txt',
 'sample64.txt',
 'sample934.t

### Exercise
- Change the function above by using hdfs3 to get filenames and size on HDFS.
- Take a look at the [API](http://hdfs3.readthedocs.io/en/latest/api.html)

In [27]:
#Your code here...

### Get files size

This python function below returns the total size of filenames in
a given local directory.

In [31]:
def files_total_size(root):
   """
   Prints sum of filesize of txt files in root
   """
   filesize = 0
   for f in os.listdir(root):
       if f.endswith(".txt"):
           filesize += os.path.getsize(os.path.join(root,f))

   print("Size of files:", filesize / 1073741824, "GB")

files_total_size(root)

Size of files: 2.9655294492840767 GB


### Exercise
- Change the function above by using hdfs3 to get size on HDFS.
- Take a look at the [API](http://hdfs3.readthedocs.io/en/latest/api.html)

In [32]:
# Your code here...

### Exercise 

- use your wordcount functions to count occurences of
words in this dummy "data lake" of text files on hdfs.

In [37]:
# Modify words function using hdfs3...
import string
def words(file):
    """ Read a text file and return a sorted list of 
    (word, 1) values."""
    translator = str.maketrans('', '', string.punctuation)
    output = []
    for line in file:
        line = line.strip()
        line = line.translate(translator)
        for word in line.split():
            word = word.lower()
            output.append((word, 1))
    output.sort()
    return output

In [38]:
import operator
def reduce(words):
    """ Read the sorted list from map and print out every word with 
    its number of occurences"""
    d = {}
    for w in words:
        try:
            d[w[0]] +=1
        except KeyError:
            d[w[0]] = 1 
    
    return sorted(d.items(), key=operator.itemgetter(1), reverse=True)

In [39]:
with open('sample.txt') as file:
    reduce(words(file))

[('quaerat', 13),
 ('est', 12),
 ('neque', 12),
 ('dolor', 10),
 ('amet', 9),
 ('numquam', 9),
 ('ut', 9),
 ('dolorem', 8),
 ('eius', 8),
 ('quiquia', 8),
 ('sit', 8),
 ('tempora', 8),
 ('labore', 7),
 ('aliquam', 6),
 ('dolore', 6),
 ('etincidunt', 6),
 ('velit', 6),
 ('voluptatem', 6),
 ('adipisci', 5),
 ('consectetur', 5),
 ('ipsum', 5),
 ('modi', 5),
 ('quisquam', 5),
 ('sed', 5),
 ('magnam', 4),
 ('non', 4),
 ('porro', 3)]

In [43]:
# Run the modified functions on hdfs latin directory
# Use functions to get filenames and size to display total ammount
# of gigabytes processed.
...

Ellipsis

### Exercise

Do the work with a Spark context.
By using the [Cheat Sheet](http://datacamp-community.s3.amazonaws.com/4d91fcbc-820d-4ae2-891b-f7a436ebefd4), you can create a parallel collection with `textFile`.
You can do your wordcount with functions `map`, `flatMap`, `reduceByKey` and `sortByKey`. Do some tries on a small local text file before to analyze the "latin data lake" on hdfs.

In [45]:
from pyspark import SparkContext
sc = SparkContext("spark://hadoop1:7077", "Your Name")
sc

In [47]:
files = "sample.txt"

# Your code here...
# wordcounts = ...

wordcounts.collect()


[(13, 'quaerat'),
 (12, 'est'),
 (12, 'neque'),
 (10, 'dolor'),
 (9, 'amet'),
 (9, 'ut'),
 (9, 'numquam'),
 (8, 'eius'),
 (8, 'quiquia'),
 (8, 'tempora'),
 (8, 'sit'),
 (8, 'dolorem'),
 (7, 'labore'),
 (6, 'dolore'),
 (6, 'aliquam'),
 (6, 'etincidunt'),
 (6, 'voluptatem'),
 (6, 'velit'),
 (5, 'ipsum'),
 (5, 'consectetur'),
 (5, 'modi'),
 (5, 'sed'),
 (5, 'quisquam'),
 (5, 'adipisci'),
 (4, 'magnam'),
 (4, 'non'),
 (3, 'porro')]

In [None]:
sc.stop()