## Wordcount

La idea de este ejemplos es obtener la cantidad de apariciones de cada palabra en todas las obras de shakespeare.
Fuente de Datos: http://www.gutenberg.org/cache/epub/100/pg100.txt (se elimino la introduccion y licencia).


In [1]:
import pyspark

try: 
    type(sc)
except NameError:
    sc = pyspark.SparkContext('local[*]')

In [7]:
shakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [8]:
shakespeareRDD.take(20)

[u'1609',
 u'',
 u'THE SONNETS',
 u'',
 u'by William Shakespeare',
 u'',
 u'',
 u'',
 u'                     1',
 u'  From fairest creatures we desire increase,',
 u"  That thereby beauty's rose might never die,",
 u'  But as the riper should by time decease,',
 u'  His tender heir might bear his memory:',
 u'  But thou contracted to thine own bright eyes,',
 u"  Feed'st thy light's flame with self-substantial fuel,",
 u'  Making a famine where abundance lies,',
 u'  Thy self thy foe, to thy sweet self too cruel:',
 u"  Thou that art now the world's fresh ornament,",
 u'  And only herald to the gaudy spring,',
 u'  Within thine own bud buriest thy content,']

In [9]:
wordsRDD = shakespeareRDD.flatMap(lambda line: line.split())
wordsRDD.take(10)

[u'1609',
 u'THE',
 u'SONNETS',
 u'by',
 u'William',
 u'Shakespeare',
 u'1',
 u'From',
 u'fairest',
 u'creatures']

In [10]:
wordsCountRDD = wordsRDD.map(lambda word: (word,(1,2)))
wordsCountRDD.take(5)

[(u'1609', (1, 2)),
 (u'THE', (1, 2)),
 (u'SONNETS', (1, 2)),
 (u'by', (1, 2)),
 (u'William', (1, 2))]

In [11]:
wordsCountRDD.reduceByKey(lambda a,b: (a[0]+b[0],1)).sortBy(ascending=False,keyfunc=lambda x:x[1][0]).take(10)

[(u'the', (23373, 1)),
 (u'I', (19540, 1)),
 (u'and', (18334, 1)),
 (u'to', (15667, 1)),
 (u'of', (15626, 1)),
 (u'a', (12575, 1)),
 (u'my', (10825, 1)),
 (u'in', (9624, 1)),
 (u'you', (9111, 1)),
 (u'is', (7862, 1))]

## N-Grams

### Que es un n-grama

Es una secuencia continua de n caracteres de una cadena de texto.

In [12]:
def trigrams(t):
    t=t.lower()
    return [t[i:i+3] for i in range(0, len(t) - 2)]

In [13]:
trigrams("hola datos")

['hol', 'ola', 'la ', 'a d', ' da', 'dat', 'ato', 'tos']

Buscando los trigramas de todas las obras de shakespeare

In [14]:
anotherShakespeareRDD = sc.textFile('data/shakespeare.txt',8)

In [15]:
trigramsRDD = anotherShakespeareRDD.flatMap(trigrams).filter(lambda a : a != '   ')

In [16]:
trigramsRDD.take(10)

[u'160',
 u'609',
 u'the',
 u'he ',
 u'e s',
 u' so',
 u'son',
 u'onn',
 u'nne',
 u'net']

Calculando la frecuencia de cada trigrama

In [17]:
trigramsCount = trigramsRDD.map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y)
print(trigramsCount.take(5))

[(u'osy', 15), (u'aln', 4), (u'? t', 262), (u'f; ', 116), (u' 54', 1)]


In [18]:
trigramsCountSorted = trigramsCount.sortBy(ascending=False,keyfunc=lambda x:x[1])
print(trigramsCountSorted.take(20))

[(u' th', 83504), (u'the', 52000), (u'he ', 35101), (u'and', 32677), (u' an', 32633), (u'nd ', 31158), (u' to', 23607), (u'is ', 23118), (u' yo', 22873), (u'you', 22242), (u' he', 20994), (u'to ', 19818), (u' of', 19811), (u' no', 19309), (u' i ', 19146), (u'her', 18969), (u'hat', 18789), (u'll ', 18605), (u'at ', 18091), (u' wi', 17937)]


In [19]:
trigramsCountSorted.take(10)

[(u' th', 83504),
 (u'the', 52000),
 (u'he ', 35101),
 (u'and', 32677),
 (u' an', 32633),
 (u'nd ', 31158),
 (u' to', 23607),
 (u'is ', 23118),
 (u' yo', 22873),
 (u'you', 22242)]

### Caculando la frecuencia total de la coleccion

In [20]:
totalFrec = trigramsCountSorted.map(lambda x: x[1]).reduce(lambda x,y: x+y)

In [None]:
print(totalFrec)

In [None]:
print(trigramsRDD.count())

### Calculando la probabilidad de cada trigrama

In [None]:
print(trigramsCountSorted.take(5))
trigramsProb = trigramsCountSorted.map(lambda x: (x[0],round(float(x[1])/totalFrec,3)))

In [None]:
trigramsProb.take(10)