# Word Counting with Spark

We start the notebook by explaining the difference between map() and flatMap().

* map: let’s you take a set a data and transform it into another set of data given a function. Ex: if you wanted to square all the numbers in an RDD, you can use map. You keep the same amount of values in your RDDs. Ex: rdd.map(lambda x: x*x)

* flatmap: Similar to “map”, but it let’s you produce multiple values from the original RDD values. The RDD you produce can be smaller or bigger than the original RDD.

With map(), we can make all elements go through the same function. For example, we can make every word in a sentence capitalized with:

*rdd = lines.map(lambda x: x.upper())*

With flatMap(), we can create new elements from each element in the RDD. For example, each word in a line in the RDD can become a new element with:

*rdd = lines.flatMap(lambda x: x.split())*

---

We will now take a book and count the amount of times each unique word appears in the book. We'll start with a simple way to do it and then use a more efficient way to do the counting as well as explain the difference between the two.

In [2]:
input = sc.textFile("/Users/jacquesthibodeau/big-data-datasets/book.txt")
words = input.flatMap(lambda x: x.split())
wordCounts = words.countByValue()

for word, count in wordCounts.items():
    cleanWord = word.encode('ascii', 'ignore')
    if (cleanWord):
        print(cleanWord, count)

b'Self-Employment:' 1
b'Building' 5
b'an' 172
b'Internet' 13
b'Business' 19
b'of' 941
b'One' 12
b'Achieving' 1
b'Financial' 3
b'and' 901
b'Personal' 3
b'Freedom' 7
b'through' 55
b'a' 1148
b'Lifestyle' 5
b'Technology' 2
b'By' 9
b'Frank' 10
b'Kane' 7
b'Copyright' 1
b'2015' 3
b'Kane.' 1
b'All' 13
b'rights' 3
b'reserved' 2
b'worldwide.' 2
b'CONTENTS' 1
b'Disclaimer' 1
b'Preface' 1
b'Part' 2
b'I:' 2
b'Making' 5
b'the' 1176
b'Big' 1
b'Decision' 1
b'Overcoming' 1
b'Inertia' 1
b'Fear' 1
b'Failure' 1
b'Career' 1
b'Indoctrination' 2
b'The' 88
b'Carrot' 1
b'on' 399
b'Stick' 2
b'Ego' 1
b'Protection' 1
b'Your' 62
b'Employer' 2
b'as' 297
b'Security' 2
b'Blanket' 1
b'Why' 3
b'its' 28
b'Worth' 1
b'it' 311
b'Unlimited' 2
b'Growth' 4
b'Potential' 1
b'Investing' 3
b'in' 552
b'Yourself,' 1
b'Not' 7
b'Someone' 2
b'Else' 1
b'No' 14
b'Dependencies' 1
b'Commute' 1
b'to' 1789
b'Live' 3
b'Where' 2
b'You' 144
b'Want' 5
b'Work' 4
b'When' 31
b'How' 29
b'Is' 17
b'Self-Employment' 1
b'for' 500
b'You?' 1
b'Flowchart:

b'interests,' 3
b'yours.' 7
b'time,' 25
b'thrived' 1
b'Im' 5
b'case' 10
b'hit' 3
b'dead' 1
b'difficulty' 1
b'promoted' 1
b'Director-level' 1
b'today' 3
b'earns' 3
b'Director' 1
b'would.' 1
b'efforts' 10
b'building' 25
b'instead' 20
b'self-promotion' 1
b'lobbying' 1
b'required' 14
b'ended' 8
b'earning' 4
b'Lets' 2
b'quantify' 1
b'10%' 6
b'growth' 24
b'sounds' 5
b'running' 9
b'small' 39
b'grows' 3
b'year,' 10
b'asking' 3
b'wrong.' 6
b'offers' 12
b'intentionally' 2
b'limited' 8
b'minimize' 4
b'expense,' 2
b'prevent' 2
b'inequities.' 1
b'yourself,' 12
b'And,' 2
b'control' 3
b'dependent' 2
b'reward.' 1
b'based' 9
b'quality' 7
b'products' 50
b'demand' 5
b'story.' 1
b'Let' 1
b'state' 7
b'again,' 5
b'important' 49
b'self-employed.' 10
b"There's" 11
b'(other' 1
b'zero),' 1
b'careful' 6
b'mitigate' 1
b'choose' 6
b'accept.' 1
b'Face' 1
b'unless' 8
b'C-suite.' 1
b"that's" 40
b'attainable,' 1
b'odds' 7
b'attaining' 1
b'significant' 5
b'wealth' 1
b'millionaire' 1
b'smaller' 3
b'longer' 5
b'More' 2
b

b'emails' 9
b'drive' 3
b'school' 2
b'morning,' 2
b'freeing' 1
b'rest' 6
b'needs' 22
b'personal.' 1
b'room' 1
b'calendar' 1
b"doctor's" 1
b'appointment,' 1
b'catching' 1
b"kids'" 2
b'concert' 1
b'award' 1
b'presentation.' 1
b'boss,' 1
b'customers.' 4
b'travel' 4
b'occasional' 2
b'7' 3
b'3?' 1
b'11' 2
b'7?' 1
b"Nobody's" 4
b'stop' 9
b'eight-hour' 1
b'arbitrary' 1
b'construct' 2
b'fatigued,' 1
b'achieved' 2
b'today.' 2
b'worthless' 3
b'lunch,' 1
b'relax' 2
b'pool' 4
b'mental' 4
b'there?' 3
b'tradeoffs' 2
b'assistants' 1
b'handle' 6
b'8' 4
b'Tim' 1
b"Ferriss's" 1
b'"The' 4
b'Four' 1
b'Hour' 1
b'Week"' 1
b'depth' 4
b'strategy.' 1
b'title' 2
b'hyperbole.' 1
b'4' 6
b'week,' 3
b'handling' 2
b'inquiries' 1
b'batched' 1
b'directing' 1
b'contractors.' 1
b'vacations' 1
b'family,' 2
b'exactly' 2
b'40' 1
b'vacation,' 1
b'grow.' 2
b'ethic' 1
b'dividends.' 1
b'"Flexibility' 1
b'schedule"' 1
b'top-rated' 1
b'Report.' 1
b'surprised' 5
b'46%' 2
b'respondents' 1
b'themselves;' 1
b'myth' 1
b'extra-hard' 1


b'earlier' 2
b'more?"' 1
b'did.' 2
b'Both' 2
b'statements' 2
b'true' 1
b'businesses.' 5
b'designed' 6
b'fast,' 1
b'above' 2
b'else.' 6
b'contrast,' 4
b'exists' 1
b'sole' 6
b'purpose' 3
b'family-owned' 1
b'restaurants,' 1
b'barber' 1
b'shops,' 1
b'cleaners,' 1
b'landscapers,' 1
b'roofing' 1
b'companies,' 2
b'freelancers,' 2
b'selling' 31
b'crafts' 1
b'imported' 1
b'goods' 8
b'employs' 1
b'Google' 53
b'billions' 1
b'love' 4
b'terms,' 5
b'be,' 7
b'surrounded' 1
b'landscaper' 1
b'restaurateur' 1
b'it;' 4
b'options' 4
b'involve' 4
b'responsibility' 1
b'hiring' 13
b'managing' 2
b'others,' 7
b'physical' 17
b'location.' 1
b'IDEAL' 1
b'discussed' 5
b'earlier.' 2
b'start.' 4
b'review' 4
b'fullest.' 1
b'minimal' 3
b'limited,' 1
b'scale.' 2
b"freelancer's" 2
b'can.' 6
b'ideal' 1
b'digital' 10
b'front,' 2
b'indefinitely' 2
b'distribution,' 3
b'completely' 7
b'automatically.' 3
b'largely' 6
b'automated' 9
b'targeted' 14
b'ads' 46
b'optimized' 1
b'conversions.' 1
b'Physical' 1
b'outsource' 5
b'produc

b'simulation' 4
b'industry,' 2
b'stagnant' 1
b'Canada,' 1
b'Europe' 1
b'Asia' 1
b'globally,' 1
b'competing' 5
b'globally.' 1
b'Understanding' 2
b'global' 2
b'especially' 5
b'cost?' 1
b'problem.' 6
b'automate' 5
b'marketing?' 1
b'globally' 3
b'techniques' 3
b'cold-calling' 1
b'in-person' 1
b'demos' 1
b'salesperson' 5
b'everywhere' 1
b'Google,' 4
b'demographics' 1
b'Facebook?' 1
b'websites' 25
b'advertise' 6
b'on?' 3
b'obligation' 1
b'expense' 5
b'yes' 1
b'questions,' 1
b'marketing.' 3
b'human' 4
b'resources,' 3
b'systems' 3
b'allow' 6
b'discover,' 1
b'evaluate,' 1
b'personally.' 3
b'replicate?' 1
b'seven' 1
b'born' 1
b'minute,' 1
b'idea.' 7
b'serving' 1
b'apply' 9
b'provisional' 1
b'patent' 10
b'protect' 4
b'profit' 9
b'application' 2
b'(that' 1
b'way).' 1
b'defend' 1
b'lawyer,' 3
b'yours' 4
b'details.' 2
b'lends' 1
b'itself' 3
b'copyright' 2
b'protection,' 3
b'unlikely' 4
b'replicated.' 1
b'Sometimes' 4
b'leveraging' 2
b'talents' 1
b'insurmountably' 1
b'stolen' 1
b'starts' 4
b'own?' 1


b'different.' 1
b'done.' 4
b'definition' 2
b'course.' 2
b'working,' 1
b'feedback' 5
b'world.' 1
b'afraid' 3
b'straightforward' 1
b'service,' 2
b'cloud' 1
b'Amazon' 2
b'Web' 1
b'Services' 2
b'(AWS)' 1
b'offering,' 1
b'scaling' 1
b'manufacture' 2
b'garage' 1
b'manufacturer' 1
b'alibaba.com' 1
b'identifying' 3
b'overseas' 1
b'manufacturers' 1
b'experience' 10
b'double-checking' 1
b'it!)' 1
b'Searching' 1
b'"How' 1
b'China"' 1
b'detailed,' 1
b'articles' 4
b'subject.' 1
b'Neomek' 1
b'low-volume' 1
b"(I've" 1
b'endorsement).' 1
b'pocket,' 2
b'NAMING' 1
b'Brainstorm' 1
b'names' 8
b'enlist' 1
b'minds' 1
b'better!' 1
b'judging' 1
b'brainstorming.' 1
b'names.' 1
b'Which' 1
b'remember?' 1
b'aspect' 3
b'theyve' 1
b'seen' 2
b'Things' 1
b'spell' 1
b'good,' 1
b'memorable' 1
b'search,' 3
b'cant' 3
b'name!' 1
b'Regardless' 1
b'registered' 1
b'federal' 1
b'trademark' 1
b'trademarks' 1
b'whoever' 1
b'would' 1
b'businesses?' 1
b'letter' 2
b'web-based,' 1
b'GoDaddy.com' 1
b'availability' 1
b'domains' 2
b'.

b'lecture' 1
b'upholding' 1
b'confidentiality' 1
b'signed.' 1
b'spared' 1
b'indignities,' 1
b'applied' 1
b'Leaving' 2
b'trigger' 1
b'whirlwind,' 1
b'chaos' 2
b'wrap' 2
b'lock' 1
b'office:' 1
b'moving,' 1
b'forms,' 1
b'certificate' 1
b'prior' 2
b'file' 1
b'included' 3
b'H.S.A.' 1
b'(health' 1
b'account,)' 1
b'automatically;' 1
b'bleed' 1
b'dry' 1
b'gone.' 1
b'Accept' 2
b'COBRA' 2
b'healthcare.gov' 1
b'Compare' 1
b'lasts' 1
b'exchange.' 2
b'exit' 2
b'interview,' 1
b'vent' 1
b'frustrations' 1
b'Leave' 1
b'friendly' 1
b'forgivable' 1
b'competitor.' 1
b'left,' 1
b'employee!' 1
b'LAST' 1
b'cardboard' 1
b'belongings,' 1
b'feels' 2
b'weird.' 1
b'remember,' 1
b'unemployed' 1
b'Soon,' 1
b'unproductive' 1
b'artificially-set' 1
b'deadlines' 1
b'appeasing' 1
b'executives,' 1
b'fearing' 1
b'reputation' 2
b'advancement' 1
b'politics.' 1
b'paycheck!' 1
b'hard,' 1
b'survival' 1
b'instinct' 1
b'effort.' 3
b'final' 2
b'tips' 2
b'fuel' 1
b'growth.' 2
b'bits' 1
b'permanent' 1
b'FUELING' 1
b'FIRE' 1
b"now's

b'one' 1
b'extensions' 2
b'Ad' 2
b'"call-outs"' 1
b'snippets' 1
b'horn,' 1
b'extra,' 1
b'screen' 1
b'"ad' 1
b'extensions"' 1
b'remarketing' 4
b'browse' 1
b'Remarketing' 1
b'around,' 3
b'effective;' 1
b'automatic' 2
b'salespeople.' 1
b"CPA's" 2
b'"similar' 1
b'visitors".' 1
b'"audience"' 1
b'composed' 1
b'times.' 1
b'strategist.' 1
b'"account' 1
b'strategist"' 1
b'amazingly' 1
b'knowledgeable' 3
b'considered.' 1
b'definitely' 2
b'techniques.' 2
b'(the' 1
b'"landing' 1
b'page")' 1
b'closely' 1
b'split' 2
b'personas,' 1
b'keywords,' 1
b'group.' 1
b'Buying' 1
b'Ads' 3
b'offers,' 3
b'directly.' 1
b'buys' 1
b'Sometimes,' 1
b'unused' 1
b'("impressions")' 1
b'AdSense' 1
b'left-over' 1
b'impressions' 2
b'"remnant' 1
b'ads,"' 1
b'waters' 3
b'remnant' 1
b'lower-value' 1
b'gauge' 1
b'better-placed' 1
b'worthwhile?' 1
b'generate.' 1
b'substantially' 1
b'impressions,' 2
b'Oftentimes,' 1
b'math' 1
b'"brand' 1
b'awareness"' 1
b'intangible' 1
b'ad,' 10
b'bargain' 1
b'attractive' 1
b'bundle' 1
b'URL' 4


b'nudge' 1
b'Automated' 1
b'nudges' 1
b'AVOIDING' 1
b'PITFALLS' 1
b'economy.' 2
b'Administration,' 1
b"States'" 1
b'gross' 1
b'domestic' 1
b'world' 1
b'predators' 2
b'BEWARE' 5
b'LEECHES' 1
b'Being' 2
b'intimidating.' 1
b'unprepared' 1
b'started,' 1
b'fortune' 1
b'wary.' 2
b'Very' 1
b'willingness' 1
b'owners,' 2
b'scam' 1
b'recently' 2
b'directors,' 1
b'co-founder' 1
b'well-meaning,' 1
b'partners' 2
b'partner' 1
b'divorce?' 1
b'transferred' 1
b'not?' 2
b'die?' 1
b'contributing' 1
b'contingencies,' 1
b'entirely.' 1
b'Apart' 1
b'divorced' 1
b'Whoever' 1
b'decision-maker,' 1
b'acquirers' 1
b'complement' 1
b'knowledge' 2
b'incubator,' 2
b'council,' 1
b'Giving' 1
b'sense,' 1
b'beast' 1
b'entirely' 1
b'cut' 2
b'unsure' 1
b'WELL-MEANING' 1
b'ADVICE' 1
b'small-business' 1
b'great' 1
b'distracting.' 2
b'challenges.' 1
b'locally' 1
b"person's" 1
b'own?"' 1
b'unsolicited' 1
b'call.' 1
b'bears' 1
b'repeating.' 1
b'know,' 1
b'prospects,' 1
b'in-house' 1
b'testified' 1
b'Based' 1
b'convinced' 1
b"bu

If we look closely, there are several problems with the output. Since we only seperated the words with whitespace, it does not take into account periods, commas, etc. This means that some words will be the same, but not classified as such.

There is more than one way to resolve this issue, such as with the natural language processing toolkit: NLTK. However, we'll be keeping it simple by using a *regular expression*.

Regular expression let you write a string that tells your computer how you want to split up a string.

We're going to do the same thing as before, but instead of only using the split function to seperate the words, we'll create our own function.

In [4]:
import re

def normalizeWords(text):
    # W+ means that you want it to break up based on words
    # so it strips out things like punctionations or whatever
    # isn't part of a word.
    # We also set everything to lowercase with lower() in order
    # to take into account words that begin a sentence or are
    # found in a title.
    return re.compile(r'\W+', re.UNICODE).split(text.lower())

words = input.flatMap(normalizeWords)
wordCounts = words.countByValue()

for word, count in wordCounts.items():
    cleanWord = word.encode('ascii', 'ignore')
    if (cleanWord):
        print(cleanWord, count)

b'self' 111
b'employment' 75
b'building' 33
b'an' 178
b'internet' 26
b'business' 383
b'of' 970
b'one' 100
b'achieving' 1
b'financial' 17
b'and' 934
b'personal' 48
b'freedom' 41
b'through' 57
b'a' 1191
b'lifestyle' 44
b'technology' 11
b'by' 122
b'frank' 11
b'kane' 10
b'copyright' 3
b'2015' 4
b'all' 137
b'rights' 3
b'reserved' 2
b'worldwide' 4
b'contents' 1
b'disclaimer' 2
b'preface' 2
b'part' 33
b'i' 387
b'making' 25
b'the' 1292
b'big' 42
b'decision' 12
b'overcoming' 2
b'inertia' 2
b'fear' 3
b'failure' 3
b'career' 31
b'indoctrination' 5
b'carrot' 4
b'on' 428
b'stick' 6
b'ego' 3
b'protection' 7
b'your' 1420
b'employer' 44
b'as' 343
b'security' 8
b'blanket' 2
b'why' 25
b'it' 649
b's' 391
b'worth' 39
b'unlimited' 6
b'growth' 39
b'potential' 38
b'investing' 16
b'in' 616
b'yourself' 78
b'not' 203
b'someone' 62
b'else' 33
b'no' 76
b'dependencies' 6
b'commute' 14
b'to' 1828
b'live' 25
b'where' 53
b'you' 1878
b'want' 122
b'work' 144
b'when' 102
b'how' 163
b'is' 560
b'for' 537
b'flowchart' 4
b's

b'accept' 9
b'interests' 7
b'yours' 15
b'thrived' 1
b'hit' 3
b'dead' 1
b'difficulty' 1
b'promoted' 1
b'director' 2
b'earns' 3
b'efforts' 17
b'instead' 37
b'lobbying' 1
b'required' 14
b'ended' 10
b'earning' 5
b'quantify' 1
b'sounds' 5
b'running' 11
b'grows' 5
b'asking' 3
b'offers' 16
b'intentionally' 2
b'limited' 12
b'expense' 10
b'prevent' 2
b'inequities' 2
b'control' 6
b'dependent' 2
b'based' 19
b'quality' 9
b'products' 67
b'demand' 9
b'story' 6
b'state' 16
b'again' 19
b'important' 58
b'careful' 6
b'mitigate' 1
b'choose' 6
b'face' 14
b'unless' 13
b'c' 3
b'suite' 1
b'attainable' 5
b'odds' 10
b'attaining' 1
b'significant' 5
b'wealth' 2
b'millionaire' 1
b'smaller' 3
b'importantly' 3
b'comes' 12
b'experiences' 4
b'buy' 15
b'everyone' 7
b'wants' 7
b'liked' 1
b'respected' 1
b'formed' 1
b'close' 11
b'bonds' 1
b'alongside' 4
b'others' 37
b'situations' 2
b'guess' 4
b'ceo' 9
b'president' 1
b'whatever' 26
b'call' 11
b'friends' 10
b'won' 39
b'respect' 1
b'guts' 1
b'yes' 5
b'daily' 3
b'interaction

b'ignore' 2
b'voice' 1
b'mails' 1
b'manage' 3
b'overwhelmed' 1
b'task' 5
b'hand' 5
b'tuning' 2
b'distractions' 1
b'communication' 5
b'monitoring' 1
b'score' 2
b'possess' 2
b'pull' 1
b'two' 17
b'items' 6
b'augment' 3
b'putting' 4
b'motivate' 1
b'missing' 3
b'address' 14
b'services' 20
b'0' 1
b'different' 21
b'independent' 2
b'child' 1
b'data' 24
b'aggregated' 1
b'quickbooks' 4
b'runaway' 1
b'partnering' 1
b'possesses' 1
b'whom' 2
b'implicitly' 1
b'partner' 6
b'catch' 3
b'guarantee' 2
b'determined' 1
b'whether' 21
b'rough' 1
b'outlook' 1
b'embark' 2
b'forget' 2
b'obvious' 3
b'section' 11
b'maximize' 9
b'benefits' 20
b'evolve' 3
b'offer' 29
b'variety' 1
b'diversify' 1
b'offerings' 3
b'reliable' 4
b'moleskine' 1
b'notebooks' 1
b'notes' 4
b'information' 25
b'learnings' 1
b'acquire' 2
b'revisiting' 1
b'continually' 1
b'refine' 2
b'period' 7
b'freelancer' 7
b'site' 38
b'cases' 5
b'employing' 1
b'mentioned' 4
b'earlier' 7
b'gestating' 1
b'expectation' 3
b'tasks' 8
b'pursue' 1
b'objectives' 1
b

b'leverage' 3
b'paragraph' 5
b'uncertain' 1
b'extend' 3
b'measurable' 2
b'knowing' 1
b'came' 11
b'clarify' 2
b'looks' 2
b'convince' 2
b'approve' 2
b'hugely' 2
b'detail' 3
b'analysis' 3
b'type' 3
b'percentage' 10
b'population' 1
b'lazy' 1
b'overstated' 1
b'cater' 5
b'reports' 2
b'discuss' 1
b'shrinking' 1
b'organizations' 1
b'belong' 2
b'membership' 1
b'landscape' 1
b'solves' 3
b'including' 5
b'item' 5
b'homework' 1
b'critique' 1
b'launch' 24
b'feature' 2
b'creep' 1
b'consumers' 2
b'measurement' 3
b'cram' 1
b'features' 5
b'delay' 1
b'complicate' 1
b'unnecessarily' 2
b'listen' 3
b'feedback' 8
b'inform' 1
b'updates' 3
b'talks' 1
b'mvp' 2
b'applies' 1
b'equally' 1
b'eric' 3
b'ries' 2
b'covers' 3
b'view' 7
b'consumer' 2
b'release' 32
b'announcing' 2
b'advertisements' 2
b'approach' 3
b'pricing' 1
b'standpoint' 7
b'arriving' 1
b'windfall' 2
b'rafi' 2
b'mohammed' 2
b'precise' 2
b'approaches' 1
b'determining' 2
b'optimal' 1
b'difference' 4
b'flow' 7
b'transaction' 2
b'subscription' 2
b'payment'

b'secret' 1
b'grabbing' 3
b'bare' 1
b'exposure' 2
b'shying' 1
b'cornerstone' 1
b'attracts' 1
b'repelling' 1
b'storefronts' 1
b'cares' 1
b'referrals' 3
b'profiles' 2
b'accordingly' 3
b'immortalize' 1
b'heavily' 1
b'adapt' 2
b'mobile' 9
b'tablet' 3
b'devices' 5
b'seconds' 2
b'emulate' 1
b'credits' 1
b'built' 4
b'maintainable' 1
b'blog' 10
b'basis' 4
b'answers' 1
b'perform' 3
b'action' 21
b'conversion' 4
b'integrated' 2
b'platform' 4
b'securing' 1
b'modify' 2
b'library' 2
b'plugins' 2
b'capabilities' 1
b'sketches' 1
b'navigation' 2
b'pages' 14
b'wireframes' 1
b'approved' 1
b'approval' 1
b'lightly' 2
b'recommendations' 1
b'adjust' 2
b'designs' 1
b'clear' 4
b'poor' 2
b'misunderstandings' 1
b'image' 5
b'visitors' 11
b'scroll' 2
b'trend' 2
b'lately' 2
b'tradeoff' 1
b'challenging' 4
b'static' 1
b'headline' 5
b'geared' 2
b'catchy' 1
b'insert' 2
b'quote' 1
b'blurb' 1
b'history' 1
b'affiliated' 1
b'copywriting' 1
b'complimentary' 1
b'journalism' 1
b'refines' 1
b'military' 3
b'rift' 4
b'gaming' 4


b'originally' 1
b'his' 5
b'agreed' 1
b'burdened' 1
b'spot' 1
b'mainly' 1
b'manipulation' 1
b'ineffective' 1
b'ulterior' 1
b'motive' 1
b'negotiating' 1
b'scammer' 1
b'maximizes' 1
b'questionable' 1
b'ramp' 1
b'handed' 1
b'prospecting' 1
b'outsourced' 1
b'emailing' 1
b'qualified' 1
b'independence' 1
b'compromises' 1
b'ethical' 1
b'dilemma' 1
b'advocates' 1
b'preserve' 1
b'asks' 1
b'employ' 1
b'flows' 1
b'measured' 1
b'pitfall' 1
b'random' 1
b'variations' 1
b'disappearance' 1
b'sees' 1
b'variation' 2
b'seasonal' 1
b'cycles' 1
b'aggregate' 1
b'susceptible' 1
b'outliers' 1
b'analyzing' 1
b'dummies' 3
b'deviation' 1
b'gamed' 1
b'congratulated' 1
b'believing' 1
b'dug' 1
b'vietnam' 1
b'documentation' 1
b'blocking' 1
b'suspicious' 1
b'variance' 1
b'thriving' 1
b'forever' 1
b'saturate' 1
b'provides' 1
b'metaphorical' 1
b'supporting' 1
b'adopt' 1
b'mindset' 1
b'conservatively' 1
b'104' 1
b'312' 1
b'mindful' 1
b'reserve' 1
b'outsourcing' 3
b'bankrupt' 1
b'automation' 1
b'automating' 2
b'sink' 1
b'

Now we're going to reorganize the data so that we can output the amount of times certain words show up in a descending order.

In [9]:
# We create a key/value pair of (word, 1) and then count every time
# a key (word) is the same.
wordCounts = words.map(lambda x: (x,1)).reduceByKey(lambda x, y: x + y)
# We flip the word and count columns and then sort by the count
wordCountsSorted = wordCounts.map(lambda x: (x[1], x[0])).sortByKey(ascending=False)

results = wordCountsSorted.collect()

for result in results:
    count = str(result[0])
    word = result[1].encode('ascii', 'ignore')
    if (word):
        print(str(word) + ":\t\t" + str(count))

b'you':		1878
b'to':		1828
b'your':		1420
b'the':		1292
b'a':		1191
b'of':		970
b'and':		934
b'that':		747
b'it':		649
b'in':		616
b'is':		560
b'for':		537
b'on':		428
b'are':		424
b'if':		411
b's':		391
b'i':		387
b'business':		383
b'can':		376
b'be':		369
b'as':		343
b'have':		321
b'with':		315
b't':		301
b'this':		280
b'or':		278
b'time':		255
b'but':		242
b'they':		234
b'will':		231
b'what':		229
b'at':		220
b'my':		215
b're':		214
b'do':		207
b'not':		203
b'about':		202
b'more':		200
b'product':		182
b'an':		178
b'up':		177
b'need':		174
b'them':		166
b'from':		166
b'how':		163
b'there':		162
b'out':		161
b'new':		153
b'people':		145
b'work':		144
b'so':		143
b'just':		142
b'own':		140
b'all':		137
b'don':		133
b'get':		123
b'customers':		123
b'by':		122
b'want':		122
b'company':		122
b'their':		122
b'some':		121
b'll':		114
b'self':		111
b'website':		109
b'make':		108
b'may':		107
b'even':		104
b'when':		102
b'one':		100
b've':		95
b'than':		92
b'also':		91
b'job':		90
b'much':		

b'deliver':		10
b'necessary':		10
b'expensive':		10
b'plans':		10
b'investors':		10
b'exchange':		10
b'course':		10
b'hopefully':		10
b'given':		10
b'unique':		10
b'popular':		10
b'percentage':		10
b'blog':		10
b'boss':		9
b'ultimately':		9
b'manager':		9
b'deal':		9
b'position':		9
b'house':		9
b'4':		9
b'college':		9
b'following':		9
b'meet':		9
b'hear':		9
b'increase':		9
b'quality':		9
b'demand':		9
b'ceo':		9
b'share':		9
b'created':		9
b'rate':		9
b'creative':		9
b'clients':		9
b'cannot':		9
b'savings':		9
b'phone':		9
b'maximize':		9
b'goods':		9
b'amazon':		9
b'sometimes':		9
b'said':		9
b'mobile':		9
b'he':		9
b'fraud':		9
b'measuring':		9
b'alone':		9
b'reading':		9
b'bit':		9
b'minimum':		9
b'stock':		9
b'stream':		9
b'tech':		9
b'achieve':		9
b'viable':		9
b'answer':		9
b'option':		9
b'loan':		9
b'relationship':		9
b'affect':		9
b'accept':		9
b'simply':		9
b'called':		9
b'across':		9
b'happens':		9
b'involved':		9
b'system':		9
b'purchasing':		9
b'government':		9
b'easier':

b'projections':		2
b'science':		2
b'consideration':		2
b'orders':		2
b'maximizing':		2
b'amateur':		2
b'apart':		2
b'database':		2
b'sections':		2
b'alternative':		2
b'secrets':		2
b'shelton':		2
b'measurable':		2
b'clarify':		2
b'looks':		2
b'belong':		2
b'feature':		2
b'consumers':		2
b'consumer':		2
b'subscription':		2
b'payment':		2
b'removes':		2
b'issues':		2
b'treat':		2
b'stages':		2
b'tried':		2
b'taxable':		2
b'reputable':		2
b'refer':		2
b'congregate':		2
b'ranking':		2
b'magazines':		2
b'television':		2
b'defined':		2
b'envelope':		2
b'spreadsheet':		2
b'estimated':		2
b'mouth':		2
b'computers':		2
b'accounts':		2
b'keen':		2
b'mentally':		2
b'policies':		2
b'provision':		2
b'involves':		2
b'assumption':		2
b'series':		2
b'hanging':		2
b'precious':		2
b'meets':		2
b'minds':		2
b'domains':		2
b'variants':		2
b'filing':		2
b'w2':		2
b'realistically':		2
b'bookkeeping':		2
b'activities':		2
b'announce':		2
b'variable':		2
b'bear':		2
b'ticket':		2
b'kept':		2
b'private':		2
b'

b'ethic':		1
b'myth':		1
b'opposite':		1
b'psychological':		1
b'association':		1
b'arises':		1
b'feeling':		1
b'depression':		1
b'alter':		1
b'disease':		1
b'complaints':		1
b'welcome':		1
b'conform':		1
b'architecture':		1
b'environment':		1
b'serious':		1
b'undertaking':		1
b'betting':		1
b'device':		1
b'assuming':		1
b'cell':		1
b'multiply':		1
b'looming':		1
b'spins':		1
b'depleted':		1
b'god':		1
b'severance':		1
b'passage':		1
b'affordable':		1
b'obamacare':		1
b'subsidizes':		1
b'u':		1
b'sticker':		1
b'angry':		1
b'pesky':		1
b'prudent':		1
b'horribly':		1
b'merely':		1
b'finished':		1
b'unfinished':		1
b'voice':		1
b'distractions':		1
b'monitoring':		1
b'pull':		1
b'motivate':		1
b'possesses':		1
b'rough':		1
b'outlook':		1
b'variety':		1
b'moleskine':		1
b'notebooks':		1
b'revisiting':		1
b'combinator':		1
b'talented':		1
b'apple':		1
b'outrageous':		1
b'male':		1
b'republic':		1
b'hell':		1
b'ageist':		1
b'212':		1
b'energetic':		1
b'hopeful':		1
b'drawn':		1
b'begin':		1
b'

And there we have it, the words that show up the most often in an descending order.

The regular expression did not work perfectly as there are words like "s" in the output, but it works well enough for what we're trying to do. If we really wanted to get the best output possible, we should use NLTK.