# Introduction

<br></br>
Take me to the [code](https://github.com/AMoazeni/Word-Count/blob/master/Code/Word%20Count.py) for Word Counting!

<br></br>
Counting the occurrence of words in a document is difficult. Let's solve this problem with a few lines of Python code. Use the provided 'WordCount' function which requires a file path input (document_name) and the minimum frequency (min_occurrence) of words you're looking for. This code can be used to generate insights from documents.

<br></br>
Once you download this repository, you can drop and '.txt' documents into the 'Data' folder and use the following code to analyze it. Here are some insights derived from the examples documents in the 'Data' folder.

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/01%20-%20Counting.gif" width=40% alt="Counting"></div>


<br></br>

# Popular CEO Names

Analyzing Fortune 1000 companies, it turns out that you have the best chances of becoming CEO if your name is David, James, or John.

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/02%20-%20CEO%20Names.png" width=70% alt="CEO-Names"></div>


<br></br>

# Software Engineering Jobs

Apparently Software Engineer jobs really care about Technical Skills, Scale, and Communication. Good to know when writing your resume.

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/03%20-%20Software%20Engineer.png" width=70% alt="Software-Engineer"></div>


<br></br>

# Shakespeare Word Count

It turns out 'father', 'think', and 'queen' are used quite often in Shakespeare's writing. This can give us insight into his personality. 

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/04%20-%20Shakespeare.png" width=70% alt="Shakespeare"></div>


<br></br>

# Sherlock Holmes

Interesting that Sherlock Holmes likes to investigate mysteries at night and morning time. 

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/05%20-%20Sherlock.png" width=70% alt="Sherlock"></div>


<br></br>

# Code

1. Install [Anaconda](https://www.anaconda.com/download/).
2. Download this repository and navigate to it.
3. Copy the '.txt' file you want to analyze into the 'Data' folder.
4. Open the 'Word Count.ipynb' file with Jupyter Notebook.
5. Type your '.txt' file name into the 'WordCount' function.
6. Click 'Run' to step through the code.


<br></br>
```shell
$ git clone https://github.com/AMoazeni/Word-Count.git
$ cd Word-Count
```

<br></br>

# Happy Coding!

Check out [AMoazeni's Github](https://github.com/AMoazeni/) for more Data Science, Machine Learning, and Robotics repositories.

<br></br>
<div align="center"><img src="https://raw.githubusercontent.com/AMoazeni/Word-Count/master/Jupyter%20Notebook/Images/06%20-%20Cat%20Typing.gif" width=40% alt="Cat-Typing"></div>





In [2]:
# Author - AMoazeni
# License - MIT

import re  # Parsing Library


def ExtractWords(document_name):
    document_text = open(document_name, 'r')
    text_string = document_text.read().lower()
    
    # '[a-z]{2,30}' - Extract words containing A to Z letters, 2 to 30 characters long
    all_words = re.findall(r'\b[a-z]{2,30}\b', text_string)
    return all_words



def WordFrequency(all_words):
    word_occur = {}
    
    for word in all_words:
        count = word_occur.get(word,0)
        word_occur[word] = count + 1
        
    return word_occur



def WordSort(word_occur):
    word_list=[]
    
    for key, value in word_occur.items():
        temp = [value,key]
        word_list.append(temp)
    
    words_sorted = sorted(word_list,key=lambda word_list: word_list[0], reverse=True)
    
    return words_sorted
    


def PrintList(words_sorted, min_occurrence):
    word_parse = []
    
    for i in range(len(words_sorted)):
        if words_sorted[i][0] >= min_occurrence:
            print(words_sorted[i])
            word_parse.append(words_sorted[i])
            
    return word_parse



def WordCount(document_name, min_occurrence):
    
    # Extract all words
    all_words = ExtractWords(document_name)
    
    # Count word occurrence frequency
    word_occur = WordFrequency(all_words)
    
    # Convert word occurrence dictionary into sorted list
    words_sorted = WordSort(word_occur)
    
    # Print sorted words and occurrences as list [word_occurrence, word_value]
    word_parse = PrintList(words_sorted, min_occurrence)
    
    return word_parse



In [4]:
# Main function - Fortune CEO names
if __name__ == '__main__':
    
    try:    
        # Enter file name here, make sure the '.txt' file is in the Data folder
        document_name = '../Data/Fortune CEO.txt'
        min_occurrence = 10
        
        # Count word occurrence in provided document
        word_parse = WordCount(document_name, min_occurrence)
        
        
    except BaseException as e:
        print('Error: ', e)

[46, 'david']
[44, 'james']
[41, 'john']
[38, 'michael']
[37, 'robert']
[37, 'thomas']
[34, 'jr']
[28, 'mark']
[22, 'william']
[20, 'richard']
[19, 'jeffrey']
[18, 'timothy']
[18, 'steven']
[17, 'christopher']
[15, 'iii']
[14, 'gary']
[13, 'gregory']
[13, 'stephen']
[12, 'brian']
[12, 'joseph']
[12, 'scott']
[11, 'smith']
[10, 'douglas']
[10, 'george']
[10, 'paul']


In [9]:
# Main function - US Constitution
if __name__ == '__main__':
    try:
        
        # Enter file name here, make sure the '.txt' file is in the Data folder
        document_name = '../Data/US Constitution.txt'
        min_occurrence = 10
        
        # Count word occurrence in provided document
        word_parse = WordCount(document_name, min_occurrence)
        
        
    except BaseException as e:
        print('Error: ', e)

[748, 'the']
[515, 'of']
[306, 'shall']
[270, 'and']
[209, 'to']
[183, 'be']
[162, 'or']
[149, 'in']
[135, 'states']
[120, 'president']
[104, 'by']
[89, 'united']
[84, 'for']
[80, 'state']
[80, 'any']
[67, 'as']
[63, 'have']
[62, 'congress']
[55, 'section']
[52, 'such']
[44, 'which']
[44, 'not']
[44, 'may']
[43, 'all']
[42, 'no']
[41, 'from']
[40, 'on']
[39, 'law']
[37, 'office']
[36, 'this']
[36, 'amendment']
[35, 'vice']
[34, 'house']
[34, 'person']
[33, 'but']
[31, 'constitution']
[31, 'representatives']
[31, 'other']
[29, 'article']
[29, 'senate']
[29, 'their']
[29, 'one']
[28, 'he']
[26, 'each']
[26, 'that']
[26, 'number']
[25, 'two']
[24, 'at']
[24, 'if']
[22, 'years']
[22, 'thereof']
[22, 'power']
[22, 'time']
[20, 'within']
[20, 'they']
[20, 'it']
[18, 'several']
[18, 'an']
[18, 'his']
[18, 'citizens']
[17, 'been']
[17, 'when']
[17, 'under']
[17, 'with']
[17, 'nor']
[16, 'electors']
[16, 'who']
[16, 'vote']
[16, 'same']
[16, 'duties']
[15, 'persons']
[15, 'them']
[14, 'powers']

In [10]:
# Main function - Software Engineer Jobs
if __name__ == '__main__':
    
    try:
        
        # Enter file name here, make sure the '.txt' file is in the Data folder
        document_name = '../Data/Software Engineer Jobs.txt'
        min_occurrence = 4
        
        # Count word occurrence in provided document
        word_parse = WordCount(document_name, min_occurrence)
        
        
    except BaseException as e:
        print('Error: ', e)

[67, 'and']
[58, 'to']
[35, 'the']
[25, 'in']
[23, 'with']
[21, 'experience']
[21, 'you']
[19, 'or']
[19, 'of']
[14, 'for']
[14, 'on']
[13, 'as']
[13, 'be']
[13, 'software']
[12, 'work']
[12, 'we']
[9, 'that']
[9, 'are']
[8, 'design']
[7, 'time']
[7, 'from']
[7, 'skills']
[7, 'about']
[7, 'google']
[7, 'our']
[7, 'problems']
[7, 'building']
[6, 'writing']
[6, 'systems']
[6, 'computer']
[6, 'programming']
[6, 'will']
[6, 'able']
[6, 'team']
[6, 'development']
[6, 'networking']
[6, 'an']
[6, 'your']
[6, 'performance']
[5, 'working']
[5, 'real']
[5, 'qualifications']
[5, 'science']
[5, 'is']
[5, 'if']
[5, 'end']
[5, 'technical']
[5, 'one']
[5, 'more']
[5, 'large']
[5, 'engineering']
[5, 'not']
[5, 'technology']
[5, 'engineers']
[5, 'scale']
[5, 're']
[5, 'streaming']
[5, 'platform']
[5, 'such']
[4, 'code']
[4, 'communication']
[4, 'system']
[4, 'need']
[4, 'data']
[4, 'great']
[4, 'information']
[4, 'language']
[4, 'other']
[4, 'but']
[4, 'technologies']
[4, 'products']
[4, 'have']
[4, 'i

In [11]:
# Main function - Shakespeare
if __name__ == '__main__':
    try:
        
        # Enter file name here, make sure the '.txt' file is in the Data folder
        document_name = '../Data/Shakespeare.txt'
        min_occurrence = 3
        
        # Count word occurrence in provided document
        word_parse = WordCount(document_name, min_occurrence)
        
        
    except BaseException as e:
        print('Error: ', e)

[27597, 'the']
[26738, 'and']
[19771, 'to']
[18138, 'of']
[13826, 'you']
[12489, 'my']
[11536, 'that']
[11112, 'in']
[9755, 'is']
[8730, 'not']
[8309, 'for']
[8009, 'with']
[7777, 'me']
[7725, 'it']
[7115, 'be']
[6875, 'your']
[6859, 'his']
[6827, 'this']
[6679, 'he']
[6277, 'but']
[5967, 'as']
[5902, 'have']
[5549, 'thou']
[5276, 'so']
[5205, 'him']
[5008, 'will']
[4808, 'what']
[4448, 'by']
[4034, 'thy']
[3960, 'all']
[3882, 'are']
[3850, 'her']
[3828, 'do']
[3797, 'no']
[3614, 'we']
[3600, 'shall']
[3511, 'if']
[3188, 'on']
[3181, 'thee']
[3094, 'lord']
[3083, 'or']
[3061, 'our']
[3041, 'king']
[2834, 'good']
[2792, 'now']
[2764, 'sir']
[2647, 'from']
[2531, 'they']
[2519, 'come']
[2516, 'at']
[2410, 'she']
[2409, 'll']
[2369, 'let']
[2357, 'enter']
[2331, 'here']
[2321, 'which']
[2299, 'would']
[2292, 'more']
[2249, 'was']
[2241, 'well']
[2223, 'then']
[2210, 'there']
[2198, 'love']
[2168, 'am']
[2167, 'how']
[2075, 'their']
[2054, 'when']
[2034, 'man']
[1980, 'them']
[1942, 'hath'

[44, 'waste']
[44, 'fore']
[44, 'fate']
[44, 'west']
[44, 'desert']
[44, 'weight']
[44, 'root']
[44, 'weeds']
[44, 'unknown']
[44, 'bind']
[44, 'runs']
[44, 'stomach']
[44, 'ladyship']
[44, 'knaves']
[44, 'offended']
[44, 'bade']
[44, 'horn']
[44, 'yond']
[44, 'disposition']
[44, 'pinch']
[44, 'dagger']
[44, 'sooth']
[44, 'sleeps']
[44, 'silvius']
[44, 'sore']
[44, 'maids']
[44, 'prevail']
[44, 'bodies']
[44, 'reverence']
[44, 'smiles']
[44, 'fac']
[44, 'fatal']
[44, 'forces']
[44, 'prisoners']
[44, 'tempest']
[44, 'robert']
[44, 'cromwell']
[44, 'lodovico']
[43, 'attending']
[43, 'numbers']
[43, 'wrought']
[43, 'elder']
[43, 'dumb']
[43, 'farther']
[43, 'pour']
[43, 'picture']
[43, 'greet']
[43, 'ripe']
[43, 'fingers']
[43, 'courtier']
[43, 'advice']
[43, 'native']
[43, 'friendship']
[43, 'methought']
[43, 'peril']
[43, 'tread']
[43, 'utter']
[43, 'soothsayer']
[43, 'iras']
[43, 'govern']
[43, 'odds']
[43, 'capitol']
[43, 'device']
[43, 'regard']
[43, 'slaughter']
[43, 'gent']
[42, 'f

[18, 'roof']
[18, 'minded']
[18, 'lofty']
[18, 'tiger']
[18, 'created']
[18, 'belongs']
[18, 'sullen']
[18, 'forlorn']
[18, 'trespass']
[18, 'respects']
[18, 'vassal']
[18, 'shames']
[18, 'brass']
[18, 'contain']
[18, 'enrich']
[18, 'added']
[18, 'limit']
[18, 'profane']
[18, 'chose']
[18, 'dwells']
[18, 'stops']
[18, 'scandal']
[18, 'compound']
[18, 'minion']
[18, 'bait']
[18, 'despise']
[18, 'falsely']
[18, 'etc']
[18, 'commodity']
[18, 'returns']
[18, 'especially']
[18, 'frenchmen']
[18, 'reports']
[18, 'arriv']
[18, 'safer']
[18, 'mort']
[18, 'turk']
[18, 'ice']
[18, 'stake']
[18, 'vent']
[18, 'manly']
[18, 'helm']
[18, 'trusty']
[18, 'address']
[18, 'disguise']
[18, 'assay']
[18, 'hedge']
[18, 'cozen']
[18, 'confession']
[18, 'marvellous']
[18, 'straw']
[18, 'description']
[18, 'velvet']
[18, 'stool']
[18, 'pleases']
[18, 'diet']
[18, 'fulvia']
[18, 'hatch']
[18, 'violence']
[18, 'witchcraft']
[18, 'impatience']
[18, 'article']
[18, 'dangers']
[18, 'priests']
[18, 'julius']
[18, '

[13, 'blank']
[13, 'tenour']
[13, 'verily']
[13, 'howling']
[13, 'studies']
[13, 'outrage']
[13, 'male']
[13, 'drunkard']
[13, 'fowl']
[13, 'spain']
[13, 'meets']
[13, 'ecstasy']
[13, 'sanctuary']
[13, 'approved']
[13, 'holiness']
[13, 'rabble']
[13, 'tullus']
[13, 'hecuba']
[13, 'bows']
[13, 'approbation']
[13, 'puff']
[13, 'ingrateful']
[13, 'accusation']
[13, 'omit']
[13, 'temples']
[13, 'churchyard']
[13, 'chances']
[13, 'dragon']
[13, 'vast']
[13, 'thumb']
[13, 'recreant']
[13, 'imprison']
[13, 'protection']
[13, 'gallia']
[13, 'taper']
[13, 'steeds']
[13, 'dishes']
[13, 'quake']
[13, 'afore']
[13, 'chambers']
[13, 'despis']
[13, 'quantity']
[13, 'perilous']
[13, 'lane']
[13, 'disorder']
[13, 'wretches']
[13, 'accurs']
[13, 'wisest']
[13, 'assume']
[13, 'rey']
[13, 'opinions']
[13, 'carrion']
[13, 'bawdy']
[13, 'unmannerly']
[13, 'daggers']
[13, 'merriment']
[13, 'model']
[13, 'gadshill']
[13, 'nicholas']
[13, 'beef']
[13, 'coventry']
[13, 'disgrac']
[13, 'follies']
[13, 'morton']

[8, 'bullets']
[8, 'bragging']
[8, 'tewksbury']
[8, 'commotion']
[8, 'melody']
[8, 'mischiefs']
[8, 'attendance']
[8, 'lineal']
[8, 'wilderness']
[8, 'somebody']
[8, 'swing']
[8, 'creditors']
[8, 'gentles']
[8, 'wooden']
[8, 'devout']
[8, 'dukedoms']
[8, 'fun']
[8, 'castles']
[8, 'worshipp']
[8, 'waxen']
[8, 'profits']
[8, 'family']
[8, 'maidens']
[8, 'abound']
[8, 'aunchient']
[8, 'te']
[8, 'seigneur']
[8, 'pless']
[8, 'outrun']
[8, 'bedlam']
[8, 'suburbs']
[8, 'keepers']
[8, 'servile']
[8, 'timeless']
[8, 'betroth']
[8, 'broker']
[8, 'pursuivant']
[8, 'betide']
[8, 'thump']
[8, 'firmly']
[8, 'uncivil']
[8, 'bastardy']
[8, 'injustice']
[8, 'skins']
[8, 'gardener']
[8, 'wiltshire']
[8, 'tigers']
[8, 'toads']
[8, 'bridal']
[8, 'brew']
[8, 'sways']
[8, 'tough']
[8, 'doubted']
[8, 'capucius']
[8, 'abergavenny']
[8, 'legitimate']
[8, 'savours']
[8, 'popilius']
[8, 'dardanius']
[8, 'valor']
[8, 'revenged']
[8, 'checks']
[8, 'knots']
[8, 'hovel']
[8, 'doct']
[8, 'woful']
[8, 'aquitaine']
[8,

[5, 'crowd']
[5, 'blooded']
[5, 'sherris']
[5, 'immodest']
[5, 'births']
[5, 'newest']
[5, 'prophesied']
[5, 'observing']
[5, 'shorten']
[5, 'imp']
[5, 'mortified']
[5, 'unloose']
[5, 'authors']
[5, 'pillage']
[5, 'southampton']
[5, 'crete']
[5, 'refresh']
[5, 'brabant']
[5, 'redoubted']
[5, 'mightiness']
[5, 'stillness']
[5, 'greyhounds']
[5, 'anglais']
[5, 'doigts']
[5, 'mots']
[5, 'nous']
[5, 'dit']
[5, 'sauf']
[5, 'honneur']
[5, 'pridge']
[5, 'foaming']
[5, 'russian']
[5, 'cripple']
[5, 'slough']
[5, 'contrived']
[5, 'provender']
[5, 'crispian']
[5, 'tip']
[5, 'signieur']
[5, 'luxurious']
[5, 'fer']
[5, 'esteems']
[5, 'contaminated']
[5, 'macedon']
[5, 'leeks']
[5, 'lowliness']
[5, 'forge']
[5, 'cudgell']
[5, 'interview']
[5, 'hedges']
[5, 'mead']
[5, 'meads']
[5, 'cited']
[5, 'curl']
[5, 'contending']
[5, 'commonly']
[5, 'poictiers']
[5, 'bonfires']
[5, 'mice']
[5, 'descry']
[5, 'rays']
[5, 'renounce']
[5, 'touraine']
[5, 'cuff']
[5, 'thirteen']
[5, 'prophetess']
[5, 'chastise']
[

[3, 'adventures']
[3, 'dainties']
[3, 'cates']
[3, 'bridget']
[3, 'muffle']
[3, 'harbinger']
[3, 'conquers']
[3, 'ell']
[3, 'spherical']
[3, 'bogs']
[3, 'chalky']
[3, 'unfinish']
[3, 'pentecost']
[3, 'locking']
[3, 'fineness']
[3, 'tilting']
[3, 'pleaded']
[3, 'sere']
[3, 'commodities']
[3, 'viol']
[3, 'sorceress']
[3, 'parings']
[3, 'harshly']
[3, 'wizard']
[3, 'customers']
[3, 'priory']
[3, 'prevailing']
[3, 'reprehended']
[3, 'sauc']
[3, 'preserving']
[3, 'rately']
[3, 'scorch']
[3, 'harlots']
[3, 'disturbed']
[3, 'mountebank']
[3, 'juggler']
[3, 'gnawing']
[3, 'unbound']
[3, 'untun']
[3, 'froze']
[3, 'wasting']
[3, 'accidentally']
[3, 'arose']
[3, 'velutus']
[3, 'nicanor']
[3, 'altitude']
[3, 'bats']
[3, 'vigilant']
[3, 'trumpeter']
[3, 'agents']
[3, 'storehouse']
[3, 'foxes']
[3, 'surer']
[3, 'swims']
[3, 'marriages']
[3, 'quarry']
[3, 'arguing']
[3, 'demerits']
[3, 'singularity']
[3, 'certainties']
[3, 'sincerely']
[3, 'butterfly']
[3, 'larum']
[3, 'ladders']
[3, 'boils']
[3, 'ha

In [13]:
# Main function - Sherlock Holmes
if __name__ == '__main__':
    
    try:
        
        # Enter file name here, make sure the '.txt' file is in the Data folder
        document_name = '../Data/Sherlock Holmes.txt'
        min_occurrence = 3
        
        # Count word occurrence in provided document
        word_parse = WordCount(document_name, min_occurrence)
        
        
    except BaseException as e:
        print('Error: ', e)

[5810, 'the']
[3088, 'and']
[2823, 'to']
[2778, 'of']
[1823, 'in']
[1767, 'that']
[1749, 'it']
[1572, 'you']
[1486, 'he']
[1411, 'was']
[1159, 'his']
[1150, 'is']
[1007, 'my']
[929, 'have']
[877, 'with']
[863, 'as']
[830, 'had']
[784, 'at']
[778, 'which']
[752, 'for']
[657, 'not']
[656, 'but']
[646, 'be']
[635, 'me']
[539, 'we']
[535, 'this']
[517, 'there']
[512, 'from']
[486, 'said']
[467, 'holmes']
[467, 'upon']
[450, 'so']
[434, 'him']
[430, 'her']
[426, 'she']
[410, 'all']
[405, 'your']
[401, 'very']
[400, 'no']
[393, 'been']
[391, 'on']
[391, 'what']
[378, 'one']
[371, 'by']
[367, 'then']
[355, 'are']
[349, 'were']
[338, 'an']
[327, 'would']
[323, 'out']
[323, 'when']
[305, 'man']
[305, 'up']
[303, 'do']
[287, 'could']
[286, 'has']
[280, 'if']
[276, 'or']
[275, 'into']
[275, 'mr']
[274, 'who']
[270, 'will']
[269, 'little']
[245, 'some']
[234, 'now']
[232, 'see']
[230, 'down']
[212, 'may']
[212, 'our']
[212, 'should']
[202, 'they']
[201, 'well']
[190, 'can']
[185, 'am']
[184, 'us']

[4, 'fond']
[4, 'wheeler']
[4, 'treated']
[4, 'unforeseen']
[4, 'vanish']
[4, 'directed']
[4, 'upward']
[4, 'material']
[4, 'fully']
[4, 'torn']
[4, 'violet']
[4, 'gaiters']
[4, 'elastic']
[4, 'bears']
[4, 'puffing']
[4, 'chemical']
[4, 'linen']
[4, 'succeed']
[4, 'defect']
[4, 'devoted']
[4, 'utterly']
[4, 'warm']
[4, 'seek']
[4, 'engagement']
[4, 'whip']
[4, 'bitter']
[4, 'recognise']
[4, 'list']
[4, 'stated']
[4, 'platform']
[4, 'murdered']
[4, 'largest']
[4, 'lake']
[4, 'mile']
[4, 'mentioned']
[4, 'excited']
[4, 'blunt']
[4, 'inquest']
[4, 'wednesday']
[4, 'season']
[4, 'confession']
[4, 'forgotten']
[4, 'deceased']
[4, 'hideous']
[4, 'allusion']
[4, 'disturbed']
[4, 'consciousness']
[4, 'hypothesis']
[4, 'sky']
[4, 'newspapers']
[4, 'bless']
[4, 'absurd']
[4, 'forming']
[4, 'rattle']
[4, 'dignity']
[4, 'wandered']
[4, 'backed']
[4, 'novel']
[4, 'wander']
[4, 'unhappy']
[4, 'surgeon']
[4, 'accused']
[4, 'tale']
[4, 'deductions']
[4, 'inferences']
[4, 'animal']
[4, 'highroad']
[4, 