## `lab06`—Data Analysis

**Objectives**

-   Use lists and list features to process data from a file.
-   Apply the standard pipeline of data analysis:  data cleaning and preparation, data processing, and output.

There are 6 questions to answer.

Double-click here to list you and collaborators' names and **StudentIDs** here:  李瑞琦3180111638☐

![](https://upload.wikimedia.org/wikipedia/en/9/9c/MilesDavisKindofBlue.jpg)

For this lab project, we are going to work with a list of blues musicians in various traditions.  You have the following files in your `lab06/data` directory:
    
    blues.txt
    swamppop.txt
    zydeco.txt

Open the file `blues.txt` in a text editor and examine its format.
    
    # Delta Blues
    Cecil Augusta
    Mose Allison
    Tommy Bankhead
    ...
    Elder Roma Wilson
    
    # Chicago Blues
    Alberta Adams
    Luther Allison
    ...

We are going to read the contents of this file and run it through the standard pipeline of data analysis:  data cleaning and preparation, data processing, and output.

### Data Cleaning and Preparation

A file simply contains an ordered collection of characters and digits.  Any logical interpretation is created by the user as he or she uses the data.  Thus we first need to ensure that incoming data are in a suitable format and structure for further analysis.

What does a suitable format look like?  If you were organizing a library of these musicians' works, you could sort by surname or by musical style.  There are several formats which could make sense, such as a collection of *database records* containing *fields* such as "artist name" and "musical style".  We will opt here for a spreadsheet-like organization:  three "columns" of data:  "surname", "first name", and "style".  Each entry will be a tuple inside of a list.

**Data Logical Human Format:**

    artist surname  artist first name   musical style
    Augusta         Cecil               Delta Blues
    Alexander       Alger "Texas"       Country Blues

**One Machine Representation of the Data :**

    musicians[0] = ('Augusta', 'Cecil', 'Delta Blues')
    musicians[1] = ('Alexander', 'Alger "Texas"', 'Country Blues')
    ...

Thus a single record should have one entry from each of these fields.  The data file we import from, however, is not in this format.  Thus when importing the data we have to:

1.  track which musical style we are currently importing; and
2.  *tokenize* the name into first-name and surname components; and
3.  add the tuple of these three items to a list.

Let's see how this works with a single record as a string.  If you have a line of text as a string,

In [93]:
example = 'Jimmy Clanton'
print(example)

Jimmy Clanton


and you'd like to turn it into a tuple `record = (surname, first_name)`, how would you do it?

First, you'd probably want to split it into pieces so there is a separate `first_name` and `surname` to assign:

In [94]:
names = example.split(' ')
print(names)

['Jimmy', 'Clanton']


Now you have a `list` called `names` which you can use to assign the separate variables in your tuple either in two steps:

In [95]:
# step 1:  get the surname and first name separately
surname = names[1]
first_name = names[0]

# step 2:  populate the record tuple
record = (surname, first_name)
print(record)

('Clanton', 'Jimmy')


or equivalently in a single step:

In [96]:
record = (names[1], names[0])
print(record)

('Clanton', 'Jimmy')


Since we're just using a simple tuple of strings, that's all you need to worry about to create a record.

This gets a little trickier with multiple first names, though:

In [97]:
example = 'John Henry Barbee'

In this case, I suggest the following method.  First, split the name into its pieces like this:

In [98]:
names = example.split(' ')
print(names)

['John', 'Henry', 'Barbee']


Next, pull the surname out as a single name.  (We have already removed Jrs. and Srs. from these data, so there's nothing exceptional left.)

In [99]:
surname = names[-1]
print(surname)

Barbee


Finally, join the other names back together using the handy `join` method:

In [100]:
first_name = ' '.join(names[:-1])  # note how this joins a list of strings by a blank space---it's an odd way of writing this
print(first_name)

John Henry


In [101]:
record = (surname, first_name)
print(record)

('Barbee', 'John Henry')


<div class="alert alert-info">
Note that we are using tuples for the records.  Recall that tuples are like `list`s, but can't be changed (*immutable*).  Since we won't need to change a record once it's created, it makes sense in this case.
</div>

Let's now store a list of records.  Practice this process on the following entries:

In [102]:

# Setup code for your exercise---do NOT edit this cell
records = [] # an empty list
entries = ['Ivory Joe Hunter', 'Etta James', 'Little Willie Littlefield', 'Robert Lowery', 'J. J. Malone', 'Percy Mayfield', 'Jimmy McCracklin']

In [103]:
###################################
####### Q1 [5 points] ############
###################################

# Cycle through each musician name and add it to records in format.
for entry in entries:
    names = entry.split(' ')                   # compose your code here
    surname1 = names[-1]                        # get the name elements out of entry
    firstname = ' '.join(names[:-1])             # create a new record
    record = (surname1, firstname)
    records.append(record)
    # append the record to the list records
print(records)

[('Hunter', 'Ivory Joe'), ('James', 'Etta'), ('Littlefield', 'Little Willie'), ('Lowery', 'Robert'), ('Malone', 'J. J.'), ('Mayfield', 'Percy'), ('McCracklin', 'Jimmy')]


In [104]:
# it should pass this test---do NOT edit this cell
assert records[0]  == ('Hunter', 'Ivory Joe')
assert records[-1] == ('McCracklin', 'Jimmy')
print('Success!')

Success!


Make sure you understand why there are two sets of parentheses on the last line (`records.append`), and ask a classmate or TA to explain if you do not.

Now we will turn the above process into a general script to load and process the data.

-   Compose a function `process` which accepts a string `filename`.  `process` should `return` a list of records contained in the file.

In [105]:
###################################
####### Q2 [38 points] ############
###################################

def process(filename):        # define your function here
    entries = []         #def process('''(delete this string and replace it with the incoming variables)'''):
    current_style = ''
    myfile = open(filename,'r')
    lines = myfile.readlines()
    myfile.close()
    # Create a blank list called `entries` and an empty string called `current_style`.

    
    # Open the file `filename`, read the data using readlines(), and close it.

    
    # Loop through each line of the file and do the following:
    for line in lines:
        line = line.strip()
        if len(line)<2:
            continue
        if line[0] == '#':
            line = line[1:]
            current_style = line
            continue
        else:
            name = line.split(' ')
            s = name[-1]
            f = ' '.join(name[:-1])
            entry = (s,f,current_style)
            entries.append(entry)
    return entries
        # Strip the whitespace off of the ends of the line using the `strip` method.

        # If the line is blank, `continue` execution.

        ### (The `continue` statement makes Python just go back to the `for` loop again with the next value 
        ### no more code is executed for the current value.)
        
        # If a line starts with `#`, it contains the musical style to be assigned to the musicians below
        # until the next line with `#`.
        # In this case, remove the `'#'` from the beginning of the string and assign the musical style to `current_style`.
        # The loop should then `continue`.

        
        # Otherwise, a line contains a blues musician.  In this case, process the record much as you did above,
        # except that you also need to add a musical style to the tuple as the third element.
        # These data should be appended to the list `entries` as a single entry in the form of a tuple,
        #    (surname, first_name, current_style)

        
    # Finally, `return` the list `entries`.

    pass # you can always delete a `pass` statement, since it does nothing

In [106]:
zydeco = process('./data/zydeco.txt')
# test your code here.  You may edit this cell, and you may use the files 'blues.txt', 'swamppop.txt', and 'zydeco.txt',
# all of which are located in the 'data/' directory.


In [107]:
# it should pass this test---do NOT edit this cell
# test basic single-genre case
zydeco = process('./data/zydeco.txt')
assert zydeco[0]  == ('Chavis', 'Boozoo', 'Zydeco')
assert zydeco[-3] == ('Dopsy', "Rockin'", 'Zydeco')
print('Success!')

Success!


In [108]:
# it should pass this test---do NOT edit this cell
# test case with multiple genres
blues = process('./data/blues.txt')
assert blues[0]  == ('Augusta', 'Cecil', 'Delta Blues')
assert blues[-1] == ('Rose', 'Bayless', 'Piano Blues')
print('Success!')

Success!


Typically, you would also need to test for duplicates.  In this case, we've already removed duplicate entries that were present.  (This may occur because the many entries will have different musical styles associated with them—many musicians performed in several styles within the genre of blues.)

###  Data Processing

You should now have a function which loads a formated text file and converts that file into a collection of records which you can now use to ask and answer questions about the properties of the data set.

In [109]:
blues = process('./data/blues.txt')

#### Sorting

For instance, sort by surname and list the first ten entries.

In [110]:
blues.sort()  # note that sort sorts in place, rather than returning a result to you---this will trip you up if you are not careful!
blues[:10]

[('Abshire', 'Nathan', 'Swamp Blues'),
 ('Adams', 'Alberta', 'Chicago Blues'),
 ('Alexander', 'Alger "Texas"', 'Country Blues'),
 ('Alexander', 'Dave', 'West Coast Blues'),
 ('Alexander', 'Linsey', 'Chicago Blues'),
 ('Allison', 'Luther', 'Chicago Blues'),
 ('Allison', 'Mose', 'Delta Blues'),
 ('Ammons', 'Albert', 'Piano Blues'),
 ('Anderson', 'Pink', 'Piedmont Blues'),
 ('Armstrong', 'Howard "Louie Bluie"', 'Country Blues')]

Tuples will automatically be sorted by the first element, then the second, then the third; this makes sorting by surname easy.  Sorting by the other fields, such as first name or style, is a bit more involved.  You have to provide a `key` to `sort` so it knows what to sort by.  The easiest way is to write a function which `return`s the second element (or the first name, in this case):

In [111]:
def second_element(a_list):
    return a_list[1]
blues = sorted(blues, key=second_element)
blues  # note that `"Baby Face" Leroy` sorts by `"` not `B`

[('Foster', '"Baby Face" Leroy', 'Chicago Blues'),
 ('Reed', 'A. C.', 'Chicago Blues'),
 ('Ammons', 'Albert', 'Piano Blues'),
 ('Collins', 'Albert', 'Chicago Blues'),
 ('Adams', 'Alberta', 'Chicago Blues'),
 ('Seward', 'Alec', 'Country Blues'),
 ('Alexander', 'Alger "Texas"', 'Country Blues'),
 ('Milburn', 'Amos', 'Piano Blues'),
 ('Odom', 'Andrew', 'Chicago Blues'),
 ('Stidham', 'Arbee', 'Chicago Blues'),
 ('Edwards', 'Archie', 'Piedmont Blues'),
 ('Burton', 'Aron', 'Chicago Blues'),
 ('Crudup', 'Arthur "Big Boy"', 'Delta Blues'),
 ('Spires', 'Arthur "Big Boy"', 'Chicago Blues'),
 ('Tate', 'Baby', 'Country Blues'),
 ('Bob', 'Barbecue', 'Country Blues'),
 ('Bob', 'Barbecue', 'Piedmont Blues'),
 ('Smith', "Barkin' Bill", 'Chicago Blues'),
 ('Chuck', 'Barrelhouse', 'Chicago Blues'),
 ('Rose', 'Bayless', 'Piano Blues'),
 ('Tucker', 'Bessie', 'Country Blues'),
 ('Broonzy', 'Big Bill', 'Chicago Blues'),
 ('Kinsey', 'Big Daddy', 'Chicago Blues'),
 ('McNeely', 'Big Jay', 'West Coast Blues'),


-   How many musical styles are there, and how many musicians in each style?
    
    To answer this question, we'll use a `dict`ionary.  A `dict` is like a `list`, except that a list is indexed by `int`s and a `dict` is indexed by many different data types.
    
    A `list` that contains strings, for example, can be indexed directly by the position of each string in the `list`:
    
    ![](./img/list-pic.png)

In [112]:
list_example = ['alpha', 'beta', 'gamma', 'delta', 'omega']
print(list_example)

['alpha', 'beta', 'gamma', 'delta', 'omega']


A `dict`, in contrast, can use strings or floating-point values or tuples (and many other things) to index.  Consider a `dict` that uses English color names to identify HTML color codes:

![](./img/dict-pic.png)

In [113]:
dict_example = {'red': '#FF0000', 'green': '#00FF00', 'yellow': '#FFFF00', 'blue': '#0000FF', 'black': '#000000'}
print(dict_example)

{'red': '#FF0000', 'green': '#00FF00', 'yellow': '#FFFF00', 'blue': '#0000FF', 'black': '#000000'}


In [114]:
dict_example['yellow']

'#FFFF00'

You can add new entries to the `dict` very easily:

In [115]:
dict_example['grey'] = '#888888'
dict_example

{'red': '#FF0000',
 'green': '#00FF00',
 'yellow': '#FFFF00',
 'blue': '#0000FF',
 'black': '#000000',
 'grey': '#888888'}

So to count the number of musicians by style, we can use a `dict` to organize our counting.

1.  Loop over the records.
2.  For each record, get the musical style field.
3.  If the style is in the `dict`, add one to the count.
4.  If the style is *not* in the `dict`, add it (and give it a count of `1`).

This is how you see which keys (indices) are in the `dict`:

In [116]:
list(dict_example.keys())

['red', 'green', 'yellow', 'blue', 'black', 'grey']

This is how you see if the `dict` has a particular key or not:

In [117]:
'yellow' in dict_example.keys()

True

In [118]:
'teal' in dict_example.keys()

False

-   Now that we've gotten this far with `dict`ionaries, let's revisit the original question:  how many artists are in each musical style?  Compose a function `count_styles` below which accepts a `list` of entries `records` (`records` contains the same element as your earlier list `entries`).  `count_styles` should `return` a `dict` containing the musical styles represented and their respective count.

In [119]:
###################################
####### Q3 [10 points] ############
# if pass the test full marks, wrong then see how many marks to remove
###################################
# define your function here
def count_styles(records): #records is a list
    styles = {}# Create a blank dictionary
    for record in records:
        style = record[-1]
        if style in styles.keys():
            styles[style] += 1
        else:
            styles[style] = 1
    return styles
            
    # Loop over the records.

        # For each record, get the musical style field.
        # If the style is in the dict, add one to the count.
        # i.e., styles['old style'] += 1

        # If the style is not in the dict, add it (and give it a count of 1).
        # i.e., styles['new style'] = 1

        
    # Return the resulting dict of styles 

    pass # you can always delete a `pass` statement, since it does nothing

In [120]:
# test your code here.  You may edit this cell, and you may use the files 'blues.txt', 'swamppop.txt', and 'zydeco.txt',
# all of which are located in the 'data/' directory.



In [121]:
# it should pass this test---do NOT edit this cell
zydeco = process('./data/zydeco.txt')
zydeco_styles = count_styles(zydeco)
assert zydeco_styles['Zydeco'] == 29
print('Success!')
zydeco_styles

Success!


{'Zydeco': 29}

In [122]:
# it should pass this test---do NOT edit this cell
blues = process('./data/blues.txt')
blues_styles = count_styles(blues)
assert blues_styles['Gospel Blues'] == 19
print('Success!')
blues_styles

Success!


{'Delta Blues': 54,
 'Chicago Blues': 148,
 'Piedmont Blues': 35,
 'Country Blues': 51,
 'Swamp Blues': 20,
 'Gospel Blues': 19,
 'West Coast Blues': 42,
 'Piano Blues': 31}

So that tells you *how many times each thing occurs*.  What if you want to know *how many kinds of things there are*?

We need to make our list contain only unique elements—that is, remove multiple copies from it.  This is called *uniqifying* the list.

To uniqify your list, you can use the following code:

In [123]:
# a short function to remove repeat elements of a list
def uniqify(input_list):
    # make an empty dictionary
    keys = {}
    for e in input_list:
        # add a key for each item in the list---duplicate keys will be overwritten
        keys[e] = 1
    return list(keys.keys())  # return a list of all unique keys

As an example, test this function on a simple list:

In [124]:
my_list = [1,2,2,4,3,2,5,7,1,2]
uniqify(my_list)

[1, 2, 4, 3, 5, 7]

In [125]:
surnames = []
for musician in blues:
   surnames.append(musician[0])

unique_surnames = uniqify(surnames)
unique_surnames.sort()
unique_surnames

['Abshire',
 'Adams',
 'Alexander',
 'Allison',
 'Ammons',
 'Anderson',
 'Armstrong',
 'Arnold',
 'Augusta',
 'B.',
 'Babe',
 'Bailey',
 'Baker',
 'Ball',
 'Band',
 'Bankhead',
 'Banks',
 'Barbee',
 'Barnes',
 'Bates',
 'Baty',
 'Becker',
 'Belfour',
 'Bell',
 'Belly',
 'Benoit',
 'Benton',
 'Billy',
 'Blackwell',
 'Blake',
 'Bloomfield',
 'Blue',
 'Bob',
 'Bogan',
 'Bonds',
 'Bonner',
 'Booker',
 'Boyd',
 'Bracey',
 'Bradley',
 'Branch',
 'Brim',
 'Brooks',
 'Broonzy',
 'Brown',
 'Brozman',
 'Buford',
 'Burnside',
 'Burton',
 'Butler',
 'Butterfield',
 'Campbell',
 'Carr',
 'Carroll',
 'Carter',
 'Caston',
 'Cephas',
 'Charles',
 'Charlie',
 'Chuck',
 'Clark',
 'Clarke',
 'Clayton',
 'Clearwater',
 'Coleman',
 'Collette',
 'Collins',
 'Cotten',
 'Cotton',
 'Council',
 'Cox',
 'Crayton',
 'Crudup',
 'Crutchfield',
 'Davenport',
 'Davis',
 'Dawkins',
 'DeSanto',
 'Diddley',
 'Dixon',
 'Dizz',
 'Domino',
 'Dorsey',
 'Doyle',
 'Drummer',
 'Duncan',
 'Dupree',
 'Edwards',
 'Estes',
 'Flynn

-   Now, use this ability to `uniqify` a list to determine how many different surnames there are.
    
    Compose a function `unique_surname_list` which accepts a `list` `records` containing the tuple entries.  `unique_surname_list` will `return` a `list` of unique surnames.

In [126]:
###################################
####### Q4 [5 points] ############
# if pass the test full marks, wrong then see how many marks to remove
###################################
# define your function here
def unique_surname_list(records):
    surnames = []
    for record in records:
        surnames.append(record[0])
    unique_surnames = uniqify(surnames)
    unique_surnames.sort()
    return unique_surnames

    
        
    # Return the resulting list of surnames

    pass # you can always delete a `pass` statement, since it does nothing

In [127]:
# test your code here.  You may edit this cell, and you may use the files 'blues.txt', 'swamppop.txt', and 'zydeco.txt',
# all of which are located in the 'data/' directory.
zydeco = process('./data/zydeco.txt')
zydeco_surnames = unique_surname_list(zydeco)
zydeco_surnames

['Adcock',
 'Ardoin',
 'Billington',
 'Broussard',
 'Carrier',
 'Chavis',
 'Chenier',
 'Delafose',
 'Dopsy',
 'Fontenot',
 'Frank',
 'Ida',
 'Jocque',
 'Leday',
 'Mojo',
 'Nate',
 'Salmon',
 'Sidney',
 'Simien',
 'Thierry',
 'Watson',
 'Wayne',
 'Williams',
 'Zydeco']

In [128]:
# it should pass this test---do NOT edit this cell
zydeco = process('./data/zydeco.txt')
zydeco_surnames = unique_surname_list(zydeco)
assert len(zydeco_surnames) == 24
print('Success!')

Success!


#### Tokenizing

To *tokenize* is to split a string into pieces (or *tokens*) by some rule.  For instance, you've done this with `split` before:

In [129]:
"The Well at the World's End".split(' ')

['The', 'Well', 'at', 'the', "World's", 'End']

We can extract some statistical information about the naming of blues artists by tokenizing all components of their first names and then counting how many times each token (or bit) occurs.  For instance, `Blind Lemon Jefferson` became `('Jefferson', 'Blind Lemon', 'Gospel Blues')`; we now wish to tokenize `'Blind Lemon'` into `'Blind'` and `'Lemon'` (since `'Blind'` is a common moniker among Delta Blues performers and their musical descendants).

-   How many times does the name element `'Blind'` occur in the file `./data/blues.txt`?
    
    To find this, you will need to:
    
    1.  Tokenize all name bits.
    2.  Count the number of times each name bit occurs.  This is very similar to the counting of musical styles previously.

First, let's get all of the names (first and last) together in one `list`.

In [130]:
blues = process('./data/blues.txt')
names = []
for musician in blues:
    names.append(musician[1])
    names.append(musician[0])

Next, tokenize each name and add the components to a master `list` of name bits.  Since you are adding a list of strings in each case to the list, you may wish to use the `extend` method instead of the `append` method.

In [131]:
###################################
####### Q5 [4 points] ############
###################################
name_bits = []
# loop over names
for name in names:
    token = name.split()      #tokenize each name into name tokens by the split method
    name_bits.extend(token)   # add the name tokens to names_bits

    
name_bits

['Cecil',
 'Augusta',
 'Mose',
 'Allison',
 'Tommy',
 'Bankhead',
 'John',
 'Henry',
 'Barbee',
 'Kid',
 'Bailey',
 'Robert',
 'Belfour',
 'Charley',
 'Booker',
 'Ishman',
 'Bracey',
 'Willie',
 'Brown',
 'R.',
 'L.',
 'Burnside',
 'Sam',
 'Carr',
 'Bo',
 'Carter',
 'James',
 'Cotton',
 'Arthur',
 '"Big',
 'Boy"',
 'Crudup',
 'Delta',
 'Blind',
 'Billy',
 'David',
 'Honeyboy',
 'Edwards',
 'Jessie',
 'Mae',
 'Hemphill',
 'King',
 'Solomon',
 'Hill',
 'John',
 'Lee',
 'Hooker',
 'Son',
 'House',
 'Howlin',
 'Wolf',
 'Elmore',
 'James',
 'Skip',
 'James',
 'Robert',
 'Johnson',
 'Tommy',
 'Johnson',
 'Junior',
 'Kimbrough',
 'Little',
 'Freddie',
 'King',
 'Robert',
 'Lockwood',
 'Willie',
 'Love',
 'Lead',
 'Belly',
 'Tommy',
 'McClennan',
 'Papa',
 'Charlie',
 'McCoy',
 'Mississippi',
 'Fred',
 'McDowell',
 'Mississippi',
 'John',
 'Hurt',
 'Sonny',
 'Boy',
 'Nelson',
 'Jack',
 'Owens',
 'Charley',
 'Patton',
 'Pinetop',
 'Perkins',
 'Doctor',
 'Ross',
 'Sonny',
 'Rhodes',
 'Johnny',
 

Finally, create a list of the count of each name bit as you did with `count_styles`.

In [132]:
name_counts = {}  # a blank dictionary

for name_bit in name_bits:
    if name_bit in name_counts.keys():
        name_counts[name_bit] += 1
    else:
        name_counts[name_bit] = 1

name_counts

{'Cecil': 2,
 'Augusta': 1,
 'Mose': 2,
 'Allison': 2,
 'Tommy': 3,
 'Bankhead': 1,
 'John': 13,
 'Henry': 4,
 'Barbee': 1,
 'Kid': 1,
 'Bailey': 2,
 'Robert': 7,
 'Belfour': 1,
 'Charley': 4,
 'Booker': 2,
 'Ishman': 1,
 'Bracey': 2,
 'Willie': 10,
 'Brown': 5,
 'R.': 1,
 'L.': 4,
 'Burnside': 1,
 'Sam': 5,
 'Carr': 2,
 'Bo': 3,
 'Carter': 1,
 'James': 7,
 'Cotton': 1,
 'Arthur': 3,
 '"Big': 4,
 'Boy"': 3,
 'Crudup': 1,
 'Delta': 1,
 'Blind': 9,
 'Billy': 3,
 'David': 1,
 'Honeyboy': 1,
 'Edwards': 3,
 'Jessie': 1,
 'Mae': 1,
 'Hemphill': 1,
 'King': 3,
 'Solomon': 1,
 'Hill': 1,
 'Lee': 4,
 'Hooker': 2,
 'Son': 3,
 'House': 1,
 'Howlin': 1,
 'Wolf': 2,
 'Elmore': 1,
 'Skip': 1,
 'Johnson': 9,
 'Junior': 4,
 'Kimbrough': 2,
 'Little': 11,
 'Freddie': 4,
 'Lockwood': 1,
 'Love': 1,
 'Lead': 1,
 'Belly': 1,
 'McClennan': 1,
 'Papa': 2,
 'Charlie': 7,
 'McCoy': 1,
 'Mississippi': 4,
 'Fred': 1,
 'McDowell': 1,
 'Hurt': 1,
 'Sonny': 4,
 'Boy': 4,
 'Nelson': 2,
 'Jack': 2,
 'Owens': 1,
 'P

In [133]:
# Sort the list and output it from most frequently occurring name bits to least.
def value(k):
    return name_counts[k]
name_count_list = sorted(name_counts, key=value, reverse=True)

for name in name_count_list:
    print(name, name_counts[name])

Johnny 14
John 13
Little 11
Willie 10
Blind 9
Johnson 9
Big 9
Joe 9
Smith 9
Robert 7
James 7
Charlie 7
Thomas 7
Eddie 7
Jimmy 7
Slim 7
J. 6
Brown 5
Sam 5
B. 5
C. 5
Jones 5
Red 5
Henry 4
Charley 4
L. 4
"Big 4
Lee 4
Junior 4
Freddie 4
Mississippi 4
Sonny 4
Boy 4
Luther 4
George 4
Brooks 4
Charles 4
Buddy 4
Otis 4
Bob 4
Taylor 4
Memphis 4
Jackson 4
Lewis 4
Tommy 3
Bo 3
Arthur 3
Boy" 3
Billy 3
Edwards 3
King 3
Son 3
Patton 3
Williams 3
Alexander 3
Bell 3
Bill 3
Walter 3
Floyd 3
Moore 3
Walker 3
Young 3
Fuller 3
White 3
Roy 3
Cecil 2
Mose 2
Allison 2
Bailey 2
Booker 2
Bracey 2
Carr 2
Hooker 2
Wolf 2
Kimbrough 2
Papa 2
Nelson 2
Jack 2
Pinetop 2
D. 2
Sims 2
Wilson 2
Banks 2
Chuck 2
Lefty 2
Buster 2
Lonnie 2
Baker 2
T. 2
Rockin' 2
William 2
"The 2
Band 2
Coleman 2
Albert 2
Lester 2
Davenport 2
Davis 2
Dixon 2
Foster 2
Leroy 2
Nick 2
Guy 2
Phil 2
Harris 2
Danny 2
Kinsey 2
Tony 2
Moss 2
"Guitar" 2
Phillips 2
A. 2
Reed 2
Robinson 2
Smothers 2
Dave 2
Sugar 2
Melvin 2
Ed 2
Etta 2
Barbecue 2
Frank 2

It's now simple for you to answer the original question of how many `'Blind'`s there are in the `blues.txt` file:

In [134]:
print(name_counts['Blind'])

9


In [135]:
# clear earlier data so we don't mask the behavior of the function `count_names` below
names = None
name_bits = None
name_count_list = None
name_counts = None

Go ahead and write this into a function `count_names`, which accepts a `list` `records` containing the tuple entries.  `count_names` will `return` a `dict` containing the number of times each name bit occurs.

In [154]:
###################################
####### Q6 [30 points] ############
# if pass the test full marks, wrong then see how many marks to remove
###################################
# define your function here


def count_names(records):
    name_counts_ret = {}  # a blank dictionary

    name_bits = []
    # loop over tuples in records
    for record in records:
        name_bits.extend(record[1].split(' '))
        name_bits.extend(record[0].split(' '))
        # tokenize the name in the tuple into name tokens---DO NOT FORGET THE FIRST NAME AND THE SURNAME ENTRIES
        # add the name tokens to name_bits

        
    name_counts = {}
    # create a dict of the count of each name bit as you did with count_styles.
    for name_bit in name_bits:
        if name_bit in name_counts.keys():
            name_counts[name_bit] += 1
        else:
            name_counts[name_bit] = 1
    def value(k):
        return name_counts[k]
    name_count_dict = sorted(name_counts, key=value, reverse=True)
    # sort the list from most frequently occurring name bits to least (can modify from earlier cells)
    return name_counts
    
    # return the resulting dict containing the number of times each name bit occurs.

    pass # you can always delete a `pass` statement, since it does nothing

In [155]:
# test your code here.  You may edit this cell, and you may use the files 'blues.txt', 'swamppop.txt', and 'zydeco.txt',
# all of which are located in the 'data/' directory.



In [156]:
# it should pass this test---do NOT edit this cell
# test number of name elements found
blues = process('./data/blues.txt')
blues_names = count_names(blues)
assert len(list(blues_names.keys())) == 594
print('Success!')

Success!


In [157]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
blues = process('./data/blues.txt')
blues_names = count_names(blues)
assert blues_names['Wheatstraw'] == 1
assert blues_names['Johnny'] == 14
print('Success!')

Success!


# Before you submit...

### Submission [8 points]
Make sure that you have filled your name and studentID as well as answered all the 6 questions.

Save this file as lab06-studentID.ipynb then UPLOAD to RELATE!