# Fundamentals of Data Analysis
## Tasks

These are my solutions to the Tasks assessment for this module. The author is Brian Doheny.

### Task 1

Write a Python function called counts that takes a list as input and returns a dictionary of unique items in the list as keys and the number of times each item appears as values. So, the input ['A', 'A', 'B', 'C', 'A'] should have output {'A': 3, 'B': 1, 'C': 1} . Your code should not depend on any module from the standard library or otherwise. You should research the task first and include a description with references of your algorithm in the notebook.

# Answer
----

In [None]:
def counts(items):
    '''Takes a list as input, and procudes a dictionary containing the unique contents of the list, 
    and the number of times it appeared in the list'''
    #Create an empty dictionary to store the results.
    results = {}
    
    #Defining a function to check if the next item is a list.
    def count_list(item):
        '''Checks if the list item is itself a list. 
        If it is a list, it will check if the next item of that list is itself a list.
        It will keep doing this until it finds an element that is not a list.
        A this point it will call count_item on that element.'''
        for subitem in item:
            #If the subitem is itself a list, it will call count_list again. 
            #This means no matter how many nested lists there are, it will iterate through all them.
            if type(subitem) == list: 
                count_list(subitem)
            #If the subitem is not a list, then it'll just run the count_item function defined below.
            else:
                count_item(subitem)
    
    #Because count_list iterates through a list, it'll throw an error if it were to encounter an uniterable object
    #For example, a string. Therefore I need a second inner function specifically for uniterable objects.
    #These will be the dictionary keys.
    def count_item(item):
        '''Checks if the list item already exists as a dictionary key. If it does, it will add 1 to its value.
        If it does not exist as a dictionary key, it will create a new key for that item, and set the value to 1'''
        if item in results:
            results[item] += 1 
        else:
            results[item] = 1
    
    #Finally, this FOR loop will go through the initial list (the items paramter). 
    #If the item within the list is itself a list, That item will go through the count_list inner function. 
    #Otherwise it goes through count_item.
    for item in items:
        if type(item) == list: 
            count_list(item)
        else:
             count_item(item)
        
    return results            

In [38]:
test = ['A', 'A', 'B', 'C', 'A']
counts(test)

{'A': 3, 'B': 1, 'C': 1}

# Approach
----
## Checking for Dictionary keys

As the program must return a dictionary of the unique items in the list, alongside how frequently they appear as value, we must first ensure we know how to iterate over a dictionary's keys. This will allow us to check if a list item already has exists as a key, and so our program can take the necessary next step; create a new key for the new item, or updating the value for the existing key if it is a duplicate item.

This can be done by using a Dictionary Membership Test[6], which will return a boolean (True or False) depending on whether or not the variable already exists in the dictionary. The syntax for a dictionary membership test is simply:

variable in dictionary

or alternatively you can check if a key does not already exist in the dictionary by using:

variable not in dictionary

With this membership test, we can then use a straightforward If/Else statement to determine whether or not a new key must be created, or if an existing key needs its value altered[2][3]:

if variable in dictionary:
    dictionary[variable] += 1
else:
    dictionary[variable] = 1

This gives us a fairly straightforward initial solution:

In [18]:
# Define the test list, as offered in the problem.
test = ['A', 'A', 'B', 'C', 'A']

In [26]:
def counts(items):
    '''Takes a list as input, and procudes a dictionary containing the unique contents of the list, 
    and the number of times it appeared in the list'''
    
    #Create an empty dictionary to store the results.
    results = {}
    
    #Iterate over the list items, and see if a key already exists.
    for item in items:
        if item in results:
            results[item] += 1 
        else:
            results[item] = 1
    #Return the completed dictionary.
    return results       

In [25]:
counts(test)

{'A': 3, 'B': 1, 'C': 1}

## Nested lists and how to deal with them

While the example list included in the problem statement doesn't contain any nested lists, they are something that the program might encounter in the real world. Therefore it's important to find a solution for this now, and one that will ideally scale with the program.

As lists are a hashable data type (i.e. they can altered), they cannot be used as a dictionary key, as outlined in the Python Data Structure documentation[4]:

```Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple contains any mutable object either directly or indirectly, it cannot be used as a key. You can’t use lists as keys, since lists can be modified in place using index assignments, slice assignments, or methods like append() and extend().```

This therefore opens up two potential solutions - change the nested list to a tuple, or find a way to iterate over the nest lists.

### Converting a nested into a tuple

Converting the inner list into a tuple, and using that as the key is straightforward solution. The FOR loop would just need to check the type of the next item it's iterating through, and if it's a list, convert that item to a tuple:

if type(list[item]) == list:
    item = tuple(item)

However, this would encounter an issue if there's yet another list within this second layer of lists. The Python Data Structure[4] documentation explicitly says that if the "tuple contains a mutable object" (in this case a list) "either directly or indirectly, it cannot be used as a key". Therefore the additional nested listed will prevent this solution from working on in all situations. Alongside this, tuples as keys might not be the ideal real world solution anyway, as it opens up the possibility of lengthy dictionary keys that may not serve much purpose for analysis. 

If we were to follow this path, it would look something like this: 

In [29]:
#If I were to follow path 1, this is how it would look.
def counts(items):
    '''Takes a list as input, and procudes a dictionary containing the unique contents of the list, 
    and the number of times it appeared in the list'''
    
    #Create an empty dictionary to store the results.
    results = {}
    
    #Turning the list into a tuple lets us add it as a dictionary key.
    #We would need to check that the item is a list before we go through the dictionary keys.
    for item in items:
        if type(item) == list: #check if the next item is a list.
            item = tuple(item) #if so, change it to be a tuple.
        else:
            pass
        
        #Once any lists are changed to tuples, we can see if they already exist in the dictionary keys.
        if item in results.keys():
            results[item] += 1 
        else:
            results[item] = 1 
   
    #Return the completed dictionary.
    return results        

In [30]:
# Create a list containing a list to test this approach.
bigtest = [1, 5, 17, 82, 91, 'horse', 'cow', 'cheese', 47.7777, 1, 1, 1, 1, 1, 5, 17, 1, 82, 91, 'horse',
           'cow', 'cheese', 15, 81, 'cheese', ['another', 'list', 'in', 'the', 'list', 77]]

counts(bigtest)

{1: 7,
 5: 2,
 17: 2,
 82: 2,
 91: 2,
 'horse': 2,
 'cow': 2,
 'cheese': 3,
 47.7777: 1,
 15: 1,
 81: 1,
 ('another', 'list', 'in', 'the', 'list', 77): 1}

In [32]:
# If there is yet another level of lists, really receive an error: "TypeError: unhashable type: 'list'"

lists_in_lists = [1, 2, 3, [1, 2, 3, 27.124, [1, 2, 3, 'hi', [[1,[1, 'hello'], 2, 3], 1, 2, 3]]]]

counts(lists_in_lists)

TypeError: unhashable type: 'list'

### Iterating over nested lists.

A more scalable and potentially useful solution is to iterate over any nested lists too. This will involve checking the data type of the next list item, and if it is a list, running the FOR loop again. This seems to be the most likely solution, as it will allow the program to count every individual item in every list. A program that can unpack nested lists could be used for counting the frequency of words in files or when scraping websites, and so has a real world use case. I am therefore interpreting this as the desired outcome for this problem.

However in order to do this, I must find a way to repeat the FOR loop as many times as necessary, ideally without having to repeat my code over and over again. 

### Nested Functions and DRY[7]

If I were to rely solely on FOR loops, I would require a loop for every level of nested list. As I do not know what lists this program will be used for, I therefore cannot estimate how many I would need. Should I include a dozen nested for loops? Will that cover every necessity? 

Obviously in such a situation my code would be unnecessarily repetitive in most scenarios, and not repetitive enough in others. I had to find a way to reuse my original FOR loop for updating the dictionary keys, without having to repeat that code endlessly, and fortunately Python has just the tool I need - nested functions[5]. This will ensure that my code is DRY - Don't Repeat Yourself[7].

By defining my initial FOR loop as a nested function, I will be able call it as many times as I need without having to repeating those 5 lines of code each time. Better yet, as this function is nested within the Counts function, it can make use of the same input parameters as were provided when the Counts function was called. Similarly, if I am using a nested function for the dictionary keys FOR loop, I could also use one for the FOR loop which checks if the next list item is of list type itself.


In [35]:
def counts(items):
    '''Takes a list as input, and procudes a dictionary containing the unique contents of the list, 
    and the number of times it appeared in the list'''
    #Create an empty dictionary to store the results.
    results = {}
    
    #Defining a function to check if the next item is a list.
    def count_list(item):
        '''Checks if the list item is itself a list. 
        If it is a list, it will check if the next item of that list is itself a list.
        It will keep doing this until it finds an element that is not a list.
        A this point it will call count_item on that element.'''
        for subitem in item:
            #If the subitem is itself a list, it will call count_list again. 
            #This means no matter how many nested lists there are, it will iterate through all them.
            if type(subitem) == list: 
                count_list(subitem)
            #If the subitem is not a list, then it'll just run the count_item function defined below.
            else:
                count_item(subitem)
    
    #Because count_list iterates through a list, it'll throw an error if it were to encounter an uniterable object
    #For example, a string. Therefore I need a second inner function specifically for uniterable objects.
    #These will be the dictionary keys.
    def count_item(item):
        '''Checks if the list item already exists as a dictionary key. If it does, it will add 1 to its value.
        If it does not exist as a dictionary key, it will create a new key for that item, and set the value to 1'''
        if item in results:
            results[item] += 1 
        else:
            results[item] = 1
    
    #Finally, this FOR loop will go through the initial list (the items paramter). 
    #If the item within the list is itself a list, That item will go through the count_list inner function. 
    #Otherwise it goes through count_item.
    for item in items:
        if type(item) == list: 
            count_list(item)
        else:
             count_item(item)
        
    return results            

In [37]:
#This version will now work with lists nested at multiple levels.
counts(lists_in_lists)

{1: 6, 2: 5, 3: 5, 27.124: 1, 'hi': 1, 'hello': 1}

### Other considerations

The above function does what is requested within the problem statement, and will work no matter what list is provided. That said, there are potentially other additions that can be included in the answer for this. For example, how should this program deal with strings that contain the same letters, but different combinations of capitalized letters?

For example, let's say our list contains two strings (among many): 
* "DRY" (the above mentioned "Don't Repeat Yourself" acronym), 
* "dry" (to mean lacking moisture)

If I wish for my program to count the frequency of combinations of letters, then I would want these two to be treated exactly the same, and so they should both contribute to the same count. 

However, if I am looking to capture the content and meaning of the strings in the list, then I would want them to be counted separately, and perhaps later combine their totals if it made sense.

The program above would follow the first of these approaches - treating these two words as fundamentally different, and therefore each having its own count. However, if we did wish to treat the two words as being the same, we could add an IF statement as the first step in the count_item function. Like so:

 def count_item(item):
      
        if type(item) == str:
            item = item.lower()
            
        if item in results:
            results[item] += 1 
        else:
            results[item] = 1


In [59]:
def counts(items):
    '''Takes a list as input, and procudes a dictionary containing the unique contents of the list, 
    and the number of times it appeared in the list'''
    #Create an empty dictionary to store the results.
    results = {}
    
    #Defining a function to check if the next item is a list.
    def count_list(item):
        '''Checks if the list item is itself a list. 
        If it is a list, it will check if the next item of that list is itself a list.
        It will keep doing this until it finds an element that is not a list.
        A this point it will call count_item on that element.'''
        for subitem in item:
            #If the subitem is itself a list, it will call count_list again. 
            #This means no matter how many nested lists there are, it will iterate through all them.
            if type(subitem) == list: 
                count_list(subitem)
            #If the subitem is not a list, then it'll just run the count_item function defined below.
            else:
                count_item(subitem)
    
    #Because count_list iterates through a list, it'll throw an error if it were to encounter an uniterable object
    #For example, a string. Therefore I need a second inner function specifically for uniterable objects.
    #These will be the dictionary keys.
    def count_item(item):
        '''If the list item is a string, it'll convert it to lower case.
        Then checks if the list item already exists as a dictionary key. If it does, it will add 1 to its value.
        If it does not exist as a dictionary key, it will create a new key for that item, and set the value to 1'''
        if type(item) == str:
            item = item.lower()
            
        if item in results:
            results[item] += 1 
        else:
            results[item] = 1
    
    #Finally, this FOR loop will go through the initial list (the items paramter). 
    #If the item within the list is itself a list, That item will go through the count_list inner function. 
    #Otherwise it goes through count_item.
    for item in items:
        if type(item) == list: 
            count_list(item)
        else:
             count_item(item)
        
    return results            

In [60]:
dry_list = ['dry', 'DRY', ['dry', 'DrY', 'DRY']]

counts(dry_list)

{'dry': 5}

As my interpretation of the problem is to treat these two variations, "dry" and "DRY" as being unique, I have not included this additional IF statement in my final answer (found at the top).

### References
<p>Example</p>
<p>[1] Real Python; How to Iterate Through a Dictionary in Python; https://realpython.com/iterate-through-dictionary-python/</p>
<p>[2] Career Karma; Python Add to Dictionary: A Guide; https://careerkarma.com/blog/python-add-to-dictionary/</p>
<p>[3] Geeks for Geeks; Python | Get specific keys' values; https://www.geeksforgeeks.org/python-get-specific-keys-values/?ref=rp</p>
<p>[4] Python; Data Structures; https://docs.python.org/3/tutorial/datastructures.html#dictionaries</p>
<p>[5] Real Python; Python Inner Functions—What Are They Good For?; https://realpython.com/inner-functions-what-are-they-good-for/ </p>
<p>[6] Programiz; Python Dictionary; https://www.programiz.com/python-programming/dictionary</p>
<p>[7] Wikipedia; Don't repeat yourself; https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

# Task 2

Write a Python function called dicerolls that simulates rolling dice. Your function should take two parameters: the number of dice k and the number of times to roll the dice n. The function should simulate randomly rolling k dice n times, keeping track of each total face value. It should then return a dictionary with the number of times each possible total face value occurred. So, calling the function as diceroll(k=2, n=1000) should return a dictionary like:

{2:19,3:50,4:82,5:112,6:135,7:174,8:133,9:114,10:75,11:70,12:36} 
You can use any module from the Python standard library you wish and you should include a description with references of your algorithm in the notebook.

Notes 

Will likely use NumPy for this. randint could probably do it.
Want to ensure I have all available combinations as dictionary keys, even if it's rolled 0 times.
Can then use numpy to pick a number from 1 to 6 k times and sum the total. Then add 1 to the corresponding dictionary key.
This would then be run n times and return the dictionary.
Need to ensure the dictionary is reset each time, so will be defined within the function.

Note - Can't use NumPy as it's not in the standard library! Will have to use Random.

In [61]:
import itertools
import random

def diceroll(k, n):
#suggestion for how to create the initial dictionary - https://stackoverflow.com/questions/39400257/python-itertools-how-to-roll-n-dice
#Creating dictionary keys for all possible results first, as smaller samples might not see a particular combination rolled.
    dice_results = {}
    
    for dice in itertools.product(range(1, 7), repeat=k):
        combination = sum(dice)            
        if combination not in dice_results.keys():
            dice_results[combination] = 0 
    
    #random documentation - https://docs.python.org/3/library/random.html
    for i in range(n):
        total = 0
        for j in range(k):
            total += random.randrange(1,7)
        dice_results[total] += 1 
        
        #total = sum(np.random.randint(1,7,k))
        #dice_results[total] += 1
        
    return dice_results

In [62]:
diceroll(k=2, n=1000)

{2: 31,
 3: 51,
 4: 89,
 5: 124,
 6: 143,
 7: 162,
 8: 138,
 9: 97,
 10: 79,
 11: 63,
 12: 23}