### Advanced Collections in Python
Collections in Python are incredibly powerful tools. They are used very commonly in many aspects of data science, including AI. This exercise demonstrates some of their more advanced functionality.

### Slicing in Python Lists
Slicing is a simple but powerful concept within Python lists. It allows you to very quickly develop sublists from a much larger collection. Below are some of the rules associated with slicing.

In [12]:
listy = [1, 2, 3, 4, 5, 6]

# When slicing, colons indicate which direction to approach the slice from.
# A colon which precedes the value means start from the first index in the list.
# A colon which follows the value means start from the last index in a list.
# The difference is important, as colon-before is exclusive, while colon-after is inclusive.

print("Before: ", listy[:2]) #This will extract up to (but not including the) value in index 2 in the list from the start
print("After: ", listy[2:]) #This will extract up to (and including) the value in index 2 in the list from the end

# We can use negative indices to start the slice from the end of the list.
print("Extract up to second last index from the start: ", listy[:-2])

# We can also define a sublist from within the list by placing the colon between values
print("Sublist: ", listy[2:4])

# We can also reverse the list, using an interesting little trick
print("Reverse: ", listy[::-1])

Before:  [1, 2]
After:  [3, 4, 5, 6]
Extract up to second last index from the start:  [1, 2, 3, 4]
Sublist:  [3, 4]
Reverse:  [6, 5, 4, 3, 2, 1]


### Multi-Dimensional Lists
Lists are a method of associating multiple pieces of data with a single variable. The data is not constrained to a particular type, meaning that you can develop lists of integers, Strings or even a list of Lists. However, the list cannot store data of varying types - meaning that you cannot insert an integer into a list of Strings.

In [24]:
x = ['a', 'b', 'c'] #a typical list
y = [['a', 'b', 'c'], ['x', 'y', 'z']] #a list of lists - in this instance, a 2-dimensional list

print(x[0]) #print the first element of x
print(y[1][0]) #print the first value of the second element in y

a
x


A list of lists is multi-dimensional. This means it considers multiple axis. For example in a 1-dimensional list, we have only one axis. In a 3-dimensional list, we are considering 3 axes. In this way, Python lists allow the manipulation of multi-dimensional data. 

Note that the number of dimensions is _not_ the number of elements in the list, but the number of axes. This is represented by the number of square brackets. The results is that wehn accessing from a list, you must specify the index along each axis.

In [32]:
two_dimension = [['a', 'b', 'c'], ['x', 'y', 'z']]
three_dimension = [[['a', 'b', 'c']], [['x', 'y', 'z']]]

print(two_dimension[0][2]) #retrieve the item with index 1 in axis 0, and index 2 in axis 1 
print(three_dimension[0][0][2]) #for each additional dimension, we mst identify an additional axis for return

c
c


All regular list methods will continue to work, regardless of the number of dimensions that a list has.

In [33]:
two_dimension.append(['m', 'n', 'o']) #append a list of items to the end of the current list of lists
print(two_dimension)

del three_dimension[0][0] #delete the list at the given index
print(three_dimension)

[['a', 'b', 'c'], ['x', 'y', 'z'], ['m', 'n', 'o']]
[[], [['x', 'y', 'z']]]


### Combining Lists in Python
We can combine lists using a zip function. This is different from just appending lists together, as it combines lists so that values in the same index now exist within a tuple at that index. The example will make this clearer.

In [20]:
list1 = [1, 2, 3]
list2 = [9, 8, 7]

list_combined = zip(list1, list2) #combine the lists
list_combined = list(list_combined) #note that zip returns a zip object, so to use it we should cast it back to a list

print(list_combined) #we now have a list of tuple objects

print(list_combined[0][0]) #we can extract any index of the combined objects as we would a multi-dimension list


[(1, 9), (2, 8), (3, 7)]
1


### Randomness in Python
There are going to be times when you want to utilise random elements in your code - particularly with lists. For example, when splitting a dataset, you may wish to ensure that the split is random, but ensure that you have 75% for training and 25% of the data in the list for testing.

In [9]:
import random #first import random
random.seed(1337) #Python is psuedo-random, so for reproducability we should always include a random seed

floaty = random.random() #This will generate a random number between 1 and 0 - a float value
print(floaty) 

#To create a random int, multiply the random function by your range and cast it as an int
inty = int(random.random() * 10) #in this example, our range is 10 - so a random number will be generated between 0 and 10
print(inty)

#However, there is already a shorthand way of achieving this in the original library
inty = random.randint(0, 10) #generate a random integer between 0 and 10
print(inty)

0.6177528569514706
5
5


### Randomness in Python Lists
There are a variety of reasons that you may want to use randomness to make a selection of lists. The most important one is that you may wish to select a distribution which is representative of the original list. This is something that cannot be obtained through specific selections, but which random selection does efficiently

In [22]:
import random
random.seed(1337)

letters = ['a', 'b', 'c', 'd', 'e']
words = ['ant', 'bat', 'cat', 'div', 'ear']

random_letter = letters[random.randint(0, len(letters)-1)] #take a random index between 0 and the length of the letters list
print(random_letter)

random_letter = words[0][random.randint(0, len(words[0])-1)] #take a random letter from the first element of the words list
print(random_letter) #note that we must use len()-1, because a list of three elements will have indices [0, 1, 2]

random_splice = letters[random.randint(0, len(letters)-1):] #create a sublist by splicing a list at a random index - note the colon
print(random_splice)

random_sublist = random.sample(letters, 3) #create an unordered sublist of random elements from the letters list
print(random_sublist)

e
t
['c', 'd', 'e']
['e', 'b', 'd']


### Dataset Example

Lets use what we have covered so far to make a specific example. Lets say that we have a dataset where we want to identify flower types. Each flower type has a label (its name) and three pieces of data associated with it, representing petal-length, stalk-length and petal-radius respectively.

I.e. 0.9, 5.5, 0.4, rose

We want to preprocess that dataset before sending it to a machine learning algorithm, so we want to create a 66% training and 33% testing split.

In [30]:
random.seed(12)

data = [] #first lets create the dataset
labels = []

data.append([0.9, 5.5, 0.4])
labels.append('Rose')
data.append([0.8, 5.0, 0.6])
labels.append('Rose')
data.append([1.2, 4.9, 0.4])
labels.append('Rose')
data.append([0.7, 6.5, 0.4])
labels.append('Petunia')
data.append([0.6, 6.0, 0.4])
labels.append('Petunia')
data.append([0.5, 6.2, 0.3])
labels.append('Petunia')
data.append([1.5, 10.0, 0.9])
labels.append('Sunflower')
data.append([1.3, 10.5, 0.8])
labels.append('Sunflower')
data.append([1.6, 11.0, 0.9])
labels.append('Sunflower')

#Each piece of data at an index in the list data corresponds to a label at the same index in the lists label
print(data[0])
print(labels[0])

#We can observe that just splicing the dataset will be ineffective
dataset_splice = data[:6]
label_splice = labels[:6]

print("No Random Method: ", dataset_splice)
print(label_splice) #notice that one class is not even represented in the training set

#Instead, lets try randomly selecting attributes from a list
dataset_random = []
label_random = []
random_block = [] #a list of indices we have previously selected to prevent from repeating an indice

for r in range(6):
    random_range = len(data) - 1
    random_index = random.randint(0, random_range)
    while(random_index in random_block): #checks to see if we have used this value before
        random_index = random.randint(0, random_range) #if so, we create a new random value
    random_block.append(random_index)
    dataset_random.append(data[random_index])
    label_random.append(labels[random_index])
    
print("Loop Method: ", dataset_random)
print(label_random) #though this method works, it is a slow way of doing things

#A better option may be to use zip to use random_sample on both lists simultaneously
dataset_zipped = zip(data, labels) 
dataset_random = random.sample(list(dataset_zipped), 6)
dataset_zipped = []
labels_zipped = []
for d in dataset_random:
    dataset_zipped.append(d[0])
    labels_zipped.append(d[1])
    
print("Zip Method: ", dataset_zipped)
print(labels_zipped)

[0.9, 5.5, 0.4]
Rose
No Random Method:  [[0.9, 5.5, 0.4], [0.8, 5.0, 0.6], [1.2, 4.9, 0.4], [0.7, 6.5, 0.4], [0.6, 6.0, 0.4], [0.5, 6.2, 0.3]]
['Rose', 'Rose', 'Rose', 'Petunia', 'Petunia', 'Petunia']
Loop Method:  [[1.3, 10.5, 0.8], [0.6, 6.0, 0.4], [1.6, 11.0, 0.9], [0.5, 6.2, 0.3], [1.2, 4.9, 0.4], [1.5, 10.0, 0.9]]
['Sunflower', 'Petunia', 'Sunflower', 'Petunia', 'Rose', 'Sunflower']
[0.9, 5.5, 0.4]
Zip Method:  [[0.9, 5.5, 0.4], [0.5, 6.2, 0.3], [0.7, 6.5, 0.4], [1.2, 4.9, 0.4], [1.5, 10.0, 0.9], [0.8, 5.0, 0.6]]
['Rose', 'Petunia', 'Petunia', 'Rose', 'Sunflower', 'Rose']


It is clear random is a better choice for achieving a good distribution of the data. However, as you may notice random does not guarantee a good distribution across labels. In fact, both times we used random, the dataset we created still had much more of one class in it than the others. To solve this, we have to use something called Stratified Random.

### Stratified Random
Stratified random ensures that although a random distribution of the data is taken, each label is fairly represented.

In [53]:
random.seed(1337)

label_dictionary = {}

for counter, value in enumerate(labels): #enumerate is a function that counts each value in a list - this count can be used as an index
    if value not in label_dictionary:
        label_dictionary[value] = [counter]
    else:
        label_dictionary[value] += [counter] #here we are just adding each label to the dictionary and storing its indices

training_indices = []

for label in label_dictionary:
    random_selection = random.sample(label_dictionary[label], 2) #we randomly select two indices from each label
    training_indices += random_selection #these form our training data for that label

print(training_indices)

training_data = []
training_labels = []

test_data = []
test_labels = []

for i in range(len(data)): #we then refer to the full dataset
    if i in training_indices: #if the data is at an indice within our training_indices list
        training_data.append(data[i])
        training_labels.append(labels[i]) #we add that to the training_data
    else:
        test_data.append(data[i]) #otherwise, we add it to test data
        test_labels.append(labels[i])

print(training_data)
print(training_labels)

print(test_data)
print(test_labels) #this means we extract data randomly, but evenly from each label

[2, 1, 5, 3, 7, 8]
[[0.8, 5.0, 0.6], [1.2, 4.9, 0.4], [0.7, 6.5, 0.4], [0.5, 6.2, 0.3], [1.3, 10.5, 0.8], [1.6, 11.0, 0.9]]
['Rose', 'Rose', 'Petunia', 'Petunia', 'Sunflower', 'Sunflower']
[[0.9, 5.5, 0.4], [0.6, 6.0, 0.4], [1.5, 10.0, 0.9]]
['Rose', 'Petunia', 'Sunflower']
