# CHAPTER 2 DATA STRUCTURES

### BSD2333 DATA WRANGLING

### 2.3 DATA STRUCTURES AND SEQUENCES

#### LISTS

Lists are fundamental Python data structures that have continuous memory locations and can host
different d ata t ypes ( such a s s trings, n umbers, fl oats, an d do ubles) an d ca n be ac cessed by the
index.

We will start with a list and list comprehension. A list comprehension is a syntactic sugar (or
shorthand) for a for loop, which iterates over a list. We will generate a list of numbers, and then
examine which ones among them are even. We will sort, reverse, and check for duplicates. We will
also see the different w ays w e c an a ccess t he l ist e lements, i terating o ver t hem a nd c hecking the
membership of an element.

The following is an example of a simple list:

In [1]:
list_example = [51, 27, 34, 46, 90, 45, -19]

The following is also an example of a list:

In [2]:
list_example2 = [15, "Yellow car", True, 9.456, [12, "Hello"]]

As you can see, a list can contain any number of the allowed data types, such as int, float, string,
and boolean, and a list can also be a mix of different data types (including nested lists).

List Functions

In this section, we will discuss a few basic functions for handling lists. You can access list elements
using the following code:


In [3]:
list_example = [51, 27, 34, 46, 90, 45, -19]
list_example[0]

51

To find out the length of a list, we simply use the len f unction. The len function in Python returns
the length of the specified list:

In [4]:
len(list_example)

7

We can append new elements in the list. append is a built-in method in Python for the list data
type:


In [5]:
list_example.append(11)
list_example

[51, 27, 34, 46, 90, 45, -19, 11]

Exercise 2.01 : Accessing the List Members
    
1. Open a new Jupyter Notebook and define a list called ssn. Read from the ssn.csv file using
the read_csv command and print the list elements:


In [8]:
import pandas as pd
ssn = list(pd.read_csv("ssn.csv"))
print(ssn)

['218-68-9955', '165-73-3124', '432-47-4043', '563-93-1393', '153-93-3401', '670-09-7369', '123-05-9652', '812-13-2476', '726-13-1007', '825-05-4836']


2. Access the first element of ssn using its forward index:

In [9]:
ssn[0]

'218-68-9955'

3. Access the fourth element of ssn using its forward index:

In [10]:
ssn[3]

'563-93-1393'

4. Access the last element of ssn using the len function:

In [11]:
 ssn[len(ssn) - 1]

'825-05-4836'

5. Access the last element of ssn using its backward index:

In [12]:
ssn[-1]

'825-05-4836'

6. Access the first three elements of ssn using forward indices:

In [13]:
 ssn[1:3]

['165-73-3124', '432-47-4043']

7. Access the last two elements of ssn by slicing:

In [14]:
 ssn[-2:]

['726-13-1007', '825-05-4836']

8. Access the first two elements using backward indices:

In [15]:
 ssn[:-2]

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476']

When we leave one side of the colon (:) blank, we are basically telling Python either to go until the
end or start from the beginning of the list. It will automatically apply the rule of list slices that
we just learned.

9. Reverse the elements in the list:


In [16]:
ssn[-1::-1]

['825-05-4836',
 '726-13-1007',
 '812-13-2476',
 '123-05-9652',
 '670-09-7369',
 '153-93-3401',
 '563-93-1393',
 '432-47-4043',
 '165-73-3124',
 '218-68-9955']

In this exercise, we learned how to access the list members with forward and backward indices.
We’ll create a list in the next exercise.


Exercise 2.02: Generating and Iterating through a List
    
In this exercise, we are going to examine various ways of generating a list and a nested list using
the same file containing the list of social security numbers ( ssn. csv) that we used in the previous
exercise.

We are going to use the append method to add new elements to the list and a while loop to iterate
through the list. To do so, let’s go through the following steps:
    
1. Open a new Jupyter Notebook and import the necessary Python libraries. Read from the
ssn.csv file:

In [17]:
import pandas as pd
ssn = list(pd.read_csv("ssn.csv"))

2. Create a list using the append method. The append method from the Python library will
allow you to add items to the list:

In [18]:
ssn_2 = []
for x in ssn:
    ssn_2.append(x)
ssn_2

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836']

3. Generate a list using the following command:


In [19]:
ssn_3 = ["soc: " + x for x in ssn_2]
ssn_3

['soc: 218-68-9955',
 'soc: 165-73-3124',
 'soc: 432-47-4043',
 'soc: 563-93-1393',
 'soc: 153-93-3401',
 'soc: 670-09-7369',
 'soc: 123-05-9652',
 'soc: 812-13-2476',
 'soc: 726-13-1007',
 'soc: 825-05-4836']

This is list comprehension, which is a very powerful tool that we need to master. The power
of list comprehension comes from the fact that we can use conditionals such as for..in inside the
comprehension itself. This will be discussed in detail in Chapter 2, Advanced Operations on Built-in
Data Structures.

4. Use a while loop to iterate over the list:

In [21]:
i = 0
while i < len(ssn_3):
    print(ssn_3[i])
    i += 1

soc: 218-68-9955
soc: 165-73-3124
soc: 432-47-4043
soc: 563-93-1393
soc: 153-93-3401
soc: 670-09-7369
soc: 123-05-9652
soc: 812-13-2476
soc: 726-13-1007
soc: 825-05-4836


5. Search all the social security numbers with the number 5 in them:


In [22]:
numbers = [x for x in ssn_3 if "9" in x]
numbers

['soc: 218-68-9955',
 'soc: 563-93-1393',
 'soc: 153-93-3401',
 'soc: 670-09-7369',
 'soc: 123-05-9652']

Let’s explore a few more list operations. We are going to use the + operator to add the contents of
two lists and use the extend keyword to replace the contents of the existing list with another list.

6. Generate a list by adding the two lists. Here, we will just use the + operator:


In [23]:
ssn_4 = ["102-90-0314" , "247-17-2338" , "318-22-2760"]
ssn_5 = ssn + ssn_4
ssn_5

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836',
 '102-90-0314',
 '247-17-2338',
 '318-22-2760']

7. Extend a string using the extend keyword:


In [24]:
ssn_2.extend(ssn_4)
ssn_2

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836',
 '102-90-0314',
 '247-17-2338',
 '318-22-2760']

8. Now, let’s loop over the first list and create a nested list inside that loop that goes over the
second list:


In [25]:
for x in ssn_2:
    for y in ssn_5:
        print(str(x) + ' , ' + str(y))

218-68-9955 , 218-68-9955
218-68-9955 , 165-73-3124
218-68-9955 , 432-47-4043
218-68-9955 , 563-93-1393
218-68-9955 , 153-93-3401
218-68-9955 , 670-09-7369
218-68-9955 , 123-05-9652
218-68-9955 , 812-13-2476
218-68-9955 , 726-13-1007
218-68-9955 , 825-05-4836
218-68-9955 , 102-90-0314
218-68-9955 , 247-17-2338
218-68-9955 , 318-22-2760
165-73-3124 , 218-68-9955
165-73-3124 , 165-73-3124
165-73-3124 , 432-47-4043
165-73-3124 , 563-93-1393
165-73-3124 , 153-93-3401
165-73-3124 , 670-09-7369
165-73-3124 , 123-05-9652
165-73-3124 , 812-13-2476
165-73-3124 , 726-13-1007
165-73-3124 , 825-05-4836
165-73-3124 , 102-90-0314
165-73-3124 , 247-17-2338
165-73-3124 , 318-22-2760
432-47-4043 , 218-68-9955
432-47-4043 , 165-73-3124
432-47-4043 , 432-47-4043
432-47-4043 , 563-93-1393
432-47-4043 , 153-93-3401
432-47-4043 , 670-09-7369
432-47-4043 , 123-05-9652
432-47-4043 , 812-13-2476
432-47-4043 , 726-13-1007
432-47-4043 , 825-05-4836
432-47-4043 , 102-90-0314
432-47-4043 , 247-17-2338
432-47-4043 

In this exercise, we used the built-in methods of Python to manipulate lists. In the next exercise,
we’ll check whether the elements or members in a dataset are present as per our expectations.

Exercise 2.03: Iterating over a List and Checking Membership

    This exercise will demonstrate how we can iterate over a list and verify that the values are as
expected. This is a manual test that can often be done while dealing with a reasonably sized
dataset for business case scenarios. Let’s go through the following steps to check the membership
of values and whether they exist in the .csv file:

    1. Import the necessary Python libraries and read from the car_models.csv file:


In [26]:
import pandas as pd
car_models = list(pd.read_csv("car_models.csv"))
car_models

['Escalade ',
 ' X5 M',
 'D150',
 'Camaro',
 'F350',
 'Aurora',
 'S8',
 'E350',
 'Tiburon',
 'F-Series Super Duty ']

2. Iterate over a list:


In [27]:
list_1 = [x for x in car_models]
for i in range(0, len(list_1)):
    print(list_1[i])

Escalade 
 X5 M
D150
Camaro
F350
Aurora
S8
E350
Tiburon
F-Series Super Duty 


However, this is not very Pythonic. Being Pythonic means to follow and conform to a set of best
practices and conventions that have been created over the years by thousands of capable developers.
In this case, this means we could use the in keyword in the for..in conditional because Python does
not have index initialization, bounds checking, or index incrementing, unlike traditional languages.
Python uses syntactic sugar to make iterating through lists easy and readable. In other languages,
you might have to create a variable (index initialization) as you loop over the list check that variable
(bounds checking) since it will be incremented in the loop (index incrementing).

3. Write the following code to see the Pythonic way of iterating over a list:

In [28]:
for i in list_1:
    print(i)

Escalade 
 X5 M
D150
Camaro
F350
Aurora
S8
E350
Tiburon
F-Series Super Duty 


Notice that in the second method, we do not need a counter anymore to access the list index;
instead, Python’s in operator gives us the element at the ith position directly.

4. Check whether the strings D150 and Mustang are in the list using the in operator:

In [29]:
"D150" in list_1

True

In [30]:
"Mustang" in list_1

False

In this exercise, we’ve seen how to iterate over a list and verified the membership of each element.
This is an important skill. Often, when working with large applications, manually checking a list
could be useful. If at any time you are unsure of a list, you can easily verify what values are present.
Now, we will see how we can perform a sort operation on a list.

Exercise 2.04: Sorting a List

In this exercise, we will sort a list of numbers, first by using the sort method and then by using the
reverse method. To do so, let’s go through the following steps:

1. Open a new Jupyter Notebook and import the necessary Python libraries:

In [31]:
import pandas as pd
ssn = list(pd.read_csv("ssn.csv"))

2. Use the sort method with reverse=True:

In [32]:
list_1 = [*range(0, 101, 1)]
list_1.sort(reverse=True)
list_1

[100,
 99,
 98,
 97,
 96,
 95,
 94,
 93,
 92,
 91,
 90,
 89,
 88,
 87,
 86,
 85,
 84,
 83,
 82,
 81,
 80,
 79,
 78,
 77,
 76,
 75,
 74,
 73,
 72,
 71,
 70,
 69,
 68,
 67,
 66,
 65,
 64,
 63,
 62,
 61,
 60,
 59,
 58,
 57,
 56,
 55,
 54,
 53,
 52,
 51,
 50,
 49,
 48,
 47,
 46,
 45,
 44,
 43,
 42,
 41,
 40,
 39,
 38,
 37,
 36,
 35,
 34,
 33,
 32,
 31,
 30,
 29,
 28,
 27,
 26,
 25,
 24,
 23,
 22,
 21,
 20,
 19,
 18,
 17,
 16,
 15,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 0]

3. Use the reverse method directly to achieve this result:

In [33]:
list_1.reverse()
list_1

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100]

The difference b etween t he s ort m ethod a nd t he r everse m ethod i s t hat w e c an u se s ort with
customized sorting, whereas we can only use reverse to reverse a list. Also, both methods work
in-place, so be aware of this while using them. Now, let’s create a list with random numbers.
Random numbers can be very useful in a variety of situations and preprocessing data is a common
process in machine learning.

Exercise 2.05: Generating a Random List

In this exercise, we will be generating a list with random numbers using the random library in
Python and performing mathematical operations on them. To do so, let’s go through the following
steps:

1. Import the random library:

In [34]:
import random

2. Use the randint method to generate some random integers and add them to a list:


In [35]:
list_1 = [random.randint(0, 30) for x in range (0, 100)]

3. Let’s print the list. Note that there will be duplicate values in list_1:


In [36]:
list_1

[16,
 9,
 9,
 1,
 6,
 11,
 23,
 15,
 21,
 9,
 24,
 4,
 21,
 21,
 30,
 1,
 30,
 8,
 19,
 2,
 28,
 16,
 30,
 29,
 20,
 3,
 21,
 22,
 16,
 7,
 30,
 16,
 29,
 9,
 22,
 22,
 30,
 21,
 10,
 26,
 9,
 21,
 17,
 9,
 9,
 6,
 6,
 24,
 4,
 4,
 15,
 11,
 10,
 6,
 19,
 6,
 22,
 26,
 1,
 18,
 0,
 12,
 4,
 9,
 10,
 28,
 5,
 25,
 20,
 26,
 3,
 9,
 19,
 10,
 22,
 8,
 1,
 22,
 12,
 25,
 21,
 23,
 30,
 14,
 5,
 6,
 28,
 3,
 7,
 8,
 4,
 16,
 8,
 22,
 10,
 15,
 16,
 24,
 3,
 30]

4. Let’s find the square of each element:

In [37]:
list_2 = [x**2 for x in list_1]
list_2

[256,
 81,
 81,
 1,
 36,
 121,
 529,
 225,
 441,
 81,
 576,
 16,
 441,
 441,
 900,
 1,
 900,
 64,
 361,
 4,
 784,
 256,
 900,
 841,
 400,
 9,
 441,
 484,
 256,
 49,
 900,
 256,
 841,
 81,
 484,
 484,
 900,
 441,
 100,
 676,
 81,
 441,
 289,
 81,
 81,
 36,
 36,
 576,
 16,
 16,
 225,
 121,
 100,
 36,
 361,
 36,
 484,
 676,
 1,
 324,
 0,
 144,
 16,
 81,
 100,
 784,
 25,
 625,
 400,
 676,
 9,
 81,
 361,
 100,
 484,
 64,
 1,
 484,
 144,
 625,
 441,
 529,
 900,
 196,
 25,
 36,
 784,
 9,
 49,
 64,
 16,
 256,
 64,
 484,
 100,
 225,
 256,
 576,
 9,
 900]

5. Now let’s find the log of the 1 elements of list_2:

In [38]:
import math
list_2 = [math.log(x+1,10) for x in list_2]
list_2

[2.4099331233312946,
 1.9138138523837167,
 1.9138138523837167,
 0.30102999566398114,
 1.5682017240669948,
 2.086359830674748,
 2.7242758696007887,
 2.3541084391474008,
 2.6454222693490914,
 1.9138138523837167,
 2.7611758131557314,
 1.2304489213782739,
 2.6454222693490914,
 2.6454222693490914,
 2.9547247909790624,
 0.30102999566398114,
 2.9547247909790624,
 1.8129133566428552,
 2.558708570533166,
 0.6989700043360187,
 2.8948696567452523,
 2.4099331233312946,
 2.9547247909790624,
 2.925312091499649,
 2.603144372620182,
 1.0,
 2.6454222693490914,
 2.6857417386022635,
 2.4099331233312946,
 1.6989700043360185,
 2.9547247909790624,
 2.4099331233312946,
 2.925312091499649,
 1.9138138523837167,
 2.6857417386022635,
 2.6857417386022635,
 2.9547247909790624,
 2.6454222693490914,
 2.0043213737826426,
 2.8305886686851442,
 1.9138138523837167,
 2.6454222693490914,
 2.4623979978989556,
 1.9138138523837167,
 1.9138138523837167,
 1.5682017240669948,
 1.5682017240669948,
 2.7611758131557314,
 1.2304489

In this exercise, we worked on random variables, lists comprehension, and preprocessing data. Let’s
put what we have learned so far together and go through an activity to practice how to handle
lists.

SETS

A set, mathematically speaking, is just a collection of well-defined d istinct o bjects. P ython gives
us a straightforward way to deal with them using its set data type.

Introduction to Sets

With the last list that we generated in the previous section; we are going to revisit the problem of
getting rid of duplicates from it. We can achieve that with the following line of code

In [39]:
list_12 = list(set(list_1))

If we print this, we will see that it only contains unique numbers. We used the set data type to turn
the first list into a set, thus getting rid of all duplicate elements, and then used the list function to
turn it into a list from a set once more:


In [40]:
list_12

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 28,
 29,
 30]

Union and Intersection of Sets

In mathematical terms, a list of unique objects is a set. There are many ways of combining sets in
the same mathematical term. One such way is the use of a union.

This simply means taking everything from both sets but only taking the common elements once.
We can implement this concept by using the following code:

In [41]:
set1 = {"Apple", "Orange", "Banana","Guava"}
set2 = {"Pear", "Peach", "Mango", "Banana", "Guava"}
set3 = {"Durian","Kiwi", "Grape","Avocado","Banana", "Guava"}

To find the union of the two sets, the following code should be used:

In [42]:
set1 | set2 | set3

{'Apple',
 'Avocado',
 'Banana',
 'Durian',
 'Grape',
 'Guava',
 'Kiwi',
 'Mango',
 'Orange',
 'Peach',
 'Pear'}

Notice that the common element, Banana, appears only once in the resulting set. The common
elements of two sets can be identified by obtaining the intersection of the two sets, as follows:

We get the intersection of two sets in Python as follows:

In [43]:
set1 & set2 & set3

{'Banana', 'Guava'}

In this section, we went through sets and how we can do basic set functionality. Sets are used
throughout database programming and design, and they are very useful for data wrangling.

0.6 Creating Null Sets

In mathematical terms, a set that has nothing inside it is called a null set or an empty set.

You can create a null set by creating a set containing no elements. You can do this by using the
following code:

In [44]:
null_set_1 = set({})
null_set_1

set()

However, to create a dictionary with null values, use the following command:

In [45]:
null_set_2 = {}
null_set_2

{}

We are going to learn about this in detail in the next section.

DICTIONARY

A dictionary is like a list, which means it is a collection of several elements. However, with the
dictionary, it is a collection of key-value pairs, where the key can be anything that can fit into
memory. Generally, we use numbers or strings as keys.

To create a dictionary, use the following code:

In [46]:
dict_1 = {"key1": "value1", "key2": "value2"}
dict_1

{'key1': 'value1', 'key2': 'value2'}

This is also a valid dictionary:

In [47]:
dict_2 = {"key1": 1, "key2": ["list_element1", 34], \
"key3": "value3","key4": {"subkey1": "v1"}, \
"key5": 4.5}
dict_2

{'key1': 1,
 'key2': ['list_element1', 34],
 'key3': 'value3',
 'key4': {'subkey1': 'v1'},
 'key5': 4.5}

The keys must be unique in a dictionary.

Exercise 2.06: Accessing and Setting Values in a Dictionary

In this exercise, we are going to access the elements and set values in a dictionary. When working
with dictionaries, it’s important to be able to iterate through each key-value pair, which will allow
you to process the data as needed. To do so, let’s go through the following steps:

1. To access a value in the dictionary, you must provide the key. Keep in mind there is no given
order for any pair in the dictionary:

In [48]:
stocks = \
{"Solar Capital Ltd.":"$920.44M", \
"Zoe's Kitchen, Inc.":"$262.32M",\
"Toyota Motor Corp Ltd Ord":"$156.02B",\
"Nuveen Virginia Quality Municipal Income Fund":"$238.33M",\
"Kinross Gold Corporation":"$5.1B",\
"Vulcan Materials Company":"$17.1B",\
"Hi-Crush Partners LP":"$955.69M",\
"Lennox International, Inc.":"$8.05B",\
"WMIH Corp.":"$247.66M",\
"Comerica Incorporated":"n/a"}

2. Print a particular element from the stocks list:

In [49]:
stocks["WMIH Corp."]

'$247.66M'

3. Set a value using the same method we use to access a value:

In [50]:
stocks["WMIH Corp."] = "$300M"

4. Define a blank dictionary and then use the key notation to assign values to it:

In [51]:
dict_3 = {} # Not a null set. It is a dict
dict_3["key1"] = "Value1"
dict_3

{'key1': 'Value1'}

As we can see, the manipulation techniques of a dictionary are pretty simple. Now, just like a list,
iterating through a dictionary is very important in order to process the data.

Exercise 2.07: Iterating over a Dictionary

In this exercise, we are going to iterate over a dictionary and print the values and keys. To do so,
let’s go through the following steps:

1. Open a new Jupyter Notebook and define a dictionary with the key provided along with it.
Keep in mind there is no given order for any pair in the dictionary:


In [52]:
stocks = \
{"Solar Capital Ltd.":"$920.44M",\
"Zoe's Kitchen, Inc.":"$262.32M",\
"Toyota Motor Corp Ltd Ord":"$156.02B",\
"Nuveen Virginia Quality Municipal Income Fund":"$238.33M",\
"Kinross Gold Corporation":"$5.1B",\
"Vulcan Materials Company":"$17.1B",\
"Hi-Crush Partners LP":"$955.69M",\
"Lennox International, Inc.":"$8.05B",\
"WMIH Corp.":"$247.66M",\
 "Comerica Incorporated":"n/a"}

2. Remove the $ character from the stocks dictionary:

In [53]:
for key,val in stocks.items():
    stocks[key] = val.replace('$', '')
stocks

{'Solar Capital Ltd.': '920.44M',
 "Zoe's Kitchen, Inc.": '262.32M',
 'Toyota Motor Corp Ltd Ord': '156.02B',
 'Nuveen Virginia Quality Municipal Income Fund': '238.33M',
 'Kinross Gold Corporation': '5.1B',
 'Vulcan Materials Company': '17.1B',
 'Hi-Crush Partners LP': '955.69M',
 'Lennox International, Inc.': '8.05B',
 'WMIH Corp.': '247.66M',
 'Comerica Incorporated': 'n/a'}

3. Iterate over the stocks dictionary again and split the value into a list with price (val) and
multiplier (mult) as separate elements where a single value is assigned to each key:

In [54]:
for key,val in stocks.items():
    mult = val[-1]
    stocks[key] = [val[:-1],mult]
stocks

{'Solar Capital Ltd.': ['920.44', 'M'],
 "Zoe's Kitchen, Inc.": ['262.32', 'M'],
 'Toyota Motor Corp Ltd Ord': ['156.02', 'B'],
 'Nuveen Virginia Quality Municipal Income Fund': ['238.33', 'M'],
 'Kinross Gold Corporation': ['5.1', 'B'],
 'Vulcan Materials Company': ['17.1', 'B'],
 'Hi-Crush Partners LP': ['955.69', 'M'],
 'Lennox International, Inc.': ['8.05', 'B'],
 'WMIH Corp.': ['247.66', 'M'],
 'Comerica Incorporated': ['n/', 'a']}

Notice the difference between how we did the iteration on the list and how we are doing it h ere. A
dictionary always contains a key-value pair, and we always need to access the value of any element
in a dictionary with its key. In a dictionary, all the keys are unique.

In the next exercise, we will revisit the problem that we encountered with the list earlier in this
chapter to create a list with unique values. We will look at another workaround to fix this problem.

Exercise 2.08: Revisiting the Unique Valued List Problem

In this exercise, we will use the unique nature of a dictionary, and we will drop the duplicate values
from a list. First, we will create a random list with duplicate values. Then, we’ll use the fromkeys
and keys methods of a dictionary to create a unique valued list. To do so, let’s go through the
following steps:

1. First, generate a random list with duplicate values:


In [55]:
import random
list_1 = [random.randint(0, 30) for x in range (0, 100)]

2. Create a unique valued list from list_1:

In [56]:
list(dict.fromkeys(list_1).keys())

[3,
 14,
 24,
 8,
 29,
 19,
 5,
 6,
 11,
 26,
 20,
 7,
 1,
 12,
 0,
 22,
 28,
 30,
 18,
 16,
 21,
 15,
 27,
 23,
 13,
 2,
 4,
 25,
 17,
 9,
 10]

Here, we have used two useful methods of the dict data type in Python, fromkeys and keys. fromkeys
is a built-in function in which a new dictionary is created from the given sequence of elements with
values given by the user, while the keys method gives us the keys of a dictionary.

0.7.4 Exercise 2.09: Deleting a Value from Dictionary

In this exercise, we are going to delete a value from dict using the del method. Perform the following
steps:

1. Create list_1 with five elements:


In [57]:
dict_1 = {"key1": 1, "key2": ["list_element1", 34], \
"key3": "value3","key4": {"subkey1": "v1"}, \
"key5": 4.5}
dict_1

{'key1': 1,
 'key2': ['list_element1', 34],
 'key3': 'value3',
 'key4': {'subkey1': 'v1'},
 'key5': 4.5}

2. We will use the del function and specify the element we want to delete:


In [58]:
del dict_1["key2"]
dict_1

{'key1': 1, 'key3': 'value3', 'key4': {'subkey1': 'v1'}, 'key5': 4.5}

3. Let’s delete key3 and key4:

In [59]:
del dict_1["key3"]
del dict_1["key4"]

4. Now, let’s print the dictionary to see its content:

In [60]:
dict_1

{'key1': 1, 'key5': 4.5}

In this exercise, we learned how to delete elements from a dictionary. This is a very useful functionality of dictionaries, and you will find that it’s used heavily when writing Python applications.

In our final e xercise o n d ict, we w ill g o over a l ess c ommonly u sed l ist c omprehension c alled dictionary comprehension. We will also examine two other ways to create a dict, which can be very
useful for processing dictionaries in one line. There could be cases where this could be used as a
range of key-value pairs of name and age or credit card number and credit card owner. A dictionary
comprehension works exactly the same way as list comprehension, but we need to specify both the
key and the value.

Exercise 2.10: Dictionary Comprehension

In this exercise, we will generate a dictionary using the following steps:

1. Generate a dict that has 0 to 9 as the keys and the square of the key as the values:

In [61]:
list_1 = [x for x in range(0, 10)]
dict_1 = {x : x**50 for x in list_1}
dict_1

{0: 0,
 1: 1,
 2: 1125899906842624,
 3: 717897987691852588770249,
 4: 1267650600228229401496703205376,
 5: 88817841970012523233890533447265625,
 6: 808281277464764060643139600456536293376,
 7: 1798465042647412146620280340569649349251249,
 8: 1427247692705959881058285969449495136382746624,
 9: 515377520732011331036461129765621272702107522001}

Can you generate a dict using dict comprehension without using a list? Let’s try this now.

2. Generate a dictionary using the dict function:


In [62]:
dict_2 = dict([('Tom', 100), ('Dick', 200), ('Harry', 300)])
dict_2

{'Tom': 100, 'Dick': 200, 'Harry': 300}

3. You can also a dictionary using the dict function, as follows:

In [63]:
dict_3 = dict(Tom=100, Dick=200, Harry=300)
dict_3

{'Tom': 100, 'Dick': 200, 'Harry': 300}

Dictionaries are very flexible a nd c an b e u sed f or a v ariety o f t asks. T he c ompact n ature of
comprehension makes them very popular. The strange-looking pair of values that just looked at
(‘Harry’, 300) is called a tuple. This is another important fundamental data type in Python. We
will learn about tuples in the next section.

TUPLES

A tuple is another data type in Python. Tuples in Python are similar to lists, with one key
difference. A tuple is a variant of a Python list that is i mmutable. Immutable basically means you
can’t modify it by adding or removing from the list. It is sequential in nature and similar to lists.

A tuple consists of values separated by commas, as follows:


In [64]:
tuple_1 = 24, 42, 2.3456, "Hello"
tuple_1

(24, 42, 2.3456, 'Hello')

Notice that, unlike lists, we did not open and close square brackets here.
When referring to a tuple, the length of the tuple is called its cardinality. This comes from database
and set theory and is a common way to reference its length.

Creating a Tuple with Different Cardinalities

This is how we create an empty tuple:

In [65]:
tuple_1 = ()

This is how we create a tuple with only one value:


In [66]:
tuple_1 = "Hello",

Notice the trailing comma here.
We can nest tuples, similar to lists and dicts, as follows:

In [67]:
tuple_1 = "hello", "there"
tuple_12 = tuple_1, 45, "Sam"

One special thing about tuples is the fact that they are an immutable data type. So, once they’re
created, we cannot change their values. We can just access them, as follows:

In [68]:
tuple_1 = "Hello", "World!"
tuple_1[1] = "Universe!"

TypeError: 'tuple' object does not support item assignment

The last line of the preceding code will result in a TypeError as a tuple does not allow modification.
This makes the use case for tuples a bit different t han l ists, a lthough t hey l ook a nd b ehave very
similarly in a few ways.

We can access the elements of a tuple in the same manner we can for lists:


In [70]:
tuple_1 = ("good", "morning!" , "how", "are", "you?")
tuple_1[0]

'good'

Let’s access another element:

In [71]:
tuple_1[3]

'are'

Unpacking a Tuple

The expression “unpacking a tuple” simply means getting the values contained in the tuple in
different variables:


In [72]:
tuple_1 = "Hello", "World"
hello, world = tuple_1
print(hello)
print(world)

Hello
World


Of course, as soon as we do that, we can modify the values contained in those variables.

Exercise 2.11: Handling Tuples

In this exercise, we will walk through the basic functionalities of tuples. Let’s go through the steps
one by one:

1. Create a tuple to demonstrate how tuples are immutable. Unpack it to read all the elements,
as follows:

In [73]:
tupleE = "1", "3", "5"
tupleE

('1', '3', '5')

2. Try to override a variable from the tupleE tuple:


In [74]:
tupleE[1] = "5"

TypeError: 'tuple' object does not support item assignment

This step will result in TypeError as the tuple does not allow modification.

3. Try to assign a series to the tupleE tuple:

In [75]:
1, 3, 5 = tupleE

SyntaxError: cannot assign to literal (3968916292.py, line 1)

This step will also result in a SyntaxError, stating that it can’t assign to the literal:

4. Print variables at 0th and 1st positions:

In [76]:
print(tupleE[0])
print(tupleE[1])

1
3


We have seen two different types of data so f ar. One is represented by numbers, while the other is
represented by textual data. Now it’s time to look into textual data in a bit more detail.

STRINGS

Strings in Python are similar to strings in any other programming language.

This is a string:


In [77]:
string1 = 'Hello World!'

A string can also be declared in this manner:

In [78]:
string2 = "Hello World 2!"

You can use single quotes and double quotes to define a string.

The start and end of a string is defined as:

str[ inclusive start position: exclusive end position ].

Strings in Python behave similar to lists, apart from one big caveat. Strings are immutable, whereas
lists are mutable data structures.

Exercise 2.12: Accessing Strings

In this exercise, we are going perform mathematical operations to access strings. Let’s go through
the following steps:

1. Create a string called str_1:


In [79]:
str_1 = "Hello World!"
str_1

'Hello World!'

You can access the elements of the string by specifying the location of the element, like we did for
lists.

2. Access the first member of the string:

In [80]:
str_1[0]

'H'

3. Access the fifth member of the string:


In [81]:
str_1[4]

'o'

4. Access the last member of the string:


In [82]:
str_1[len(str_1) - 1]

'!'

5. Access the last member of the string, in a different way this time:

In [83]:
str_1[-1]

'!'

Each of the preceding operations will give you the character at the specific i ndex. The method for
accessing the elements of a string is like accessing a list. Let’s do a couple of more exercises to
manipulate strings.

Exercise 2.13: String Slices

This exercise will demonstrate how we can slice strings the same way as we did with lists. Although
strings are not lists, the functionality will work in the same way.

Let’s go through the following steps:

1. Create a string, str_1:

In [84]:
str_1 = "Hello World! I am learning data wrangling"
str_1

'Hello World! I am learning data wrangling'

2. Specify the slicing values and slice the string:


In [85]:
str_1[2:10]

'llo Worl'

3. Slice a string by skipping a slice value:


In [86]:
str_1[-31:]

'd! I am learning data wrangling'

4. Use negative numbers to slice the string:

In [87]:
str_1[-10:-5]

' wran'

As we can see, it is quite simple to manipulate strings with basic operations.

String Functions

To find out the length of a string, we simply use the len function:


In [88]:
str_1 = "Hello World! I am learning data wrangling"
len(str_1)

41

The length of the string is 41. To convert a string’s case, we can use the lower and upper methods:

In [89]:
str_1 = "A COMPLETE UPPER CASE STRING"
str_1.lower()

'a complete upper case string'

To change the case of the string, use the following code:

In [90]:
str_1.upper()

'A COMPLETE UPPER CASE STRING'

To search for a string within a string, we can use the find method:

In [None]:
str_1 = "A complicated string looks like this"
str_1.find("complicated")
str_1.find("hello")

The output is -1. Can you figure out whether the find method is case-sensitive or no t? Also, what
do you think the find method returns when it actually finds the string?

To replace one string with another, we have the replace method. Since we know that a string is an
immutable data structure, replace actually returns a new string instead of replacing and returning
the actual one:


In [91]:
str_1 = "A complicated string looks like this"
str_1.replace("complicated", "simple")

'A simple string looks like this'

Strings have two useful methods: split and join. Here are their definitions:

str.split(separator)

The seperator argument is a delimiter that you define:

string.join(seperator)

Exercise 2.14: Splitting and Joining a String

This exercise will demonstrate how to perform split and join operations on a string. These two
string methods need separate approaches as they allow you to convert a string into a list and vice
versa. Let’s go through the following steps to do so:

1. Create a string and convert it into a list using the split method:


In [92]:
str_1 = "Name, Age, Sex, Address"
list_1 = str_1.split(",")
list_1

['Name', ' Age', ' Sex', ' Address']

2. Combine this list into another string using the join method:


In [93]:
s = " | "
s.join(list_1)

'Name |  Age |  Sex |  Address'

# CHAPTER 2 DATA STRUCTURES

2.4 FUNCTIONS

Introduction

We were introduced to the basic concepts of different fundamental data structures in the previous
chapter. We learned about lists, sets, dictionaries, tuples, and strings. However, what we have
covered so far were only basic operations on those data structures. They have much more to
offer o nce you l earn h ow t o u tilize t hem e ffectively. In th is ch apter, we wi ll ve nture fu rther into
the land of data structures. We will learn about advanced operations and manipulations and use
fundamental data structures to represent more complex and higher-level data structures; this is
often handy while wrangling data in real life. These higher-level topics will include stacks, queues,
interiors, and file operations.

In this chapter, we will also learn how to open a file using built-in Python methods and about the
many different file operations, such as reading and writing data, and safely closing files once we are
done. We will also take a look at some of the problems to avoid while dealing with files.

Iterator

Iterators in Python are very useful when dealing with data as they allow you to parse the data one
unit at a time. Iterators are stateful, which means it will be helpful to keep track of the previous
state. An iterator is an object that implements the next method—meaning an iterator can iterate
over collections such as lists, tuples, dictionaries, and more. Practically, this means that each time
we call the method, it gives us the next element from the collection; if there is no further element
in the list, then it raises a StopIteration exception.

Let’s learn about the various functions we can use with itertools. As you execute each line of the
code after the import statement, you will be able to see details about what that particular function
does and how to use it:

In [94]:
from itertools import (permutations, combinations, \
dropwhile, repeat, zip_longest)

permutations?
combinations?
dropwhile?
repeat?
zip_longest?

Exercise 2.15: Introducing to the Iterator

In this exercise, we’re going to generate a long list containing numbers. We will first c heck the
memory occupied by the generated list. We will then check how we can use the iterator module to
reduce memory utilization, and finally, we w ill u se t his i terator t o l oop over t he l ist. To d o this,
let’s go through the following steps:

1. Open a new Jupyter Notebook and generate a list that will contain 10000000 ones. Then,
store this list in a variable called big_list_of_numbers:

In [95]:
big_list_of_numbers = [1 for x in range (0, 10000000)]
big_list_of_numbers

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


2. Check the size of this variable:

In [96]:
from sys import getsizeof
getsizeof(big_list_of_numbers)

89095160

The value shown is 81528048 (in bytes). This is a huge chunk of memory occupied by the list. And
the big_list_of_numbers variable is only available once the list comprehension is over. It can also
overflow the available system memory if you try too big a number.

3. Let’s use the repeat() method from itertools to get the same number but with less memory:


In [97]:
from itertools import repeat
small_list_of_numbers = repeat(1, times=10000000)
getsizeof(small_list_of_numbers)

48

The last line shows that our list small_list_of_numbers is only 48 bytes in size. Also, it is a lazy
method, a technique used in functional programming that will delay the execution of a method or
a function by a few seconds. In this case, Python will not generate all the elements initially. It
will, instead, generate them one by one when asked, thus saving us time. In fact, if you omit the
times keyword argument in the repeat() method in the preceding code, then you can practically
generate an infinite number of ones.

4. Loop over the newly generated iterator:


In [98]:
for i, x in enumerate(small_list_of_numbers):
    print(x)
    if i > 10:
        break

1
1
1
1
1
1
1
1
1
1
1
1


We use the enumerate function so that we get the loop counter, along with the values. This will
help us break the loop once we reach a certain number (10, for example.

In this exercise, we first learned how to use the iterator function to reduce memory u sage. Then,
we used an iterator to loop over a list. Now, we’ll see how to create stacks.

Stacks
                                                       
A stack is a very useful data structure. If you know a bit about CPU internals and how a program
gets executed, then you will know that a stack is present in many such cases. It is simply a listwith one restriction, Last In First Out (LIFO, meaning an element that comes in last goes out
first when a value is read from a stack.
                                                                                                                                          
We will implement a stack using a Python list. Python lists have a method called pop, which does
the exact same pop operation that you can see in the preceding illustration. Basically, the pop
function will take an element off the stack, using the Last in First Out (LIFOr ules. We will use
that to implement a stack in the following exercise.

Let’s look at an example and try to understand the working of push() and pop() function:

Exercise 2.16: Stack Using List

In [99]:
# Python code to demonstrate Implementing
# stack using list
stack = ["Goku", "Bezita", "Gohan"]
# Let's push some other names into our list
stack.append("Trunks")
stack.append("Goten")
print(stack)

# Removes the last item
print(stack.pop())

print(stack)

# Removes the last item
print(stack.pop())

print(stack)

['Goku', 'Bezita', 'Gohan', 'Trunks', 'Goten']
Goten
['Goku', 'Bezita', 'Gohan', 'Trunks']
Trunks
['Goku', 'Bezita', 'Gohan']


Exercise 2.17: Implementing a Stack in Python
    
In this exercise, we’ll implement a stack in Python. We will first c reate a n e mpty s tack a nd add
new elements to it using the append method. Next, we’ll take out elements from the stack using
the pop method. Let’s go through the following steps:

1. Import the necessary Python library and define an empty stack:

In [100]:
import pandas as pd
stack = []

2. Use the append method to add multiple elements to the stack. Thanks to the append method,
the element will always be appended at the end of the list:


In [101]:
stack.append('my_test@test.edu')
stack.append('rahul.subhramanian@test.edu')
stack.append('sania.test@test.edu')
stack.append('alec_baldwin@test.edu')
stack.append('albert90@test.edu')
stack.append('stewartj@test.edu')
stack

['my_test@test.edu',
 'rahul.subhramanian@test.edu',
 'sania.test@test.edu',
 'alec_baldwin@test.edu',
 'albert90@test.edu',
 'stewartj@test.edu']

3. Let’s read a value from our stack using the pop method. This method reads the current last
index of the list and returns it to us. It also deletes the index once the read is done:


In [102]:
tos = stack.pop()
tos

'stewartj@test.edu'

As you can see, the last value of the stack has been retrieved. Now, if we add another value to the
stack, the new value will be appended at the end of the stack.

4. Append Hello@test.com to the stack:

In [103]:
stack.append("Hello@test.com")
stack

['my_test@test.edu',
 'rahul.subhramanian@test.edu',
 'sania.test@test.edu',
 'alec_baldwin@test.edu',
 'albert90@test.edu',
 'Hello@test.com']

In [104]:
tos = stack.pop()
tos

'Hello@test.com'

From the exercise, we can see that the basic stack operations, append and pop, are pretty easy to
perform.

Let’s visualize a problem where you are scraping a web page and you want to follow each URL
present there (backlinks. Let’s split the solution to this problem into three parts. In the first part,
we would append all the URLs scraped off the page into the s tack. In the second part, we would
pop each element in the stack, and then lastly, we would examine every URL, repeating the same
process for each page. We will examine a part of this task in the next exercise.

Exercise 2.18: Implementing a Stack Using User-De ined Methods

In this exercise, we will continue the topic of stacks from the last exercise. This time, we will
implement the append and pop functions by creating user-defined m ethods. We will implement a
stack, and this time with a business use case example (taking Wikipedia as a source. The aim
of this exercise is twofold. In the first f ew s teps, w e w ill e xtract a nd a ppend t he U RLs scraped
off a web page in a stack, which also involves the string methods discussed in the last c hapter. In
the next few steps, we will use the stack_pop function to iterate over the stack and print them.
This exercise will show us a subtle feature of Python and how it handles passing list variables to
functions. Let’s go through the following steps:

1. First, define two functions: stack_push and stack_pop. We renamed them so that we do not
have a namespace conflict. Also, create a stack called url_ stack for later use:

In [105]:
def stack_push(s, value):
    return s + [value]
def stack_pop(s):
    tos = s[-1]
    del s[-1]
    return tos
url_stack = []
url_stack

[]

The first function takes the already existing stack and adds the value at the end of it.

Now, we are going to have a string with a few URLs in it.

2. Analyze the string so that we push the URLs in the stack one by one as we encounter them,
and then use a for loop to pop them one by one. Let’s take the first line from the Wikipedia
article (https://en.wikipedia.org/wiki/Data_mining) about data science:


In [106]:
wikipedia_datascience = """Data science is an interdisciplinary
field that uses scientific methods, processes, algorithms and systems
to extract knowledge [https://en.wikipedia.org/wiki/Knowledge] and
insights from data [https://en.wikipedia.org/wiki/Data] in various
forms, both structured and unstructured,similar to data mining
[https://en.wikipedia.org/wiki/Data_mining]"""

For the sake of the simplicity of this exercise, we have kept the links in square brackets beside the
target words.

3. Find the length of the string:

In [107]:
len(wikipedia_datascience)

347

4. Convert this string into a list by using the split method from the string, and then calculate
its length:

In [108]:
wd_list = wikipedia_datascience.split()
wd_list

['Data',
 'science',
 'is',
 'an',
 'interdisciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods,',
 'processes,',
 'algorithms',
 'and',
 'systems',
 'to',
 'extract',
 'knowledge',
 '[https://en.wikipedia.org/wiki/Knowledge]',
 'and',
 'insights',
 'from',
 'data',
 '[https://en.wikipedia.org/wiki/Data]',
 'in',
 'various',
 'forms,',
 'both',
 'structured',
 'and',
 'unstructured,similar',
 'to',
 'data',
 'mining',
 '[https://en.wikipedia.org/wiki/Data_mining]']

5. Check the length of the list:

In [109]:
len(wd_list)

34

6. Use a for loop to go over each word and check whether it is a URL. To do that, we will use
the startswith method from the string, and if it is a URL, then we push it into the stack:


In [110]:
for word in wd_list:
    if word.startswith("[https://"):
        url_stack = stack_push(url_stack, word[1:-1])
        print(word[1:-1])


https://en.wikipedia.org/wiki/Knowledge
https://en.wikipedia.org/wiki/Data
https://en.wikipedia.org/wiki/Data_mining


Notice the use of string slicing to remove the surrounding double quotes “[” ”]”.

7. Print the value in url_stack:


In [111]:
print(url_stack)

['https://en.wikipedia.org/wiki/Knowledge', 'https://en.wikipedia.org/wiki/Data', 'https://en.wikipedia.org/wiki/Data_mining']


8. Iterate over the list and print the URLs one by one by using the stack_ popz function:

In [114]:
for i in range(0, len(url_stack)):
    print(stack_pop(url_stack))

https://en.wikipedia.org/wiki/Data_mining
https://en.wikipedia.org/wiki/Data
https://en.wikipedia.org/wiki/Knowledge


9. Print it again to make sure that the stack is empty after the final for loop:

In [113]:
print(url_stack)

['https://en.wikipedia.org/wiki/Knowledge', 'https://en.wikipedia.org/wiki/Data', 'https://en.wikipedia.org/wiki/Data_mining']


In this exercise, we have noticed a strange phenomenon in the stack_pop method. We passed the
list variable there, and we used the del operator inside the function in step 1, but it changed the
original variable by deleting the last index each time we called the function. If you use languages
like C, C++, and Java, then this is a completely unexpected behavior as, in those languages, this
can only happen if we pass the variable by reference, and it can lead to subtle bugs in Python code.
So, be careful when using the user-defined methods.

Lambda Expressions

In general, it is not a good idea to change a variable’s value inside a function. Any variable that is
passed to the function should be considered and treated as immutable. This is close to the principles
of functional programming. However, in that case, we could use unnamed functions that are neither
immutable nor mutable and are typically not stored in a variable. Such an expression or function,
called a lambda expression in Python, is a way to construct one-line, nameless functions that are,
by convention, side-effect-free and are loosely considered as implementing functional programming.

Let’s look at the following exercise to understand how we use a lambda expression.

Exercise 2.19: Implementing a Lambda Expression

In this exercise, we will use a lambda expression to prove the famous trigonometric identity:

Let’s go through the following steps to do this: 1. Import the math package:

In [115]:
import math

2. Define two functions, my_sine and my_cosine, using the def keyword. The reason we are
declaring these functions is the original sin and cos functions from the math package take
radians as input, but we are more familiar with degrees. So, we will use a lambda expression
to define a wrapper function for sine and cosine, then use it. This lambda function will
automatically convert our degree input to radians and then apply sin or cos on it and return
the value:

In [116]:
def my_sine():
    return lambda x: math.sin(math.radians(x))
def my_cosine():
    return lambda x: math.cos(math.radians(x))


3. Define sine and cosine for our purpose:

In [117]:
sine = my_sine()
cosine = my_cosine()
math.pow(sine(30), 2) + math.pow(cosine(30), 2)

1.0

Notice that we have assigned the return value from both my_sine and my_cosine to two variables,
and then used them directly as the functions. It is a much cleaner approach than using them
explicitly. Notice that we did not explicitly write a return statement inside the lambda function;
it is assumed.

Now, in the next section, we will be using lambda functions, also known as anonymous functions,
which come from lambda calculus. Lambda functions are useful for creating temporary functions
that are not named. The lambda expression will take an input and then return the first character
of that input.

Exercise 2.20: Lambda Expression for Sorting

In this exercise, we will be exploring the sort function to take advantage of the lambda function.
What makes this exercise useful is that you will be learning how to create any unique algorithm
that could be used for sorting a dataset. The syntax for a lambda function is as follows:

lambda x :

A lambda expression can take one or more inputs. A lambda expression can also be used to reverse
sort by using the parameter of reverse as True. We’ll use the reverse functionality as well in this
exercise. Let’s go through the following steps:

1. Let’s store the list of tuples we want to sort in a variable called capitals:

In [118]:
capitals = [("USA", "Washington"), ("India", "Delhi"), ("France",
"Paris"), ("UK", "London")]

2. Print the output of this list:

In [119]:
capitals

[('USA', 'Washington'),
 ('India', 'Delhi'),
 ('France', 'Paris'),
 ('UK', 'London')]

3. Sort this list by the name of the capitals of each country, using a simple lambda expression.
The following code uses a lambda function as the sort function. It will sort based on the first
element in each tuple:


In [120]:
capitals.sort(key=lambda item: item[0])
capitals

[('France', 'Paris'),
 ('India', 'Delhi'),
 ('UK', 'London'),
 ('USA', 'Washington')]

As we can see, lambda expressions are powerful if we master them and use them in our data
wrangling jobs. They are also side-effect-free—meaning that they do not change the values of the
variables that are passed to them in place.

We will now move on to the next section, where we will discuss membership checking for each
element. Membership checking is commonly used terminology in qualitative research and describes
the process of checking that the data present in a dataset is accurate.

Exercise 2.21: Multi-Element Membership Checking

In this exercise, we will create a list of words using for loop to validate that all the elements in the
first list are present in the second l ist. Let’s see how:

1. Create a list_of_words list with words scraped from a text corpus:

In [121]:
list_of_words = ["Hello", "there.", "How", "are", "you", "doing?"]
list_of_words

['Hello', 'there.', 'How', 'are', 'you', 'doing?']

2. Define a check_for list, which will contain two similar elements of list_of_words:


In [122]:
check_for = ["How", "are"]
check_for

['How', 'are']

There is an elaborate solution, which involves a for loop and a few if/else conditions (and you
should try to write it), but there is also an elegant Pythonic solution to this problem, which takes
one line and uses the all function. The all function returns True if all elements of the iterable are
True.
3. Use the in keyword to check membership of the elements in the check_for list in
list_of_words:

In [123]:
all('b' in list_of_words for b in check_for)

False

It is indeed elegant and simple to reason about, and this neat trick is very important while dealing
with lists. Basically, what we are doing is looping over the first l ist w ith t he c omprehension and
then looping over the second list using the for loop. What makes this elegant is how compactly
we can represent this complex process. Caution should be taken when using very complex list
comprehension—the more complex you make it, the harder it is to read.

Let’s look at the next data structure: a queue.

Queue

Apart from stacks, another high-level data structure type that we are interested in is queues. A
queue is like a stack, which means that you continue adding elements one by one. With a queue,
the reading of elements obeys the First in First Out (FIFO strategy.

We will accomplish this first using list methods and will show you that, for this purpose, they are
inefficient. T hen, we w ill l earn a bout t he d equeue d ata s tructure f rom t he c ollections m odule of
Python. A queue is a very important data structure. We can think of a scenario on a producerconsumer system design. When doing data wrangling, you will often come across a problem where
you must process very big files. O ne o f t he ways t o d eal w ith t his p roblem i s t o s plit t he chunk
the contents of the file i nto s maller p arts a nd t hen p ush t hem i nto a q ueue w hile c reating small,
dedicated worker processes, to read off the queue and process one small chunk at a t ime. This is a
very powerful design, and you can even use it efficiently to design huge multi-node data wrangling
pipelines.

Below is list implementation of queue. We use pop(0) to remove the first item from a list.

Exercise 2.22: Queue Using List

In [124]:
# Python code to demonstrate Implementing
# Queue using list
queue = ["Goku", "Bezita", "Gohan"]
queue.append("Trunks")
queue.append("Goten")
print(queue)

# Removes the first item
print(queue.pop(0))

print(queue)

# Removes the first item
print(queue.pop(0))

print(queue)

['Goku', 'Bezita', 'Gohan', 'Trunks', 'Goten']
Goku
['Bezita', 'Gohan', 'Trunks', 'Goten']
Bezita
['Gohan', 'Trunks', 'Goten']


Exercise 2.23: Implementing a Queue in Python

In this exercise, we’ll implement a queue in Python. We’ll use the append function to add elements
to the queue and use the pop function to take elements out of the queue. We’ll also use the deque
data structure and compare it with the queue in order to understand the wall time required to
complete the execution of an operation. To do so, perform the following steps:

1. Create a Python queue with the plain list methods. To record the time the append operation
in the queue data structure takes, we use the %%time command:

In [140]:
%%time
queue = []
for i in range(0, 100000):
    queue.append(i)
print("Queue created")
queue

Queue created
Wall time: 9.97 ms


[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


2. If we were to use the pop function to empty the queue and check the items in it:

In [138]:
for i in range(0, 100000):
    queue.pop(0)
print("Queue emptied")

Queue emptied


However, this time, we’ll use the %%time magic command while executing the preceding code to
see that it takes a while to finish:


In [141]:
%%time
for i in range(0, 100000):
    queue.pop(0)
print("Queue emptied")
queue

Queue emptied
Wall time: 879 ms


[]

In a modern MacBook, with a quad-core processor and 8 GB of RAM, it took around 1.20 seconds
to finish. With Windows 10, it took around 2.24 seconds to finish. It takes this amount of time
because of the pop(0) operation, which means every time we pop a value from the left of the list
(the current 0 index), Python has to rearrange all the other elements of the list by shifting them
one space left. Indeed, it is not a very optimized implementation.

3. Implement the same queue using the deque data structure from Python’s collections package
and perform the append and pop functions on this data structure:

In [129]:
%%time
from collections import deque
queue2 = deque()
for i in range(0, 100000):
    queue2.append(i)
print("Queue created")
for i in range(0, 100000):
    queue2.popleft()
print("Queue emptied")

Queue created
Queue emptied
Wall time: 16 ms


With the specialized and optimized queue implementation from Python’s standard library, the time
that this should take for both the operations is only approximately 27.9 milliseconds. This is a
huge improvement on the previous one.

We will end the discussion on data structures here. What we discussed here is just the tip of the
iceberg. Data structures are a fascinating subject. There are many other data structures that we
did not touch on and that, when used efficiently, c an o ffer en ormous ad ded va lue. We strongly
encourage you to explore data structures more. Try to learn about linked lists, trees, graphs, and
all the different variations o f t hem a s much a s you c an; you w ill fi nd th ere ar e ma ny similarities
between them and you will benefit greatly from studying t hem. Not only do they offer the joy of
learning, but they are also the secret mega-weapons in the arsenal of a data practitioner that you
can bring out every time you are challenged with a di icult data wrangling job.

Exercise 2.24 : Using Deque

In [142]:
# Python code to demonstrate Implementing
# Stack using deque
from collections import deque
queue = deque(["Luffy", "Zorro", "Sanji", "Nami"])
print(queue)
queue.append("Franky")
print(queue)
queue.append("Usop")
print(queue)
print(queue.pop())
print(queue.pop())
print(queue)

deque(['Luffy', 'Zorro', 'Sanji', 'Nami'])
deque(['Luffy', 'Zorro', 'Sanji', 'Nami', 'Franky'])
deque(['Luffy', 'Zorro', 'Sanji', 'Nami', 'Franky', 'Usop'])
Usop
Franky
deque(['Luffy', 'Zorro', 'Sanji', 'Nami'])


Basic File Operations in Python

In the previous topic, we investigated a few advanced data structures and also learned neat and
useful functional programming methods to manipulate them without side effects. In this topic, we
will learn about a few OS-level functions in Python, such as working with files, but these could also
include working with printers, and even the internet. We will concentrate mainly on file-related
functions and learn how to open a file, read the data line by line or all at once, and finally, how to
cleanly close the file we o pened. T he c losing o peration o f a fi le sh ould be do ne ca utiously, which
is ignored most of the time by developers. When handling file o perations, we often run into very
strange and hard-to-track-down bugs because a process opened a file and did not close it properly.We will apply a few of the techniques we have learned about to a file that we will read to practice
our data wrangling skills further.

Exercise 2.25: File Operations

In this exercise, we will learn about the OS module of Python, and we will also look at two very
useful ways to write and read environment variables. The power of writing and reading environment
variables is often very important when designing and developing data-wrangling pipelines.

The purpose of the OS module is to give you ways to interact with OS-dependent functionalities.
In general, it is pretty low-level and most of the functions from there are not useful on a day-to-day
basis; however, some are worth learning. os.environ is the collection Python maintains with all the
present environment variables in your OS. It gives you the power to create new ones. The os.getenv
function gives you the ability to read an environment variable:

1. Import the os module.

In [143]:
import os

2. Set a few environment variables:


In [144]:
os.environ['MY_KEY'] = "MY_VAL"
os.getenv('MY_KEY')

'MY_VAL'

3. Print the environment variable when it is not set:

In [145]:
print(os.getenv('MY_KEY_NOT_SET'))

None


4. Print the os environment:

In [146]:
print(os.environ)

environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\user\\AppData\\Roaming', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files', 'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files', 'COMPUTERNAME': 'MSI', 'COMSPEC': 'C:\\Windows\\system32\\cmd.exe', 'CONFIGSETROOT': 'C:\\Windows\\ConfigSetRoot', 'DRIVERDATA': 'C:\\Windows\\System32\\Drivers\\DriverData', 'FPS_BROWSER_APP_PROFILE_STRING': 'Internet Explorer', 'FPS_BROWSER_USER_PROFILE_STRING': 'Default', 'HOMEDRIVE': 'C:', 'HOMEPATH': '\\Users\\user', 'INTEL_DEV_REDIST': 'C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\', 'LOCALAPPDATA': 'C:\\Users\\user\\AppData\\Local', 'LOGONSERVER': '\\\\MSI', 'MIC_LD_LIBRARY_PATH': 'C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\compiler\\lib\\mic', 'NUMBER_OF_PROCESSORS': '12', 'ONEDRIVE': 'C:\\Users\\user\\OneDrive - ump.edu.my', 'ONEDRIVECOMMERCIAL': 'C:\\Users\\user\\O

After executing the preceding code, you will be able to see that you have successfully printed
the value of MY_KEY, and when you tried to print MY_KEY_NOT_SET, it printed None.
Therefore, utilizing the OS module, you will be able to set the value of environment variables in
your system.

File Handling

In this section, we will learn about how to open a file in P ython. We will learn about the different
modes that we can use and what they stand for when opening a file. Python has a built-in open
function that we will use to open a file. The open function takes a few arguments as i nput. Among
them, the first one, which stands for the name of the file you want to op en, is the only one that’s
mandatory. Everything else has a default value. When you call open, Python uses underlying
system-level calls to open a file handler and return it to the caller.

Usually, a file can be opened either for reading or w riting. If we open a file in one mode, the other
operation is not supported. Whereas reading usually means we start to read from the beginning
of an existing file, w riting c an m ean e ither s tarting a n ew fi le an d wr iting fr om th e be ginning or
opening an existing file and appending to it.

You can open a file for reading with the command that f ollows. The path (highlightedwould need
to be changed based on the location of the file on your system.

In [158]:
fd = open("data_temporary_files.txt")

We will discuss some more functions in the following section.

This is opened in rt mode (opened for the reading+text mode. You can open the same file in
binary mode if you want. To open the file in binary mode, use the rb (read, bytemode:


In [172]:
fd = open('AA.txt',"rb")
fd

<_io.BufferedReader name='AA.txt'>

This is how we open a file for writing:

In [160]:
fd = open("data_temporary_files.txt ", "w")
fd

<_io.TextIOWrapper name='data_temporary_files.txt ' mode='w' encoding='cp1252'>

Exercise 2.26: Opening and Closing a File

In this exercise, we will learn how to close a file after opening it.

We must close a file once we have opened i t. A lot of system-level bugs can occur due to a dangling
file handler, which means the file is still being modified, even though the application is done using
it. Once we close a file, no further operations can be performed on that file using that specific file
handler.

1. Open a file in binary mode:


In [173]:
fd = open("AA.txt", "rb")

2. Close a file using close(:

In [174]:
fd.close()

Python also gives us a closed flag with the file ha ndler. If we print it before closing, then we will
see False, whereas if we print it after closing, then we will see True. If our logic checks whether a
file is properly closed or not, then this is the flag we want to use.

The ‘With’ Statement

In this section, we will learn about the with statement in Python and how we can effectively use it
in the context of opening and closing files.

The with command is a compound statement in Python, like if and for, designed to combine
multiple lines. Like any compound statement, with also affects the execution of the code enclosed
by it. In the case of with, it is used to wrap a block of code in the scope of what we call a Context
Manager in Python. A context manager is a convenient way to work with resources and will help
avoid forgetting to close the resource. A detailed discussion of context managers is out of the scope
of this exercise and this topic in general, but it is sufficient t o s ay t hat i f a c ontext m anager is
implemented inside the open call for opening a file in Python, it is guaranteed that a close call will
automatically be made if we wrap it inside a with statement.

Opening a File Using the with Statement

Open a file using the with statement:

In [175]:
with open("AA.txt") as fd:
    print(fd.closed)
print(fd.closed)

False
True


If we execute the preceding code, we will see that the first print will end up printing False, whereas
the second one will print True. This means that as soon as the control goes out of the with block,
the file descriptor is automatically closed.

Exercise 2.27: Reading a File Line by Line

In this exercise, we’ll read a file line by l ine. Let’s go through the following steps to do s o: 1 . Open
a file and then read the file line by line and print it as we read it:

In [176]:
with open("Alice`s Adventures in Wonderland, "\
"by Lewis Carroll", encoding="utf8") as fd:
    for line in fd:
        print(line)


FileNotFoundError: [Errno 2] No such file or directory: 'Alice`s Adventures in Wonderland, by Lewis Carroll'

Looking at the preceding code, we can see why it is important. With this short snippet of code,
you can even open and read files that are many gigabytes in size, line by line, and without flooding
or overrunning the system memory. There is another explicit method in the file descriptor object,
called readline, which reads one line at a time from a file.

2. Duplicate the same for loop, just after the first one:

In [177]:
with open("Alice`s Adventures in Wonderland, "\
"by Lewis Carroll", encoding="utf8") as fd:
    for line in fd:
        print(line)
    print("Ended first loop")
for line in fd:
    print(line)

FileNotFoundError: [Errno 2] No such file or directory: 'Alice`s Adventures in Wonderland, by Lewis Carroll'

Exercise 2.28: Writing to a File

In this exercise, we’ll look into file operations by showing you how to read from a dictionary and
write to a file. We will write a few lines to a file and read the file:

Let’s go through the following steps: 1. Use the write function from the file descriptor object:


In [166]:
data_dict = {"India": "Delhi", "France": "Paris",\
"UK": "London", "USA": "Washington"}
with open("data_temporary_files.txt", "w") as fd:
    for country, capital in data_dict.items():
        fd.write("The capital of {} is {}\n"\
                .format(country, capital))

2. Read the file using the following command:

In [167]:
with open("data_temporary_files.txt", "r") as fd:
    for line in fd:
        print(line)

The capital of India is Delhi

The capital of France is Paris

The capital of UK is London

The capital of USA is Washington



3. Use the print function to write to a file using the following command:

In [168]:
data_dict_2 = {"China": "Beijing", "Japan": "Tokyo"}
with open("data_temporary_files.txt", "a") as fd:
    for country, capital in data_dict_2.items():
        print("The capital of {} is {}"\
                .format(country, capital), file=fd)

4. Read the file using the following command:

In [169]:
with open("data_temporary_files.txt", "r") as fd:
    for line in fd:
        print(line)

The capital of India is Delhi

The capital of France is Paris

The capital of UK is London

The capital of USA is Washington

The capital of China is Beijing

The capital of Japan is Tokyo

