# <center>Introduction to Data</center>

<center>Dr. W.J.B. Mattingly</center>

<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>

<center>January 2022</center>

## Covered in this Chapter

1) What is Data?<br>
2) Types of Python Data<br>
3) Strings<br>
3) Numbers (Integers and Floats)<br>
3) Booleans<br>
3) Manipulating Strings<br>
3) Mathematical Operations<br>

## Some Quick Notes about Terminology and Commands we will Use

### Objects

When we import or create data within Python, we are essentially creating an object, or a variable. These two words mean slightly different things and, but are often used interchangeably. We will get into the differences later in this textbook, but for now, view an object as something that is created by a Python script. An object is stored in your computer's memory so that it can be used later in a program. Think of your computer's memory rather like your own brain. Imagine if you needed to remember what the word for "hello" in German. You may use your memory rather like a flashcard, where "hello" in English equates to "hallo" in German. In Python, we create objects in a similar way so that our computer understands what that object name corresponds to.

They can be created by typing a unique word, followed by an = sign, followed by the specific data. As we will learn throughout this chapter, there are many types of data that are created differently. Let's create our first object before we begin. This will be a string, or a piece of text. (We will learn about these in more detail below.) In my case, I want to create the object author. I want author to be associated with my name in memory. In the cell, or block of code, below, let's do this.

In [1]:
author = "William Mattingly"

### The Print Function

Excellent! We have created our first object. Now, it is time to use that object. Below, we will learn about ways we can manipulate strings, but for now, let's simply see if that object exists in memory. We can do this with the print function.

The print function will become your best friend in Python. It is, perhaps, the function I use most commonly. The reason for this is because the print function allows for you to easily debug, or identify problems and fix them, within your code. It allows us to print off objects that are stored in memory.

To use the print function, we type the word print followed by an open parentheses. After the open parentheses, we place the object or that or piece of data that we want to print. After that, we close the function with the close parentheses. Let's try to print off our new object author to make sure it is in memory.

In [2]:
print (author)

William Mattingly


Notice that when I execute the cell above, I see an output that relates to the object we created above. What would happen if I tried to print off that object, but I used a capital letter, rather than a lowercase one at the beginning, so Author, rather than author?

### Case Sensitivity

In [3]:
print (Author)

NameError: name 'Author' is not defined

The scary looking block of text above indicates that we have produced an error in Python. This mistake teaches us two things. First, python is case sensitive. This means that if any object (or string) will need to be matched in not only letters, but also the case of those letters. Second, this mistake teaches us that we can only call objects that have been created and stored in memory.

## What is Data?

In Python there are seven key pieces of data and data structures with which we will be working: strings, numbers (integers and floats), booleans, lists, tuples, and dictionaries. In the next two chapters, we will explore each of these.

Data are pieces of information (the singular is datum)i.e., integers, floats, and strings. Data structures are objects that make data relational, i.e. lists, tuples, and dictionaries. Before you proceed to lesson three, you MUST have a basic understanding of the ways in which you create data in Python and the ways in which you make that data relational through data structures. Start to train your brain to recognize the Python syntax for these pieces of data and data structures discussed below.

## Strings

<b>Strings</b> are a sequence of characters. These can be digits or they can be letters or symbols, but what makes a string distinctly different from an integer or a float is the presence of quotation marks, i.e. ” ” or ‘ ‘. The opening of a quotation mark indicates to Python that a string has begun and the closing of the same style of a quotation mark indicates the close of a string. It is important to use the same style of quotation mark for a string, either a double or a single. In the examples below, we have two string objects: a_string and b_string, the former corresponds to the string “Hello” and the latter corresponds to the string “Bye”.

### Examples of Strings

In [1]:
#Strings - any kind of text
str1 = "This is a string."

In [2]:
print (str1)

This is a string.


In [3]:
str2 = 'This is a string too.'

In [4]:
print (str2)

This is a string too.


In [11]:
str3 = "This is a "bad" example of a string"

SyntaxError: invalid syntax (<ipython-input-11-34cf070f294b>, line 1)

In [10]:
print (str3)

This is a "bad" example of a string


In [12]:
str4 = '''
This is a verrry long string.

'''

In [13]:
print (str4)


This is a verrry long string.




## Numbers (Integers and Floats)

Numbers are represented in programming languages in two several ways. The two we will deal with are integers and floats.

An <b>integer</b> is a digit that does not contain a decimal place, i.e. 1 or 2 or 3. This can be a number of any size, such as 100,001,200. A float, on the other hand, is a digit with a decimal place. So, while 1 is an integer, 1.0 is a float. Floats, like integers, can be of any size, but they necessarily have a decimal place, i.e. 200.0020002938. In python, you do not need any special characters to create an integer or float object. You simply need an equal sign. In the example below, we have two objects which are created with a single equal sign. These objects are titled an_integer and a_float with the former being an object that corresponds to the integer 1 and the latter being an object that corresponds to the float 1.1.

### Examples of Numbers

In [15]:
int1 = 1

In [16]:
print (int1)

1


In [17]:
float1 = 1.1

In [18]:
print (float1)

1.1


## Booleans

The term boolean comes from Boolean algebra, which is a type of mathematics that works in binary logic. Binary is the basis for all computers, save for the more nascent quantum computers. Binary is 0 or 1; off or on; true or false. A boolean object in programming languages is either True or False. True is 1, while False is 0. In Python we can express these concepts with capitalized T or F in True or False. Let's make one such object now.

### Examples of Booleans

In [19]:
bool1 = True

In [20]:
print (bool1)

True


In [22]:
bool2 = True

In [23]:
bool3 = False

In [24]:
print (bool3)

False


In [26]:
bool4 = False

## Working with Strings as Data

They are the key way in which we handle text data in Python. This means that for the digital humanist, strings will be fundamentally necessary to understand as they are the chief form of data we use.

In order to interact with strings as pieces of data, we use methods and functions. The chief functions for interacting with strings on a basic level come standard with Python. This means that you do not need to install third-party libraries. Later in this textbook we will do more advanced things with strings using third-party libraries, such as Regex, but for now, we will simply work with the basic functions.

Although I will discuss functions in greater detail in a later lesson, it is important to understand what a function is. A function is a block of code stored outside (or inside) your Python script. We call a function by using the function name and added an open and a close parentheses. Often when we call functions, we need to pass arguments. These are the pieces of data that the function will perform the operations on. The arguments are contained within the parentheses and delineated with commas. In some cases, there will be named arguments which require you to specify a specific argument.

In the example below, we have an object, which is a string (a_string), and two different functions. The first function is the split function which will split a string. In the example we state the object upon which we want to perform the function. Next, we place a period to state the specific function that we want to run on the string. We have an open parentheses in which we have a single argument, a string (created by quotation marks). The specific string is a single comma. This means that we are telling Python to split the string any time it finds a comma. Remember though, strings are immutable. This means that in order to store the result of this function in memory, we need to create a new object. This new object is new_string which we create with an equal sign. Were we to print off new string, we would see a list of strings that would look like this: [“Hello”,”Bye”]. To run the code below, hit the play button.

Let's learn to manipulate strings now through code, but first we need to create a string. Let's call it str6.

In [27]:
str6 = "This is a new string."

It is not a very clever name, but it will work for our purposes. Now, let's try to convert the entire string into all uppercase letters. We can do this with the method .upper(). Notice that the .upper() is coming after the string and within the () are no arguments. This is a way you can easily identify a method (as opposed to a function). We will learn more about these distinctions in the chapters on functions and classes.

In [28]:
print (str6.upper())

THIS IS A NEW STRING.


Noice that our string is now all uppercase. We can do the same thing with the .lower() method, but this method will make everything in the string lowercase.

In [29]:
print (str6.lower())

this is a new string.


On the surface, these methods may appear to only be useful in niche circumstances. While these methods are useful for making strings look the way you want them to look, they have far greater utility. Imagine if you wanted to search for a name, "William", in a string. What if the data you are examining is from emails, text messages, etc. William may be capitalized or not. This means that you would have to run two searches for William across a string. If, however, you lowercase the string before you search, you can simply search for "william" and you will find all hits. This is one of the things that happens on the back-end of most search engines to ensure that your search is not strictly case-sensitive. In Python, however, it is important to do this step of data cleaning before running searches over strings.

Let's explore another method, .capitalize(). This method will allow you to capitalize a string.

In [30]:
str7 = "william"

In [31]:
print (str7.capitalize())

William


I will use this in niche circumstances, particularly when I am performing data cleaning and need to ensure that all names or proper nouns in a dataset are cleaned and well-structured.

Perhaps the most useful string method is .replace(). Notice in the cells below, replace takes a mandatory of two arguments, or things passed between the parentheses. Each is separated by a comma. The first argument is the substring or piece of the string that you want to replace and the second argument is what you want to replace it with. Why is this so useful? If you are using Python to analyze texts, those texts will, I promise, never be well-cleaned. They may have bad encoding, characters that will throw off searches, bad OCR, multiple line breaks, hyphenated characters, the list goes on. Replace allows you to quickly and effectively clean textual data so that it can be standardized.

In the example below, let's try and replace the period at the end of "Mattingly."

In [2]:
str8 = "My name is William Mattingly."

In [3]:
print (str8.replace(".", ""))

My name is William Mattingly


Excellent! Now, let's try and reprint off str8 and see what happens.

In [4]:
print (str8)

My name is William Mattingly.


Uh oh! Something is not right. Nothing has changed! Indeed, this is because strings are immutable objects. In order to change a string, you must recreate it in memory or create a new string object from it. Let's try and do that below.

In [5]:
str9 = str8.replace(".", "")

In [6]:
print (str9)

My name is William Mattingly


Excellent! Now we have a new string that has been cleaned, but let's say I am only interested in grabbing what comes after the phrase "My name is"

In [7]:
str10 = str9.replace("My name is", "")

Everything looks good when we print it off below, but notice, there is a leading white space before the W. This is uncleaned data. We often want to remove leading or trailing whitespaces so that all data is consistent in our dataset. We can do this by using the .strip() method.

In [8]:
print (str10)

 William Mattingly


In [42]:
print (str10.strip())

William Mattingly


Now, our data is cleaned.

In [43]:
print (str9)

My name is William Mattingly


Strings have a lot of other useful methods that we will be learning about throughout this textbook, such as the split() method which returns a list of substrings that are split by the delimiter, which is the argument of the method. By default, split() will split your string at the whitespace.

In [9]:
print (str9.split())

['My', 'name', 'is', 'William', 'Mattingly']


As we learn about lists in the next chapter, you will be able to use split to grab specific items from that list. Notice that in the output below, we have the same output as our method of replace and strip above. It is important to remember that in programming, there is rarely one right answer. Usually a problem can be solved many ways, but some are more efficient or easier to parse when another programmer tries to read your code.

In [10]:
print (str9.split("My name is ")[1])

William Mattingly


In [47]:
#Pythonic => the standard way to do something in Python

## Working with Numbers as Data

Now that you understand how strings work, let’s begin exploring another type of data: numbers. Numbers in Python exist in two chief forms: integers and floats. As noted in Lesson 02, integers are numbers without a decimal point, whereas floats are numbers with a decimal point. This is am important distinction that you MUST remember, especially when working with data imported from and exported to Excel.

As digital humanists, you might be thinking to yourself, “I just work with text, why should I care so much about numbers?” The answer? Numbers allow us to form quantitative analysis. What if you want to know the times a specific author wrote to a colleague or to which places he wrote most frequently, as was the case with the Republic of Letters project at Stanford? In order to perform that kind of analysis, you MUST have a command of how numbers work in Python, how to perform basic mathematical functions on those numbers, and how to interact with them. Further, numbers are essential to understand for performing more advanced functions in Python, such as Loops, explored in Lesson 09.

The way in which you create a number object in Python is the to create an object name, use the equal sign and type the number. If your number has a decimal, Python will automatically consider it a float. If it does not, it will automatically consider it an integer.

In [48]:
num1 = 1
num2 = 2

Throughout your DH project, you will very likely need to manipulate numbers through mathematical operations. Here is a list of the common operations:

1. Addition +
2. Subtraction –
3. Multiplication *
4. Exponential Multiplication **
5. Division /
6. Modulo % #This will return the remainder, e.g. 2%7 will yield 1.
7. Floor // #This will return the max number of times two numbers can be divided into each other, e.g. 2//7 will yield 3.

In [49]:
print (num1+num2)

3


In [50]:
print (num1-num2)

-1


In [51]:
print (num1*num2)

2


In [52]:
print (num1/num2)

0.5


In [53]:
print (100*8)

800


In [54]:
print (num1/num2)

0.5


In [55]:
print (num1//num2)

0


In [56]:
print (5/2)

2.5


In [57]:
print (5//2)

2


In addition to this, you will very often in loops need to identify Comparison Operators (equal to, less than, etc). Here is a list of those:
1. Equal to (==)
2. Greater than (>)
3. Less than (<)
4. Less than or equal to (<=)
5. Greater than or equal to (>=)
6. Not equal to (!=)

We will address these in greater detail in a later chapter

## Conclusion

This chapter has introduced you two some of the essential types of data: strings, integers, floats, and booleans. It has also introduced you to some of the key methods and operations that you can perform on strings and numbers. Before moving onto the next chapter, I recommend spending some time and testing out these methods on your own data. Try and manipulate an input text to locate and retrieve specific information.

If you haven't yet installed Python, that's okay! There are free compilers online, including the one I have available at PythonHumanities.com