# Week 4 Data Munging - Data Types, Strings and Text Data 

Week 4 reading: **Pandas for Everyone** chapters 7,8 (pages 146 - 170)

Outline:

* Chapter 7 - Data Types
    1. Data types
        * How computers use binary and what does 64-bit mean?
        * 16, 32, 64 bit? Why?
    2. Numbers vs. characters
* Chapter 8 - Strings and Text Data
    1. Regular expressions
    
## Overview

Python, and particularly the NumPy library, recognizes and can work with a wide variety of data types, but in the end, it all really comes down to 3 types: **integer, floating point, and string**.

Pandas, utilizing NumPy arrays for storage, sees data as 64-bit integers (int64), 64-bit floating point (float64), or string (object). **Pandas for Everyone** does a great job of discussing how to convert from one type to another (chapter 7) as well as a more in-depth look at string slicing, functions that operate on strings, formatting, and pattern matching using "regular expressions." 

As with last week, this week's From The Expert will mainly serve to highlight important concepts from the book. Students are encouraged to read the book and follow along with the examples.

## 1. Data Types

### How computers use binary and what does 64-bit mean?

Computers use binary (`0 and 1` or `off and on`) at the lowest level to store everything -- integers, floating point numbers, strings, etc. This is done primarily because transistors (the building blocks of electronic circuits) function as on/off switches. RAM is essentially just large square matrices of transistors, each one in an *off* or *on* state, plus some logic circuits that can read those states and report on it. 

Data moving from RAM to CPU or hard drive to RAM is represented as electrical voltages on a wire. 0 volts means "off" and 5 volts (for example) represents "on". The voltage on the wire (data line) will stay high or low for a prescribed amount of time to distinguish two 0s or 1s from just a very long pulse.

Now, let's say we have a sequence of 4 bits -- 0110 -- that we need to transmit from memory to the CPU. Given that each high or low pulse must be maintained for some length of time (to make the math easy we will use 1 second), the data transmission would look like this:

0 (1 sec.)<br>
-<br>
1 (1 sec.)<br>
-<br>
1 (1 sec.)<br>
-<br>
0 (1 sec.)<br>

This is called "serial transmission" and it should be fairly apparent that 4 bits would take 4 seconds. But, what would happen if we could have four of those wires, each one carrying a bit?

0 | 1 | 1 | 0 (1 sec.)

Four bits would only take 1 second, and then another four bits could be sent the next second, thus quadrupling the amount of data that can be moved in one time unit. If this represented 4 data lines from RAM to the CPU, it would be a **4 bit computer**.

### 16, 32, 64 bit? Why?

The computer **does not** know what an 'A' is. Similarly, it does not know what 'か' is or '食' or any other human language. It knows '0' and '1' and somehow humans must find a clever way to make a bunch of 0s and 1s into an 'A'. 

Originally, this clever way was called ASCII (American Standard Code for Information Interchange) which used 7 bits to encode the English alphabet (upper and lower case) plus numbers and some symbols. 

**CAUTION: MATH AHEAD**

Now, since $2^7=128$ we could represent 128 characters with ASCII on our 8-bit computers. 

Letter|ASCII|Binary
------|-----|------
A | 65 | 01000001
B | 66 | 01000010
C | 67 | 01000011
D | 68 | 01000100

Going back to our hypothetical computer that takes 1 second per transmission, sending 'ABCD' would look like this:

01000001 (1 sec.)<br>
-<br>
01000010 (1 sec.)<br>
-<br>
01000011 (1 sec.)<br>
-<br>
01000100 (1 sec.)<br>

But, what if we had a 16-bit computer?

01000001 | 01000010 (1 sec.)<br>
-<br>
01000011 | 01000100 (1 sec.)<br>

Or 32-bits?

01000001 | 01000010 | 01000011 | 01000100 (1 sec.)<br>

Another benefit is the size of number that can be represented. 

We calculate the largest number by taking 2 possible states (on or off) and raise it to the power of data bits (8 bits):

$$2^8=256$$

But if we increase our data lines by the next power of 2 to 16, the largest integer we can represent is:

$$2^{16}=65536$$

Until only recently, Microsoft Windows was a 32-bit operating system, meaning the largest integer it could recognize as a chunk was:

$$2^{32}=4294967296$$

Now, Windows is 64-bit, and Pandas natively uses 64 bits for integers, so the largest integer value we can put in a Pandas int64 column is:

$$2^{64}=18446744073709551616$$

# 2. Numbers vs. characters

A point of confusion for many students is the difference between a number and a text character of a number. To illustrate:

In [9]:
num = 4
char = '4'

Notice the single quotes around the value in char. Let's see what happens if we try to do some math:

In [10]:
num + num

8

In [11]:
char + char

'44'

In [12]:
num + char

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Why did char + char print out 44? That is obviously the wrong answer.

It happened because '4' is a string, and + for strings joins them together.

<hr> 

Quick quiz:

Which one of these is really a number?

    A. Social Security Number
    B. Phone number
    C. Employee ID
    D. Yearly salary
    E. All of the above

<hr>

The ONLY right answer is D. 

D? Why D? 

Because D is the only one that you might want to do math with. Think about it a second -- does your SSN + my SSN have any meaning? Does your phone number have a square root? What would you use it for? And does 2 * employee_id mean s/he does twice the work or gets paid twice as much?

Let's revisit our employee dataframe from Week 2:

In [14]:
import pandas as pd
emp_df = pd.DataFrame({'emp_id':['1','2','3','4'], 'emp_name':['Tom', 'Mary','John', 'Tim'], 'dept_id':['1','2', '3', '1'] })

In [15]:
emp_df.head()

Unnamed: 0,emp_id,emp_name,dept_id
0,1,Tom,1
1,2,Mary,2
2,3,John,3
3,4,Tim,1


Emp_id and dept_id both look like numbers, correct?

In [16]:
emp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
emp_id      4 non-null object
emp_name    4 non-null object
dept_id     4 non-null object
dtypes: object(3)
memory usage: 176.0+ bytes


Pandas thinks they are strings. Does it matter? Not in the short term but many machine learning algorithms do not like strings.

One final word on the topic. Dept_id looks like an integer, is classified as a string, but in reality it isn't either one. What is it? A category. You wouldn't do math with the number itself, but you may want to use it to group employees.

**LOOK** at your data. **THINK** about what each variable *really means*. Does a dept_id of 3 mean there are three of them? Or maybe the employee has worked up from 1? Of course not. 

# Chapter 8 - Strings and Text Data

Hopefully, most of chapter 8 is review and easily understood. Once again, students are encouraged to work through the examples in the book on their own to make sure they understand string manipulation. 

The section that may cause confusion is regular expressions. Subsequently, that is where we will focus our attention next.

# 1. Regular expressions

There is a programmer's joke that goes
> A programmer is trying to solve a problem and thinks to himself, "I'll use regular expressions!" <br>
> Now he has **two** problems.

Regular expressions come from the study of *formal languages* in theoretical computer science in the 1950s. Regex was built into the Unix operting system in the late 1960s as part of a text editor and has since come into common use by Unix/Linux system administrators as well as many programmers. Most, if not all, programming languages have some form of regular expression pattern matching built in or available as an add-on library. The Perl programming language makes particularly heavy use of regex.

Regex is a particularly fast pattern matching engine, often used to find particular words or phrases in very large log files. It is also used to verify input patterns. For example, insuring telephone numbers or Social Security Numbers are entered in the proper format or that email addresses have an '@' symbol and a '.' in the latter section. 

*reference: https://en.wikipedia.org/wiki/Regular_expression*

Regular expressions work on strings. The simplest form will match the pattern you specify exactly. For example if your regular expression is `Python`, it will **only** match `Python`. It will skip both `python` and `PYTHON`. Regex can be told to ignore case. This will be demonstrated later.

Table 8.5 on Page 165 of the book shows **Basic RegEx Syntax**. Many sources will call these **metacharacters**, meaning characters used for control (meta - analyze at a higher level, thus meta-characters are characters used to analyze characters).

When working with regex we need to import the `re` library and the function we often use is `search()`.

In [17]:
import re

In order to use a regex, we need two things:

1. A pattern to match.
2. A string to search for the pattern in.

Let's start with the most basic case: finding a literal string inside another string. Note the use of a **raw string** for our pattern -- this tells regex to take the string as it is, don't try to interpret special characters like `\`. We will also use the `group()` function to show the string matched.

In [18]:
pattern = r'Python'
strng = 'I love Python programming!'

re.search(pattern, strng).group()

'Python'

Not very exciting, I know. What would happen if the search didn't find the pattern?

In [20]:
re.search(pattern, 'Functional programming rocks, too!').group()

AttributeError: 'NoneType' object has no attribute 'group'

`.` is used to match any single character, like so:

In [21]:
pattern = r'P.th.n'

re.search(pattern, strng).group()

'Python'

The book demonstrates the `match()` function whereas we used `search()` above. What is the difference? `match()` only checks the beginning of the string but `search()` checks the entire string. For contrast:

In [22]:
re.match(pattern, strng).group()

AttributeError: 'NoneType' object has no attribute 'group'

Finally, let's take a look at a common usage for regex -- checking if user input matches a particular pattern.

Imagine for a moment that we have a web page that is asking users to input their email address and we want to verify that the format of their response is correct. 

In [28]:
good_user_email = 'noname@example.com' # This is what we got from the user
email_pattern = r'[\w\.-]+@[\w\.-]+'   # Check regex syntax in book for info

valid_input = re.search(email_pattern, good_user_email)

if (bool(valid_input)):    # bool() turns the search return into either a True or False
    print('Email address is properly formatted.')
else:
    print('Email address is improperly formatted.')

Email address is properly formatted.


Obviously we could have done more significant processing than just printing a message.

Let's try a bad one

In [30]:
good_user_email = 'noname#example.com'    # Maybe they just fat-fingered the @
email_pattern = r'[\w\.-]+@[\w\.-]+'  

valid_input = re.search(email_pattern, good_user_email)

if (bool(valid_input)):    
    print('Email address is properly formatted.')
else:
    print('Email address is improperly formatted.')

Email address is improperly formatted.


The pattern above just does the most basic check and is usually sufficient. 

Just for fun, here is the General Email Regex as discussed in the RFC 5322 Official Standard.

<code>
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
</code><br>
<br>
BTW - you may remember that these "From The Expert" pages were generated from Jupyter notebooks, which uses regular expressions to parse the markdown (text) cells. Meaning it is incredibly difficult to put complex regex like above in a cell and have it formatted correctly!
<br><br>
Reference: https://emailregex.com/