# Persnickety Python: Text Data Cleaning Workshop

## Overview of Topics Covered:

## Main Topics (This Notebook):

### Part I - Base Python

* Python strings
* Built-in string methods
* Type conversions
* Lists and Dictionaries
* List & Dictionary Comprehensions
* Functions and order of operations

#### Part II - `pandas`

* `pandas` and `DataFrame` objects
* Filtering rows in pandas using `Series`
* Filtering rows in pandas using `.query()`
* Exploratory Data Analysis methods in `pandas`
* `lambda`functions and pandas `.apply()`
* Types of Character Encodings
* Parsing dates with `dateutil`
* `merge` - combining two DataFrames along a common column

## Part III - Supplementary Topics:
Your choice of the following:

## Option 1: Notebook A

### Notebook A - Ask A Manager - Salary Survey

An in-depth view of a complicated data-cleaning process. This one is a *very* long demonstration... we may not get through all of it during the workshop.

* *Messy* Data from Google Forms surveys
* ...Like, *really* messy
* CSV vs. Excel files in Pandas
* Renaming columns in a Pandas DataFrame
* Converting strings to integers
* String methods in action
* Manipulating data with `.apply()` and `lambda`
* Filtering rows in Pandas
* Using data from multiple DataFrame columns
* Data preprocessing/integration/enrichment
* `datetime` and `dateutil` (for more, see PPW2 - Doctor Who)
* Pandas `.merge()` to join datasets

## Option 2: Choose one or two of the following: B, C, D:

### Notebook B - Doctor Who - Actor Timeline

Using regular expressions (regex) to find numerical data in text

* Extracting text data from a Wikipedia table
* Pandas `.read_html()`
* Regular expressions (Python `re` module)
* Getting time series data out of text data
* Python's `datetime` module and the `dateutil` package
* Timeline visualization

### Notebook C - Goodreads - Book Ratings

How to fix a broken .csv

* Reading in a badly-formatted dataset
* Working with `bytes` objects
* Cleaning *before* loading into pandas

### Notebook D - Behavioral Risk Factor Surveillance System (BRFSS) 2014

How to extract data from PDF files

* `pdfminer.six`
* Cleaning up excess whitespace in strings
* Using a dictionary to replace values in a column

### Appendix - Odds & Ends

* Syntax updates and warnings in Pandas

# PPW Part I - Python Review & Data Cleaning Introduction

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import dateutil

# Strings

### What are Strings?

Strings are basic units of text. They can contain any character. Python recognizes input as a string when it's enclosed in matching quotation marks.

Strings can be combined using the "+" operator. Say we have two string variables that denote the start date and end date of a process, and we want to print them out in a coherent sentence. We can do so like this:

In [3]:
start_date = '2024-05-01'
end_date = '2024-05-31'

#Notice the spaces around "to".
full_string = start_date + " to " + end_date

print(full_string)

2024-05-01 to 2024-05-31


In the example above, the strings themselves are bounded by quotation marks. The dates use single-quotes, but the `" to "` uses double-quotes. The two types of quotation marks both do the same thing, but in order for a string to be considered complete by Python, there must be a matching pair at the start and end of the string.

This is helpful when you want your string to include quotation marks. For strings with both single- and double-quotes, we have to make use of a different character, the backslash: `\`

### Escape Characters

"`\`" is called an "escape character" in Python (and in markdown cells). Placing an escape character before another character in a string will cause a different behavior from the character by itself (it "escapes" its normal use). In Python strings, "\\t" represents a tab, and "\\n" represents a newline character.

In [4]:
print('line 1\n\n\n\n\n\n\tline 2 (tab-indented)')

line 1





	line 2 (tab-indented)


Also, in order for a "\\" to show up correctly in markdown cells\*, it has to have another \\ in front of it. "\`" (grave) is another special character that functions as an escape character in some contexts in markdown, but not in code cells.

\**(Double-click in this markdown cell to see how many backslashes and graves there actually are in the markdown text!)*

In [5]:
backslash_str = "The backslash (\\) allows you to use both 'single' and \"double\" quotes in the same string."
print(backslash_str)

The backslash (\) allows you to use both 'single' and "double" quotes in the same string.


### Coding Exercise: Strings and Quotation Marks

How would you turn the following into Python strings?


`When they ascended the steps to the hall, Maria's alarm was every moment increasing, and even Sir William did not look perfectly calm.`

`"Every man look out along his oars!" cried Starbuck. "Thou, Queequeg, stand up!"`

What about this one?

`"Well, sir," he said, with a suspicious sort of modesty, "I think I can; but I don't know as 'ow you'd be satisfied with the theory."`

Copy-paste these three sentences into the cell below, and try to get them to print out correctly.

In [165]:
################################################################################
################################################################################

sentence1 = "When they ascended the steps to the hall, Maria's alarm was every moment increasing, and even Sir William did not look perfectly calm."
sentence2 = '\"Every man look out along his oars!\" cried Starbuck. \"Thou, Queequeg, stand up!\"'
sentence3 = '"Well, sir," he said, with a suspicious sort of modesty, "I think I can; but I don\'t know as \'ow you\'d be satisfied with the theory."'

print(sentence1, '\n')
print(sentence2, '\n')
print(sentence3)

################################################################################
################################################################################

When they ascended the steps to the hall, Maria's alarm was every moment increasing, and even Sir William did not look perfectly calm. 

"Every man look out along his oars!" cried Starbuck. "Thou, Queequeg, stand up!" 

"Well, sir," he said, with a suspicious sort of modesty, "I think I can; but I don't know as 'ow you'd be satisfied with the theory."


## Indexing & Slicing Strings

"Containers" in Python are objects that contain other objects. There are four kinds of objects built into base Python that are explicilty meant to be used as containers: lists, tuples, dictionaries, and sets. 

With the exception of sets, containers are all "subscriptable" – that is to say: objects inside a container (other than those contained in sets) can be retrieved from the container object. In the case of lists and tuples, this is done with square brackets and a positional index; in the case of dictionaries, this is done with square brackets and the name of a "key" that specifies what to retrieve. We'll cover more about lists and dictionaries in a bit, but for now, we must talk about the secret fifth type of container...

Python strings!

Strings are container objects that hold individual characters in a specific order, much like the elements of a Python list or tuple. As such, strings may be indexed in much the same way, using a positional index.

Elements of strings are accessed with square brackets with index numbers in them. A single number will yield the element at that index (important note: in Python, counting starts at 0, not at 1), while two numbers separated by a colon will yield indices from the first number up to (but not including) the second number. For example, an index of `[0:3]` returns elements at indices 0, 1, and 2, *but not 3*.

A negative number in one of these positions indicates that the index starts at the end of the string rather than the beginning.

In [7]:
start_date = '2024-05-01'

print(start_date)
print(start_date[0])
print(start_date[0:4])
print(start_date[-5:])
print(start_date[-5:-3])

2024-05-01
2
2024
05-01
05


A third number passed in the square brackets indicates a "stride"; it will skip numbers other than those that match that interval. A negative here will reverse the order of the characters returned from the string.

In [8]:
print(start_date[0:4:2])
print(start_date[3::-1])

22
4202


### Coding Exercise: Slicing a String

Use square brackets in the following cell to print only the word "pages".

*Hint: spaces also count towards elements in a string.*

In [167]:
################################################################################
################################################################################

sentence4 = "How many pages are there in the Learning Python book by Mark Lutz?"

print(sentence4[8:14])

################################################################################
################################################################################

 pages


# String Methods

String objects in Python have built-in functions called "methods" that allow specific operations to be performed without the need to write additional code. There are over 40 string methods, each of which with its own specific task.

Like all methods, these are accessed by using the dot operator after the string variable, the name of the method, and open-and-closed parentheses, i.e.: `string_name.method()`.

https://www.w3schools.com/python/python_ref_string.asp

In [10]:
string_var3 = "The quick brown fox jumped over the lazy dog." 
string_var4 = "\nSphinx of black quartz, judge my vow."
string_var5 = string_var3 + string_var4

### Case manipulation
Several string methods deal with changing the case of one or more characters in the string. `.upper()` and `.lower()` are the most commonly-used of these.

In [11]:
print(string_var5)

The quick brown fox jumped over the lazy dog.
Sphinx of black quartz, judge my vow.


In [12]:
print(string_var5.upper())

THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
SPHINX OF BLACK QUARTZ, JUDGE MY VOW.


In [13]:
print(string_var5.lower())

the quick brown fox jumped over the lazy dog.
sphinx of black quartz, judge my vow.


### Properties
Some string methods allow you to determine specific properties of a string.

In [14]:
print(string_var5.isnumeric())

False


In [15]:
print(string_var5.isalpha())

False


In [16]:
print(string_var5.count('i'))

2


### Segmentation
Splitting strings on a particular value can be very important to data cleaning.

In [17]:
print(string_var5.split())

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'Sphinx', 'of', 'black', 'quartz,', 'judge', 'my', 'vow.']


In [18]:
print(string_var5.split(', '))

['The quick brown fox jumped over the lazy dog.\nSphinx of black quartz', 'judge my vow.']


### Replacing Values
`.replace()` is perhaps the most important string method for data cleaning, because of its versitility and specificity.

One of the most useful tricks `.replace()` lets you do is replacing a character with an empty string, so it gets removed entirely.

In [19]:
print(string_var5.replace('brown', 'red'))

The quick red fox jumped over the lazy dog.
Sphinx of black quartz, judge my vow.


In [20]:
print(string_var5.replace(' ', ''))

Thequickbrownfoxjumpedoverthelazydog.
Sphinxofblackquartz,judgemyvow.


### Chaining String Methods

You can use multiple string methods at once by adding them one after another. They will be applied in the sequence you write them. Remember to use a `.` before each method.

In [21]:
print(string_var3.replace('fox', 'bear').upper())

THE QUICK BROWN BEAR JUMPED OVER THE LAZY DOG.


What happens when you switch the order the string methods are applied?

In [22]:
print(string_var3.upper().replace('fox', 'bear'))

THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.


If you use `.upper()` first, replacing 'fox' with 'bear' won't do anything... you'd have to look for 'FOX' instead

### Coding Exercise: String Methods

Let's try these out to solve a common problem: normalizing data.

Imagine you have conducted a survey with write-in options for country of residence. The United States residents who responded to the survey all filled in their answers in a different way. If you want to group them all in a single category, you'll need some way of making all their responses fit the same format.

Use some string methods to make sure that all the abbreviations print out the same. How can you order the string methods to accomplish this *most efficiently?*


In [168]:
################################################################################
################################################################################

united_states = ['USA', 'u.s.a.', 'America', 'U.S.A.', 'usa', 'U.S. of A.', 'america']

for abbrev in united_states:
    print(abbrev.lower().replace('.','').replace(' of ','').replace('america','usa'))

################################################################################
################################################################################

usa
usa
usa
usa
usa
usa
usa


# Converting between Types

Python has built-in functions that let you convert objects to different data types, provided they meet the criteria for the new data type.

`str()` lets you change an integer or a float into a string. Any integer or floating point number can be changed to a string.

`int()` and `float()` for converting to integer and floating point data types, respectively. Text data can't always be formatted as an integer or float, so this only works when all the characters of the text string *could be* components of a number. The text string "1000" could become the integer 1000, but you'd get an error if you tried to convert "1,000" to an integer.

It's also important to note that converting a floating point to an integer *does not* round the number; it just removes the decimal component entirely.

In [24]:

#Floating point to integer, no decimal value
print(int(3.0))

#Floating point to integer, decimal value is truncated
print(int(3.8))

#Integer to floating point
print(float(3))

#Floating point to text string
print(str(3.8))

#Text string to integer
print(int('3'))

#Text string to floating point
print(float('3'))

#Text string with punctuation to integer:
print(float("1,000"))

3
3
3.0
3.8
3
3.0


ValueError: could not convert string to float: '1,000'

### Coding Exercise: Convert `start_date` to Integers

Using slices along with the `int()` function, use the `start_date` variable to set the values of the `year`, `month`, and `day` variables 

In [173]:
################################################################################
################################################################################

start_date = '2024-05-01'

year = int(start_date[0:4])
month = int(start_date[6:7])
day = int(start_date[9:10])

print(year, type(year) == int)
print(month, type(month) == int)
print(day, type(day) == int)

################################################################################
################################################################################

True
True
True


Now, try using the `.split()` string method instead of slices.

In [175]:
################################################################################
################################################################################

start_date = '2024-05-01'

year = int(start_date.split('-')[0])
month = int(start_date.split('-')[1])
day = int(start_date.split('-')[2])

print(year, type(year) == int)
print(month, type(month) == int)
print(day, type(day) == int)

################################################################################
################################################################################

True
True
True


# Lists and Dictionaries

You've probably encountered container objects before if you've taken an introductory Python class or workshop before, but we'll briefly review the characteristics of lists and dictionaries here:

### List - constructed with `[ ]` or list()
An array that contains elements in a specific order. Lists store items in a sequence, elements of lists can be indexed using the numerical value of their position in the list (remember, counting in Python starts at "0"). To access the first element of a list, you have to put square brackets after the name of the list, and the number 0 inside them. Lists are "mutable", which means their order and contents can be changed after they are created.

Individual items can be added at the end of a list with `.append()` and removed at the end of a list with `.pop()`.

To add the contents of a list to another list, instead of `.append()`, use `.extend()`.


### Dictionary - constructed with `{ }` or dict()

A `set` of "`key: value`" pairs. Keys must be unique.  Dictionaries are specialized version of sets; instead of containing unique elements, they contain unique `keys` that can be used to retrieve corresponding elements called `values`, sort of how when you open up a physical dictionary, you can find the definition of a word by searching for it alphabetically. Dictionaries are also mutable. To access a value, you must put square brackets after the name of the dictionary, and put the name of the related key in the brackets.

Dictionaries don't use `.append()` but you can remove a key:value pair by using `.pop(key)` to specify the key.

### Coding Exercise - Adding Elements to a List

Try adding some elements to `sample_list2` by filling in the blanks. 

They don't have to be the same type!

In [177]:
################################################################################
################################################################################

sample_list = ['a', 682, [1,2,3], 'b']

sample_list.append(793)

print(sample_list)

################################################################################
################################################################################

['a', 682, [1, 2, 3], 'b', 793]


### Removing an Element from a List

Adding `.pop()` by itself at the end of a list will remove the last element. Adding `.pop(0)` will remove the element at the `0` index (the first element), `pop(1)` will remove the second element, and so on.

You can also use `.pop()` to get an element out of a list and use it for something else, by setting a variable equal to the value returned by `pop()`.

In [145]:
sample_list.pop()

sample_list

['a', 682, [1, 2], 'b', 793]

In [29]:
sample_list.pop(1)

sample_list

['a', [1, 2, 3], 'b']

In [30]:
#Use the last element as a new variable
list_element_b = sample_list.pop()
list_element_b

'b'

In [31]:
sample_list

['a', [1, 2, 3]]

### Coding Exercise - Extending a List

Within `sample_list`, the third element (`sample_list[2]`) is itself a list.

Try using a combination of `.pop()` and `.extend()` to remove that element from `sample_list` and extend `sample_list` with its contents.

In [159]:
################################################################################
################################################################################

sample_list = ['a', 682, [1,2,3], 'b']

sample_list.extend(sample_list.pop(2))

sample_list

################################################################################
################################################################################

['a', 682, 'b', 1, 2, 3]

#### Looping Through a List with `For`

Lists (and other kinds of containers) are especially useful in saving time for the programmer because they can be *iterated* through. Most container objects be iterated through, and so they are often also called "iterables".

In [32]:
#Let's reset our original values in `sample_list`
sample_list = ['a', 682, [1,2,3], 'b']

for item in sample_list:
	print(item)

a
682
[1, 2, 3]
b


#### Accessing Elements of a List

In [33]:
sample_list[0]

'a'

In [34]:
sample_list[2]

[1, 2, 3]

In [35]:
sample_list[-1]

'b'

### Adding Key:Value Pairs to a Dictionary

Dictionaries work very differently from lists in terms of how they store and retrieve data. Here's an example of using dictionaries to store contact information for people here at the Library.

In [156]:
#Here's my contact info:

profile1 = {}

profile1['first_name'] = 'David'
profile1['last_name'] = 'Merten-Jones'
profile1['institution'] = 'The Claremont Colleges Library'
profile1['occupation'] = 'Data Services Specialist'
profile1['email address'] = 'david.merten-jones@claremont.edu'

print(profile1)

# Here is my supervisor's contact info:

profile2 = {}

profile2['first_name'] = 'Jeanine'
profile2['last_name'] = 'Finn'
profile2['institution'] = 'The Claremont Colleges Library'
profile2['occupation'] = 'Head of Data and Digital Scholarship Services'
profile2['email_address'] = 'jeanine.finn@claremont.edu'

print(profile2)

{'first_name': 'David', 'last_name': 'Merten-Jones', 'institution': 'The Claremont Colleges Library', 'occupation': 'Data Services Specialist', 'email address': 'david.merten-jones@claremont.edu'}
{'first_name': 'Jeanine', 'last_name': 'Finn', 'institution': 'The Claremont Colleges Library', 'occupation': 'Head of Data and Digital Scholarship Services', 'email_address': 'jeanine.finn@claremont.edu'}


#### "Nested" Dictionaries

What if we wanted to store both these contact profiles in the same Python object? 

As we saw in `sample_list`, lists can store other lists. The same is true of dictionaries.

Let's make an empty dictionary called "contacts" and add each of the profile dictionaries to it.

In [157]:
contacts = {}

contacts['contact1'] = profile1
contacts['contact2'] = profile2

contacts

{'contact1': {'first_name': 'David',
  'last_name': 'Merten-Jones',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Data Services Specialist',
  'email address': 'david.merten-jones@claremont.edu'},
 'contact2': {'first_name': 'Jeanine',
  'last_name': 'Finn',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Head of Data and Digital Scholarship Services',
  'email_address': 'jeanine.finn@claremont.edu'}}

To access a dictionary within a dictionary, you can use one set of square brackets with the `key` that stores the dictionary as a value, and a second set of square brackets with the key to retrieve the value you want.

In [38]:
contacts['contact1']['first_name']

'David'

In [147]:
contacts['contact2']['first_name']

'Jeanine'

In [39]:
contacts['contact1']['email address']

'david.merten-jones@claremont.edu'

### Coding Exercise - Adding Key:Value Pairs to a Dictionary

Try adding your info to the "contacts" dictionary.

In [149]:
################################################################################
################################################################################

profile3 = {
    # Your profile here
}

contacts

################################################################################
################################################################################

{'contact1': {'first_name': 'David',
  'last_name': 'Merten-Jones',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Data Services Specialist',
  'email_address': 'david.merten-jones@claremont.edu'},
 'contact2': {'first_name': 'Jeanine',
  'last_name': 'Finn',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Head of Data and Digital Scholarship Services',
  'email_address': 'jeanine.finn@claremont.edu'}}

# Using `for` Loops & Comprehensions with Lists and Dictionaries

In Python, you can accomplish a lot using `for` loops. A `for` loop lets you iterate through a container object (or generator, but that's a topic for another day) and procedurally do a task for every element in it.

`for` loops are written in a particular syntax that requires everything to be set up ahead of time; if you want to create a new list and add elements to it by using a `for` loop, the list must exist before you can add elements to it.

When you're operating on the elements of a list (or dictionary) it is often quicker and more space-efficient to use something called a "comprehension" to perform the same actions. Comprehensions create new objects in-place, so you don't need to pre-define the space in which to store their output.

Say you want to change all the items in a list of first names to start with a capital letter. In a `for` loop, that would look something like this:


In [None]:
first_name_list = ['abhimanyu', 'beatrice', 'carlos', 'daoud', 'ekaterina', 'francesca', 'george', 'hinah']

In [None]:
capitalized_names = []

for name in first_name_list:
    capitalized_names.append(name.title())

capitalized_names

['Abhimanyu',
 'Beatrice',
 'Carlos',
 'Daoud',
 'Ekaterina',
 'Francesca',
 'George',
 'Hinah']

That does the job, but it takes up several lines of code. Let's see that same operation expressed as a comprehension:

In [42]:
capitalized_names = [name.title() for name in first_name_list]

capitalized_names

['Abhimanyu',
 'Beatrice',
 'Carlos',
 'Daoud',
 'Ekaterina',
 'Francesca',
 'George',
 'Hinah']

By surrounding the statment with square brackets, we have automatically formatted the result as a list.

It is also possible to create a dictionary using a comprehension. If you have two lists of equal length that contain associated values, you can create a dictionary from them using a comprehension, and the handy builtin `zip()` function, which combines two or more lists and splits them by element into tuples containing one element of each list. 

First, let's see how to add elements to a dictionary using a `for` loop:

In [43]:
fruit_list = ['apple', 'banana', 'cherimoya', 'durian']
qty_list = [2, 5, 2, 1]

In [44]:
#How `zip()` works:

for tup in zip(fruit_list, qty_list):
    print(tup)

('apple', 2)
('banana', 5)
('cherimoya', 2)
('durian', 1)


Using a `for` loop:

In [None]:
fruit_dict = {}

for fruit, qty in zip(fruit_list, qty_list):
    fruit_dict[fruit] = qty

fruit_dict

Using a dictionary comprehension:

In [46]:
fruit_dict = {fruit:qty for fruit, qty in zip(fruit_list, qty_list)}

fruit_dict

{'apple': 2, 'banana': 5, 'cherimoya': 2, 'durian': 1}

# Functions

Here are a few functions that will let us manipulate strings.

https://docs.python.org/3.5/tutorial/controlflow.html#documentation-strings

In [150]:
def separate_words(sample_string, delimiter=' '):
	words = sample_string.split(delimiter)
	return words

def add_elipses(sample_string):
  return(sample_string+'...')

def join_words(sample_list, delimiter=' '):
	title = delimiter.join(sample_list)
	return title

input_string = 'Please speak more slowly'

print(separate_words(input_string))
print('\n', add_elipses(input_string))
print('\n', join_words(separate_words(input_string)))
print('\n', join_words([add_elipses(word) for word in separate_words(input_string)]))

['Please', 'speak', 'more', 'slowly']

 Please speak more slowly...

 Please speak more slowly

 Please... speak... more... slowly...


### OPTIONAL - Coding Exercise: Functions as an Assembly Line

This one is more free-form, since we're taking a short break at the halfway point of the workshop. You can play around with this if you feel like exploring what functions do when applied in different sequences.

Try using each function (`separate_words()`, `add_elipses()`, `join_words()`) by itself, using the same input string. Combine more than one function, and change the order in which the functions are applied.

Here's an oddball to try out: 

`print(join_words(join_words(input_string)))`

Why is Python giving us this particular output?

# *Intermission*

# Part II - Pandas

![](images/pandas.jpg)

No, not that kind of pandas! The Python package `pandas` (a name derived from "panel data", a term for a particular format of data used in Economics) is built for handling tabular data in many shapes and sizes, using a special kind of data table called a `DataFrame`.

Pandas DataFrames are at the core of the vast majority of data science and data analytics in Python. Pandas provides an incredibly powerful suite of tools for working with data, with more precision and power than software like Microsoft Excel. If you are studying data science, get to know Pandas as well as you can.

![](images/pandaslogo.jpg)

Pandas's most-used class is the DataFrame. DataFrames are specialized container objects that store data in named columns. Column names must be unique, and all columns must be the same length. A DataFrame is like a spreadsheet, but you don't click in cells and type into it; you use functions and methods to manipulate the data it contains.

Though individual cells are more difficult to edit in a DataFrame than in an application like Microsoft Excel or Google Sheets, the platform lends itself well to making sweeping edits quickly. This is a huge advantage when you want to clean data or to engineer new features in a dataset.

Pandas also has integrated statistical functions, so it's easy to get summary statistics for an entire dataset.

Pandas is also capable of [reading data from and writing data to external files](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) in a wide variety of formats, from `.csv` files to `.json` to `.xlsx` (Excel workbooks) and others.

Let's begin by making a very simple DataFrame to illustrate how we can work with text data in a tabular format in Pandas.

## Coding Exercise - DataFrames from Lists

In [49]:
fruit_list = ['apple', 'banana', 'cherimoya', 'durian']
qty_list = [2, 5, 2, 1]

#One way to construct a DataFrame is to use a dictionary with column names as keys, lists as values. 
fruit_df = pd.DataFrame(
    {
        'fruit':fruit_list,
        'qty':qty_list
    }
)

In [50]:
fruit_df

Unnamed: 0,fruit,qty
0,apple,2
1,banana,5
2,cherimoya,2
3,durian,1


Individual rows can be added to a DataFrame using `.loc[]` and setting the position to the next index after the last existing row.

In [51]:
fruit_df.loc[len(fruit_df)] = ['elderberry', 34]

In [52]:
fruit_df

Unnamed: 0,fruit,qty
0,apple,2
1,banana,5
2,cherimoya,2
3,durian,1
4,elderberry,34


## DataFrames from Nested Dictionaries

DataFrames can be constructed in several different ways... we saw lists already, but `pandas` can also transform nested dictionaries into DataFrames. This can be a little more difficult, depending on how the dictionaries are stored.

We can do this with the `contacts` dictionary we made earlier... but in order to, we'll have to fix an error in one of the profile dictionaries inside `contacts`.

Let's try putting the dictionary into a DataFrame and see what the problem is:

In [158]:
df_contacts = pd.DataFrame(contacts)
df_contacts

Unnamed: 0,contact1,contact2
first_name,David,Jeanine
last_name,Merten-Jones,Finn
institution,The Claremont Colleges Library,The Claremont Colleges Library
occupation,Data Services Specialist,Head of Data and Digital Scholarship Services
email address,david.merten-jones@claremont.edu,
email_address,,jeanine.finn@claremont.edu


Oh no! The contacts dictionary contains two profile dictionaries with mismatched key names. One of them has an underscore in the key "email_address" and the other has a space. This results in there being extra cells in the DataFrame, each with a `NaN` ("Not a Number"), Python's default `null` value.

Let's correct the issue in place and then reconstruct the DataFrame.

As we saw earlier with lists, `.pop()` can be used to remove a key:value pair from a dictionary. We can then set a new key:value pair using the data removed using `.pop()`.

In [54]:


contacts['contact1']['email_address'] = contacts['contact1'].pop('email address')

In [55]:
contacts

{'contact1': {'first_name': 'David',
  'last_name': 'Merten-Jones',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Data Services Specialist',
  'email_address': 'david.merten-jones@claremont.edu'},
 'contact2': {'first_name': 'Jeanine',
  'last_name': 'Finn',
  'institution': 'The Claremont Colleges Library',
  'occupation': 'Head of Data and Digital Scholarship Services',
  'email_address': 'jeanine.finn@claremont.edu'}}

In [56]:
df_contacts = pd.DataFrame(contacts)
df_contacts

Unnamed: 0,contact1,contact2
first_name,David,Jeanine
last_name,Merten-Jones,Finn
institution,The Claremont Colleges Library,The Claremont Colleges Library
occupation,Data Services Specialist,Head of Data and Digital Scholarship Services
email_address,david.merten-jones@claremont.edu,jeanine.finn@claremont.edu


The data is all there, but it's in the wrong orientation - we want rows for contacts, columns for data categories.

We can transpose the axes by using `.T` at the end of the DataFrame.

In [57]:
df_contacts = df_contacts.T
df_contacts

Unnamed: 0,first_name,last_name,institution,occupation,email_address
contact1,David,Merten-Jones,The Claremont Colleges Library,Data Services Specialist,david.merten-jones@claremont.edu
contact2,Jeanine,Finn,The Claremont Colleges Library,Head of Data and Digital Scholarship Services,jeanine.finn@claremont.edu


## Making a New Column in a DataFrame

The syntax for generating a new column in a `pandas` DataFrame is very similar to creating a new key:value pair in a dictionary. You can use the contents of other columns in the DataFrame to generate a new column. In this case, if we want to combine first and last names into one column, we can use the "+" operator (and throw in an extra " " so the words aren't right next to each other).

In [58]:
df_contacts['full_name'] = df_contacts['first_name'] + ' ' + df_contacts['last_name']

df_contacts

Unnamed: 0,first_name,last_name,institution,occupation,email_address,full_name
contact1,David,Merten-Jones,The Claremont Colleges Library,Data Services Specialist,david.merten-jones@claremont.edu,David Merten-Jones
contact2,Jeanine,Finn,The Claremont Colleges Library,Head of Data and Digital Scholarship Services,jeanine.finn@claremont.edu,Jeanine Finn


## Coding Exercise - DataFrames from Lists

In [59]:
################################################################################
################################################################################

# Below are three lists, each of which have ten elements. The first list
# contains the names of the 10 most populous cities in California. The second
# list contains their populations. The third list contains their respective
# counties.

# We have used `pd.DataFrame()` to construct a DataFrame, and added a column
# for "city"

# Population estimates (2024) from: https://en.wikipedia.org/wiki/List_of_largest_cities_in_California_by_population
# Area (sq miles) from: https://en.wikipedia.org/wiki/List_of_municipalities_in_California

cities = [
    'Los Angeles', 'San Diego', 'San Jose', 'San Francisco', 'Fresno',
    'Sacramento', 'Long Beach', 'Oakland', 'Bakersfield', 'Anaheim'
]

population_est_2024 = [
    3878704, 1404452, 997368, 827526, 550105,
    535798, 450901, 443554, 417468, 344561
]

counties = [
    'Los Angeles County', 'San Diego County', 'Santa Clara County',
    'San Francisco County', 'Fresno County', 'Sacramento County',
    'Los Angeles County', 'Alameda County', 'Kern County', 'Orange County']

area_sq_miles = [
    469.49, 325.88, 178.26, 46.91, 115.18,
    98.61, 50.71, 55.93, 149.78, 50.27
]

df_cal = pd.DataFrame()

df_cal['city'] = cities

# Fill in the blanks to make DataFrame columns for the remaining two lists:

df_cal['pop'] = population_est_2024
df_cal['county'] = counties
df_cal['area_sq_miles'] = area_sq_miles

# Run this cell by pressing `shift-enter`

################################################################################
################################################################################

In [60]:
df_cal

Unnamed: 0,city,pop,county,area_sq_miles
0,Los Angeles,3878704,Los Angeles County,469.49
1,San Diego,1404452,San Diego County,325.88
2,San Jose,997368,Santa Clara County,178.26
3,San Francisco,827526,San Francisco County,46.91
4,Fresno,550105,Fresno County,115.18
5,Sacramento,535798,Sacramento County,98.61
6,Long Beach,450901,Los Angeles County,50.71
7,Oakland,443554,Alameda County,55.93
8,Bakersfield,417468,Kern County,149.78
9,Anaheim,344561,Orange County,50.27


In [61]:
df_cal['pop_dens'] = df_cal['pop'] / df_cal['area_sq_miles']

In [62]:
df_cal

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
0,Los Angeles,3878704,Los Angeles County,469.49,8261.526337
1,San Diego,1404452,San Diego County,325.88,4309.72137
2,San Jose,997368,Santa Clara County,178.26,5595.018512
3,San Francisco,827526,San Francisco County,46.91,17640.716265
4,Fresno,550105,Fresno County,115.18,4776.046189
5,Sacramento,535798,Sacramento County,98.61,5433.50573
6,Long Beach,450901,Los Angeles County,50.71,8891.75705
7,Oakland,443554,Alameda County,55.93,7930.520293
8,Bakersfield,417468,Kern County,149.78,2787.207905
9,Anaheim,344561,Orange County,50.27,6854.207281


## Filtering Rows in Pandas

If you've used Microsoft Excel, Google Sheets, or other spreadsheet applications, you may have had to apply a filter to isolate a subset of records on a sheet. Pandas also has that capability, but you don't use a dropdown menu to do it - you write the filter in a line of code instead.

If we want to look only at rows that match a certain condition in a DataFrame, we can make use of Boolean comparison operators, `==`, `!=`, `>=`, `<=`, `>`, *and* `<`. These can be used to create a Pandas `Series` object that contains Boolean values for each item in a column.

In [63]:
df_cal['county'] == 'Los Angeles County'

0     True
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
Name: county, dtype: bool

In [64]:
df_cal['pop'] >= 550000

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8    False
9    False
Name: pop, dtype: bool

We can then use that Series to filter out rows that have a `False` value. If we want to look at the rows of `df_cal` where "population" is over 550000, we can apply a filter by placing that Series object inside square brackets after the name of the DataFrame.

This may look a little weird with the name of the DataFrame showing twice, but it makes sense if you read it as "show only the rows of the dataframe where the values in the dataframe's population column are equal to 2".

You can also apply multiple conditions when filtering rows... for this, you need to put parentheses around each condition, and join them with either a `&` for "and" or a `|` for "or".

In [65]:
df_cal[(df_cal['county'] == 'Los Angeles County') & (df_cal['pop'] >= 550000)]

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
0,Los Angeles,3878704,Los Angeles County,469.49,8261.526337


In [66]:
df_cal[(df_cal['pop'] < 550000) & (df_cal['pop_dens'] >= 7000)]

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
6,Long Beach,450901,Los Angeles County,50.71,8891.75705
7,Oakland,443554,Alameda County,55.93,7930.520293


You can also assign the Series objects to variables, and filter on those. This makes the operation more modular, in case you need to run it multiple times with different conditions.

In [67]:
condition1 = df_cal['pop'] < 550000
condition2 = df_cal['pop_dens'] >= 7000

df_cal[condition1 & condition2]

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
6,Long Beach,450901,Los Angeles County,50.71,8891.75705
7,Oakland,443554,Alameda County,55.93,7930.520293


### `.query()`

Lastly, one of the most powerful, oft-overlooked tools for filtering data in `pandas` is the DataFrame's `.query()` method.

***Note: if you are first-time `pandas` user, `.query()` can be a little overwhelming, and using Series objects to filter (e.g.: `df[df[column] == value]`) may be easier to work with.***

Filtering with Series objects is very reliable, but can be clunky and space-inefficient.

Using `.query()` requires that you know a little more about syntax, but you can produce concise filters very quickly with it.

To get the same result as our last filter on `pop` and `pop_dens`, you can write the following query, as a single string:

In [68]:
df_cal.query('pop < 550000 & pop_dens >= 7000')

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
6,Long Beach,450901,Los Angeles County,50.71,8891.75705
7,Oakland,443554,Alameda County,55.93,7930.520293


If you are passing a string as part of one of the conditions, you will need to use quotes inside your query string, like so:

In [69]:
df_cal.query("county == 'Los Angeles County' & pop >= 550000")

Unnamed: 0,city,pop,county,area_sq_miles,pop_dens
0,Los Angeles,3878704,Los Angeles County,469.49,8261.526337


# Cleaning an Entire Dataset

Let's look closer at the source data for the California city area: https://en.wikipedia.org/wiki/List_of_municipalities_in_California

This data table has several fields that can convey information to humans, but which are not formatted in a way that is machine-readable for purposes of summary statistics or using as inputs in a regression.

The column names are a bit clunky because they contain text references to the article's footnotes.

As you can see in the first row, Adelanto's land area is listed as "52.87 sq mi (136.9 km2)". If we want to use that column to get the mean, maximum, or minimum land area, we will need to extract the numerical data (either 52.78, for square miles, or 136.9, for square kilometers) from the text string.

In [117]:
df_cal2 = pd.read_csv('ca_cities.csv', index_col=0)

In [118]:
df_cal2

Unnamed: 0,Name,Type,County,Population (2020)[1],Population (2010)[9],Change,Land area[10],Population density[10],Incorporated[8]
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970"
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982"
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854"
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908"
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903"
...,...,...,...,...,...,...,...,...,...
478,Yountville,Town,Napa,3436,2933,+17.1%,1.49 sq mi (3.9 km2),"2,306.0/sq mi (890.4/km2)","February 4, 1965"
479,Yreka†,City,Siskiyou,7807,7765,+0.5%,9.99 sq mi (25.9 km2),781.5/sq mi (301.7/km2),"April 21, 1857"
480,Yuba City†,City,Sutter,70117,64925,+8.0%,14.98 sq mi (38.8 km2),"4,680.7/sq mi (1,807.2/km2)","January 23, 1908"
481,Yucaipa,City,San Bernardino,54542,51367,+6.2%,28.27 sq mi (73.2 km2),"1,929.3/sq mi (744.9/km2)","November 27, 1989"


### Renaming Columns

`pandas` DataFrames have a `.rename()` method that lets you pass a dictionary to replace existing column names with new ones. Pasing `inplace=True` makes these changes permanent.

When choosing variable names for DataFrames you're going to be working with in `pandas`, it is best to avoid spaces or excess punctuation, as these can make certain operations more difficult.

In [119]:
df_cal2.rename(
    columns={
        'Name':'name',
        'Type':'type',
        'County':'county',
        'Population (2020)[1]':'pop2020',
        'Population (2010)[9]':'pop2010',
        'Change':'change',
        'Land area[10]':"area",
        'Population density[10]':'pop_dens',
        'Incorporated[8]':'incorporated'
        
    }, inplace=True
)

df_cal2.head()

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970"
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982"
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854"
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908"
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903"


## Exploratory Data Analysis - `pandas` Methods

Data analysis is impossible without a way of discovering the characteristics of the data you're working with.

You have to know what parts of the data need to be cleaned, and what format the data exists in, so you can change that format if necessary. The process of familiarizing yourself with the contents of a dataset is often called "Exploratory Data Analysis" (EDA for short).

Finding out what's in a dataset is really simple if it's four columns by ten rows, but if you're dealing with larger datasets, it quickly becomes an incredibly laborious task.

Fortunately, `pandas` comes with a suite of built-in tools that let you conduct EDA quickly.

### `.head()` and `.tail()`

DataFrames have a `.info()`method that lets you see what types of data each column contains. The datatypes "int64", "float64", and "object" correspond to integers, floats, and strings.

In this case, although the last four columns all represent numeric information, they are stored as text strings. We want to try to convert them (as much as possible) to numeric data.

### `.info()`

In [120]:
df_cal2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 483 entries, 0 to 482
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          483 non-null    object
 1   type          483 non-null    object
 2   county        483 non-null    object
 3   pop2020       483 non-null    int64 
 4   pop2010       483 non-null    int64 
 5   change        483 non-null    object
 6   area          483 non-null    object
 7   pop_dens      483 non-null    object
 8   incorporated  483 non-null    object
dtypes: int64(2), object(7)
memory usage: 37.7+ KB


Currently, only the population columns are in a numeric format (int64 is a 64-bit integer). We can see summary statistics for these columns by using the `.describe()` method.

### `.describe()`

In [121]:
df_cal2.describe()

Unnamed: 0,pop2020,pop2010
count,483.0,483.0
mean,68375.68,64009.5
std,205842.8,197865.7
min,200.0,112.0
25%,11331.5,10861.0
50%,31051.0,28976.0
75%,72382.5,66768.5
max,3898747.0,3792621.0


### `.unique()` and `.value_counts()`

When looking at individual columns in a dataset, pandas also lets you see how many unique values a column contains, with `.unique()`. Often, an even more useful command is `.value_counts()`, which shows unique values and ranks them by frequency.

In [122]:
df_cal2['type'].unique()

array(['City', 'Town', 'City and county'], dtype=object)

In [123]:
df_cal2['type'].value_counts()

type
City               459
Town                23
City and county      1
Name: count, dtype: int64

You can use `.head()` in conjunction with `.value_counts()` to look at only the top 5 (or whatever integer value you pass in the parentheses of `.head()`) categories, like so:

In [124]:
df_cal2['county'].value_counts().head(10)

county
Los Angeles       88
Orange            34
Riverside         26
San Bernardino    24
San Mateo         20
Contra Costa      19
San Diego         18
Fresno            15
Santa Clara       15
Alameda           14
Name: count, dtype: int64

### `.shape`

You can quickly get the shape of a DataFrame by using `.shape` (note that there are no parentheses after `.shape`)

In [125]:
df_cal2.shape

(483, 9)

### `.duplicated()` and `.any()`

We're fortunate in this case to have a dataset with no dupicate rows (that is, rows with the exact same values across all columns), but `.duplicated()` and `.any()`

In [126]:
df_cal2.duplicated().any()

False

## Cleaning Data using `.apply()` and `lambda`

A quick way to create or modify new columns in a data frame is to use the `.apply()` method with and pass a function, often (but not always) using the `lambda` keyword. `.apply()` is a method of the `pd.Series` Class.

A Lambda function is an "anonymous" function (one that doesn't need to be named since it's used only in a local context). "Lambda" can mean different things in different programming languages, but in Python it allows for a quick ad hoc function to be used without needing to define the function officially with a `def` statement.

Assuming the parameters of a function are appropriate to the data in the Pandas series, we can pass in a single function as an argument of `.apply()` without the need for a lambda statement.

We can use a lambda expression to remove the '†' character from the city names column, like so:

In [127]:
df_cal2['name'] = df_cal2['name'].apply(lambda x: x.replace('†',''))

In [128]:
df_cal2['name']

0          Adelanto
1      Agoura Hills
2           Alameda
3            Albany
4          Alhambra
           ...     
478      Yountville
479           Yreka
480       Yuba City
481         Yucaipa
482    Yucca Valley
Name: name, Length: 483, dtype: object

We can also use lambda expressions to remove non-numerical characters from text strings that we intend to change into numerical data types.

In [129]:
df_cal2['pct_change'] = df_cal2['change'].apply(lambda x: x.strip('+%'))

In [130]:
df_cal2.head()

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970",19.8
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982",−0.2
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854",6.1
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908",9.3
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903",−0.3


In [131]:
df_cal2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 483 entries, 0 to 482
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          483 non-null    object
 1   type          483 non-null    object
 2   county        483 non-null    object
 3   pop2020       483 non-null    int64 
 4   pop2010       483 non-null    int64 
 5   change        483 non-null    object
 6   area          483 non-null    object
 7   pop_dens      483 non-null    object
 8   incorporated  483 non-null    object
 9   pct_change    483 non-null    object
dtypes: int64(2), object(8)
memory usage: 41.5+ KB


Hmm... the "pct_change" column is still stored as "object". Can you think of what we should do to change it to a decimal number?

In [132]:
df_cal2['pct_change'] = df_cal2['change'].apply(lambda x: float(x.strip('+%')))

ValueError: could not convert string to float: '−0.2'

That's weird! We should be able to convert a '-' to a minus sign... what character are we actually seeing in the table?

In [133]:
df_cal2.iloc[1]['change'][0]

'−'

In [134]:
df_cal2.iloc[1]['change'][0] == '-'

False

*...What?*

## Character Encodings: ASCII, ISO-8859-1, UTF-8

There are several systems for storing alphanumeric characters; the most basic of these is [ASCII](https://en.wikipedia.org/wiki/ASCII), which stores upper- and lower-case letters in the English alphabet, numbers, some punctuation, and control characters, all in 8 bits (a byte) of computer memory. There are 128 characters in ASCII.

But there are thousands of discrete characters when other languages are taken into consideration. UTF-8 (Unicode) and ISO 8859-1 (Latin-1) are other character formats that take up more space, but which can store representations of many other characters besides the ones in ASCII... including some that *look* like ASCII characters but aren't. The ASCII '-' sign functions both as a hyphen (en-dash) and a minus sign.

However, there is also a Unicode character 8722 which is explicitly a minus sign. This is how the minus signs are encoded on some Wikipedia articles.

It's important to note that the first 128 characters in Unicode correspond to the characters in ASCII.

#### `ord()` and `chr()`

If you want to convert a character to its unique Unicode identifier, you can use the built-in Python function `ord()`. To go the opposite direction and convert from Unicode identifier to character, you can use `chr()`.

In [135]:
#ASCII en-dash/hyphen/minus
ord('-')

45

In [136]:
chr(45)

'-'

In [137]:
#Special Unicode minus
ord(df_cal2.iloc[1]['change'][0])

8722

In [138]:
chr(8722)

'−'

In [139]:
df_cal2['pct_change'] = df_cal2['change'].apply(lambda x: float(x.replace(chr(8722), '-').strip('+%')))
df_cal2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 483 entries, 0 to 482
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          483 non-null    object 
 1   type          483 non-null    object 
 2   county        483 non-null    object 
 3   pop2020       483 non-null    int64  
 4   pop2010       483 non-null    int64  
 5   change        483 non-null    object 
 6   area          483 non-null    object 
 7   pop_dens      483 non-null    object 
 8   incorporated  483 non-null    object 
 9   pct_change    483 non-null    float64
dtypes: float64(1), int64(2), object(7)
memory usage: 41.5+ KB


There! Now the minus sign can be correctly interpreted by `float()` and used to convert a string to a negative decimal number.

## Coding Exercise - String Methods in Lambda Expressions

How do we get our area into a numeric data type?

In [140]:
################################################################################
################################################################################

df_cal2['area_sq_miles'] = df_cal2['area'].apply(lambda x: float(x.split('sq')[0].replace(',','')))

################################################################################
################################################################################

In [142]:
df_cal2.head()

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970",19.8,52.87
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982",-0.2,7.8
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854",6.1,10.45
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908",9.3,1.79
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903",-0.3,7.63


What about population density?

In [95]:
################################################################################
################################################################################

df_cal2['pop_dens_sq_miles'] = df_cal2['pop_dens'].apply(lambda x: float(x.split('/')[0].replace(',','')))

################################################################################
################################################################################

In [143]:
df_cal2.head()

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970",19.8,52.87
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982",-0.2,7.8
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854",6.1,10.45
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908",9.3,1.79
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903",-0.3,7.63


### Machine-Readable Dates - `dateutil`

The `dateutil` module has a parser that can change dates like "December 22, 1970" to a format like "1970-12-22". We'll use it here to clean our "incorporated" column.

In [96]:
df_cal2['incorp_parse'] = df_cal2['incorporated'].apply(lambda x: dateutil.parser.parse(x.split('[')[0]))

In [97]:
df_cal2

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles,pop_dens_sq_miles,incorp_parse
0,Adelanto,City,San Bernardino,38046,31765,+19.8%,52.87 sq mi (136.9 km2),719.6/sq mi (277.8/km2),"December 22, 1970",19.8,52.87,719.6,1970-12-22
1,Agoura Hills,City,Los Angeles,20299,20330,−0.2%,7.80 sq mi (20.2 km2),"2,602.4/sq mi (1,004.8/km2)","December 8, 1982",-0.2,7.80,2602.4,1982-12-08
2,Alameda,City,Alameda,78280,73812,+6.1%,10.45 sq mi (27.1 km2),"7,490.9/sq mi (2,892.3/km2)","April 19, 1854",6.1,10.45,7490.9,1854-04-19
3,Albany,City,Alameda,20271,18539,+9.3%,1.79 sq mi (4.6 km2),"11,324.6/sq mi (4,372.4/km2)","September 22, 1908",9.3,1.79,11324.6,1908-09-22
4,Alhambra,City,Los Angeles,82868,83089,−0.3%,7.63 sq mi (19.8 km2),"10,860.8/sq mi (4,193.4/km2)","July 11, 1903",-0.3,7.63,10860.8,1903-07-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...
478,Yountville,Town,Napa,3436,2933,+17.1%,1.49 sq mi (3.9 km2),"2,306.0/sq mi (890.4/km2)","February 4, 1965",17.1,1.49,2306.0,1965-02-04
479,Yreka,City,Siskiyou,7807,7765,+0.5%,9.99 sq mi (25.9 km2),781.5/sq mi (301.7/km2),"April 21, 1857",0.5,9.99,781.5,1857-04-21
480,Yuba City,City,Sutter,70117,64925,+8.0%,14.98 sq mi (38.8 km2),"4,680.7/sq mi (1,807.2/km2)","January 23, 1908",8.0,14.98,4680.7,1908-01-23
481,Yucaipa,City,San Bernardino,54542,51367,+6.2%,28.27 sq mi (73.2 km2),"1,929.3/sq mi (744.9/km2)","November 27, 1989",6.2,28.27,1929.3,1989-11-27


`dateutil` turned the dates in "incoroprated" column into Timestamp data, a special numerical data type that allows for various operations dealing with time.

In [98]:
df_cal2['incorp_parse'].max()

Timestamp('2024-07-01 00:00:00')

In [99]:
df_cal2['incorp_parse'].min()

Timestamp('1850-02-27 00:00:00')

In [100]:
df_cal2.describe()

Unnamed: 0,pop2020,pop2010,pct_change,area_sq_miles,pop_dens_sq_miles,incorp_parse
count,483.0,483.0,483.0,483.0,483.0,483
mean,68375.68,64009.5,6.062112,17.154654,4313.461491,1928-08-03 22:15:39.130434816
min,200.0,112.0,-81.8,0.31,22.4,1850-02-27 00:00:00
25%,11331.5,10861.0,0.95,3.535,2067.4,1902-05-25 12:00:00
50%,31051.0,28976.0,4.7,8.42,3540.9,1919-02-28 00:00:00
75%,72382.5,66768.5,9.4,19.41,5352.1,1960-05-28 00:00:00
max,3898747.0,3792621.0,153.2,469.49,21485.5,2024-07-01 00:00:00
std,205842.8,197865.7,12.913795,32.904595,3344.961557,


## Coding Exercise - Cities Incorporated Before California Statehood

California was admitted to the Union as the 31st state on September 9, 1850.

Can you filter the `df_cal2` DataFrame to find all the cities that were incorporated before this date?

How about cities that were incorporated within ten years of this date?

In [101]:
################################################################################
################################################################################

df_cal2[df_cal2['incorp_parse'] < dateutil.parser.parse('September 9, 1850')]

################################################################################
################################################################################

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles,pop_dens_sq_miles,incorp_parse
36,Benicia,City,Solano,27131,26997,+0.5%,12.81 sq mi (33.2 km2),"2,118.0/sq mi (817.7/km2)","March 27, 1850",0.5,12.81,2118.0,1850-03-27
239,Los Angeles,City,Los Angeles,3898747,3792621,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,469.49,8304.2,1850-04-04
360,Sacramento‡,City,Sacramento,524943,466488,+12.5%,98.61 sq mi (255.4 km2),"5,323.4/sq mi (2,055.4/km2)","February 27, 1850",12.5,98.61,5323.4,1850-02-27
368,San Diego,City,San Diego,1386932,1307402,+6.1%,325.88 sq mi (844.0 km2),"4,256.0/sq mi (1,643.2/km2)","March 27, 1850",6.1,325.88,4256.0,1850-03-27
371,San Francisco,City and county,San Francisco,873965,805235,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,46.91,18630.7,1850-04-15
375,San Jose,City,Santa Clara,1013240,945942,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,178.26,5684.1,1850-03-27
389,Santa Barbara,City,Santa Barbara,88665,88410,+0.3%,19.51 sq mi (50.5 km2),"4,544.6/sq mi (1,754.7/km2)","April 9, 1850",0.3,19.51,4544.6,1850-04-09
422,Stockton,City,San Joaquin,320804,291707,+10.0%,62.21 sq mi (161.1 km2),"5,156.8/sq mi (1,991.0/km2)","July 23, 1850",10.0,62.21,5156.8,1850-07-23


\****For more time-related Python concepts, join us two weeks from now on Tuesday, November 4 from 2-3 PM, for the Punctual Python workshop. We'll go over a brief selection of tools for handling temporal data, converting between time zones, timing code executions, and creating a progress bar for long Python processes.***

# Merging DataFrames

Notice that there are different population figures between the two DataFrames, df_cal and df_cal2. The `df_cal` DataFrame has population estimates for 2024, while `df_cal2` has figures from the 2010 and 2020 census.

What if we want to compare populations between 2020 census and the 2024 estimates?

For that, we'll need to merge the DataFrames before we continue.

## Conceptual Overview: Joining Tables

Whenever you merge data, you must consider which variables/columns you want to merge your data on, and what type of merge you want to perform. In order to merge two datasets, they must have at least one variable in common. If they have multiple variables in common, depending on your desired outcome, you may want to include more than one column to merge the data on.

Merges (also called "joins") come in several forms:

**Inner Join:** only include data from rows where there is a match, where both dataframes have a record that matches on the common variable(s).

**Outer Join (or Full Outer Join):** include all data from both dataframes, insert null values where there is no match on the common variable.

**Left (or Left Outer) Join:** include all data from the first dataframe, inserting null values for columns from the second dataframe where there is no match on the common variable.

**Right Join (or Right Outer Join):** include all data from the second dataframe, inserting null values for columns from the first dataframe where there is no match on the common variable.

You may not have to learn SQL (Structured Query Language) for your projects, but it is a highly flexible and useful tool for retrieving data from databases, and it provides a framework for understanding how data can be merged.

[Geeksforgeeks.org's SQL Joins page](https://www.geeksforgeeks.org/sql/sql-join-set-1-inner-left-right-and-full-joins/) is helpful for visualizing different types of joins.


## Joining DataFrames with `pd.merge()`

`pd.merge()` provides robust syntax for joining DataFrames. You can choose the order of your DataFrames, how they are merged (inner, outer, left, right), and which columns they must have in common.

In order to merge our `df_cal` DataFrame, which contains the 2024 population estimate column, with our `df_cal2` DataFrame, which has the 2010 and 2020 census population counts, we need to make sure the other columns match up; the "city" column in the first DataFrame must be renamed "name" to match the second, and we should probably also rename "pop" to "pop_2024_est"

In [102]:
df_cal.rename(columns={'pop':'pop_2024_est', 'city':'name'}, inplace=True)
df_cal

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens
0,Los Angeles,3878704,Los Angeles County,469.49,8261.526337
1,San Diego,1404452,San Diego County,325.88,4309.72137
2,San Jose,997368,Santa Clara County,178.26,5595.018512
3,San Francisco,827526,San Francisco County,46.91,17640.716265
4,Fresno,550105,Fresno County,115.18,4776.046189
5,Sacramento,535798,Sacramento County,98.61,5433.50573
6,Long Beach,450901,Los Angeles County,50.71,8891.75705
7,Oakland,443554,Alameda County,55.93,7930.520293
8,Bakersfield,417468,Kern County,149.78,2787.207905
9,Anaheim,344561,Orange County,50.27,6854.207281


In [103]:
df_cal['county'] = df_cal['county'].apply(lambda x: x.replace(' County', ''))

In [104]:
df_cal

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens
0,Los Angeles,3878704,Los Angeles,469.49,8261.526337
1,San Diego,1404452,San Diego,325.88,4309.72137
2,San Jose,997368,Santa Clara,178.26,5595.018512
3,San Francisco,827526,San Francisco,46.91,17640.716265
4,Fresno,550105,Fresno,115.18,4776.046189
5,Sacramento,535798,Sacramento,98.61,5433.50573
6,Long Beach,450901,Los Angeles,50.71,8891.75705
7,Oakland,443554,Alameda,55.93,7930.520293
8,Bakersfield,417468,Kern,149.78,2787.207905
9,Anaheim,344561,Orange,50.27,6854.207281


In [105]:
df_cal2[df_cal2['name'].isin(df_cal['name'].values)]

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles,pop_dens_sq_miles,incorp_parse
9,Anaheim,City,Orange,346824,336265,+3.1%,50.27 sq mi (130.2 km2),"6,899.2/sq mi (2,663.8/km2)","March 18, 1876",3.1,50.27,6899.2,1876-03-18
26,Bakersfield,City,Kern,403455,347483,+16.1%,149.78 sq mi (387.9 km2),"2,693.7/sq mi (1,040.0/km2)","January 11, 1898",16.1,149.78,2693.7,1898-01-11
150,Fresno,City,Fresno,542107,494665,+9.6%,115.18 sq mi (298.3 km2),"4,706.6/sq mi (1,817.2/km2)","October 12, 1885",9.6,115.18,4706.6,1885-10-12
234,Long Beach,City,Los Angeles,466742,462257,+1.0%,50.71 sq mi (131.3 km2),"9,204.1/sq mi (3,553.7/km2)","December 13, 1897",1.0,50.71,9204.1,1897-12-13
239,Los Angeles,City,Los Angeles,3898747,3792621,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,469.49,8304.2,1850-04-04
291,Oakland,City,Alameda,440646,390724,+12.8%,55.93 sq mi (144.9 km2),"7,878.5/sq mi (3,041.9/km2)","May 4, 1852",12.8,55.93,7878.5,1852-05-04
368,San Diego,City,San Diego,1386932,1307402,+6.1%,325.88 sq mi (844.0 km2),"4,256.0/sq mi (1,643.2/km2)","March 27, 1850",6.1,325.88,4256.0,1850-03-27
371,San Francisco,City and county,San Francisco,873965,805235,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,46.91,18630.7,1850-04-15
375,San Jose,City,Santa Clara,1013240,945942,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,178.26,5684.1,1850-03-27


In [106]:
df_cal_merged = pd.merge(left=df_cal, right=df_cal2, on=['name','county','area_sq_miles'], how='left')
df_cal_merged

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens_x,type,pop2020,pop2010,change,area,pop_dens_y,incorporated,pct_change,pop_dens_sq_miles,incorp_parse
0,Los Angeles,3878704,Los Angeles,469.49,8261.526337,City,3898747.0,3792621.0,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,8304.2,1850-04-04
1,San Diego,1404452,San Diego,325.88,4309.72137,City,1386932.0,1307402.0,+6.1%,325.88 sq mi (844.0 km2),"4,256.0/sq mi (1,643.2/km2)","March 27, 1850",6.1,4256.0,1850-03-27
2,San Jose,997368,Santa Clara,178.26,5595.018512,City,1013240.0,945942.0,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,5684.1,1850-03-27
3,San Francisco,827526,San Francisco,46.91,17640.716265,City and county,873965.0,805235.0,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,18630.7,1850-04-15
4,Fresno,550105,Fresno,115.18,4776.046189,City,542107.0,494665.0,+9.6%,115.18 sq mi (298.3 km2),"4,706.6/sq mi (1,817.2/km2)","October 12, 1885",9.6,4706.6,1885-10-12
5,Sacramento,535798,Sacramento,98.61,5433.50573,,,,,,,,,,NaT
6,Long Beach,450901,Los Angeles,50.71,8891.75705,City,466742.0,462257.0,+1.0%,50.71 sq mi (131.3 km2),"9,204.1/sq mi (3,553.7/km2)","December 13, 1897",1.0,9204.1,1897-12-13
7,Oakland,443554,Alameda,55.93,7930.520293,City,440646.0,390724.0,+12.8%,55.93 sq mi (144.9 km2),"7,878.5/sq mi (3,041.9/km2)","May 4, 1852",12.8,7878.5,1852-05-04
8,Bakersfield,417468,Kern,149.78,2787.207905,City,403455.0,347483.0,+16.1%,149.78 sq mi (387.9 km2),"2,693.7/sq mi (1,040.0/km2)","January 11, 1898",16.1,2693.7,1898-01-11
9,Anaheim,344561,Orange,50.27,6854.207281,City,346824.0,336265.0,+3.1%,50.27 sq mi (130.2 km2),"6,899.2/sq mi (2,663.8/km2)","March 18, 1876",3.1,6899.2,1876-03-18


*Wait a minute... what happened this time?* Is Sacramento missing from the second DataFrame?

Let's check by "county"!

In [107]:
df_cal2[df_cal2['county'] == 'Sacramento']

Unnamed: 0,name,type,county,pop2020,pop2010,change,area,pop_dens,incorporated,pct_change,area_sq_miles,pop_dens_sq_miles,incorp_parse
75,Citrus Heights,City,Sacramento,87583,83301,+5.1%,14.22 sq mi (36.8 km2),"6,159.1/sq mi (2,378.1/km2)","January 1, 1997",5.1,14.22,6159.1,1997-01-01
127,Elk Grove,City,Sacramento,176124,153015,+15.1%,41.99 sq mi (108.8 km2),"4,194.4/sq mi (1,619.5/km2)","July 1, 2000",15.1,41.99,4194.4,2000-07-01
141,Folsom,City,Sacramento,80454,72203,+11.4%,27.88 sq mi (72.2 km2),"2,885.7/sq mi (1,114.2/km2)","April 20, 1946",11.4,27.88,2885.7,1946-04-20
152,Galt,City,Sacramento,25383,23647,+7.3%,7.15 sq mi (18.5 km2),"3,550.1/sq mi (1,370.7/km2)","August 16, 1946",7.3,7.15,3550.1,1946-08-16
195,Isleton,City,Sacramento,794,804,−1.2%,0.44 sq mi (1.1 km2),"1,804.5/sq mi (696.7/km2)","May 14, 1923",-1.2,0.44,1804.5,1923-05-14
334,Rancho Cordova,City,Sacramento,79332,64776,+22.5%,34.57 sq mi (89.5 km2),"2,294.8/sq mi (886.0/km2)","July 1, 2003",22.5,34.57,2294.8,2003-07-01
360,Sacramento‡,City,Sacramento,524943,466488,+12.5%,98.61 sq mi (255.4 km2),"5,323.4/sq mi (2,055.4/km2)","February 27, 1850",12.5,98.61,5323.4,1850-02-27


Some of you may have noticed this earlier when we did the exercise on finding California cities incorporated before California became a state...

The `‡` symbol, like the `†` we saw earlier, is used for footnotes. Let's remove it!

Can you remove it without explicitly typing it within the lambda expression?

In [108]:
df_cal2['name'] = df_cal2['name'].apply(lambda x: x.replace('‡',''))

In [109]:
df_cal_merged = pd.merge(left=df_cal, right=df_cal2, on=['name','county','area_sq_miles'], how='left')
df_cal_merged

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens_x,type,pop2020,pop2010,change,area,pop_dens_y,incorporated,pct_change,pop_dens_sq_miles,incorp_parse
0,Los Angeles,3878704,Los Angeles,469.49,8261.526337,City,3898747,3792621,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,8304.2,1850-04-04
1,San Diego,1404452,San Diego,325.88,4309.72137,City,1386932,1307402,+6.1%,325.88 sq mi (844.0 km2),"4,256.0/sq mi (1,643.2/km2)","March 27, 1850",6.1,4256.0,1850-03-27
2,San Jose,997368,Santa Clara,178.26,5595.018512,City,1013240,945942,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,5684.1,1850-03-27
3,San Francisco,827526,San Francisco,46.91,17640.716265,City and county,873965,805235,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,18630.7,1850-04-15
4,Fresno,550105,Fresno,115.18,4776.046189,City,542107,494665,+9.6%,115.18 sq mi (298.3 km2),"4,706.6/sq mi (1,817.2/km2)","October 12, 1885",9.6,4706.6,1885-10-12
5,Sacramento,535798,Sacramento,98.61,5433.50573,City,524943,466488,+12.5%,98.61 sq mi (255.4 km2),"5,323.4/sq mi (2,055.4/km2)","February 27, 1850",12.5,5323.4,1850-02-27
6,Long Beach,450901,Los Angeles,50.71,8891.75705,City,466742,462257,+1.0%,50.71 sq mi (131.3 km2),"9,204.1/sq mi (3,553.7/km2)","December 13, 1897",1.0,9204.1,1897-12-13
7,Oakland,443554,Alameda,55.93,7930.520293,City,440646,390724,+12.8%,55.93 sq mi (144.9 km2),"7,878.5/sq mi (3,041.9/km2)","May 4, 1852",12.8,7878.5,1852-05-04
8,Bakersfield,417468,Kern,149.78,2787.207905,City,403455,347483,+16.1%,149.78 sq mi (387.9 km2),"2,693.7/sq mi (1,040.0/km2)","January 11, 1898",16.1,2693.7,1898-01-11
9,Anaheim,344561,Orange,50.27,6854.207281,City,346824,336265,+3.1%,50.27 sq mi (130.2 km2),"6,899.2/sq mi (2,663.8/km2)","March 18, 1876",3.1,6899.2,1876-03-18


In [110]:
df_cal_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   name               10 non-null     object        
 1   pop_2024_est       10 non-null     int64         
 2   county             10 non-null     object        
 3   area_sq_miles      10 non-null     float64       
 4   pop_dens_x         10 non-null     float64       
 5   type               10 non-null     object        
 6   pop2020            10 non-null     int64         
 7   pop2010            10 non-null     int64         
 8   change             10 non-null     object        
 9   area               10 non-null     object        
 10  pop_dens_y         10 non-null     object        
 11  incorporated       10 non-null     object        
 12  pct_change         10 non-null     float64       
 13  pop_dens_sq_miles  10 non-null     float64       
 14  incorp_parse 

In [113]:

df_cal_merged['pct_chg_20_24'] = ((df_cal_merged['pop_2024_est'] - df_cal_merged['pop2020']) / df_cal_merged['pop2020']) * 100

In [114]:
df_cal_merged

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens_x,type,pop2020,pop2010,change,area,pop_dens_y,incorporated,pct_change,pop_dens_sq_miles,incorp_parse,pct_chg_20_24
0,Los Angeles,3878704,Los Angeles,469.49,8261.526337,City,3898747,3792621,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,8304.2,1850-04-04,-0.514088
1,San Diego,1404452,San Diego,325.88,4309.72137,City,1386932,1307402,+6.1%,325.88 sq mi (844.0 km2),"4,256.0/sq mi (1,643.2/km2)","March 27, 1850",6.1,4256.0,1850-03-27,1.26322
2,San Jose,997368,Santa Clara,178.26,5595.018512,City,1013240,945942,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,5684.1,1850-03-27,-1.56646
3,San Francisco,827526,San Francisco,46.91,17640.716265,City and county,873965,805235,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,18630.7,1850-04-15,-5.3136
4,Fresno,550105,Fresno,115.18,4776.046189,City,542107,494665,+9.6%,115.18 sq mi (298.3 km2),"4,706.6/sq mi (1,817.2/km2)","October 12, 1885",9.6,4706.6,1885-10-12,1.475354
5,Sacramento,535798,Sacramento,98.61,5433.50573,City,524943,466488,+12.5%,98.61 sq mi (255.4 km2),"5,323.4/sq mi (2,055.4/km2)","February 27, 1850",12.5,5323.4,1850-02-27,2.067844
6,Long Beach,450901,Los Angeles,50.71,8891.75705,City,466742,462257,+1.0%,50.71 sq mi (131.3 km2),"9,204.1/sq mi (3,553.7/km2)","December 13, 1897",1.0,9204.1,1897-12-13,-3.393952
7,Oakland,443554,Alameda,55.93,7930.520293,City,440646,390724,+12.8%,55.93 sq mi (144.9 km2),"7,878.5/sq mi (3,041.9/km2)","May 4, 1852",12.8,7878.5,1852-05-04,0.65994
8,Bakersfield,417468,Kern,149.78,2787.207905,City,403455,347483,+16.1%,149.78 sq mi (387.9 km2),"2,693.7/sq mi (1,040.0/km2)","January 11, 1898",16.1,2693.7,1898-01-11,3.47325
9,Anaheim,344561,Orange,50.27,6854.207281,City,346824,336265,+3.1%,50.27 sq mi (130.2 km2),"6,899.2/sq mi (2,663.8/km2)","March 18, 1876",3.1,6899.2,1876-03-18,-0.652492


In [115]:
df_cal_merged.describe()

Unnamed: 0,pop_2024_est,area_sq_miles,pop_dens_x,pop2020,pop2010,pct_change,pop_dens_sq_miles,incorp_parse,pct_chg_20_24
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10,10.0
mean,985043.7,154.102,7248.022693,989760.1,934908.2,7.96,7358.05,1866-02-23 09:36:00,-0.250098
min,344561.0,46.91,2787.207905,346824.0,336265.0,1.0,2693.7,1850-02-27 00:00:00,-5.3136
25%,445390.8,52.015,4940.411074,447170.0,408607.2,3.85,4860.8,1850-03-29 00:00:00,-1.337968
50%,542951.5,106.895,6224.612896,533525.0,480576.5,7.8,6291.65,1851-04-25 00:00:00,0.072926
75%,954907.5,171.14,8178.774826,978421.2,910765.2,11.775,8197.775,1883-05-22 06:00:00,1.422321
max,3878704.0,469.49,17640.716265,3898747.0,3792621.0,16.1,18630.7,1898-01-11 00:00:00,3.47325
std,1068541.0,140.147819,4123.4569,1074066.0,1051899.0,4.900839,4437.318027,,2.641947


In [116]:
df_cal_merged.query('pct_chg_20_24 < 0')

Unnamed: 0,name,pop_2024_est,county,area_sq_miles,pop_dens_x,type,pop2020,pop2010,change,area,pop_dens_y,incorporated,pct_change,pop_dens_sq_miles,incorp_parse,pct_chg_20_24
0,Los Angeles,3878704,Los Angeles,469.49,8261.526337,City,3898747,3792621,+2.8%,"469.49 sq mi (1,216.0 km2)","8,304.2/sq mi (3,206.3/km2)","April 4, 1850",2.8,8304.2,1850-04-04,-0.514088
2,San Jose,997368,Santa Clara,178.26,5595.018512,City,1013240,945942,+7.1%,178.26 sq mi (461.7 km2),"5,684.1/sq mi (2,194.6/km2)","March 27, 1850",7.1,5684.1,1850-03-27,-1.56646
3,San Francisco,827526,San Francisco,46.91,17640.716265,City and county,873965,805235,+8.5%,46.91 sq mi (121.5 km2),"18,630.7/sq mi (7,193.3/km2)","April 15, 1850[13]",8.5,18630.7,1850-04-15,-5.3136
6,Long Beach,450901,Los Angeles,50.71,8891.75705,City,466742,462257,+1.0%,50.71 sq mi (131.3 km2),"9,204.1/sq mi (3,553.7/km2)","December 13, 1897",1.0,9204.1,1897-12-13,-3.393952
9,Anaheim,344561,Orange,50.27,6854.207281,City,346824,336265,+3.1%,50.27 sq mi (130.2 km2),"6,899.2/sq mi (2,663.8/km2)","March 18, 1876",3.1,6899.2,1876-03-18,-0.652492


The population of San Francisco shrank by over 5% from 2020 to 2024. Long Beach had more than a 3% drop in population.

# End of Parts I & II