# Data Cleaning

This lesson covers one of the most important topics when working with data, cleaning. 

## Learning Objectives

After completing this less you will:

- Understand how missing data is represented in Python and in Pandas with `None` and `NaN`
- Know how to identify missing data values in your data
- Perform calculations with missing values
- Identify when data values are missing with `isna()` and `notna()`
- TODO Drop rows with missing values from Dataframes using `dropna()`
- TODO Fill in numeric values using `fillna()`
- Use vectorized string operations to clean dirty data 

## Data Used in this Lesson

- [City of Pittsburgh Trees](https://data.wprdc.org/dataset/city-trees) - This dataset contains information about trees cared for and managed by the City of Pittsburgh Department of Public Works Forestry Division. Each row represents a tree with with information about the type, location, size, and various measures about economic and environmental benefits. The dataset contains 45,709 entries.
    - The dataset is stored as a CSV file with the name `pgh-trees.csv`
- [Mayors of Pittsburgh](https://en.wikipedia.org/wiki/List_of_mayors_of_Pittsburgh) - This dataset contains information about the Mayors of the city of Pittsburgh going back to 1816 when Pittsburgh formally became a city. Each row represents a mayor with information about their name, term, political party, opposition, and notes about them. The dataset contains 61 entries.
    - The dataset is stored as a TSV file wit the name `pgh-mayors.tsv`

In [None]:
# load the necessary libraries
import pandas as pd
import numpy as np


## Representing Missing Data Values

* One of challenges you may face when working with messy data are *missing* or *null* values
* There are multiple ways to representing missing values in Python
* There is a Pythonic way using the `None` object
* There is a Numpy/Pandas-y way using `NaN`

### `None`, A *Pythonic* way to represent Missing Data

* `None` is the standard way of representing nothing in plain Python, it is one of the basic data types.
* It is useful, but it is not an memory efficient way to represent missing values because it is an Object.
* It can be used in numeric and programmatic contexts, but it has a computational cost.

#### Task - Adding None values to your arrays

1. Run the code cell below to generate a numpy array of integer values
2. Make a note of the data type, is it a numeric data type?
1. Modify the code cell to include at least 3 `None` values in the Python list `my_list` and re-run the code cell.
2. Make a note how the data type has changed, is it still numeric?

In [None]:
# create a python list of integer values
my_list = [1, 45, 2, 48, 2, 456, 672, 43, 121, 123, 56]

# convert the Python list to a numpy array
my_array = np.array(my_list)
# print the data type of the array
print("The data type is: ", my_array.dtype)
# display the array values
my_array

#### Answer - Adding None values to your arrays

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
# create a python list of integer values
my_list = [1, None, 2, 48, 2, None, 672, 43, 121, None, 56]

# convert the Python list to a numpy array
my_array = np.array(my_list)
# print the data type of the array
print("The data type is: ", my_array.dtype)
# display the array values
my_array

#### Task - Performing Computations with Objects

1. Run the code cell below to calculate the time it takes to sum an array of integer values vs. object values
2. Compare the results (it will be the first number), which one was faster? Why?

In [None]:
# create a list of objects and a list of integers
# compute their sum and time how long it takes
for dtype in ['object','int']:
    print("Performing calculations on an array of data type = ", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()
print("Finished!")

*your answer here*


#### Answer - Performing Computations with Objects

Click on the ellipses (...) below to see the answers.

*Answer*

Ints.

Because Numpy arrays (and Pandas Series) require all values to be the same data type, it will default to the most expressive and most inefficient data type if you have mixed data types. This means any computational operations running over the array/series are going to run slower than they could if the data type was numeric.
    
There is also one final issue with using `None` to represent missing values...

#### Task - Performing Calculations with `None`

1. In the code cell below, compute the sum of the `my_array` variable by adding the `.sum()` at the end of the variable name.
2. Make a note of the error, what is the error message saying?

In [None]:
# your code here


#### Answer - Performing Calculations with `None`

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
my_array.sum()

*Answer*

The error message is saying the addition operator doesn't work to add an Integer to a None.

#### Task - Convert `my_array` to a Series

1. In the code cell below, use the Pandas Series function to create a series from the Python list, `my_list`, you created earlier that contains a `None` value. 
2. Display the Series
3. What happened to the `None` type?
4. What is the data type of the Series?

In [None]:
# your code here


#### Answer - Convert `my_array` to a Series

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
my_series = pd.Series(my_list)
my_series

*Answer*

The `None` type was converted to `NaN`s and the data type of all the values in the Series are floating point numbers.

#### Task - Performing Calculations with `NaN`

1. In the code cell below, compute the sum of the `my_series` variable by adding `.sum()` at the end of the variable name.
2. Is there an error? What happens to the missing value?

In [None]:
# your code here


#### Answer - Performing Calculations with `NaN`

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 
my_series.sum()

*Answer* 

There is no error and it looks like the missing values were just ignored.

### NaN, The Missing Numeric Data Type for Missing Data

* The Numpy third-party library has a mechanism for representing missing numeric values as `NaN`. You can explicitly create a `NaN` value using `np.nan`.
* Pandas will automatically convert `None` values into to `NaN` values for convenience. Note, Numpy does not do this) 
* In terms of data types, NaNs are considered floating point numbers in accordance with the [IEEE 754 Standard](https://en.wikipedia.org/wiki/IEEE_754)
* Note for R users: There is no `Null` only `NaN`
* This means you can use them with other numeric arrays for fast computations

## Working with Missing Data Values

Pandas has several mechanisms for working with missing data. Each of these methods has a version that works with Series and a version that works with DataFrames
* `isna()` - Generate a boolean mask of the missing values 
* `notna()` - Do the opposite of `isna()` 
* `dropna()` - Create a filtered copy of the data with no null values
* `fillna(value)` - Create a copy of the data and fill in missing values.

In [None]:
my_series.fillna(

#### Task - Identify Missing Values

1. Using the Series from the previous task, `my_series`, add `.isna()` to the end of the variable name.
2. Look at the results, what do you think the `isna()` function is doing?

In [None]:
# your code here


#### Answer - Identify Missing Values

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
my_series.isna()

#### Task - Identify NOT Missing Values

1. Using the Series from the previous task, `my_series`, add `.notna()` to the end of the variable name.
2. Look at the results, what do you think the `notna()` function is doing?

In [None]:
# your code here


#### Answer - Identify NOT Missing Values

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
my_series.notna()

#### Task - Counting Missing or Not Missing Values

![under develpment gif](http://textfiles.com/underconstruction/Dimension4933picsA-Hconstruction.gif)

In [None]:
# How many missing values?
my_series[my_series.isna()].size

In [None]:
# How many not missing values?
my_series[my_series.notna()].size

### Missing Values in Dataframes

This section is currently under development.

![under construction gif](http://textfiles.com/underconstruction/MoMotorCity8021construct.gif)
![under develpment gif](http://textfiles.com/underconstruction/Dimension4933picsA-Hconstruction.gif)
![Under construction gif](http://textfiles.com/underconstruction/ReResearchTriangle7711imagesconstruction-a.gif)

### Filling in Missing values

This section is currently under development.

![under construction gif](http://textfiles.com/underconstruction/MoMotorCity8021construct.gif)
![under develpment gif](http://textfiles.com/underconstruction/Dimension4933picsA-Hconstruction.gif)
![Under construction gif](http://textfiles.com/underconstruction/ReResearchTriangle7711imagesconstruction-a.gif)

## Vectorized String Operations

The Pandas Series data structure (which means columns of a DataFrame) have a set of methods for working with strings of textual data. These methods function similarly to their plain python equivalents, but they are *vectorized* meaning they operate on all values in a column. This makes it easier to select a column of textual data from a DataFrame, specify a string method, and operate on every entry in that column. 

String methods are not only computationally fast, they are also concise to write and handle dirty data. We can see this in practice in the code example below.

In [None]:
# create a Python list of names
names = ['peter', 'Paul', 'MARY', 'gUIDO']

# loop over the list of names 
for name in names:
    print(name.capitalize())

#### Task - Processing Missing Values in Python

1. Copy the code cell from the example above.
2. Add a `None` or `np.nan` value to the Python list and execute. What happens? 
3. What would need to happen to accommodate dirty data if you were processing with Python?

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# create a list with null value 
names = ['peter', 'Paul', np.nan, 'MARY', 'gUIDO']

# 
for name in names:
    print(name.capitalize())

*answer*

Without additional exception handling for different, Python will try to run the `capitalize()` function on the `None` or `nan`, which don't have any string methods because they aren't strings. You would have to add some conditional statements or error handling to deal with those cases.

It would mean you have to write a lot more code.

#### Task - Vectorized Capitalize

1. Run the code cell below to convert the Python list into a Series and use the *vectorized* string 
2. What happened to the missing data values?
3. Do you think this was future lines of code than working with vanilla Python? 

In [None]:
# convert our list into a Series
names = pd.Series(names)
# Use the string vector function to capitalize everything
names.str.capitalize()

#### Answer

Click on the ellipses (...) below to see the answers.

*answer*

2. The missing value was automatically ignored and passed over.
3. Yes ;)

### Pandas String Methods

Pandas includes a a bunch of string methods for doing things to strings. Visit the [Pandas documentation on string methods](https://pandas.pydata.org/docs/user_guide/text.html#method-summary) to read about what they do.

|  Functions  |. |.  |. |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |


#### Task - Playing with String Methods

1. In the cells below, try three of the string operations listed above on the Pandas Series `monte`

Remember, if you place your cursor after the `str.` you can hit tab to auto-complete and shift-tab to see suggestions.

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

In [None]:
# First
monte.str.


In [None]:
# Second
monte.str.




In [None]:
# Third
monte.str.




## Dealing With Dirty Data

The following section will walk you through the process of working with real-world dirty data. In this case we will be loading the [City of Pittsburgh Trees](https://data.wprdc.org/dataset/city-trees) and doing a bit of cleaning using string operations.

#### Task - Load Pittsburgh Tree Dataset

1. Use the `read_csv()` function to load the `pgh-trees.csv` file in the `datasets` directory into a dataframe called `trees`.
2. Display the Dataframe

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees = pd.read_csv("datasets/pgh-trees.csv")
trees

#### Task - Why the Warning?

Loading this dataset raises an error because one of the columns has mixed data types.

1. Which column do you think is "Column (1)" from the warning above?
2. Look at the display which shows the first and last 5 rows of the dataframe. What is the row index of the row with the problematic value?

#### Answer - Why the Warning?

Click on the ellipses (...) below to see the answers.

*answer*

The `address_number` column contains dirty data values. It should be, and mostly is, numbers representing the street number. However, there is at least at least one row, with the index 45704, that has the full street address.

#### Task - Select the dirty data value

1. Use the row and column indexing property, `iloc`, to select just the dirty data value identified in the previous task.
2. What is the data type of the result, how does that compare with other values in that column?
3. Why is this a problem?

In [None]:
# your answer here

#### Answer - Select the Dirty data value

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees.iloc[45704, 1]

*answer*

The result is a string but most of the other values in that column are integers. However, because both types of data are present in the CSV file the data type of the entire column must be `object`.

## Finding Dirty Rows with Pandas

The first step in working with dirty data values is to determine how many dirty data values you have.

The following tasks will use several Pandas concepts and functions in order to generate the table below.
- Select the column containing dirty data values from the dataframe
- Create a *boolean mask* of True/False values for each value in the column determining if the value is clean or dirty
- Filtering the whole dataframe with the mask to show only the rows with dirty data

 The following table shows just the rows of the `trees` dataset that contain dirty data values in the `address_number` column. 

|       |         id | address_number   | street     | common_name       | scientific_name      |   height |   width |...|
|------:|-----------:|:-----------------|:-----------|:------------------|:---------------------|---------:|--------:|---|
| 45294 |  293638645 | 1200 Diana       | nan        | Locust: Black     | Robinia pseudoacacia |       65 |     nan |...|
| 45308 |   33483504 | 1402 w north ave | nan        | nan               | nan                  |      nan |     nan |...|
| 45322 |   49984623 | 18 sprain st     | nan        | Mulberry: Red     | Morus rubra          |       40 |      30 |...|
| 45323 | 1817752170 | 18 sprain st     | nan        | Maple: Sugar      | Acer saccharum       |       30 |     nan |...|
| 45324 | 1918808238 | 502 Foreland     | nan        | Pear: Callery     | Pyrus calleryana     |       15 |     nan |...|
| 45325 |  472070793 | 502 Foreland st  | nan        | Pear: Callery     | Pyrus calleryana     |      nan |     nan |...|
| 45333 |  785059502 | 345 dalton ave   | nan        | Oak: Northern Red | Quercus rubra        |       70 |     nan |...|
| 45704 |   39047675 | 499 N LANG AVE   | N LANG AVE | Maple: Norway     | Acer platanoides     |       15 |      15 |...|


#### Task - Select the column with dirty data

1. Use column indexing to select only the `address_number` column from the `trees` DataFrame
2. What is the pandas data structure you get back?

In [None]:
# your code here


#### Answer - Selecting the column with dirty data 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 
trees['address_number']
# series

### How many dirty values?

The next step in our data cleaning is to identify which values in the `address_number` column contain values that are not numbers. There are several string methods that *might* be helpful for identifying values that are only number or have alphabetic characters such as `isalpha()`, `isdigit()`, `isnumeric()` and `isdecimal()`. However, you will notice there are also many missing values as well. 

The column contains three possible values:
- An integer representing the street address number
- A string representing the street number and name
- A `nan` representing a missing street number


#### Task - Identify values with alphabetic characters

1. Execute the code cell below that adds the code `.str.contains("[A-Za-z]",case=False, na=False)` to the results of the previous task.
2. Review the results and the look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)  for the `contains()` function. Consider the following questions:
    1. What do you think a result of `True` means?' What do you think a result of `False` means?
    2. What does the `str.contains()` function do?
    3. The parameter `"A-Za-z]"` is a [regular expression](https://en.wikipedia.org/wiki/Regular_expression), what do you think it matches?
    4. Why do you need to add `na=False` parameters? What happens to the results if you remove them?


In [None]:
# Check for alphabetic characters
trees['address_number'].str.contains("[A-Za-z]", na=False)

#### Answer -  Check for alphabetic characters

Click on the ellipses (...) below to see the answers.

*answer*

1. True means there are alphabetic characters. False means there are not.
2. The contains function returns true or false if the value matches the pattern
3. A regular expression that matches any alphabetic characters
4. To ignore the case sensitivity of the pattern and to consider missing values a False

#### Task - Filter with a Boolean Mask

1. Save the results of the previous task into a variable called `mask` 
2. Use the `mask` variable as an index for the `trees` dataframe. The results should match the data table with 8 rows at the beginning of the section.
3. How many rows contain dirty data? 

In [None]:
# your code here


#### Answer - Filter with a Boolean Mask

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mask = trees['address_number'].str.contains("[A-Za-z]", na=False)
trees[mask]

*answer* 

There are 8 rows with dirty data values in the address number column.

## Much Ado about Dirty Data

The data table from the previous task shows the rows in the Pittsburgh Trees dataset that have dirty data values.

The question is what do we do about it?

There

#### Task - How Big are the data in Memory?

Because we are about to start cleaning the dataset it is not a bad idea to create a copy of the data (in case we screw up while cleaning). However, it is important to check and make sure the DataFrame isn't so large that creating a copy will use up too much memory.

1. Use the `info()` method to see how big 
2. At what size do you think making a copy of the data is not a good idea?

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees.info()

*answer*

The trees dataset takes about 20 megabytes of data. This is a small enough dataset that a copy would only take up around 40 megabytes. 

Because Pandas loads everything into memory (RAM), you need to be mindful if your dataset is larger than half of the available memory on your computer.

#### Task - Create a Copy

1. Run the code cell below if you think the dataset is small enough to create a duplicate copy in memory.

FYI: The JupyterHub server has 62 Gigabytes of memory.

In [None]:
# create a copy of the tree data
clean_trees = trees.copy()

#### Task - Missing Comments when cleaning the `clean_trees` DataFrame

1. The following Python code cell will loop over the index values of the rows with dirty data. For each of the rows it will extract the street number and name and assign them to the correct column. However, the code is missing comments and it is hard to understand what each line is doing. Add descriptive comments above each line after the `#` mark in the code below.

Once you run the code once, the `clean_trees` DataFrame will actually be clean. If you need to re-execute the code below and see the proper output, re-create the copy of the `clean_trees` dataset by re-running the code from the previous task.

Hint: The `address, *rest = dirty_value.split` is some fancy Pythonic code. A less fancy way perform the same operations would be:

```python
values = dirty_value.split()
address = values[0]
rest = values[1:]
```

In [None]:
# 
for index in clean_trees[mask].index:
    # 
    dirty_value = clean_trees.iloc[index,1]
    # 
    print(dirty_value)
    # 
    address, *rest = dirty_value.split()
    # 
    street = " ".join(rest).upper()
    # 
    print(address, "---", street)
    print()
    #
    clean_trees.iloc[index,1] = address
    # 
    clean_trees.iloc[index,2] = street

#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# loop over the index values of the dirty rows
for index in clean_trees[mask].index:
    # extract dirty data value from the address_number column
    dirty_value = clean_trees.iloc[index,1]
    # print out the value
    print(dirty_value)
    # split on spaces. put the first value in the variable address and
    # the remaining values in a list variable called rest
    address, *rest = dirty_value.split()
    # join all the values in the rest list on spaces and 
    # call the uppercase string method
    street = " ".join(rest).upper()
    # Print the address and street separated by dashes then a blank line
    print(address, "---", street)
    print()
    # update the address_number column value in the dataframe 
    clean_trees.iloc[index,1] = address
    # update the street name column value in the dataframe
    clean_trees.iloc[index,2] = street

#### Task - Compare the clean and dirty trees 

1. Review the [dataframes documentation for the `compare()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html?highlight=compare#pandas.DataFrame.compare) and write code to compare `clean_trees` with the original dirty `trees` dataframe.
2. What do the `self` and `other` column labels mean?
2. Does the result match your expectations?

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# use the compare method to compare clean and dirty tree data
clean_trees.compare(trees)

*answer* 

2. Self refers to `clean_trees` dataframe because it was the dataframe whose `compare()` method was called. Other refers to the `trees` data because it was the dataframe passed as a parameter. 
3. Yes, self shows the `address_number` column split whereas other shows the data combined in the `address_number` column.

#### Task - Save clean to disk

1. Write your `clean_trees` DataFrame to disk as a CSV file called `pgh-trees-clean.csv` (you don't need to save it in a separate directory). 
2. Don't save the row index values when saving the CSV files. Review the [to_csv documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv) to identify the appropriate parameter for disabling this behavior.
3. Open the newly saved CSV file in Jupyter via the File Browser in the Left Menu Bar.

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
clean_trees.to_csv("pgh-trees-clean.csv", index=False)

#### Task - Reload the Trees dataset

1. Reload the trees dataset from the CSV file on disk. Do you see a warning?
2. Use `info()` to inspect the column data types, what datatype is the `address_number`. Why is it that data type?

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
clean_trees = pd.read_csv("pgh-trees-clean.csv")
clean_trees.info()

*answer*

1. No warning!
2. The data type of the `address_column` is a floating point value. It is good that it's a numeric data type. It isn't an integer because there are missing values.

## Working with Textual Data

String methods are useful not only for cleaning dirty data, but for processing textual data and creating new, more useful, data values for performing computations. The following series of tasks will use additional string methods to perform some complicated textual data manipulation on a dataset about the Mayors of Pittsburgh.

#### Task - Load the Pittsburgh Mayors dataset

1. Load the `pgh-mayors.tsv` file from the `datasets` directory into a DataFrame called `mayors`. Make a note of the file extension, `tsv` or tabbed-separated values, and specify your separator accordingly.
2. Display the Dataframe

In [None]:
# your code here


#### Answer - Load the Pittsburgh Mayors dataset

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mayors = pd.read_csv("datasets/pgh-mayors.tsv", sep="\t")
mayors

#### Task - Split Mayor Names

The `Mayor` column contains the full name of each Mayor, but it would be nice if we had separate columns for their first and last names. To do this we need to use several string methods to process the data values so they can be assigned to new columns.

1. Review the list of [String Methods](https://pandas.pydata.org/docs/user_guide/text.html#method-summary) from the Pandas user guide.
2. Add a string method to the end of the code below to separate each Mayor's first and last name
3. What type of Pandas data structure gets returned? What data type are the individual values?

In [None]:
# finish this code
mayors["Mayor"].str.

#### Answer - Split Mayor Names

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mayors["Mayor"].str.split()

#### Task - Extract Values from Complex Data Types

The results of the previous task have separated the Mayoral names, but the results are stored a Python lists *inside* a Pandas Series. Pandas provides a mechanism for extracting specific values from complex data structures that are stored as values inside of a Series. Review the [documentation for `str.get()` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get.html) to see how to extract values at particular index positions.

1. Use `str.get()` method with method chaining to extract a Series of just the first names 

In [None]:
# your code here


#### Answer - Extract Values from Complex Data Types

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mayors["Mayor"].str.split().str.get(0)

#### Task - Extract Other Values from Complex Data Types

1. 1. Use `str.get()` method with method chaining to extract a Series of just the last names 

In [None]:
# your code here


#### Answer

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mayors["Mayor"].str.split().str.get(1)

#### Task - Create Columns using Assignment

Now that we can extract sequences of values representing the first and last names of Pittsburgh's mayors, the next step is to modify the DataFrame and add columns with the extracted values. The example code below shows how to use Python's assignment operator in combination with the column named index to define new column names with the first and last names.

```python
mayors["First"] = mayors["Mayor"].str.split().str.get(0)
mayors["Last"] = mayors["Mayor"].str.split().str.get(1)
```
However, there is a quicker (in terms of writing code) way to add both columns. The code cell below provides an alternative way to insert columns. Consider the following questions when executing the code cell below.

1. Why is the `expand` parameter necessary when assigning the results to multiple columns?
2. Where does this code place the new columns using this approach?

In [None]:
# alternative way to create columns
mayors[["First","Last"]] = mayors["Mayor"].str.split(expand=True)
mayors

#### Answer 

Click on the ellipses (...) below to see the answers.

*answer*

1. The `expand` parameter is neccessary because it will return the results as a dataframe with two columns, which fits the multiple column assignment. Without the expand parameter the results is a Series made of Python lists. The multiple assignment can't infer what values should go where in that case.
2. When creating columns using an assignment operator it will create the columns at the end of the dataframe.

#### Task - Insert New Columns at a Specific Position

There more than one way to add columns of data to your DataFrame. One of the more precise is the `insert` method of a DataFrame. Review the documentation for a DataFrame's Insert Method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html) and consider the code example below.

```python
mayors.insert(<column position>, <column name>, <values>)
```
1. What value would insert a new column immediately to the left of the `Mayor` column?
2. What happens if you use the same name as we used before? What about a new name?
3. Use the results from the previous task to specify the appropriate sequence of values to insert
4. Write your code twice to create columns both the first and last names


In [None]:
# your code here


#### Answer - Insert New Columns at a Specific Position

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
mayors.insert(1, "First Name", mayors["Mayor"].str.split().str.get(0))
mayors.insert(2, "Last Name", mayors["Mayor"].str.split().str.get(1))
mayors

#### Task - Rearrange Columns with the `reindex` method

Now we have a DataFrame with a bunch of duplicated columns and in a wonky order. You can change the order of columns using the [`reindex()` dataframe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html?highlight=reindex#pandas.DataFrame.reindex). 

1. Replace the `???` with a Python list of strings representing the columns names you want in a sensible order.
2. What happens if you forget to include a column name? 
3. What if you have a typo in a column name?

In [None]:
# Your code here
header_list = ???
mayors.reindex(columns=header_list)

#### Answer - Rearrange Columns with the `reindex` method

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
header_list = ["Number", "First Name", "Last Name", "Mayor", "Term", "Party", "Notes"]
mayors.reindex(columns=header_list)

*answer*

2. If you don't include a column name it will drop that column
3. If you have a typo it will create a new column of missing values.