# Transforming DataFrames

## Introducing DataFrames

### The pandas package

The pandas package is used by almost the entire Python data science community, because it allows easy and powerful data manipulation and data visualization. The package is built on NumPy and Matplotlib.

In [1]:
import pandas as pd

### The pandas DataFrame

It is a data structure useful for handling and manipulating rectangular data. This is, data organized so that each row is an observation and each column a property. Every column can hold a different object type.

In [23]:
homelessness = pd.read_csv('./data/homelessness.csv', index_col=0)
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


### Exploring a DataFrame

In [35]:
# Print the first rows of a DataFrame
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


In [36]:
# Print information about the column types and missing values
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB


In [37]:
# Print the DataFrame's dimensions
homelessness.shape

(51, 5)

In [38]:
# Print some summary statistics of the DataFrame's contents
homelessness.describe()

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3504.882353,6405637.0
std,15991.025083,7805.411811,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,592.0,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


### Components of a DataFrame

In [39]:
# A 2D NumPy array of the DF´s contents
homelessness.values

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588],
       ['Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       ['New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       ['South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       ['South Atlantic', 'District of Columbia', 3770.0, 3134.0, 701547],
       ['South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       ['South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       ['Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       ['Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       ['East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       ['East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
       ['West North Central', 'Iowa', 1711.0, 1038.0, 3148618]

In [40]:
# Column names of the DF
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

In [41]:
# Index object of the DF 
homelessness.index

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
            50],
           dtype='int64')

## Sorting and subsetting

### Sorting

Find interesting information may be easier if we change the rows order based on a variable or even based on multiple variables.

In [42]:
# Sort homelessness by the number of homeless individuals, from smallest to largest 
# Save this as homelessness_ind

homelessness_ind = homelessness.sort_values('individuals').head()
homelessness_ind.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434.0,205.0,577601
34,West North Central,North Dakota,467.0,75.0,758080
7,South Atlantic,Delaware,708.0,374.0,965479
39,New England,Rhode Island,747.0,354.0,1058287
45,New England,Vermont,780.0,511.0,624358


In [43]:
# Sort homelessness by the number of homeless family_members in descending order
# Save this as homelessness_fam
homelessness_fam = homelessness.sort_values('family_members', ascending=False)
homelessness_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,Pacific,California,109008.0,20964.0,39461588
21,New England,Massachusetts,6811.0,13257.0,6882635
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666


In [44]:
# Sort homelessness first by region (ascending), and then by number of family members (descending) 
# Save this as homelessness_reg_fam
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])
homelessness_reg_fam.head(10)

Unnamed: 0,region,state,individuals,family_members,state_pop
13,East North Central,Illinois,6752.0,3891.0,12723071
35,East North Central,Ohio,6929.0,3320.0,11676341
22,East North Central,Michigan,5209.0,3142.0,9984072
49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,East North Central,Indiana,3776.0,1482.0,6695497
42,East South Central,Tennessee,6139.0,1744.0,6771631
17,East South Central,Kentucky,2735.0,953.0,4461153
0,East South Central,Alabama,2570.0,864.0,4887681
24,East South Central,Mississippi,1024.0,328.0,2981020
32,Mid-Atlantic,New York,39827.0,52070.0,19530351


### Subsetting columns

We can use bracket notation to select columns from a DataFrame.

In [46]:
# Print the first five values of the individuals column
homelessness['individuals'].head()

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64

In [47]:
# Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order
# Print the head of the result
state_fam = homelessness[['state', 'family_members']]
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


### Subsetting rows

We can filter rows from a DataFrame based on a boolean condition.

In [49]:
# Filter homelessness for cases where the number of individuals is greater than ten thousand
homelessness[homelessness['individuals'] > 10000]

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


We can filter rows based on a specific variable value.

In [50]:
# Filter homelessness for cases where the USA Census region is "Mountain"
homelessness[homelessness['region'] == 'Mountain']

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
5,Mountain,Colorado,7607.0,3250.0,5691287
12,Mountain,Idaho,1297.0,715.0,1750536
26,Mountain,Montana,983.0,422.0,1060665
28,Mountain,Nevada,7058.0,486.0,3027341
31,Mountain,New Mexico,1949.0,602.0,2092741
44,Mountain,Utah,1904.0,972.0,3153550
50,Mountain,Wyoming,434.0,205.0,577601


We can use multiple conditions to filter a DataFrame.

In [52]:
# Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific"
homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434.0,582.0,735139


If we want subset rows based on values of a categorical variable, the most efficient way is using the `.isin()` method.

In [53]:
# Filter homelessness for cases where the USA census state is in the list of Mojave states, canu 

canu = ["California", "Arizona", "Nevada", "Utah"]
homelessness[homelessness['state'].isin(canu)]

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550


## New columns

### Derivation of new columns 

You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. You can create new columns from scratch, but it is also common to derive them from other columns. This has many names, such as transforming, mutating, and feature engineering.

In [54]:
# Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add another column to homelessness, named p_individuals, containing the proportion of homeless people in each state who are individuals
homelessness['p_individuals'] = homelessness['individuals']/homelessness['total']

homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals
0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.748398
1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.71131
2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.735834
3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0,0.840708
4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.838704
