## `Section 01: Transforming DataFrames`

### 01- Inspecting a DataFrame
* Print the head of the homelessness DataFrame.
* Print information about homelessness
* Print the shape of homelessness
* Print a description of homelessness



In [7]:
# Import cars data
import pandas as pd
homelessness = pd.read_csv('data/homelessness.csv')
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,EastSouthCentral,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139
2,Mountain,Arizona,7259,2606,7158024
3,WestSouthCentral,Arkansas,2280,432,3009733
4,Pacific,California,109008,20964,39461588


In [8]:
# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())


             region       state  individuals  family_members  state_pop
0  EastSouthCentral     Alabama         2570             864    4887681
1           Pacific      Alaska         1434             582     735139
2          Mountain     Arizona         7259            2606    7158024
3  WestSouthCentral    Arkansas         2280             432    3009733
4           Pacific  California       109008           20964   39461588
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   region          51 non-null     object
 1   state           51 non-null     object
 2   individuals     51 non-null     int64 
 3   family_members  51 non-null     int64 
 4   state_pop       51 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB
None
(51, 5)
         individuals  family_members     state_pop
count      51.000000       51.000000  5.1000

### 02- Parts of a DataFrame
* Import `pandas` using the alias `pd`.
* Print a 2D NumPy array of the values in `homelessness`.
* Print the column names of `homelessness`.
* Print the index of `homelessness`.

In [9]:
# Import pandas using the alias pd
import pandas as pd 

# Print the values of homelessness
print(homelessness.values)
print(15 * "===")

# Print the column index of homelessness
print(homelessness.columns)
print(15 * "===")

# Print the row index of homelessness
print(homelessness.index)
print(15 * "===")


[['EastSouthCentral' 'Alabama' 2570 864 4887681]
 ['Pacific' 'Alaska' 1434 582 735139]
 ['Mountain' 'Arizona' 7259 2606 7158024]
 ['WestSouthCentral' 'Arkansas' 2280 432 3009733]
 ['Pacific' 'California' 109008 20964 39461588]
 ['Mountain' 'Colorado' 7607 3250 5691287]
 ['NewEngland' 'Connecticut' 2280 1696 3571520]
 ['SouthAtlantic' 'Delaware' 708 374 965479]
 ['SouthAtlantic' 'DistrictofColumbia' 3770 3134 701547]
 ['SouthAtlantic' 'Florida' 21443 9587 21244317]
 ['SouthAtlantic' 'Georgia' 6943 2556 10511131]
 ['Pacific' 'Hawaii' 4131 2399 1420593]
 ['Mountain' 'Idaho' 1297 715 1750536]
 ['EastNorthCentral' 'Illinois' 6752 3891 12723071]
 ['EastNorthCentral' 'Indiana' 3776 1482 6695497]
 ['WestNorthCentral' 'Iowa' 1711 1038 3148618]
 ['WestNorthCentral' 'Kansas' 1443 773 2911359]
 ['EastSouthCentral' 'Kentucky' 2735 953 4461153]
 ['WestSouthCentral' 'Louisiana' 2540 519 4659690]
 ['NewEngland' 'Maine' 1450 1066 1339057]
 ['SouthAtlantic' 'Maryland' 4914 2230 6035802]
 ['NewEngland' '

### 03-Sorting rows

* Sort `homelessness` by the number of homeless individuals, from smallest to largest, and save this as `homelessness_ind`.
Print the head of the sorted DataFrame.

* Sort `homelessness` by the number of homeless family_members in descending order, and save this as `homelessness_fam`.
Print the head of the sorted DataFrame.

* Sort `homelessness` first by region (ascending), and then by number of family members (descending). Save this as `homelessness_reg_fam`.
Print the head of the sorted DataFrame.

In [10]:
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())

              region        state  individuals  family_members  state_pop
50          Mountain      Wyoming          434             205     577601
34  WestNorthCentral  NorthDakota          467              75     758080
7      SouthAtlantic     Delaware          708             374     965479
39        NewEngland  RhodeIsland          747             354    1058287
45        NewEngland      Vermont          780             511     624358


In [11]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending=False)

# Print the top few rows
print(homelessness_fam.head())

              region          state  individuals  family_members  state_pop
32      Mid-Atlantic        NewYork        39827           52070   19530351
4            Pacific     California       109008           20964   39461588
21        NewEngland  Massachusetts         6811           13257    6882635
9      SouthAtlantic        Florida        21443            9587   21244317
43  WestSouthCentral          Texas        19199            6111   28628666


In [12]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

              region      state  individuals  family_members  state_pop
13  EastNorthCentral   Illinois         6752            3891   12723071
35  EastNorthCentral       Ohio         6929            3320   11676341
22  EastNorthCentral   Michigan         5209            3142    9984072
49  EastNorthCentral  Wisconsin         2740            2167    5807406
14  EastNorthCentral    Indiana         3776            1482    6695497


### 04- Subsetting columns

* Create a DataFrame called `individuals` that contains only the `individuals` column of `homelessness`.
* Print the head of the result.
* Create a DataFrame called `state_fam` that contains only the `state` and family_members columns of `homelessness`, in that order.
* Print the head of the result.
* Create a DataFrame called `ind_state` that contains the `individuals` and `state` columns of `homelessness`, in that order.
* Print the head of the result.

In [13]:
# Select the individuals column
individuals = homelessness["individuals"]

# Print the head of the result
print(individuals.head())

0      2570
1      1434
2      7259
3      2280
4    109008
Name: individuals, dtype: int64


In [14]:
# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

# Print the head of the result
print(state_fam.head())

        state  family_members
0     Alabama             864
1      Alaska             582
2     Arizona            2606
3    Arkansas             432
4  California           20964


In [15]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]

# Print the head of the result
print(ind_state.head())

   individuals       state
0         2570     Alabama
1         1434      Alaska
2         7259     Arizona
3         2280    Arkansas
4       109008  California


### 05- Subsetting rows

* Filter `homelessness` for cases where the number of individuals is greater than ten thousand, assigning to `ind_gt_10k`. View the printed result.
* Filter `homelessness` for cases where the USA Census region is `"Mountain"`, assigning to` mountain_reg`. View the printed result.
* Filter `homelessness` for cases where the number of `family_members` is less than one thousand and the region is `"Pacific"`, assigning to fam_lt_1k_pac. View the printed result.




In [16]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

# See the result
print(ind_gt_10k)

              region       state  individuals  family_members  state_pop
4            Pacific  California       109008           20964   39461588
9      SouthAtlantic     Florida        21443            9587   21244317
32      Mid-Atlantic     NewYork        39827           52070   19530351
37           Pacific      Oregon        11139            3337    4181886
43  WestSouthCentral       Texas        19199            6111   28628666
47           Pacific  Washington        16424            5880    7523869


In [17]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# See the result
print(mountain_reg)

      region      state  individuals  family_members  state_pop
2   Mountain    Arizona         7259            2606    7158024
5   Mountain   Colorado         7607            3250    5691287
12  Mountain      Idaho         1297             715    1750536
26  Mountain    Montana          983             422    1060665
28  Mountain     Nevada         7058             486    3027341
31  Mountain  NewMexico         1949             602    2092741
44  Mountain       Utah         1904             972    3153550
50  Mountain    Wyoming          434             205     577601


In [18]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)

    region   state  individuals  family_members  state_pop
1  Pacific  Alaska         1434             582     735139


### 06-Subsetting rows by categorical variables

* Filter `homelessness` for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to `south_mid_atlantic`. View the printed result.
* Filter `homelessness` for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

In [19]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic", "Mid-Atlantic"])]

# See the result
print(south_mid_atlantic)

          region         state  individuals  family_members  state_pop
30  Mid-Atlantic     NewJersey         6048            3350    8886025
32  Mid-Atlantic       NewYork        39827           52070   19530351
38  Mid-Atlantic  Pennsylvania         8163            5349   12800922


In [20]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona         7259            2606    7158024
4    Pacific  California       109008           20964   39461588
28  Mountain      Nevada         7058             486    3027341
44  Mountain        Utah         1904             972    3153550


### 07- Adding new columns

* Add a new column to `homelessness`, named `total`, containing the sum of the `individuals` and `family_members` columns.
* Add another column to `homelessness`, named `p_individuals`, containing the proportion of homeless people in each state who are individuals.


In [23]:
# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

# Add p_individuals col as proportion of total that are individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]

# See the result
homelessness.head(10)

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals
0,EastSouthCentral,Alabama,2570,864,4887681,3434,0.748398
1,Pacific,Alaska,1434,582,735139,2016,0.71131
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834
3,WestSouthCentral,Arkansas,2280,432,3009733,2712,0.840708
4,Pacific,California,109008,20964,39461588,129972,0.838704
5,Mountain,Colorado,7607,3250,5691287,10857,0.700654
6,NewEngland,Connecticut,2280,1696,3571520,3976,0.573441
7,SouthAtlantic,Delaware,708,374,965479,1082,0.654344
8,SouthAtlantic,DistrictofColumbia,3770,3134,701547,6904,0.54606
9,SouthAtlantic,Florida,21443,9587,21244317,31030,0.691041


### 08-Combo-attack!

* Add a column to `homelessness`, `indiv_per_10k`, containing the number of homeless individuals per ten thousand people in each state.
* Subset rows where `indiv_per_10k` is higher than `20`, assigning to `high_homelessness`.
* Sort `high_homelessness` by descending `indiv_per_10k`, assigning to `high_homelessness_srt`.
* Select only the `state` and `indiv_per_10k` columns of `high_homelessness_srt` and save as `result`. Look at the `result`.

In [24]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"]  / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)

                 state  indiv_per_10k
8   DistrictofColumbia      53.738381
11              Hawaii      29.079406
4           California      27.623825
37              Oregon      26.636307
28              Nevada      23.314189
47          Washington      21.829195
32             NewYork      20.392363


==================================
### `The End`
==================================