[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HarisJafri-xcode/Data-Analyst-in-Python/blob/main/04_Data_Manipulation_with_pandas/Transforming_DataFrames.ipynb)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

.head() returns the first few rows (the “head” of the DataFrame).

.info() shows information on each of the columns, such as the data type and number of missing values.

.shape returns the number of rows and columns of the DataFrame.

.describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

Let us import the Data Set inside a pandas DataFrame.

In [3]:
file_path = 'https://raw.githubusercontent.com/HarisJafri-xcode/Data-Analyst-in-Python/refs/heads/main/04_Data_Manipulation_with_pandas/homelessness.csv'
homelessness = pd.read_csv(file_path)

In [4]:
# display the head of homelessness DataFrame
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139
2,Mountain,Arizona,7259,2606,7158024
3,West South Central,Arkansas,2280,432,3009733
4,Pacific,California,109008,20964,39461588


In [5]:
# display the information about the DataFrame
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   region          51 non-null     object
 1   state           51 non-null     object
 2   individuals     51 non-null     int64 
 3   family_members  51 non-null     int64 
 4   state_pop       51 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB


In [6]:
# display the number of rows and columns in the DataFrame
homelessness.shape

(51, 5)

In [7]:
# print the Statistical Analysis of the DataFrame
homelessness.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
individuals,51.0,7225.784,15991.03,434.0,1446.5,3082.0,6781.5,109008.0
family_members,51.0,3504.882,7805.412,75.0,592.0,1482.0,3196.0,52070.0
state_pop,51.0,6405637.0,7327258.0,577601.0,1777413.5,4461153.0,7340946.5,39461588.0


# Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

.values: A two-dimensional NumPy array of values.

.columns: An index of columns: the column names.

.index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options.

In [8]:
# print 2D NumPy Array of the values in DataFrame
homelessness.values

array([['East South Central', 'Alabama', 2570, 864, 4887681],
       ['Pacific', 'Alaska', 1434, 582, 735139],
       ['Mountain', 'Arizona', 7259, 2606, 7158024],
       ['West South Central', 'Arkansas', 2280, 432, 3009733],
       ['Pacific', 'California', 109008, 20964, 39461588],
       ['Mountain', 'Colorado', 7607, 3250, 5691287],
       ['New England', 'Connecticut', 2280, 1696, 3571520],
       ['South Atlantic', 'Delaware', 708, 374, 965479],
       ['South Atlantic', 'District of Columbia', 3770, 3134, 701547],
       ['South Atlantic', 'Florida', 21443, 9587, 21244317],
       ['South Atlantic', 'Georgia', 6943, 2556, 10511131],
       ['Pacific', 'Hawaii', 4131, 2399, 1420593],
       ['Mountain', 'Idaho', 1297, 715, 1750536],
       ['East North Central', 'Illinois', 6752, 3891, 12723071],
       ['East North Central', 'Indiana', 3776, 1482, 6695497],
       ['West North Central', 'Iowa', 1711, 1038, 3148618],
       ['West North Central', 'Kansas', 1443, 773, 2911359],
 

In [9]:
# print the column names of the DataFrame
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

In [10]:
# print the index of the DataFrame
homelessness.index

RangeIndex(start=0, stop=51, step=1)

# Sorting Rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

sorting on one column >>> df.sort_values("breed")

sorting on multiple columns	>>> df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

Sort homelessness by the number of homeless individuals in the individuals column, from smallest to largest, and save this as homelessness_ind.

Print the head of the sorted DataFrame.

In [11]:
homelessness_ind = homelessness.sort_values('individuals')

In [12]:
homelessness_ind.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434,205,577601
34,West North Central,North Dakota,467,75,758080
7,South Atlantic,Delaware,708,374,965479
39,New England,Rhode Island,747,354,1058287
45,New England,Vermont,780,511,624358


Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.

In [13]:
homelessness_fam = homelessness.sort_values('family_members',ascending=False)

In [14]:
homelessness_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827,52070,19530351
4,Pacific,California,109008,20964,39461588
21,New England,Massachusetts,6811,13257,6882635
9,South Atlantic,Florida,21443,9587,21244317
43,West South Central,Texas,19199,6111,28628666


Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.

In [15]:
homelessness_reg_fam = homelessness.sort_values(['region','family_members'],ascending=[True,False])

In [16]:
homelessness_reg_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
13,East North Central,Illinois,6752,3891,12723071
35,East North Central,Ohio,6929,3320,11676341
22,East North Central,Michigan,5209,3142,9984072
49,East North Central,Wisconsin,2740,2167,5807406
14,East North Central,Indiana,3776,1482,6695497


# Subsetting Columns

When working with data, you may not need all of the variables in your dataset. 

Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. 

To select only "col_a" of the DataFrame df, use df["col_a"]

To select "col_a" and "col_b" of df, use df[["col_a", "col_b"]]

Create a Series called individuals that contains only the individuals column of homelessness.

In [17]:
individuals = homelessness['individuals']
individuals.head()

0      2570
1      1434
2      7259
3      2280
4    109008
Name: individuals, dtype: int64

Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.

In [18]:
state_fam = homelessness[['state','family_members']]
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864
1,Alaska,582
2,Arizona,2606
3,Arkansas,432
4,California,20964


Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.

In [19]:
ind_state = homelessness[['individuals','state']]
ind_state.head()

Unnamed: 0,individuals,state
0,2570,Alabama
1,1434,Alaska
2,7259,Arizona
3,2280,Arkansas
4,109008,California


# Subsetting Rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k

In [20]:
ind_gt_10k_mask = homelessness['individuals'] > 10000

ind_gt_10k = homelessness[ind_gt_10k_mask]

In [21]:
ind_gt_10k

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008,20964,39461588
9,South Atlantic,Florida,21443,9587,21244317
32,Mid-Atlantic,New York,39827,52070,19530351
37,Pacific,Oregon,11139,3337,4181886
43,West South Central,Texas,19199,6111,28628666
47,Pacific,Washington,16424,5880,7523869


Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg.

In [22]:
mountain_reg_mask = (homelessness['region'] == 'Mountain')

mountain_reg = homelessness[mountain_reg_mask]

In [23]:
mountain_reg

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259,2606,7158024
5,Mountain,Colorado,7607,3250,5691287
12,Mountain,Idaho,1297,715,1750536
26,Mountain,Montana,983,422,1060665
28,Mountain,Nevada,7058,486,3027341
31,Mountain,New Mexico,1949,602,2092741
44,Mountain,Utah,1904,972,3153550
50,Mountain,Wyoming,434,205,577601


Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac.

In [24]:
fam_lt_1k_pac_condition_1 = homelessness['family_members'] < 1000
fam_lt_1k_pac_condition_2 = homelessness['region'] == "Pacific"

In [25]:
fam_lt_1k_pac = homelessness[fam_lt_1k_pac_condition_1 & fam_lt_1k_pac_condition_2]

In [26]:
fam_lt_1k_pac

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434,582,735139


# Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the or operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]

condition = dogs["color"].isin(colors)

dogs[condition]

Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness.

In [27]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

In [28]:
mojave_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259,2606,7158024
4,Pacific,California,109008,20964,39461588
28,Mountain,Nevada,7058,486,3027341
44,Mountain,Utah,1904,972,3153550


# Adding New Columns

You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.

Add another column to homelessness, named p_homeless, containing the proportion of the total homeless population to the total population in each state state_pop.

In [29]:
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_homeless col as proportion of total homeless population to the state population
homelessness['p_homeless'] = homelessness['total']/homelessness['state_pop']

homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless
0,East South Central,Alabama,2570,864,4887681,3434,0.000703
1,Pacific,Alaska,1434,582,735139,2016,0.002742
2,Mountain,Arizona,7259,2606,7158024,9865,0.001378
3,West South Central,Arkansas,2280,432,3009733,2712,0.000901
4,Pacific,California,109008,20964,39461588,129972,0.003294


# Combo Attack

"Which state has the highest number of homeless individuals per 10,000 people in the state?"

In [30]:
homelessness["indiv_per_10k"] = 10000 * (homelessness['individuals'] / homelessness['state_pop'] )
homelessness_srt = homelessness.sort_values('indiv_per_10k',ascending=False)
result = homelessness_srt[['state','indiv_per_10k']]

In [31]:
result.head()

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
