# **Inspecting a DataFrame**

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

**.head()** returns the first 5 rows as defult (the “head” of the DataFrame).

**.info()** shows information on each of the columns, such as the data type and number of missing values.

**.shape** returns the number of rows and columns of the DataFrame.

**.describe()** calculates a few summary statistics for each column.


---


# **Parts of a DataFrame**

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

**.values**: A two-dimensional NumPy array of values.

**.columns**: An index of columns: the column names.

**.index**: An index for the rows: either row numbers or row names.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/Data/homelessness.csv',index_col=0)


In [27]:
print(df.to_string()) #.to_string() is use for showing full dataframe

                region                 state  individuals  family_members  state_pop     total     p_ind    ind_10k
0   East South Central               Alabama       2570.0           864.0    4887681    3434.0  0.748398   5.258117
1              Pacific                Alaska       1434.0           582.0     735139    2016.0  0.711310  19.506515
2             Mountain               Arizona       7259.0          2606.0    7158024    9865.0  0.735834  10.141067
3   West South Central              Arkansas       2280.0           432.0    3009733    2712.0  0.840708   7.575423
4              Pacific            California     109008.0         20964.0   39461588  129972.0  0.838704  27.623825
5             Mountain              Colorado       7607.0          3250.0    5691287   10857.0  0.700654  13.366045
6          New England           Connecticut       2280.0          1696.0    3571520    3976.0  0.573441   6.383837
7       South Atlantic              Delaware        708.0           374.

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
None


In [6]:
print(df.shape)

(51, 5)


In [7]:
print(df.describe())

         individuals  family_members     state_pop
count      51.000000       51.000000  5.100000e+01
mean     7225.784314     3504.882353  6.405637e+06
std     15991.025083     7805.411811  7.327258e+06
min       434.000000       75.000000  5.776010e+05
25%      1446.500000      592.000000  1.777414e+06
50%      3082.000000     1482.000000  4.461153e+06
75%      6781.500000     3196.000000  7.340946e+06
max    109008.000000    52070.000000  3.946159e+07


In [8]:
print(df.values)

[['East South Central' 'Alabama' 2570.0 864.0 4887681]
 ['Pacific' 'Alaska' 1434.0 582.0 735139]
 ['Mountain' 'Arizona' 7259.0 2606.0 7158024]
 ['West South Central' 'Arkansas' 2280.0 432.0 3009733]
 ['Pacific' 'California' 109008.0 20964.0 39461588]
 ['Mountain' 'Colorado' 7607.0 3250.0 5691287]
 ['New England' 'Connecticut' 2280.0 1696.0 3571520]
 ['South Atlantic' 'Delaware' 708.0 374.0 965479]
 ['South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
 ['South Atlantic' 'Florida' 21443.0 9587.0 21244317]
 ['South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
 ['Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 ['Mountain' 'Idaho' 1297.0 715.0 1750536]
 ['East North Central' 'Illinois' 6752.0 3891.0 12723071]
 ['East North Central' 'Indiana' 3776.0 1482.0 6695497]
 ['West North Central' 'Iowa' 1711.0 1038.0 3148618]
 ['West North Central' 'Kansas' 1443.0 773.0 2911359]
 ['East South Central' 'Kentucky' 2735.0 953.0 4461153]
 ['West South Central' 'Louisiana' 2540.0 519.0 4659690]
 ['New 

In [9]:
print(df.columns)

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')


In [10]:
print(df.index)

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
            50],
           dtype='int64')


# **Sorting rows**


---


Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to


**.sort_values().**

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.


one column:	`df.sort_values("breed")`

multiple columns:	`df.sort_values(["breed", "weight_kg"])`


By combining **.sort_values()** with .head(), you can answer questions in the form, "What are the top cases where…?".

By defult, .sort_values() shows us ascending order.

In [11]:
# Sort df by individuals
df_ind=df.sort_values('individuals')
print(df_ind.head())

                region         state  individuals  family_members  state_pop
50            Mountain       Wyoming        434.0           205.0     577601
34  West North Central  North Dakota        467.0            75.0     758080
7       South Atlantic      Delaware        708.0           374.0     965479
39         New England  Rhode Island        747.0           354.0    1058287
45         New England       Vermont        780.0           511.0     624358


In [12]:
# Sort df by descending family members
df_fam=df.sort_values('family_members', ascending=False)
print(df_fam.head())

                region          state  individuals  family_members  state_pop
32        Mid-Atlantic       New York      39827.0         52070.0   19530351
4              Pacific     California     109008.0         20964.0   39461588
21         New England  Massachusetts       6811.0         13257.0    6882635
9       South Atlantic        Florida      21443.0          9587.0   21244317
43  West South Central          Texas      19199.0          6111.0   28628666


In [34]:
# Sort df by region, then descending family members
df_r_f=df.sort_values(['region','family_members'],ascending=[True,False], ignore_index=True)
print(df_r_f)

                region                 state  individuals  family_members  \
0   East North Central              Illinois       6752.0          3891.0   
1   East North Central                  Ohio       6929.0          3320.0   
2   East North Central              Michigan       5209.0          3142.0   
3   East North Central             Wisconsin       2740.0          2167.0   
4   East North Central               Indiana       3776.0          1482.0   
5   East South Central             Tennessee       6139.0          1744.0   
6   East South Central              Kentucky       2735.0           953.0   
7   East South Central               Alabama       2570.0           864.0   
8   East South Central           Mississippi       1024.0           328.0   
9         Mid-Atlantic              New York      39827.0         52070.0   
10        Mid-Atlantic          Pennsylvania       8163.0          5349.0   
11        Mid-Atlantic            New Jersey       6048.0          3350.0   

# **Subsetting columns**


---


When working with data, you may not need all of the variables in your dataset. Square brackets [ ] can be used to select only the columns that matter to you in an order that makes sense to you. 

To select only "col_a" of the DataFrame df, use

`df["col_a"]`

To select "col_a" and "col_b" of df, use

`df[["col_a", "col_b"]]`

In [14]:
# Select only the individuals and state columns, in that order
print(df[['individuals','state_pop']].head())

   individuals  state_pop
0       2570.0    4887681
1       1434.0     735139
2       7259.0    7158024
3       2280.0    3009733
4     109008.0   39461588


In [15]:
# Select the individuals column
print(df[['individuals']].head())

   individuals
0       2570.0
1       1434.0
2       7259.0
3       2280.0
4     109008.0


# **Subsetting rows**


---




A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is, first, we have subset a column in pandas series. Then we have to put the series as an index value in df.



```
series=df['value']

print(df[series]) or print(df[df['value'])
```



we can also put conditions inside [ ]

ex: print(df[series]==400)

Suppose, we have dogs dataframe, where height_cm & color are columns.
 
```
 dogs[dogs["height_cm"] > 60]

 dogs[dogs["color"] == "tan"]
```


You can filter for multiple conditions at once by using the "bitwise and" operator '&' and 'bitwise or' operator '|'

`dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]`

`dogs[(dogs["height_cm"] > 60) | (dogs["color"] == "tan")]`


In [16]:
# Filter for rows where individuals is greater than 10000
print(df[df['individuals']>10000])

                region       state  individuals  family_members  state_pop
4              Pacific  California     109008.0         20964.0   39461588
9       South Atlantic     Florida      21443.0          9587.0   21244317
32        Mid-Atlantic    New York      39827.0         52070.0   19530351
37             Pacific      Oregon      11139.0          3337.0    4181886
43  West South Central       Texas      19199.0          6111.0   28628666
47             Pacific  Washington      16424.0          5880.0    7523869


In [17]:
# Filter for rows where region is Mountain
print(df[df['region']=='Mountain'])

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
5   Mountain    Colorado       7607.0          3250.0    5691287
12  Mountain       Idaho       1297.0           715.0    1750536
26  Mountain     Montana        983.0           422.0    1060665
28  Mountain      Nevada       7058.0           486.0    3027341
31  Mountain  New Mexico       1949.0           602.0    2092741
44  Mountain        Utah       1904.0           972.0    3153550
50  Mountain     Wyoming        434.0           205.0     577601


In [18]:
# Filter for rows where family_members is less than 1000 and region is Pacific
print(df[(df['family_members']<1000)&(df['region']=='Pacific')])

    region   state  individuals  family_members  state_pop
1  Pacific  Alaska       1434.0           582.0     735139


# **Subsetting rows by categorical variables**


---


Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, (for example, 1st code). 

Instead, use the **.isin()** method, which will allow you to tackle this problem by writing one condition instead of three separate ones. (for example , 2nd code)

in .isin() method, first we have to create a list and then we have to put this list in .isin(list)

In [19]:
#1st code, doing with or operator
#Subset for rows of Desert State (Arizona or California or Nevada or Utah state)
print(df[(df['state']=='Arizona')|
         (df['state']=='California')|
         (df['state']=='Nevada')|
         (df['state']=='Utah')])

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
4    Pacific  California     109008.0         20964.0   39461588
28  Mountain      Nevada       7058.0           486.0    3027341
44  Mountain        Utah       1904.0           972.0    3153550


In [20]:
#2nd code, doing with .isin()
#Subset for rows of Desert State (Arizona or California or Nevada or Utah state)
acnu = ["Arizona", "California", "Nevada", "Utah"]
print(df[df['state'].isin(acnu)])

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
4    Pacific  California     109008.0         20964.0   39461588
28  Mountain      Nevada       7058.0           486.0    3027341
44  Mountain        Utah       1904.0           972.0    3153550


# **Subsetting rows and columns together**

---

We can subset rows and columns together in a one line code.

Suppose, We want to subset 'New England' value from 'region' column and wants to show the 'state' and 'state_pop' column of only 'New England' 'region' column from df.

In [31]:
print(df[df['region']=='New England'][['state','state_pop']]) #showing 'state' and 'state_pop only for New England

            state  state_pop
6     Connecticut    3571520
19          Maine    1339057
21  Massachusetts    6882635
29  New Hampshire    1353465
39   Rhode Island    1058287
45        Vermont     624358


In [32]:
print(df[df['region']=='New England'][['state']]) #showing 'state' only for New England
                                                  #showing as a DataFrame

            state
6     Connecticut
19          Maine
21  Massachusetts
29  New Hampshire
39   Rhode Island
45        Vermont


In [33]:
print(df[df['region']=='New England']['state']) #showing as a series

6       Connecticut
19            Maine
21    Massachusetts
29    New Hampshire
39     Rhode Island
45          Vermont
Name: state, dtype: object


## **Adding new columns**


---





You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

In [21]:
#Add 'total' column as sum of individuals and family_members
df['total']=df['individuals']+df['family_members']
print(df.head())

               region       state  individuals  family_members  state_pop  \
0  East South Central     Alabama       2570.0           864.0    4887681   
1             Pacific      Alaska       1434.0           582.0     735139   
2            Mountain     Arizona       7259.0          2606.0    7158024   
3  West South Central    Arkansas       2280.0           432.0    3009733   
4             Pacific  California     109008.0         20964.0   39461588   

      total  
0    3434.0  
1    2016.0  
2    9865.0  
3    2712.0  
4  129972.0  


In [22]:
#Add 'p_ind' column as proportion of individuals and total
df['p_ind']=df['individuals']/df['total']
print(df.head())

               region       state  individuals  family_members  state_pop  \
0  East South Central     Alabama       2570.0           864.0    4887681   
1             Pacific      Alaska       1434.0           582.0     735139   
2            Mountain     Arizona       7259.0          2606.0    7158024   
3  West South Central    Arkansas       2280.0           432.0    3009733   
4             Pacific  California     109008.0         20964.0   39461588   

      total     p_ind  
0    3434.0  0.748398  
1    2016.0  0.711310  
2    9865.0  0.735834  
3    2712.0  0.840708  
4  129972.0  0.838704  


## Which state has the highest number of homeless individuals per 10,000 people in the state? Print out the 'result'.



---






In [23]:
#'ind_10k' column as df 'individuals' per 10k 'state pop'
df['ind_10k']=(df['individuals']/df['state_pop'])*10000
print(df.head())

               region       state  individuals  family_members  state_pop  \
0  East South Central     Alabama       2570.0           864.0    4887681   
1             Pacific      Alaska       1434.0           582.0     735139   
2            Mountain     Arizona       7259.0          2606.0    7158024   
3  West South Central    Arkansas       2280.0           432.0    3009733   
4             Pacific  California     109008.0         20964.0   39461588   

      total     p_ind    ind_10k  
0    3434.0  0.748398   5.258117  
1    2016.0  0.711310  19.506515  
2    9865.0  0.735834  10.141067  
3    2712.0  0.840708   7.575423  
4  129972.0  0.838704  27.623825  


In [24]:
#Subset rows for 'ind_10k' greater than 20
high_ind=df[df['ind_10k']>20]
print(high_ind)

            region                 state  individuals  family_members  \
4          Pacific            California     109008.0         20964.0   
8   South Atlantic  District of Columbia       3770.0          3134.0   
11         Pacific                Hawaii       4131.0          2399.0   
28        Mountain                Nevada       7058.0           486.0   
32    Mid-Atlantic              New York      39827.0         52070.0   
37         Pacific                Oregon      11139.0          3337.0   
47         Pacific            Washington      16424.0          5880.0   

    state_pop     total     p_ind    ind_10k  
4    39461588  129972.0  0.838704  27.623825  
8      701547    6904.0  0.546060  53.738381  
11    1420593    6530.0  0.632619  29.079406  
28    3027341    7544.0  0.935578  23.314189  
32   19530351   91897.0  0.433387  20.392363  
37    4181886   14476.0  0.769481  26.636307  
47    7523869   22304.0  0.736370  21.829195  


In [25]:
#Sorting high_ind by descending ind_10k
highest_ind=high_ind.sort_values('ind_10k',ascending=False)
print(highest_ind)

            region                 state  individuals  family_members  \
8   South Atlantic  District of Columbia       3770.0          3134.0   
11         Pacific                Hawaii       4131.0          2399.0   
4          Pacific            California     109008.0         20964.0   
37         Pacific                Oregon      11139.0          3337.0   
28        Mountain                Nevada       7058.0           486.0   
47         Pacific            Washington      16424.0          5880.0   
32    Mid-Atlantic              New York      39827.0         52070.0   

    state_pop     total     p_ind    ind_10k  
8      701547    6904.0  0.546060  53.738381  
11    1420593    6530.0  0.632619  29.079406  
4    39461588  129972.0  0.838704  27.623825  
37    4181886   14476.0  0.769481  26.636307  
28    3027341    7544.0  0.935578  23.314189  
47    7523869   22304.0  0.736370  21.829195  
32   19530351   91897.0  0.433387  20.392363  


In [26]:
result=highest_ind[['state','ind_10k']]
print(result)

                   state    ind_10k
8   District of Columbia  53.738381
11                Hawaii  29.079406
4             California  27.623825
37                Oregon  26.636307
28                Nevada  23.314189
47            Washington  21.829195
32              New York  20.392363


So, 'District of Columbia' state has the highest number of homeless individuals per 10,000 people in the state.