- 需求：
    - 导入文件，查看原始数据
    - 将人口数据和各州简称数据进行合并
    - 将合并的数据中重复的abbreviation列进行删除
    - 查看存在缺失数据的列
    - 找到有哪些state/region使得state的值为NaN，进行去重操作
    - 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
    - 合并各州面积数据areas
    - 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
    - 去除含有缺失数据的行
    - 找出2010年的全民人口数据
    - 计算各州的人口密度
    - 排序，并找出人口密度最高的州

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame

In [6]:
#导入文件，查看原始数据
abb = pd.read_csv('./data/state-abbrevs.csv') #state(州的全称)abbreviation（州的简称）
abb.head()

Unnamed: 0,state,abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [7]:
area = pd.read_csv('./data/state-areas.csv') #state州的全称，area (sq. mi)州的面积
area.head()

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707


In [8]:
pop = pd.read_csv('./data/state-population.csv')#state/region简称，ages年龄，year时间，population人口数量
pop.head()

Unnamed: 0,state/region,ages,year,population
0,AL,under18,2012,1117489.0
1,AL,total,2012,4817528.0
2,AL,under18,2010,1130966.0
3,AL,total,2010,4785570.0
4,AL,under18,2011,1125763.0


In [12]:
#将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop.head()

Unnamed: 0,state,abbreviation,state/region,ages,year,population
0,Alabama,AL,AL,under18,2012,1117489.0
1,Alabama,AL,AL,total,2012,4817528.0
2,Alabama,AL,AL,under18,2010,1130966.0
3,Alabama,AL,AL,total,2010,4785570.0
4,Alabama,AL,AL,under18,2011,1125763.0


In [13]:
#将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head()

Unnamed: 0,state,state/region,ages,year,population
0,Alabama,AL,under18,2012,1117489.0
1,Alabama,AL,total,2012,4817528.0
2,Alabama,AL,under18,2010,1130966.0
3,Alabama,AL,total,2010,4785570.0
4,Alabama,AL,under18,2011,1125763.0


Unnamed: 0,state,state/region,ages,year,population
0,Alabama,AL,under18,2012,1117489.0
1,Alabama,AL,total,2012,4817528.0
2,Alabama,AL,under18,2010,1130966.0
3,Alabama,AL,total,2010,4785570.0
4,Alabama,AL,under18,2011,1125763.0


In [14]:
#查看存在缺失数据的列
#方式1：isnull，notll，any，all
abb_pop.isnull().any(axis=0)
#state,population这两列中是存在空值

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

In [15]:
#方式2：
abb_pop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         2448 non-null   object 
 1   state/region  2544 non-null   object 
 2   ages          2544 non-null   object 
 3   year          2544 non-null   int64  
 4   population    2524 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB


In [16]:
#找到有哪些state/region使得state的值为NaN，进行去重操作(将state中的空值对应的简称找到，且对简称进行去重)
abb_pop.head()

Unnamed: 0,state,state/region,ages,year,population
0,Alabama,AL,under18,2012,1117489.0
1,Alabama,AL,total,2012,4817528.0
2,Alabama,AL,under18,2010,1130966.0
3,Alabama,AL,total,2010,4785570.0
4,Alabama,AL,under18,2011,1125763.0


思路：可以将state这一列中的空值对应的行数据取出，从该行数据中就可以取出简称的值

In [17]:
#1.将state中的空值定位到
abb_pop['state'].isnull()
#2.将上述的布尔值作为源数据的行索引
abb_pop.loc[abb_pop['state'].isnull()]#将state中空对应的行数据取出
#3.将简称取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region']
#4.对简称去重
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()

#结论：只有PR和USA对应的全称数据为空值

array(['PR', 'USA'], dtype=object)

- 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
- 思考：填充该需求中的空值可不可以使用fillna？
    - 不可以。fillna可以使用空的紧邻值做填充。fillna(value='xxx')使用指定的值填充空值
    - 使用给元素赋值的方式进行填充！

1. 先给USA的全称对应的空值进行批量赋值

In [20]:
#1.1将USA对应的行数据找出（行数据中就存在state的空值）
abb_pop.loc[abb_pop['state/region'] == 'USA']#将usa对应的行数据取出

Unnamed: 0,state,state/region,ages,year,population
2496,,USA,under18,1990,64218512.0
2497,,USA,total,1990,249622814.0
2498,,USA,total,1991,252980942.0
2499,,USA,under18,1991,65313018.0
2500,,USA,under18,1992,66509177.0
2501,,USA,total,1992,256514231.0
2502,,USA,total,1993,259918595.0
2503,,USA,under18,1993,67594938.0
2504,,USA,under18,1994,68640936.0
2505,,USA,total,1994,263125826.0


In [22]:
#1.2将USA对应的全称空对应的行索引取出
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
indexs

Int64Index([2496, 2497, 2498, 2499, 2500, 2501, 2502, 2503, 2504, 2505, 2506,
            2507, 2508, 2509, 2510, 2511, 2512, 2513, 2514, 2515, 2516, 2517,
            2518, 2519, 2520, 2521, 2522, 2523, 2524, 2525, 2526, 2527, 2528,
            2529, 2530, 2531, 2532, 2533, 2534, 2535, 2536, 2537, 2538, 2539,
            2540, 2541, 2542, 2543],
           dtype='int64')

In [27]:
abb_pop.iloc[indexs]

Unnamed: 0,state,state/region,ages,year,population
2496,United States,USA,under18,1990,64218512.0
2497,United States,USA,total,1990,249622814.0
2498,United States,USA,total,1991,252980942.0
2499,United States,USA,under18,1991,65313018.0
2500,United States,USA,under18,1992,66509177.0
2501,United States,USA,total,1992,256514231.0
2502,United States,USA,total,1993,259918595.0
2503,United States,USA,under18,1993,67594938.0
2504,United States,USA,under18,1994,68640936.0
2505,United States,USA,total,1994,263125826.0


In [26]:
abb_pop.loc[indexs,'state'] = 'United States'

In [29]:
#2.可以将PR的全称进行赋值
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
indexs

Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
            2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
            2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
            2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
            2492, 2493, 2494, 2495],
           dtype='int64')

In [32]:
abb_pop.loc[indexs,'state'] = 'PPPRRR'
abb_pop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         2544 non-null   object 
 1   state/region  2544 non-null   object 
 2   ages          2544 non-null   object 
 3   year          2544 non-null   int64  
 4   population    2524 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 183.8+ KB


In [35]:
#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()

Unnamed: 0,state,state/region,ages,year,population,area (sq. mi)
0,Alabama,AL,under18,2012.0,1117489.0,52423.0
1,Alabama,AL,total,2012.0,4817528.0,52423.0
2,Alabama,AL,under18,2010.0,1130966.0,52423.0
3,Alabama,AL,total,2010.0,4785570.0,52423.0
4,Alabama,AL,under18,2011.0,1125763.0,52423.0


In [40]:
#我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()] #空对应的行数据
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs

Int64Index([], dtype='int64')

In [41]:
#去除含有缺失数据的行
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)
abb_pop_area.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2449 entries, 0 to 2544
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   state          2449 non-null   object 
 1   state/region   2448 non-null   object 
 2   ages           2448 non-null   object 
 3   year           2448 non-null   float64
 4   population     2448 non-null   float64
 5   area (sq. mi)  2449 non-null   float64
dtypes: float64(3), object(3)
memory usage: 133.9+ KB


In [42]:
#找出2010年的全民人口数据(基于df做条件查询)
abb_pop_area.query('ages == "total" & year == 2010')

Unnamed: 0,state,state/region,ages,year,population,area (sq. mi)
3,Alabama,AL,total,2010.0,4785570.0,52423.0
91,Alaska,AK,total,2010.0,713868.0,656425.0
101,Arizona,AZ,total,2010.0,6408790.0,114006.0
189,Arkansas,AR,total,2010.0,2922280.0,53182.0
197,California,CA,total,2010.0,37333601.0,163707.0
283,Colorado,CO,total,2010.0,5048196.0,104100.0
293,Connecticut,CT,total,2010.0,3579210.0,5544.0
379,Delaware,DE,total,2010.0,899711.0,1954.0
389,District of Columbia,DC,total,2010.0,605125.0,68.0
475,Florida,FL,total,2010.0,18846054.0,65758.0


In [43]:
#计算各州的人口密度(人口除以面积)
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area

Unnamed: 0,state,state/region,ages,year,population,area (sq. mi),midu
0,Alabama,AL,under18,2012.0,1117489.0,52423.0,21.316769
1,Alabama,AL,total,2012.0,4817528.0,52423.0,91.897221
2,Alabama,AL,under18,2010.0,1130966.0,52423.0,21.573851
3,Alabama,AL,total,2010.0,4785570.0,52423.0,91.287603
4,Alabama,AL,under18,2011.0,1125763.0,52423.0,21.474601
...,...,...,...,...,...,...,...
2444,Wyoming,WY,total,1991.0,459260.0,97818.0,4.695046
2445,Wyoming,WY,under18,1991.0,136720.0,97818.0,1.397698
2446,Wyoming,WY,under18,1990.0,136078.0,97818.0,1.391135
2447,Wyoming,WY,total,1990.0,453690.0,97818.0,4.638103


In [47]:
#排序，并找出人口密度最高的州
abb_pop_area.sort_values(by='midu',axis=0,ascending=False).iloc[0]

state            District of Columbia
state/region                       DC
ages                            total
year                           2013.0
population                   646449.0
area (sq. mi)                    68.0
midu                      9506.602941
Name: 391, dtype: object