## Data preprocessing

Based on 2017 ACS 5-Year Estimates

This notebook shows how to preprocess data from .csv files to numpy integer arrays.

In [125]:
import pandas as pd
import numpy as np

Data contains records of impoverished population in each ethnic group of each municipality in Wayne County, Michigan.

Columns for each municipality are: Total population, Number of people below poverty level, Percentage of people below poverty level.

In [126]:
df = pd.read_csv("wayne_county_poverty_2017_race_only.csv")

df

Unnamed: 0,Label (Grouping),"Allen Park city, Wayne County, Michigan!!Total!!Estimate","Allen Park city, Wayne County, Michigan!!Below poverty level!!Estimate","Allen Park city, Wayne County, Michigan!!Percent below poverty level!!Estimate","Belleville city, Wayne County, Michigan!!Total!!Estimate","Belleville city, Wayne County, Michigan!!Below poverty level!!Estimate","Belleville city, Wayne County, Michigan!!Percent below poverty level!!Estimate","Dearborn city, Wayne County, Michigan!!Total!!Estimate","Dearborn city, Wayne County, Michigan!!Below poverty level!!Estimate","Dearborn city, Wayne County, Michigan!!Percent below poverty level!!Estimate",...,"Wayne city, Wayne County, Michigan!!Percent below poverty level!!Estimate","Westland city, Wayne County, Michigan!!Total!!Estimate","Westland city, Wayne County, Michigan!!Below poverty level!!Estimate","Westland city, Wayne County, Michigan!!Percent below poverty level!!Estimate","Woodhaven city, Wayne County, Michigan!!Total!!Estimate","Woodhaven city, Wayne County, Michigan!!Below poverty level!!Estimate","Woodhaven city, Wayne County, Michigan!!Percent below poverty level!!Estimate","Wyandotte city, Wayne County, Michigan!!Total!!Estimate","Wyandotte city, Wayne County, Michigan!!Below poverty level!!Estimate","Wyandotte city, Wayne County, Michigan!!Percent below poverty level!!Estimate"
0,Population for whom poverty status is determined,27136,1783,6.6%,3866,587,15.2%,94895,27573,29.1%,...,23.8%,81334,11887,14.6%,12393,910,7.3%,25157,2723,10.8%
1,White alone,25192,1574,6.2%,3437,540,15.7%,85934,25045,29.1%,...,20.1%,60371,8118,13.4%,10796,502,4.6%,24193,2524,10.4%
2,Black or African American alone,454,115,25.3%,347,47,13.5%,3526,682,19.3%,...,38.1%,14675,2878,19.6%,873,385,44.1%,229,75,32.8%
3,American Indian and Alaska Native alone,28,0,0.0%,0,0,-,113,34,30.1%,...,0.0%,381,39,10.2%,64,0,0.0%,45,15,33.3%
4,Asian alone,129,29,22.5%,24,0,0.0%,1960,518,26.4%,...,78.0%,3230,346,10.7%,430,0,0.0%,151,0,0.0%
5,Native Hawaiian and Other Pacific Islander alone,0,0,-,0,0,-,29,0,0.0%,...,-,51,0,0.0%,0,0,-,0,0,-
6,Some other race alone,571,32,5.6%,6,0,0.0%,358,187,52.2%,...,10.1%,877,143,16.3%,75,8,10.7%,207,44,21.3%
7,Two or more races,762,33,4.3%,52,0,0.0%,2975,1107,37.2%,...,17.9%,1749,363,20.8%,155,15,9.7%,332,65,19.6%
8,Hispanic or Latino origin (of any race),2687,100,3.7%,85,21,24.7%,2812,532,18.9%,...,24.8%,2613,596,22.8%,582,43,7.4%,1151,134,11.6%
9,"White alone, not Hispanic or Latino",23340,1501,6.4%,3368,535,15.9%,83643,24782,29.6%,...,19.6%,58713,7695,13.1%,10300,467,4.5%,23306,2438,10.5%


Dropping percentage columns

In [127]:
sub_df = df[df.columns.drop(list(df.filter(regex='.*Percent below poverty level.*')))]
sub_df = sub_df[sub_df.columns.drop(list(df.filter(regex='Label.*')))]
sub_df.head()

Unnamed: 0,"Allen Park city, Wayne County, Michigan!!Total!!Estimate","Allen Park city, Wayne County, Michigan!!Below poverty level!!Estimate","Belleville city, Wayne County, Michigan!!Total!!Estimate","Belleville city, Wayne County, Michigan!!Below poverty level!!Estimate","Dearborn city, Wayne County, Michigan!!Total!!Estimate","Dearborn city, Wayne County, Michigan!!Below poverty level!!Estimate","Dearborn Heights city, Wayne County, Michigan!!Total!!Estimate","Dearborn Heights city, Wayne County, Michigan!!Below poverty level!!Estimate","Detroit city, Wayne County, Michigan!!Total!!Estimate","Detroit city, Wayne County, Michigan!!Below poverty level!!Estimate",...,"Village of Grosse Pointe Shores city, Wayne County, Michigan!!Total!!Estimate","Village of Grosse Pointe Shores city, Wayne County, Michigan!!Below poverty level!!Estimate","Wayne city, Wayne County, Michigan!!Total!!Estimate","Wayne city, Wayne County, Michigan!!Below poverty level!!Estimate","Westland city, Wayne County, Michigan!!Total!!Estimate","Westland city, Wayne County, Michigan!!Below poverty level!!Estimate","Woodhaven city, Wayne County, Michigan!!Total!!Estimate","Woodhaven city, Wayne County, Michigan!!Below poverty level!!Estimate","Wyandotte city, Wayne County, Michigan!!Total!!Estimate","Wyandotte city, Wayne County, Michigan!!Below poverty level!!Estimate"
0,27136,1783,3866,587,94895,27573,55580,10665,668133,252897,...,2856,48,16884,4013,81334,11887,12393,910,25157,2723
1,25192,1574,3437,540,85934,25045,47501,9515,92379,35058,...,2657,41,12965,2604,60371,8118,10796,502,24193,2524
2,454,115,347,47,3526,682,4437,527,530864,199990,...,64,1,3179,1210,14675,2878,873,385,229,75
3,28,0,0,0,113,34,235,0,2301,1123,...,14,0,29,0,381,39,64,0,45,15
4,129,29,24,0,1960,518,1010,172,9536,4001,...,90,6,150,117,3230,346,430,0,151,0


Conversion of string cells to integers for the numpy array.

In [128]:
sub_df.astype("str")  # ensure its all strings first
sub_df = sub_df.replace(',','', regex=True)  # remove all commas
sub_df.head()


Unnamed: 0,"Allen Park city, Wayne County, Michigan!!Total!!Estimate","Allen Park city, Wayne County, Michigan!!Below poverty level!!Estimate","Belleville city, Wayne County, Michigan!!Total!!Estimate","Belleville city, Wayne County, Michigan!!Below poverty level!!Estimate","Dearborn city, Wayne County, Michigan!!Total!!Estimate","Dearborn city, Wayne County, Michigan!!Below poverty level!!Estimate","Dearborn Heights city, Wayne County, Michigan!!Total!!Estimate","Dearborn Heights city, Wayne County, Michigan!!Below poverty level!!Estimate","Detroit city, Wayne County, Michigan!!Total!!Estimate","Detroit city, Wayne County, Michigan!!Below poverty level!!Estimate",...,"Village of Grosse Pointe Shores city, Wayne County, Michigan!!Total!!Estimate","Village of Grosse Pointe Shores city, Wayne County, Michigan!!Below poverty level!!Estimate","Wayne city, Wayne County, Michigan!!Total!!Estimate","Wayne city, Wayne County, Michigan!!Below poverty level!!Estimate","Westland city, Wayne County, Michigan!!Total!!Estimate","Westland city, Wayne County, Michigan!!Below poverty level!!Estimate","Woodhaven city, Wayne County, Michigan!!Total!!Estimate","Woodhaven city, Wayne County, Michigan!!Below poverty level!!Estimate","Wyandotte city, Wayne County, Michigan!!Total!!Estimate","Wyandotte city, Wayne County, Michigan!!Below poverty level!!Estimate"
0,27136,1783,3866,587,94895,27573,55580,10665,668133,252897,...,2856,48,16884,4013,81334,11887,12393,910,25157,2723
1,25192,1574,3437,540,85934,25045,47501,9515,92379,35058,...,2657,41,12965,2604,60371,8118,10796,502,24193,2524
2,454,115,347,47,3526,682,4437,527,530864,199990,...,64,1,3179,1210,14675,2878,873,385,229,75
3,28,0,0,0,113,34,235,0,2301,1123,...,14,0,29,0,381,39,64,0,45,15
4,129,29,24,0,1960,518,1010,172,9536,4001,...,90,6,150,117,3230,346,430,0,151,0


In [129]:
data_array = sub_df.to_numpy()
print(data_array[0][0], type(data_array[0][0]))

27136 <class 'str'>


In [130]:
data_array = data_array.astype(int)
print(data_array[0][0], type(data_array[0][0]))
data_array

27136 <class 'numpy.int64'>


array([[ 27136,   1783,   3866,    587,  94895,  27573,  55580,  10665,
        668133, 252897,   9240,   3399,   9881,    876,  26747,   2901,
          4536,    398,  10173,    277,   5206,    195,   9206,    153,
         11240,    721,  15630,    642,  21164,  10772,  13643,   1815,
         10704,   5245,  15692,   1282,  24368,   8084,  36899,   7352,
         93469,   5047,  10402,   2916,   2637,    218,  28812,    685,
          8872,    396,   7559,   2918,  11647,   1365,   3183,    133,
         23136,   4336,  29162,   3319,   9272,   1478,  60750,  12036,
         18115,   1191,   2856,     48,  16884,   4013,  81334,  11887,
         12393,    910,  25157,   2723],
       [ 25192,   1574,   3437,    540,  85934,  25045,  47501,   9515,
         92379,  35058,   3905,   1502,   9216,    789,  24994,   2607,
          4343,    377,   9784,    255,   4808,    195,   8689,    134,
          9646,    642,  13715,    545,  11634,   5383,   4851,    463,
           446,    247,