# Data cleaning for exploring Simpson's paradox

This is one of the Jupyter notebooks I used in my preparation of *[Probably Overthinking It: How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions](https://greenteapress.com/wp/probably-overthinking-it)*, University of Chicago Press, 2023.

Before you read these notebooks, please keep in mind:

* There is some explanatory text here, but some of the examples will not make sense until you have read the corresponding chapter in the book.

* While preparing these notebooks, I made some changes to improve the readability of the code. There might be small differences between what appears in the book and what you get when you run the code.

[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt/blob/book/notebooks/clean_simpson.ipynb).

In [43]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [45]:
# This dataset is prepared in clean_simpson.ipynb

DATA_PATH = "https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/data/"

filename = "gss_eds.3.hdf"
download(DATA_PATH + filename)

Downloaded gss_eds.3.hdf


In [46]:
!ls -lh gss_eds.3.hdf

-rw-rw-r-- 1 downey downey 31M Jan  7 14:34 gss_eds.3.hdf


In [47]:
gss = pd.read_hdf('gss_eds.3.hdf', 'gss1')
gss.shape

(68846, 205)

In [48]:
recode_polviews = {1:'Liberal', 
                   2:'Liberal', 
                   3:'Liberal', 
                   4:'Moderate', 
                   5:'Conservative', 
                   6:'Conservative', 
                   7:'Conservative'}

In [49]:
gss['polviews3'] = gss['polviews'].replace(recode_polviews)
gss['polviews3'].value_counts()

polviews3
Moderate        23125
Conservative    20165
Liberal         16056
Name: count, dtype: int64

>Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?

The valid responses are:

```
0	Strong democrat
1	Not str democrat
2	Ind,near dem
3	Independent
4	Ind,near rep
5	Not str republican
6	Strong republican
7	Other party
```

You can [read the codebook for `partyid` here](https://gssdataexplorer.norc.org/projects/52787/variables/141/vshow).

In [10]:
recode_partyid = {0: 'Democrat',
                  1:'Democrat', 
                   2:'Independent', 
                   3:'Independent', 
                   4:'Independent', 
                   5:'Republican', 
                   6:'Republican', 
                   7:'Other'}

In [11]:
gss['partyid4'] = gss['partyid'].replace(recode_partyid)
gss['partyid4'].value_counts()

partyid4
Independent    25271
Democrat       24560
Republican     17328
Other           1198
Name: count, dtype: int64

Respondent's highest degree

```
0 	Lt high school
1 	High school
2 	Junior college
3 	Bachelor
4 	Graduate
8 	Don't know
9 	No answer
```



In [12]:
gss['degree'].value_counts()

degree
1.0    35623
0.0    14021
3.0     9895
4.0     4940
2.0     4158
Name: count, dtype: int64

> What is your religious preference? Is it Protestant, Catholic, Jewish, some other religion, or no religion?

```
1 	Protestant
2 	Catholic
3 	Jewish
4 	None
5 	Other
6 	Buddhism
7 	Hinduism
8 	Other eastern
9 	Moslem/islam
10 	Orthodox-christian
11 	Christian
12 	Native american
13 	Inter-nondenominational
```



In [13]:
recode_relig = {1:'Protestant', 
                   2:'Catholic', 
                   3:'Other', 
                   4:'None', 
                   5:'Other', 
                   6:'Other', 
                   7:'Other', 
                   8:'Other', 
                   9:'Other', 
                   10:'Other Christian', 
                   11:'Other Christian', 
                   12:'Other', 
                   13:'Other'}

In [14]:
gss['relig5'] = gss['relig'].replace(recode_relig)
gss['relig5'].value_counts()

relig5
Protestant         37689
Catholic           17428
None                8875
Other               3301
Other Christian     1099
Name: count, dtype: int64

> If you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class, or the upper class?
 
```
1 	Lower class
2 	Working class
3 	Middle class
4 	Upper class
5 	No class
8 	Don't know
9 	No answer
0 	Not applicable
```

In [15]:
recode_class = {1:'Lower class', 
                   2:'Working class', 
                   3:'Middle class', 
                   4:'Upper class', 
                   }

In [16]:
gss['class'] = gss['class'].replace(recode_class)
gss['class'].value_counts()

class
Working class    30094
Middle class     29367
Lower class       3801
Upper class       2068
Name: count, dtype: int64

```
0 	Lt high school
1 	High school
2 	Junior college
3 	Bachelor
4 	Graduate
```

In [17]:
recode_degree = {0: 'Less than high school',
                  1:'High school', 
                   2:'Junior college', 
                   3:'Bachelor', 
                   4:'Graduate'}

In [18]:
gss['degree5'] = gss['degree'].replace(recode_degree)
gss['degree5'].value_counts()

degree5
High school              35623
Less than high school    14021
Bachelor                  9895
Graduate                  4940
Junior college            4158
Name: count, dtype: int64

AGE

In [19]:
gss['age'].describe()

count    68290.000000
mean        44.900966
std         17.139294
min         18.000000
25%         30.000000
50%         43.000000
75%         57.000000
max         89.000000
Name: age, dtype: float64

In [20]:
gss['age'].head()

0    60.0
1    51.0
2    23.0
3    42.0
4    43.0
Name: age, dtype: float64

In [21]:
bins = np.arange(17, 95, 5)
print(len(bins))
bins

16


array([17, 22, 27, 32, 37, 42, 47, 52, 57, 62, 67, 72, 77, 82, 87, 92])

In [22]:
labels = bins[:-1] + 3

gss['age5'] = pd.cut(gss['age'], bins, labels=labels).astype(float)
gss['age5'].head()

0    60.0
1    50.0
2    25.0
3    40.0
4    45.0
Name: age5, dtype: float64

In [23]:
gss['age5'].value_counts().sort_index()

age5
20.0    5565
25.0    7242
30.0    7022
35.0    7104
40.0    6620
45.0    6246
50.0    5919
55.0    5533
60.0    4840
65.0    3970
70.0    3208
75.0    2443
80.0    1439
85.0     774
90.0     365
Name: count, dtype: int64

In [24]:
gss['cohort'].head()

0    1912.0
1    1921.0
2    1949.0
3    1930.0
4    1929.0
Name: cohort, dtype: float64

In [25]:
bins = np.arange(1889, 2001, 10)
labels = bins[:-1] + 1

gss['cohort10'] = pd.cut(gss['cohort'], bins, labels=labels).astype(float)
gss['cohort10'].head()

0    1910.0
1    1920.0
2    1940.0
3    1930.0
4    1920.0
Name: cohort10, dtype: float64

In [26]:
gss['cohort10'].value_counts().sort_index()

cohort10
1890.0      483
1900.0     1681
1910.0     3648
1920.0     5992
1930.0     7011
1940.0    10728
1950.0    13726
1960.0    11023
1970.0     7238
1980.0     4619
1990.0     1877
Name: count, dtype: int64

In [27]:
gss['year'].tail()

68841    2021
68842    2021
68843    2021
68844    2021
68845    2021
Name: year, dtype: int16

In [28]:
bins = np.arange(1970, 2026, 5)
bins

array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020,
       2025])

In [29]:
labels = bins[:-1] + 2

gss['year5'] = pd.cut(gss['year'], bins, labels=labels).astype(float)
gss['year5'].tail()

68841    2022.0
68842    2022.0
68843    2022.0
68844    2022.0
68845    2022.0
Name: year5, dtype: float64

In [30]:
gss['year5'].value_counts().sort_index()

year5
1972.0    6091
1977.0    6029
1982.0    6466
1987.0    7679
1992.0    6115
1997.0    8553
2002.0    5577
2007.0    8577
2012.0    4512
2017.0    5215
2022.0    4032
Name: count, dtype: int64

Family income on 1972-2006 surveys in constant dollars (base = 1986)

In [31]:
gss['realinc'].describe()

count     61277.000000
mean      34892.588431
std       31337.050178
min         218.000000
25%       13595.000000
50%       25837.000000
75%       43600.000000
max      162607.000000
Name: realinc, dtype: float64

In [32]:
gss['log_realinc'] = np.log10(gss['realinc'])
gss['log_realinc'].describe()

count    61277.000000
mean         4.366593
std          0.441560
min          2.338456
25%          4.133379
50%          4.412242
75%          4.639486
max          5.211139
Name: log_realinc, dtype: float64

In [33]:
temp, bins = pd.qcut(gss['log_realinc'], 10, retbins=True)
temp.head()

0    (4.412, 4.504]
1    (4.327, 4.412]
2    (4.214, 4.327]
3    (4.688, 4.861]
4    (4.587, 4.688]
Name: log_realinc, dtype: category
Categories (10, interval[float64, right]): [(2.337, 3.829] < (3.829, 4.059] < (4.059, 4.214] < (4.214, 4.327] ... (4.504, 4.587] < (4.587, 4.688] < (4.688, 4.861] < (4.861, 5.211]]

In [34]:
bins

array([2.33845649, 3.82872433, 4.05941194, 4.21351776, 4.32735893,
       4.41224209, 4.50416491, 4.5866998 , 4.68833082, 4.86117584,
       5.21113924])

In [35]:
labels = np.diff(bins) / 2 + bins[:-1]
labels

array([3.08359041, 3.94406813, 4.13646485, 4.27043835, 4.36980051,
       4.4582035 , 4.54543236, 4.63751531, 4.77475333, 5.03615754])

In [36]:
gss['log_realinc10'] = pd.cut(gss['log_realinc'], bins, labels=labels).astype(float)
gss['log_realinc10'].head()

0    4.458203
1    4.369801
2    4.270438
3    4.774753
4    4.637515
Name: log_realinc10, dtype: float64

In [38]:
!rm gss_simpson.hdf

In [40]:
gss.to_hdf('gss_simpson.hdf', 'gss', complevel=6)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->Index(['class', 'polviews3', 'partyid4', 'relig5', 'degree5'], dtype='object')]

  gss.to_hdf('gss_simpson.hdf', 'gss', complevel=6)


In [41]:
!ls -lh gss_simpson.hdf

-rw-rw-r-- 1 downey downey 12M Jan  7 14:08 gss_simpson.hdf


Copyright 2023 Allen B. Downey

