<a href="https://colab.research.google.com/github/550tealeaves/DATA-70500-working-with-data/blob/main/ORIG_DataNarratives2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Telling a Story with Data

In this notebook, I'll show you how to use the tools we've learned to pose a sociological question, find and read appropriate data, and then do a series of analyses that form an answer to the question. Through this process, we'll keep the principles of *Data Feminism* in mind.



The data source I identified as relevant to my question is IPUMS at the University of Minnesota: https://usa.ipums.org/usa/index.shtml

IPUMS is a form of US Census data that is made available to researchers, with identifying information removed. IPUMS used to stand for "Integrated Public Use Microdata Series" but now people just consider it the name of a data source. It allows us to ask many different kinds of questions about the US population, including individual adults and households. It is a popular data source for many kinds of sociological research.



IPUMS data: https://usa.ipums.org/usa-action/variables/group

NYC Department of City Planning: https://www.nyc.gov/site/planning/data-maps/open-data.page

American Community Survey from US Census: https://www.census.gov/programs-surveys/acs/

Getting American Community Survey data from the US Census via their API:
https://www.census.gov/programs-surveys/acs/data/data-via-api.html

Here's a handy summary:
https://www.census.gov/content/dam/Census/programs-surveys/acs/data/census-data-api-flyer_ACS.pdf



In [1]:
# Code block 0: Install library
!pip install researchpy

Collecting researchpy
  Downloading researchpy-0.3.6-py3-none-any.whl.metadata (1.2 kB)
Downloading researchpy-0.3.6-py3-none-any.whl (34 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.3.6


In [None]:
# Code block 1: Libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sb
import matplotlib.pyplot as plt
import math
import researchpy as rp

In [None]:
# Code block 2: Reading the data
IPUMS_df = pd.read_csv('/content/drive/MyDrive/Data/usa_00019.csv')


In [None]:
IPUMS_df.info('verbose')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20416 entries, 0 to 20415
Data columns (total 33 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   YEAR       20416 non-null  int64  
 1   SAMPLE     20416 non-null  int64  
 2   SERIAL     20416 non-null  int64  
 3   CBSERIAL   20416 non-null  int64  
 4   HHWT       20416 non-null  float64
 5   CLUSTER    20416 non-null  int64  
 6   STATEFIP   20416 non-null  int64  
 7   COUNTYFIP  20416 non-null  int64  
 8   STRATA     20416 non-null  int64  
 9   GQ         20416 non-null  int64  
 10  PERNUM     20416 non-null  int64  
 11  PERWT      20416 non-null  float64
 12  SEX        20416 non-null  int64  
 13  AGE        20416 non-null  int64  
 14  MARST      20416 non-null  int64  
 15  HISPAN     20416 non-null  int64  
 16  HISPAND    20416 non-null  int64  
 17  CITIZEN    20416 non-null  int64  
 18  RACASIAN   20416 non-null  int64  
 19  RACBLK     20416 non-null  int64  
 20  RACPAC

In [None]:
# Handle missing values - turn missing values into not a number
IPUMS_df['SEX'] = np.where(IPUMS_df['SEX'] > 2, np.nan, IPUMS_df['SEX'])
IPUMS_df['AGE'] = np.where(IPUMS_df['AGE'] > 140, np.nan, IPUMS_df['AGE'])
IPUMS_df['EDUC'] = np.where(IPUMS_df['EDUC'] > 11, np.nan, IPUMS_df['EDUC'])


In [None]:
# Build a model to start the data narrative
# When using categorical variables - you have to remove one of the options and leave it as the comparison and create binary variables for all the other categories
Y = IPUMS_df['UHRSWORK'] # Hours usually worked (in hours) # DEPENDENT VARIABLE - numeric variable
X = IPUMS_df[['SEX', 'AGE', 'RACASIAN', 'RACBLK', 'RACPACIS', 'RACOTHER', 'EDUC']] #INDEPENDENT VARIABLES - race=white is the comparison category b/c the other races are clearly listed - binary/numeric variables
X = sm.add_constant(X)
model0 = sm.OLS(Y, X, missing='drop').fit()
print(model0.summary())

                            OLS Regression Results                            
Dep. Variable:               UHRSWORK   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     982.9
Date:                Tue, 05 Nov 2024   Prob (F-statistic):               0.00
Time:                        18:58:44   Log-Likelihood:                -88073.
No. Observations:               20416   AIC:                         1.762e+05
Df Residuals:                   20408   BIC:                         1.762e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.7045      3.350      0.509      0.6

~ **About 25% of the variation in the model is explained by the demographic variables**

Treat demographic variables as if they are statistically independent, but what about intersectionality? Variables don't exist in a vacuum.




So here we've asked if race, gender, age, and education affect the amount usually worked. By using OLS, we are able to look at the unique effects since all the IVs serve as control variables for any of the others.

We see a pattern of significant effects. So our next step is to think about what would be meaningful follow-up questions. This is how we construct a data narrative, by thinking about how we can explain the meaning of the pattern we see in the original model.

One question we might ask is about interaction effects. This is a way to operationalize the concept of intersectionality.

In [None]:
# Create interaction variable - identify as woman and as black - black women vs everyone else
IPUMS_df['WOMAN'] = np.where(IPUMS_df['SEX'] == 2, 1, 0)
IPUMS_df['BLACK_WOMAN'] = IPUMS_df['WOMAN'] * IPUMS_df['RACBLK'] #'Black_woman' - is the interaction b/w woman and black


In [None]:
# Build a model to start the data narrative
Y = IPUMS_df['UHRSWORK'] # Hours usually worked (in hours)
X = IPUMS_df[['SEX', 'AGE', 'RACASIAN', 'RACBLK', 'RACPACIS', 'RACOTHER', 'EDUC', 'BLACK_WOMAN']] # use the original model and then include the new interaction variable 'black_woman'
X = sm.add_constant(X)
model1 = sm.OLS(Y, X, missing='drop').fit()
print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:               UHRSWORK   R-squared:                       0.253
Model:                            OLS   Adj. R-squared:                  0.253
Method:                 Least Squares   F-statistic:                     864.3
Date:                Tue, 05 Nov 2024   Prob (F-statistic):               0.00
Time:                        19:12:34   Log-Likelihood:                -88060.
No. Observations:               20416   AIC:                         1.761e+05
Df Residuals:                   20407   BIC:                         1.762e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           8.4017      3.596      2.337      

**Sumarry - black women work on average 3.9 hours more than everyone else**

Here, in model 1 (the second model) we see a significant effect for the interaction variable we created to examine one aspect of intersectionality. We could (and should) create other interaction effects to reflect more fully the nature of intersectionality.

Then we want to think of the next question that would advance our data narrative.

## Activity
1. Think of a follow-up question to ask and compute a new model.

2. Interpret the results of your model.

3. Explain how your model advances the data narrative.

4. What would be a good next question?

5.  What information about the American Community Survey from the US Census would you include in a data biography?