Luke Dauer
## Load Data

Copy/paste the following code block to the code cell below and make sure it runs without error:

In [5]:
import pandas as pd 

# load sample csv
df_gss_sample = pd.read_csv('data/gss_sample.csv', index_col=0, low_memory=False, encoding='utf8') 

# load rows counts of full data
gss_full_row_counts = pd.read_csv('data/gss_full_row_counts.csv', index_col=0, low_memory=False, encoding='utf8')

# load years per variable data
gss_years_per_var = pd.read_csv('meta/gss_data_years_per_var.csv', index_col=0, low_memory=False, encoding='utf8')

# load data dictionary
gss_data_dictionary =  pd.read_csv('meta/gss_data_dictionary.csv', index_col=0, low_memory=False, encoding='latin1')

## Question 1

In the code cell below, perform the following steps: 

1. Using only basic column assignment methods, use the respondent's `age` and `agewed` values in `df_gss_sample` to derive a new column called `years_since_married` that represents the number of years since the respondent was married. 
2. Using only basic column assignment methods, use the `year` and `years_since_married` variables in `df_gss_sample` to derive a new column called `approx_year_wed` that represents the approximate year the respondent was married. 
3. Display the first ten rows of the columns `year`, `age`, `agewed`, `marital`, `years_since_married`, and `approx_year_wed` in the DataFrame `df_gss_sample`

```python
# your code here 
```

---

In [9]:
df_gss_sample['years_since_married'] = df_gss_sample['age'] - df_gss_sample['agewed']

df_gss_sample['approx_year_wed'] = df_gss_sample['year'] - df_gss_sample['years_since_married']

print(df_gss_sample[['year', 'age', 'agewed', 'marital', 'years_since_married', 'approx_year_wed']].head(10))


       year   age  agewed  marital  years_since_married  approx_year_wed
1247   1972  68.0    18.0      2.0                 50.0           1922.0
31037  1994  33.0     NaN      1.0                  NaN              NaN
50404  2006  38.0     NaN      5.0                  NaN              NaN
34084  1996  77.0     NaN      2.0                  NaN              NaN
38488  2000  26.0     NaN      5.0                  NaN              NaN
24889  1989  26.0    22.0      3.0                  4.0           1985.0
22793  1988  40.0    22.0      1.0                 18.0           1970.0
64359  2018  43.0     NaN      5.0                  NaN              NaN
30582  1994  43.0    23.0      3.0                 20.0           1974.0
66423  2021  73.0     NaN      2.0                  NaN              NaN


## Question 2

In the code cell below, perform the following steps: 

1. Write a function called `translate_regions` that takes a `DataFrame` row as its input. Assuming that the input `DataFrame` has the exact structure of `df_gss_sample`, write the function so that it converts each numerical code in the `reg16` and `region` columns respectively into their appropriate semantic/string categories.
2. Call the function on `df_gss_sample` using the `apply()` method and capture the results as new columns in `df_gss_sample` called `reg16_trans` and `region_trans` respectively.
3. Display the first ten rows of the columns `reg16`, `region`, `reg16_trans`, and `region_trans` in the DataFrame `df_gss_sample`

__Hint:__ Use the`regions` dictionary in the code cell below in your function. You may also need to handle some cases where the column value is not found in the dictionary.

```python
# your code here 
```

---

In [13]:
regions = {
    1: "New England",
    2: "Middle Atlantic",
    3: "East North Central",
    4: "West North Central",
    5: "South Atlantic",
    6: "East South Central",
    7: "West South Central",
    8: "Mountain",
    9: "Pacific"
}

def translate_regions(row):
    reg16_trans = regions.get(row['reg16'], "Unknown")  
    region_trans = regions.get(row['region'], "Unknown") 
    return pd.Series([reg16_trans, region_trans])

df_gss_sample[['reg16_trans', 'region_trans']] = df_gss_sample.apply(translate_regions, axis=1)

print(df_gss_sample[['reg16', 'region', 'reg16_trans', 'region_trans']].head(10))


       reg16  region         reg16_trans        region_trans
1247     2.0       5     Middle Atlantic      South Atlantic
31037    5.0       5      South Atlantic      South Atlantic
50404    7.0       7  West South Central  West South Central
34084    5.0       5      South Atlantic      South Atlantic
38488    2.0       9     Middle Atlantic             Pacific
24889    8.0       8            Mountain            Mountain
22793    5.0       5      South Atlantic      South Atlantic
64359    7.0       7  West South Central  West South Central
30582    2.0       2     Middle Atlantic     Middle Atlantic
66423    3.0       3  East North Central  East North Central


## Question 3

In the code cell below, perform the following steps: 

1. Subset the `df_gss_sample` DataFrame so that you are working only with the `prestg10` and `sibs` columns.
2. Using `groupby` and `agg` methods, group your subset DataFrame on the respondent's number of siblings and display summary columns for: (a) the mean `prestg10` for each `sibs` count; and (2) the number of rows (n) used for each mean value.
3. Display all 28 rows of the grouped and aggregated DataFrame in the code cell below
4. In the markdown cell below, explain whether the mean `prestg10` scores suggest a relationship between professional prestige and number of siblings.


```python
# your code here 
```

```python
[Your markdown here]
```

---

In [17]:
subset_df = df_gss_sample[['prestg10', 'sibs']]

grouped_df = subset_df.groupby('sibs').agg(mean_prestige=('prestg10', 'mean'),  count=('prestg10', 'count')).reset_index()  

print(grouped_df.to_string(index=False))


 sibs  mean_prestige  count
  0.0      47.315476    168
  1.0      45.131059    557
  2.0      45.145234    661
  3.0      43.897727    528
  4.0      42.340050    397
  5.0      42.086275    255
  6.0      43.210000    200
  7.0      39.595506    178
  8.0      41.018868    106
  9.0      38.282051     78
 10.0      37.942857     35
 11.0      37.488372     43
 12.0      38.615385     26
 13.0      40.500000     20
 14.0      38.454545     11
 15.0      41.125000      8
 16.0      33.000000      4
 17.0      33.666667      3
 18.0      43.000000      2
 19.0      46.000000      1
 21.0      31.500000      2
 22.0            NaN      0
 23.0      52.000000      1
 24.0      40.000000      1
 25.0      48.000000      1
 27.0      28.000000      1


### Relationship Between Prestige and Number of Siblings
The grouped data shows how occupational prestige varies with the number of siblings. If the mean prestg10 scores decline as sibs increases, this might suggest that people in larger families tend to have lower professional prestige, possibly due to resource distribution among siblings.

## Question 4

In the code cell below, perform the following steps: 

1. Building on Question 3, alter your subset from the `df_gss_sample` DataFrame so that you are working with the `fund16` column (how fundamentalist was the respondent at age 16), in addition to the `prestg10` and `sibs` columns.
2. Using `groupby` and `agg` methods, group the subset DataFrame on the `sibs` column, and subgroup on the `fund16` column. Display summary columns for: (a) the mean `prestg10` for each `sibs` count; and (b) the number of rows (n) used for each mean value for each combination of `fund16` and `sibs` values. 
3. Display the entirety of the grouped and aggregated DataFrame in the code cell below
4. In the markdown cell below, explain how controlling for religiousity in this way affects your interpretation of the association between professional prestige and number of siblings.

```python
# your code here 
```

[Your markdown here]

In [28]:
subset_df = df_gss_sample[['prestg10', 'sibs', 'fund16']]

grouped_df = subset_df.groupby(['sibs', 'fund16']).agg(mean_prestige=('prestg10', 'mean'),  count=('prestg10', 'count')).reset_index()  

print(grouped_df.to_string())


    sibs  fund16  mean_prestige  count
0    0.0     1.0      44.900000     40
1    0.0     2.0      46.523810     63
2    0.0     3.0      50.264151     53
3    1.0     1.0      43.687023    131
4    1.0     2.0      43.867220    241
5    1.0     3.0      48.164474    152
6    2.0     1.0      43.312500    160
7    2.0     2.0      44.921053    304
8    2.0     3.0      46.857955    176
9    3.0     1.0      42.520958    167
10   3.0     2.0      43.792793    222
11   3.0     3.0      45.387387    111
12   4.0     1.0      40.881890    127
13   4.0     2.0      42.721311    183
14   4.0     3.0      43.312500     64
15   5.0     1.0      41.670732     82
16   5.0     2.0      40.992000    125
17   5.0     3.0      45.454545     33
18   6.0     1.0      42.477612     67
19   6.0     2.0      44.337079     89
20   6.0     3.0      41.696970     33
21   7.0     1.0      38.054545     55
22   7.0     2.0      40.107692     65
23   7.0     3.0      42.433333     30
24   8.0     1.0      39.

### Controlling for Religious Fundamentalism in Prestige & Siblings Analysis
Adding fund16 as a control variable allows us to see how religious upbringing interacts with professional prestige and family size. If prestige scores vary significantly across different levels of fund16, this suggests that religious background may influence career paths independently of family size.