Lab 3 Data analysis QMSSGR5015

Bowen Bao

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
!pip install requests
import requests
import io
from tqdm import tqdm
import zipfile



I would like to use GSS data, let's go for 2022 only

In [None]:
url = 'https://gss.norc.org/content/dam/gss/get-the-data/documents/stata/2022_stata.zip'

response = requests.get(url, stream=True)
total_size = int(response.headers.get('content-length', 0))
block_size = 1024

tqdm_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
content = io.BytesIO()

for data in response.iter_content(block_size):
    tqdm_bar.update(len(data))
    content.write(data)

tqdm_bar.close()

if total_size != 0 and tqdm_bar.n != total_size:
    print("Error in downloading the file")
else:
    print("Download completed")

100%|██████████| 5.28M/5.28M [00:00<00:00, 33.2MiB/s]

Download completed





1. bivariate regression

Dependent Variable: homosex (idea on homosexual relationship): the question measures the respondent's opinion on sexual relations between two adults of the same sex on a 1-4 scale, always wrong, almost always wrong, wrong only some times, or not wrong at all.

In [None]:
with zipfile.ZipFile(content) as z:
    file_list = z.namelist()
    stata_files = [file for file in file_list if file.endswith('.dta')]

    if stata_files:
        stata_file = stata_files[0]

        with z.open(stata_file) as stata_file_stream:
          columns_to_load = ['homosex', 'age', 'polviews']
          print("Loading selected columns from Stata file with numeric labels")
          df_numeric = pd.read_stata(stata_file_stream, columns=columns_to_load, convert_categoricals=False)
          print("Data with numeric labels loaded successfully")

        with z.open(stata_file) as stata_file_stream:
          print("Loading selected columns from Stata file with string (categorical) labels")
          df_categorical = pd.read_stata(stata_file_stream, columns=columns_to_load)
          print("Data with categorical labels loaded successfully")

Loading selected columns from Stata file with numeric labels
Data with numeric labels loaded successfully
Loading selected columns from Stata file with string (categorical) labels
Data with categorical labels loaded successfully


-100, -99, -98 and -97 are missing or unapplicable values, get rid of them

In [None]:
df_numeric = df_numeric[~df_numeric['homosex'].isin([-100, -99, -98, -97])]
df_numeric = df_numeric[~df_numeric['age'].isin([-100, -99, -98, -97])]
df_numeric = df_numeric[~df_numeric['polviews'].isin([-100, -99, -98, -97])]

Now we can see the distribution of "homosex" data, from the distribution we can see that more than 50% of respondents think sexual relationship between two person of the same sex is "not wrong at all"

In [None]:
df = df_numeric
df[['homosex']].describe()

Unnamed: 0,homosex
count,2652.0
mean,2.983032
std,1.333241
min,1.0
25%,1.0
50%,4.0
75%,4.0
max,4.0


The independent variable will be "age", which measures the respondent's age, ranging from 18-89, we can see that the mean age of respondents to gss is about 48 years old

In [None]:
df[['age']].describe()

Unnamed: 0,age
count,3893.0
mean,48.244541
std,17.746931
min,18.0
25%,33.0
50%,47.0
75%,63.0
max,89.0


I expect that due to different culture, education and political views, age will have an effect on people's idea on homosexual relationships. Restricts the analysis to respondents that are not missing on the other variable "polviews."

In [None]:
homosex_age = smf.ols(formula = 'homosex~age', data = df, subset=df['polviews'].notna()).fit()
print (homosex_age.summary())

                            OLS Regression Results                            
Dep. Variable:                homosex   R-squared:                       0.058
Model:                            OLS   Adj. R-squared:                  0.058
Method:                 Least Squares   F-statistic:                     151.1
Date:                Sun, 27 Oct 2024   Prob (F-statistic):           9.67e-34
Time:                        00:35:46   Log-Likelihood:                -4095.8
No. Observations:                2450   AIC:                             8196.
Df Residuals:                    2448   BIC:                             8207.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.8673      0.075     51.489      0.0

Coefficient for age is -0.0180, which indicates that for each year the older the citizens is, the citizens' extent of approval for homosexual relationship decrease by 0.0180 on a four point scale. The model is statistically significant with p < 0.001. The R-squared of 0.058 indicats that 5.8% of the variability observed in "homosex" is explained by the regression model. It's a weak association, however, the regression does suggest that older people are statistical significantly less acceptive to homosexual relationship.

2. Add an additional variable that might mediate or partly "explain" the initial association from that simple regression above -- and explain your results.

Assuming that senior citizens are relatively more conservative, I will add 'polviews' to explain the effect of age. The variable 'polviews' measures the respondent's political view on a 1-7 scale, from extremely liberal (1) to moderate (4) to extremely conservative (7)

In [None]:
df[['polviews']].describe()

Unnamed: 0,polviews
count,4022.0
mean,3.959722
std,1.508121
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,7.0


From the distribution, we can see that there are roughly equal numbers of liberals and conservatives in the pool of respondents. Now how do my two independent variables correlate?

In [None]:
df_filtered = df[['age', 'polviews']].dropna()
stats.pearsonr(df_filtered['age'], df_filtered['polviews'])

PearsonRResult(statistic=0.13363555467749091, pvalue=1.5074616434477912e-16)

A correlation coefficient of +0.1336, a weak positive correlation, but it does show that older citizens tend to have more conservative political views.

We will run the regression with the "polview" added, ranging from 1 (extremely liberal) to 7 (extremely conservative). This will allow us to see if difference in political views can explain the association between age and approval for homosexual relationships.

In [None]:
homosex_withpolview_age = smf.ols(formula = 'homosex ~ age + polviews', data = df).fit()
print (homosex_withpolview_age.summary())

                            OLS Regression Results                            
Dep. Variable:                homosex   R-squared:                       0.184
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     276.4
Date:                Sun, 27 Oct 2024   Prob (F-statistic):          6.09e-109
Time:                        00:35:51   Log-Likelihood:                -3919.7
No. Observations:                2450   AIC:                             7845.
Df Residuals:                    2447   BIC:                             7863.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.9344      0.089     55.524      0.0

Age has a coefficient of -0.0149, indicates that after excluding the effect from political views, for each year older the respondent is, his or hers approval for homosexual relationship decrease by 0.0149 on a four-point scale.

Political views has a coefficient of -0.3102, indicates that for each point the respondents is more conservative on a 7 point scale, his or hers approval for homosexual relation decrease by 0.3102 on a four-point scale

The adjusted R-squared is 0.184, which suggests that 18.4% of variance of approval for homosexual relationship is explained by age and political views combined. They are both statistically significant with a p value lower than 0.001.

In conclusion, I suggest that our regressions proves our hypothesis. Older citizens are more conservative, and more conservative people tends to disagree with same sex sexual relationships. Still, age and political views only explains a portion of variance, and there are other variables causing variance in people's idea on homosexual relationships.

3. More on extreme combinations. Find the top 5 entities that are ranked at the top on one variable and ranked at the bottom on another variable. Interpret your results.

Since my dataset is different than the professor's example, I will do it in another way. I'm going to look for people who are senior citizens and thinks that sexual relationships between two adults of the same sex is not wrong at all, so I'm controlling for homosex, 4 = sexual relationships between same sex adults not wrong at all. I assume that senior citizens who thinks homosexual relationships are not wrong at all are mostly liberal or moderate.



In [None]:
control_value = 4

filtered_df = df[df['homosex'] == control_value]
sorted_df = filtered_df.sort_values(by='age', ascending=False)
print(sorted_df)

      homosex   age  polviews
1781      4.0  89.0       4.0
2037      4.0  88.0       2.0
1041      4.0  87.0       1.0
27        4.0  86.0       4.0
693       4.0  86.0       3.0
...       ...   ...       ...
3970      4.0   NaN       5.0
4029      4.0   NaN       4.0
4100      4.0   NaN       NaN
4125      4.0   NaN       3.0
4136      4.0   NaN       3.0

[1582 rows x 3 columns]


In [None]:
top_extremes = sorted_df.head(10)

print("Top Extremes:\n", top_extremes)

Top Extremes:
       homosex   age  polviews
1781      4.0  89.0       4.0
2037      4.0  88.0       2.0
1041      4.0  87.0       1.0
27        4.0  86.0       4.0
693       4.0  86.0       3.0
594       4.0  86.0       4.0
3023      4.0  85.0       4.0
581       4.0  85.0       2.0
3147      4.0  84.0       4.0
843       4.0  84.0       2.0


As we can see, the 10 oldest respondents who suggested that homosexual relationships are "not wrong at all" hold political views ranging from 1-4, which means that they are extremely liberal, liberal, slightly liberal and moderate. The results supports our hypothesis, that senior citizens who support same sex sexual relationships mostly hold non-conservative political views.