# Estimate Multi-ancestry PRS versus PD risk
- **Project:** Multi-ancestry PRS
- **Version:** Python/3.9
- **Status:** COMPLETE
- **Last Updated:** 25-FEB-2024

## Notebook Overview
- Logistic regression models adjusted by covariates (age, gender, PCs)

In [1]:
## Load packages
module load python
module load R

[+] Loading python 3.10  ... 
[+] Loading gcc  11.3.0  ... 
[+] Loading HDF5  1.12.2 
[+] Loading netcdf  4.9.0 
[-] Unloading gcc  11.3.0  ... 
[+] Loading gcc  11.3.0  ... 
[+] Loading openmpi/4.1.3/gcc-11.3.0  ... 
[+] Loading pandoc  2.18  on cn2458 
[+] Loading pcre2  10.40 
[+] Loading R 4.3.2 


In [1]:
## Change kernel to python
import pandas as pd

# Read the file into a DataFrame
file_path = f"{WORK_DIR}/GP2_master_key_release6.txt"
df = pd.read_csv(file_path, delimiter='\t')  # Assuming columns are tab-separated, adjust if needed

# Add a new column with "0"
df.insert(0, 'New_Column', 0)

# Extract columns GP2sampleID and sex_for_qc
result_df = df[['New_Column', 'GP2sampleID', 'sex_for_qc', 'age']]

# Save the result to a CSV file without headers
result_df.to_csv(f'{WORK_DIR}/covariates.txt', index=False, header=True, sep = '\t')

In [1]:
cd ${WORK_DIR}/quality_control/release6/genotype_qc/
ls *eigenvec

GP2_release6_NOVEMBER_2023_AAC.eigenvec
GP2_release6_NOVEMBER_2023_AFR.eigenvec
GP2_release6_NOVEMBER_2023_AJ.eigenvec
GP2_release6_NOVEMBER_2023_AMR.eigenvec
GP2_release6_NOVEMBER_2023_CAH.eigenvec
GP2_release6_NOVEMBER_2023_CAS.eigenvec
GP2_release6_NOVEMBER_2023_EAS.eigenvec
GP2_release6_NOVEMBER_2023_EUR.eigenvec
GP2_release6_NOVEMBER_2023_FIN.eigenvec
GP2_release6_NOVEMBER_2023_MDE.eigenvec
GP2_release6_NOVEMBER_2023_SAS.eigenvec
GP2_release6_NOVEMBER_AAC_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_AFR_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_AJ_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_AMR_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_CAS_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_EAS_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_EUR_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_FIN_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_MDE_maf_hwe_pca.eigenvec
GP2_release6_NOVEMBER_SAS_maf_hwe_pca.eigenvec


In [2]:
import pandas as pd
import statsmodels.api as sm

## RUN PRS versus RISK across ancestries (AFRICAN summary stats)
# List of ancestries
ancestries = ["AAC", "AFR", "AJ", "AMR", "EAS", "EUR", "CAS"]

for ancestry in ancestries:  
    print("Ancestry:", ancestry)
    # Construct file paths
    prs_file = f"{WORK_DIR}/imputed_data/" + ancestry + "/PRS_score_release_AFRICANS.profile"
    covs_file = f"{WORK_DIR}/quality_control/release6/genotype_qc/GP2_release6_NOVEMBER_" + ancestry + "_maf_hwe_pca.eigenvec"

    # Read PRS data
    temp_data = pd.read_csv(prs_file, delim_whitespace =True)

    # Read covariates data
    # Specify your custom column names
    custom_column_names = ["FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"]

    # Read the tab-delimited file, skip the first line, and use custom column names
    temp_covs = pd.read_csv(covs_file, delim_whitespace =True, skiprows=1, names=custom_column_names)

    # Read additional covariates
    temp_covs_2 = pd.read_csv(f"{WORK_DIR}/covariates.txt", sep="\t")
    temp_covs_2.head()
    temp_covs_2 = temp_covs_2.rename(columns={"GP2sampleID": "IID"})

    # Merge covariates
    covs = pd.merge(temp_covs, temp_covs_2, on="IID")

    # Merge PRS data and covariates
    data = pd.merge(temp_data, covs, on="IID")

    # print("looking a covs")
    # print(covs.head())
    # print("looking at temp_data")
    # print(temp_data.head())
    # print("looking at data")
    # print(data.head())
    # print(data.describe())

    # Remove missing or unknown cases
    dat = data[data["PHENO"] != -9]

    # Logistic regression model phenotype
    dat['CASE'] = dat['PHENO'] - 1 

    # Standardize PRS
    mean_controls = dat.loc[dat["CASE"] == 0, "SCORE"].mean()
    sd_controls = dat.loc[dat["CASE"] == 0, "SCORE"].std()
    dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls

    # Quick peak at the data
    print("Taking a look at the dataset for " + ancestry + " regression models.")
    print(dat.describe())

    # Logistic regression model using statsmodels formula
    formula = "CASE ~ zSCORE + sex_for_qc + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10"
    model = sm.Logit.from_formula(formula, data=dat)
    result = model.fit()

    # Print the summary
    print(result.summary())

    # Print
    print("Done analyzing " + ancestry + " now on to the next thing.")

Ancestry: AAC
Taking a look at the dataset for AAC regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  1030.0  1030.000000  1030.000000  1030.000000  1030.000000  1030.0   
mean      0.0     1.242718   166.126214    81.626214    -0.004264     0.0   
std       0.0     0.428935     1.980368     5.026373     0.003391     0.0   
min       0.0     1.000000   156.000000    66.000000    -0.015085     0.0   
25%       0.0     1.000000   166.000000    78.000000    -0.006524     0.0   
50%       0.0     1.000000   166.000000    82.000000    -0.004318     0.0   
75%       0.0     1.000000   168.000000    85.000000    -0.002065     0.0   
max       0.0     2.000000   168.000000    97.000000     0.007922     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  1030.000000  1030.000000  1030.000000  1030.000000  ...  1030.000000   
mean     -0.009980     0.003704     0.000382     0.000388  ...    -0.000329   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for AFR regression models.
        FID_x        PHENO          CNT        CNT2        SCORE   FID_y  \
count  2569.0  2569.000000  2569.000000  2569.00000  2569.000000  2569.0   
mean      0.0     1.355002   162.507591    80.47100    -0.001763     0.0   
std       0.0     0.478607     2.821404     4.93265     0.002846     0.0   
min       0.0     1.000000   150.000000    64.00000    -0.012241     0.0   
25%       0.0     1.000000   160.000000    77.00000    -0.003739     0.0   
50%       0.0     1.000000   162.000000    81.00000    -0.001834     0.0   
75%       0.0     2.000000   164.000000    84.00000     0.000125     0.0   
max       0.0     2.000000   166.000000    97.00000     0.008498     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  2569.000000  2569.000000  2569.000000  2569.000000  ...  2569.000000   
mean      0.000098     0.000521    -0.000131     0.000178  ...     0.000027   
std       0.019510    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Optimization terminated successfully.
         Current function value: 0.521712
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                   CASE   No. Observations:                 1193
Model:                          Logit   Df Residuals:                     1179
Method:                           MLE   Df Model:                           13
Date:                Mon, 04 Mar 2024   Pseudo R-squ.:                 0.02633
Time:                        16:40:10   Log-Likelihood:                -622.40
converged:                       True   LL-Null:                       -639.23
Covariance Type:            nonrobust   LLR p-value:                  0.001354
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2555      0.539      2.328      0.020       0.199       2.312
zSCORE         0.0328      0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for EUR regression models.
         FID_x         PHENO           CNT          CNT2         SCORE  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000   
mean       0.0      1.607426    159.724509     73.720470      0.002226   
std        0.0      0.488336      3.363237      5.405883      0.003889   
min        0.0      1.000000    142.000000     53.000000     -0.013118   
25%        0.0      1.000000    158.000000     70.000000     -0.000392   
50%        0.0      2.000000    160.000000     74.000000      0.002204   
75%        0.0      2.000000    162.000000     77.000000      0.004834   
max        0.0      2.000000    168.000000     96.000000      0.016607   

         FID_y           PC1           PC2           PC3           PC4  ...  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000  ...   
mean       0.0      0.000058     -0.000155     -0.000047      0.000197  ...   
std        0.0      0.006639      0.0068

In [3]:
import pandas as pd
import statsmodels.api as sm

## RUN PRS versus RISK across ancestries (EUROPEAN summ stats)
# List of ancestries
ancestries = ["AAC", "AFR", "AJ", "AMR", "EAS", "EUR", "CAS"]

for ancestry in ancestries:  
    print("Ancestry:", ancestry)
    # Construct file paths
    prs_file = f"{WORK_DIR}/imputed_data/" + ancestry + "/PRS_score_release_EUROPEAN.profile"
    covs_file = f"{WORK_DIR}/quality_control/release6/genotype_qc/GP2_release6_NOVEMBER_" + ancestry + "_maf_hwe_pca.eigenvec"

    # Read PRS data
    temp_data = pd.read_csv(prs_file, delim_whitespace =True)

    # Read covariates data
    # Specify your custom column names
    custom_column_names = ["FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"]

    # Read the tab-delimited file, skip the first line, and use custom column names
    temp_covs = pd.read_csv(covs_file, delim_whitespace =True, skiprows=1, names=custom_column_names)

    # Read additional covariates
    temp_covs_2 = pd.read_csv(f"{WORK_DIR}/covariates.txt", sep="\t")
    temp_covs_2.head()
    temp_covs_2 = temp_covs_2.rename(columns={"GP2sampleID": "IID"})

    # Merge covariates
    covs = pd.merge(temp_covs, temp_covs_2, on="IID")

    # Merge PRS data and covariates
    data = pd.merge(temp_data, covs, on="IID")

    # print("looking a covs")
    # print(covs.head())
    # print("looking at temp_data")
    # print(temp_data.head())
    # print("looking at data")
    # print(data.head())
    # print(data.describe())

    # Remove missing or unknown cases
    dat = data[data["PHENO"] != -9]

    # Logistic regression model phenotype
    dat['CASE'] = dat['PHENO'] - 1 

    # Standardize PRS
    mean_controls = dat.loc[dat["CASE"] == 0, "SCORE"].mean()
    sd_controls = dat.loc[dat["CASE"] == 0, "SCORE"].std()
    dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls

    # Quick peak at the data
    print("Taking a look at the dataset for " + ancestry + " regression models.")
    print(dat.describe())

    # Logistic regression model using statsmodels formula
    formula = "CASE ~ zSCORE + sex_for_qc + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10"
    model = sm.Logit.from_formula(formula, data=dat)
    result = model.fit()

    # Print the summary
    print(result.summary())

    # Print
    print("Done analyzing " + ancestry + " now on to the next thing.")

Ancestry: AAC
Taking a look at the dataset for AAC regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  1030.0  1030.000000  1030.000000  1030.000000  1030.000000  1030.0   
mean      0.0     1.242718   178.114563    72.143689    -0.016308     0.0   
std       0.0     0.428935     1.984998     5.317219     0.002766     0.0   
min       0.0     1.000000   168.000000    54.000000    -0.025477     0.0   
25%       0.0     1.000000   178.000000    69.000000    -0.018209     0.0   
50%       0.0     1.000000   178.000000    72.000000    -0.016281     0.0   
75%       0.0     1.000000   180.000000    76.000000    -0.014559     0.0   
max       0.0     2.000000   180.000000    90.000000    -0.007043     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  1030.000000  1030.000000  1030.000000  1030.000000  ...  1030.000000   
mean     -0.009980     0.003704     0.000382     0.000388  ...    -0.000329   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for AFR regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  2569.0  2569.000000  2569.000000  2569.000000  2569.000000  2569.0   
mean      0.0     1.355002   172.455430    68.635267    -0.018585     0.0   
std       0.0     0.478607     2.857407     4.912285     0.002530     0.0   
min       0.0     1.000000   160.000000    49.000000    -0.026690     0.0   
25%       0.0     1.000000   170.000000    65.000000    -0.020366     0.0   
50%       0.0     1.000000   172.000000    69.000000    -0.018567     0.0   
75%       0.0     2.000000   174.000000    72.000000    -0.016979     0.0   
max       0.0     2.000000   176.000000    85.000000    -0.008618     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  2569.000000  2569.000000  2569.000000  2569.000000  ...  2569.000000   
mean      0.000098     0.000521    -0.000131     0.000178  ...     0.000027   
std       0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Optimization terminated successfully.
         Current function value: 0.474546
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                   CASE   No. Observations:                 1193
Model:                          Logit   Df Residuals:                     1179
Method:                           MLE   Df Model:                           13
Date:                Mon, 04 Mar 2024   Pseudo R-squ.:                  0.1144
Time:                        16:40:12   Log-Likelihood:                -566.13
converged:                       True   LL-Null:                       -639.23
Covariance Type:            nonrobust   LLR p-value:                 1.199e-24
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5893      0.573      1.029      0.304      -0.534       1.712
zSCORE         0.6703      0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for EAS regression models.
        FID_x        PHENO         CNT         CNT2        SCORE   FID_y  \
count  3328.0  3328.000000  3328.00000  3328.000000  3328.000000  3328.0   
mean      0.0     1.299279   156.29387    64.910156     0.000075     0.0   
std       0.0     0.458011     4.30169     5.538730     0.002857     0.0   
min       0.0     1.000000   138.00000    44.000000    -0.009778     0.0   
25%       0.0     1.000000   154.00000    61.000000    -0.001836     0.0   
50%       0.0     1.000000   156.00000    65.000000     0.000054     0.0   
75%       0.0     2.000000   160.00000    69.000000     0.002034     0.0   
max       0.0     2.000000   168.00000    82.000000     0.010797     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  3328.000000  3328.000000  3328.000000  3328.000000  ...  3328.000000   
mean      0.001644    -0.000132    -0.004306    -0.000278  ...    -0.000873   
std       0.013886    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for EUR regression models.
         FID_x         PHENO           CNT          CNT2         SCORE  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000   
mean       0.0      1.607426    171.708353     77.871938     -0.009962   
std        0.0      0.488336      3.368953      5.675227      0.003123   
min        0.0      1.000000    154.000000     55.000000     -0.020786   
25%        0.0      1.000000    170.000000     74.000000     -0.012031   
50%        0.0      2.000000    172.000000     78.000000     -0.010059   
75%        0.0      2.000000    174.000000     82.000000     -0.008004   
max        0.0      2.000000    180.000000    100.000000      0.014443   

         FID_y           PC1           PC2           PC3           PC4  ...  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000  ...   
mean       0.0      0.000058     -0.000155     -0.000047      0.000197  ...   
std        0.0      0.006639      0.0068

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


In [4]:
import pandas as pd
import statsmodels.api as sm

## RUN PRS versus RISK across ancestries (LATINO summ stats)
# List of ancestries
ancestries = ["AAC", "AFR", "AJ", "AMR", "EAS", "EUR", "CAS"]

for ancestry in ancestries:  
    print("Ancestry:", ancestry)
    # Construct file paths
    prs_file = f"{WORK_DIR}/imputed_data/" + ancestry + "/PRS_score_release_LATINO.profile"
    covs_file = f"{WORK_DIR}/quality_control/release6/genotype_qc/GP2_release6_NOVEMBER_" + ancestry + "_maf_hwe_pca.eigenvec"

    # Read PRS data
    temp_data = pd.read_csv(prs_file, delim_whitespace =True)

    # Read covariates data
    # Specify your custom column names
    custom_column_names = ["FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"]

    # Read the tab-delimited file, skip the first line, and use custom column names
    temp_covs = pd.read_csv(covs_file, delim_whitespace =True, skiprows=1, names=custom_column_names)

    # Read additional covariates
    temp_covs_2 = pd.read_csv(f"{WORK_DIR}/covariates.txt", sep="\t")
    temp_covs_2.head()
    temp_covs_2 = temp_covs_2.rename(columns={"GP2sampleID": "IID"})

    # Merge covariates
    covs = pd.merge(temp_covs, temp_covs_2, on="IID")

    # Merge PRS data and covariates
    data = pd.merge(temp_data, covs, on="IID")

    # print("looking a covs")
    # print(covs.head())
    # print("looking at temp_data")
    # print(temp_data.head())
    # print("looking at data")
    # print(data.head())
    # print(data.describe())

    # Remove missing or unknown cases
    dat = data[data["PHENO"] != -9]

    # Logistic regression model phenotype
    dat['CASE'] = dat['PHENO'] - 1 

    # Standardize PRS
    mean_controls = dat.loc[dat["CASE"] == 0, "SCORE"].mean()
    sd_controls = dat.loc[dat["CASE"] == 0, "SCORE"].std()
    dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls

    # Quick peak at the data
    print("Taking a look at the dataset for " + ancestry + " regression models.")
    print(dat.describe())

    # Logistic regression model using statsmodels formula
    formula = "CASE ~ zSCORE + sex_for_qc + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10"
    model = sm.Logit.from_formula(formula, data=dat)
    result = model.fit()

    # Print the summary
    print(result.summary())

    # Print
    print("Done analyzing " + ancestry + " now on to the next thing.")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Ancestry: AAC
Taking a look at the dataset for AAC regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  1030.0  1030.000000  1030.000000  1030.000000  1030.000000  1030.0   
mean      0.0     1.242718   134.493204    54.601942    -0.001271     0.0   
std       0.0     0.428935     1.697609     4.717974     0.001997     0.0   
min       0.0     1.000000   128.000000    41.000000    -0.007731     0.0   
25%       0.0     1.000000   134.000000    51.000000    -0.002624     0.0   
50%       0.0     1.000000   134.000000    55.000000    -0.001272     0.0   
75%       0.0     1.000000   136.000000    58.000000     0.000063     0.0   
max       0.0     2.000000   136.000000    70.000000     0.005501     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  1030.000000  1030.000000  1030.000000  1030.000000  ...  1030.000000   
mean     -0.009980     0.003704     0.000382     0.000388  ...    -0.000329   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for AFR regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  2569.0  2569.000000  2569.000000  2569.000000  2569.000000  2569.0   
mean      0.0     1.355002   133.293889    51.275983    -0.001513     0.0   
std       0.0     0.478607     2.438265     4.405168     0.001801     0.0   
min       0.0     1.000000   122.000000    34.000000    -0.008699     0.0   
25%       0.0     1.000000   132.000000    48.000000    -0.002697     0.0   
50%       0.0     1.000000   134.000000    51.000000    -0.001533     0.0   
75%       0.0     2.000000   136.000000    54.000000    -0.000319     0.0   
max       0.0     2.000000   136.000000    66.000000     0.005471     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  2569.000000  2569.000000  2569.000000  2569.000000  ...  2569.000000   
mean      0.000098     0.000521    -0.000131     0.000178  ...     0.000027   
std       0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for AJ regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  1460.0  1460.000000  1460.000000  1460.000000  1460.000000  1460.0   
mean      0.0     1.687671   133.228767    62.817808    -0.000218     0.0   
std       0.0     0.463602     1.163646     4.989196     0.002242     0.0   
min       0.0     1.000000   124.000000    47.000000    -0.008780     0.0   
25%       0.0     1.000000   132.000000    60.000000    -0.001754     0.0   
50%       0.0     2.000000   134.000000    63.000000    -0.000137     0.0   
75%       0.0     2.000000   134.000000    66.000000     0.001289     0.0   
max       0.0     2.000000   134.000000    79.000000     0.007633     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  1460.000000  1460.000000  1460.000000  1460.000000  ...  1460.000000   
mean      0.000386    -0.000949    -0.000961     0.000456  ...    -0.000814   
std       0.00

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Optimization terminated successfully.
         Current function value: 0.517796
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                   CASE   No. Observations:                 1193
Model:                          Logit   Df Residuals:                     1179
Method:                           MLE   Df Model:                           13
Date:                Mon, 04 Mar 2024   Pseudo R-squ.:                 0.03364
Time:                        16:40:13   Log-Likelihood:                -617.73
converged:                       True   LL-Null:                       -639.23
Covariance Type:            nonrobust   LLR p-value:                 4.481e-05
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.1560      0.543      2.130      0.033       0.092       2.220
zSCORE         0.2110      0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for EAS regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  3328.0  3328.000000  3328.000000  3328.000000  3328.000000  3328.0   
mean      0.0     1.299279   126.243389    53.582031     0.002200     0.0   
std       0.0     0.458011     3.755643     4.992739     0.001866     0.0   
min       0.0     1.000000   110.000000    36.000000    -0.003884     0.0   
25%       0.0     1.000000   124.000000    50.000000     0.000950     0.0   
50%       0.0     1.000000   126.000000    54.000000     0.002199     0.0   
75%       0.0     2.000000   128.000000    57.000000     0.003526     0.0   
max       0.0     2.000000   136.000000    69.000000     0.008708     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  3328.000000  3328.000000  3328.000000  3328.000000  ...  3328.000000   
mean      0.001644    -0.000132    -0.004306    -0.000278  ...    -0.000873   
std       0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for EUR regression models.
         FID_x         PHENO           CNT          CNT2         SCORE  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000   
mean       0.0      1.607426    129.782611     60.405831     -0.000124   
std        0.0      0.488336      2.894623      4.991056      0.002155   
min        0.0      1.000000    116.000000     42.000000     -0.009660   
25%        0.0      1.000000    128.000000     57.000000     -0.001577   
50%        0.0      2.000000    130.000000     60.000000     -0.000118   
75%        0.0      2.000000    132.000000     64.000000      0.001328   
max        0.0      2.000000    136.000000     78.000000      0.008472   

         FID_y           PC1           PC2           PC3           PC4  ...  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000  ...   
mean       0.0      0.000058     -0.000155     -0.000047      0.000197  ...   
std        0.0      0.006639      0.0068

In [5]:
import pandas as pd
import statsmodels.api as sm

## RUN PRS versus RISK across ancestries (EAST ASIANS summ stats)
# List of ancestries
ancestries = ["AAC", "AFR", "AJ", "AMR", "EAS", "EUR", "CAS"]

for ancestry in ancestries:  
    print("Ancestry:", ancestry)
    # Construct file paths
    prs_file = f"{WORK_DIR}/imputed_data/" + ancestry + "/PRS_score_release_EASTASIANS.profile"
    covs_file = f"{WORK_DIR}/quality_control/release6/genotype_qc/GP2_release6_NOVEMBER_" + ancestry + "_maf_hwe_pca.eigenvec"

    # Read PRS data
    temp_data = pd.read_csv(prs_file, delim_whitespace =True)

    # Read covariates data
    # Specify your custom column names
    custom_column_names = ["FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"]

    # Read the tab-delimited file, skip the first line, and use custom column names
    temp_covs = pd.read_csv(covs_file, delim_whitespace =True, skiprows=1, names=custom_column_names)

    # Read additional covariates
    temp_covs_2 = pd.read_csv(f"{WORK_DIR}/covariates.txt", sep="\t")
    temp_covs_2.head()
    temp_covs_2 = temp_covs_2.rename(columns={"GP2sampleID": "IID"})

    # Merge covariates
    covs = pd.merge(temp_covs, temp_covs_2, on="IID")

    # Merge PRS data and covariates
    data = pd.merge(temp_data, covs, on="IID")

    # print("looking a covs")
    # print(covs.head())
    # print("looking at temp_data")
    # print(temp_data.head())
    # print("looking at data")
    # print(data.head())
    # print(data.describe())

    # Remove missing or unknown cases
    dat = data[data["PHENO"] != -9]

    # Logistic regression model phenotype
    dat['CASE'] = dat['PHENO'] - 1 

    # Standardize PRS
    mean_controls = dat.loc[dat["CASE"] == 0, "SCORE"].mean()
    sd_controls = dat.loc[dat["CASE"] == 0, "SCORE"].std()
    dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls

    # Quick peak at the data
    print("Taking a look at the dataset for " + ancestry + " regression models.")
    print(dat.describe())

    # Logistic regression model using statsmodels formula
    formula = "CASE ~ zSCORE + sex_for_qc + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10"
    model = sm.Logit.from_formula(formula, data=dat)
    result = model.fit()

    # Print the summary
    print(result.summary())

    # Print
    print("Done analyzing " + ancestry + " now on to the next thing.")

Ancestry: AAC
Taking a look at the dataset for AAC regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  1030.0  1030.000000  1030.000000  1030.000000  1030.000000  1030.0   
mean      0.0     1.242718   126.631068    57.075728    -0.002478     0.0   
std       0.0     0.428935     1.628764     4.617549     0.003183     0.0   
min       0.0     1.000000   118.000000    40.000000    -0.010982     0.0   
25%       0.0     1.000000   126.000000    54.000000    -0.004699     0.0   
50%       0.0     1.000000   126.000000    57.000000    -0.002532     0.0   
75%       0.0     1.000000   128.000000    60.000000    -0.000287     0.0   
max       0.0     2.000000   128.000000    72.000000     0.006207     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  1030.000000  1030.000000  1030.000000  1030.000000  ...  1030.000000   
mean     -0.009980     0.003704     0.000382     0.000388  ...    -0.000329   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for AFR regression models.
        FID_x        PHENO          CNT         CNT2        SCORE   FID_y  \
count  2569.0  2569.000000  2569.000000  2569.000000  2569.000000  2569.0   
mean      0.0     1.355002   125.420786    54.637213    -0.004499     0.0   
std       0.0     0.478607     2.298765     4.439972     0.002793     0.0   
min       0.0     1.000000   114.000000    37.000000    -0.014022     0.0   
25%       0.0     1.000000   124.000000    52.000000    -0.006336     0.0   
50%       0.0     1.000000   126.000000    55.000000    -0.004533     0.0   
75%       0.0     2.000000   128.000000    58.000000    -0.002631     0.0   
max       0.0     2.000000   128.000000    70.000000     0.005817     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  2569.000000  2569.000000  2569.000000  2569.000000  ...  2569.000000   
mean      0.000098     0.000521    -0.000131     0.000178  ...     0.000027   
std       0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Optimization terminated successfully.
         Current function value: 0.515641
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                   CASE   No. Observations:                 1193
Model:                          Logit   Df Residuals:                     1179
Method:                           MLE   Df Model:                           13
Date:                Mon, 04 Mar 2024   Pseudo R-squ.:                 0.03766
Time:                        16:40:14   Log-Likelihood:                -615.16
converged:                       True   LL-Null:                       -639.23
Covariance Type:            nonrobust   LLR p-value:                 6.180e-06
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0925      0.543      2.013      0.044       0.029       2.156
zSCORE         0.2603      0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls


Taking a look at the dataset for EAS regression models.
        FID_x        PHENO          CNT        CNT2        SCORE   FID_y  \
count  3328.0  3328.000000  3328.000000  3328.00000  3328.000000  3328.0   
mean      0.0     1.299279   116.534856    53.56881     0.000127     0.0   
std       0.0     0.458011     3.713191     4.99342     0.002984     0.0   
min       0.0     1.000000   104.000000    36.00000    -0.011113     0.0   
25%       0.0     1.000000   114.000000    50.00000    -0.001864     0.0   
50%       0.0     1.000000   116.000000    54.00000     0.000119     0.0   
75%       0.0     2.000000   120.000000    57.00000     0.002154     0.0   
max       0.0     2.000000   126.000000    69.00000     0.010134     0.0   

               PC1          PC2          PC3          PC4  ...          PC6  \
count  3328.000000  3328.000000  3328.000000  3328.000000  ...  3328.000000   
mean      0.001644    -0.000132    -0.004306    -0.000278  ...    -0.000873   
std       0.013886    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat["zSCORE"] = (dat["SCORE"] - mean_controls) / sd_controls
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat['CASE'] = dat['PHENO'] - 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[

Taking a look at the dataset for EUR regression models.
         FID_x         PHENO           CNT          CNT2         SCORE  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000   
mean       0.0      1.607426    122.957796     61.305422      0.003880   
std        0.0      0.488336      2.604508      4.965706      0.002987   
min        0.0      1.000000    110.000000     40.000000     -0.008609   
25%        0.0      1.000000    122.000000     58.000000      0.001860   
50%        0.0      2.000000    124.000000     61.000000      0.003904   
75%        0.0      2.000000    124.000000     65.000000      0.005919   
max        0.0      2.000000    128.000000     78.000000      0.015825   

         FID_y           PC1           PC2           PC3           PC4  ...  \
count  19311.0  19311.000000  19311.000000  19311.000000  19311.000000  ...   
mean       0.0      0.000058     -0.000155     -0.000047      0.000197  ...   
std        0.0      0.006639      0.0068