<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

# **Linking Preprocessing**

The data used in this thesis is taken from the **Substance Use and Risk Factor (SURF)** project led by Dr. Randi Schuster through the
Massachusetts General Hospital Center for Addiction Medicine. The data is collected from 60 middle and high schools in Massachusetts and surveyed annually. The longitudinal dataset created for the RL model uses survey results over a 4 year period from 2020-2023.

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

### **Libraries and Dependencies**

In [5]:
import numpy as np
import pandas as pd

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

### **Reading the Raw Data**

In [6]:
# Read data
surf2020 = pd.read_csv("data/SY2020.csv")
surf2021 = pd.read_csv("data/SY2021.csv")
surf2022 = pd.read_csv("data/SY2022.csv")
surf2023 = pd.read_csv("data/SY2023.csv")

  surf2021 = pd.read_csv("data/SY2021.csv")
  surf2022 = pd.read_csv("data/SY2022.csv")
  surf2023 = pd.read_csv("data/SY2023.csv")


In [7]:
# Store dataframes into a dictionary for easy access
dfs = {
    2020: surf2020, 
    2021: surf2021, 
    2022: surf2022, 
    2023: surf2023
}
years = range(2020, 2024)

In [8]:
# Add `Time_point` for linking code
for year in years:
    dfs[year]["SSS.INT.Time_point"] = year

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

### **Suicide Identifiers**

For SURF 2020 and 2021, suicide questions were not asked. To ensure compatibility when we combine the dataframes, we will add a `NaN` columns for those questions.

In [9]:
suicide_identifiers = [
    "INV.INT.SI.Thoughts",
    "INV.INT.SI.How",
    "INV.INT.SI.Attempt",
    "INV.INT.SI.Selfharm"
]

In [10]:
for identifier in suicide_identifiers:
    dfs[2020][identifier] = np.nan
    dfs[2021][identifier] = np.nan

In [11]:
dfs[2020][suicide_identifiers].head()

Unnamed: 0,INV.INT.SI.Thoughts,INV.INT.SI.How,INV.INT.SI.Attempt,INV.INT.SI.Selfharm
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

### **Standardize Column Names**

Several questions over the years have slightly changed identifiers. To, again, ensure compatibility, we will standardize the column names to allow for easier comparisons later on.

In [12]:
# Rename columns to standardized names
column_mapping = {
    "INV.LGL.SUB.Cigarettes.Ever": "INV.LGL.SUB.Cigarettes.Life",
    "INV.LGL.SUB.Alcohol.Ever": "INV.LGL.SUB.Alcohol.Life",
    "INV.LGL.SUB.Cannabis.Ever": "INV.LGL.SUB.Cannabis.Life",
    "INV.LGL.SUB.Vapes.Ever": "INV.LGL.SUB.Vapes.Life",
    "INV.LGL.SUB.PrescriptionMisuse.Ever": "INV.LGL.SUB.Other.Prescription",
    "INV.INT.SI.Attempt": "INV.INT.SI.Attempt",
    "INV.INT.SI.Thoughts": "INV.INT.SI.Thoughts",
    "INV.INT.SI.How": "INV.INT.SI.How",
    "INV.INT.SI.Selfharm": "INV.INT.SI.Selfharm",
    "INV.LGL.SUB.Other.Hallucinogens": "INV.LGL.SUB.Other.Hallucinogens",
    "INV.LGL.SUB.Other.Psychedelic": "INV.LGL.SUB.Other.Psychedelic",
    "INV.LGL.SUB.Other.Cocaine": "INV.LGL.SUB.Other.Cocaine",
    "INV.LGL.SUB.Other.Meth": "INV.LGL.SUB.Other.Meth",
    "INV.LGL.SUB.Other.Heroin": "INV.LGL.SUB.Other.Heroin",
    "INV.LGL.SUB.Other.Inhalants": "INV.LGL.SUB.Other.Inhalants",
    "INV.LGL.SUB.Other.Steroids": "INV.LGL.SUB.Other.Steroids",
}

for year in years:
    dfs[year].rename(columns=column_mapping, inplace=True)

In [13]:
common_qs = set(surf2020.columns).intersection(set(surf2021.columns)).intersection(set(surf2022.columns)).intersection(set(surf2023.columns))
common_qs = sorted(list(common_qs))

# Common questions over the years
for q in common_qs:
    print(q)

IDX.INT.Origin.Database
IDX.INT.Origin.Record
INV.CHR.HelpSeeking.Other
INV.DBL.APSS.Q1.MindReading
INV.DBL.APSS.Q2.TVRadio
INV.DBL.APSS.Q3.Spying
INV.DBL.APSS.Q4.Auditory
INV.DBL.APSS.Q5.Controlled
INV.DBL.APSS.Q6.Visual
INV.DBL.APSS.Q7.Grandiosity
INV.DBL.APSS.Total
INV.FCT.PHQ4.Total
INV.INT.ERS.IntensityArousalTotal
INV.INT.ERS.PersistenceTotal
INV.INT.ERS.Q01.Persistence1
INV.INT.ERS.Q02.Sensitivity1
INV.INT.ERS.Q03.IntensityArousal1
INV.INT.ERS.Q04.IntensityArousal2
INV.INT.ERS.Q05.Sensitivity2
INV.INT.ERS.Q06.IntensityArousal3
INV.INT.ERS.Q07.Sensitivity3
INV.INT.ERS.Q08.Persistence2
INV.INT.ERS.Q09.Sensitivity4
INV.INT.ERS.Q10.Persistence3
INV.INT.ERS.Q11.Persistence4
INV.INT.ERS.Q12.Sensitivity5
INV.INT.ERS.Q13.Sensitivity6
INV.INT.ERS.Q14.Sensitivity7
INV.INT.ERS.Q15.Sensitivity8
INV.INT.ERS.Q16.Sensitivity9
INV.INT.ERS.Q17.IntensityArousal4
INV.INT.ERS.Q18.Sensitivity10
INV.INT.ERS.Q19.IntensityArousal5
INV.INT.ERS.Q20.IntensityArousal6
INV.INT.ERS.Q21.IntensityArousal7
INV.

In [14]:
# Subset the columns to only include unique common questions over the years
trunc = {}
for year in years:
    trunc[year] = dfs[year][common_qs]
    trunc[year] = trunc[year].loc[:, ~trunc[year].columns.duplicated()]

In [15]:
# Create new dataframes for each class (cohort)
cohorts = range(2023, 2027)
cohort_dataframes = {}
    
for cohort in cohorts:
    filtered_df = pd.concat([trunc[year][trunc[year]['SSS.INT.Cohort'] == cohort] for year in years], ignore_index=True)
    cohort_dataframes[cohort] = filtered_df
    cohort_dataframes[cohort].to_csv(f'cohorts/cohort{cohort}.csv', index=False)

In [28]:
common_qs

['IDX.INT.Origin.Database',
 'IDX.INT.Origin.Record',
 'INV.CHR.HelpSeeking.Other',
 'INV.DBL.APSS.Q1.MindReading',
 'INV.DBL.APSS.Q2.TVRadio',
 'INV.DBL.APSS.Q3.Spying',
 'INV.DBL.APSS.Q4.Auditory',
 'INV.DBL.APSS.Q5.Controlled',
 'INV.DBL.APSS.Q6.Visual',
 'INV.DBL.APSS.Q7.Grandiosity',
 'INV.DBL.APSS.Total',
 'INV.FCT.PHQ4.Total',
 'INV.INT.ERS.IntensityArousalTotal',
 'INV.INT.ERS.PersistenceTotal',
 'INV.INT.ERS.Q01.Persistence1',
 'INV.INT.ERS.Q02.Sensitivity1',
 'INV.INT.ERS.Q03.IntensityArousal1',
 'INV.INT.ERS.Q04.IntensityArousal2',
 'INV.INT.ERS.Q05.Sensitivity2',
 'INV.INT.ERS.Q06.IntensityArousal3',
 'INV.INT.ERS.Q07.Sensitivity3',
 'INV.INT.ERS.Q08.Persistence2',
 'INV.INT.ERS.Q09.Sensitivity4',
 'INV.INT.ERS.Q10.Persistence3',
 'INV.INT.ERS.Q11.Persistence4',
 'INV.INT.ERS.Q12.Sensitivity5',
 'INV.INT.ERS.Q13.Sensitivity6',
 'INV.INT.ERS.Q14.Sensitivity7',
 'INV.INT.ERS.Q15.Sensitivity8',
 'INV.INT.ERS.Q16.Sensitivity9',
 'INV.INT.ERS.Q17.IntensityArousal4',
 'INV.INT.ER

#### **Using Cohort Dataframes**

Now that we have separated the 4 years of survey data by cohort, we can put this into the linking algorithm provided by Michael Pascale and Kevin Potter. This algorithm exists in the `rettopnivek/camrprojects` package in R.

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

### **Clean Linked Data**

Now that we have linked observations from each 

In [16]:
# Read linked
linked23 = pd.read_csv("linked/linked2023.csv")
linked24 = pd.read_csv("linked/linked2024.csv")
linked25 = pd.read_csv("linked/linked2025.csv")
linked26 = pd.read_csv("linked/linked2026.csv")

linked = {
    2023: linked23,
    2024: linked24,
    2025: linked25,
    2026: linked26,
}

  linked23 = pd.read_csv("linked/linked2023.csv")
  linked24 = pd.read_csv("linked/linked2024.csv")
  linked25 = pd.read_csv("linked/linked2025.csv")
  linked26 = pd.read_csv("linked/linked2026.csv")


In [22]:
for col in linked23.columns:
    print(col)

Unnamed: 0
IDX.INT.Origin.Database
IDX.INT.Origin.Record
INV.CHR.HelpSeeking.Other
INV.DBL.APSS.Q1.MindReading
INV.DBL.APSS.Q2.TVRadio
INV.DBL.APSS.Q3.Spying
INV.DBL.APSS.Q4.Auditory
INV.DBL.APSS.Q5.Controlled
INV.DBL.APSS.Q6.Visual
INV.DBL.APSS.Q7.Grandiosity
INV.DBL.APSS.Total
INV.FCT.PHQ4.Total
INV.INT.ERS.IntensityArousalTotal
INV.INT.ERS.PersistenceTotal
INV.INT.ERS.Q01.Persistence1
INV.INT.ERS.Q02.Sensitivity1
INV.INT.ERS.Q03.IntensityArousal1
INV.INT.ERS.Q04.IntensityArousal2
INV.INT.ERS.Q05.Sensitivity2
INV.INT.ERS.Q06.IntensityArousal3
INV.INT.ERS.Q07.Sensitivity3
INV.INT.ERS.Q08.Persistence2
INV.INT.ERS.Q09.Sensitivity4
INV.INT.ERS.Q10.Persistence3
INV.INT.ERS.Q11.Persistence4
INV.INT.ERS.Q12.Sensitivity5
INV.INT.ERS.Q13.Sensitivity6
INV.INT.ERS.Q14.Sensitivity7
INV.INT.ERS.Q15.Sensitivity8
INV.INT.ERS.Q16.Sensitivity9
INV.INT.ERS.Q17.IntensityArousal4
INV.INT.ERS.Q18.Sensitivity10
INV.INT.ERS.Q19.IntensityArousal5
INV.INT.ERS.Q20.IntensityArousal6
INV.INT.ERS.Q21.IntensityAr

In [37]:
linked23["SSS.INT.SurveyYear"]

0       2020
1       2020
2       2020
3       2020
4       2020
        ... 
9851    2023
9852    2023
9853    2023
9854    2023
9855    2023
Name: SSS.INT.SurveyYear, Length: 9856, dtype: int64

In [20]:
linked23.head(20)

Unnamed: 0.1,Unnamed: 0,IDX.INT.Origin.Database,IDX.INT.Origin.Record,INV.CHR.HelpSeeking.Other,INV.DBL.APSS.Q1.MindReading,INV.DBL.APSS.Q2.TVRadio,INV.DBL.APSS.Q3.Spying,INV.DBL.APSS.Q4.Auditory,INV.DBL.APSS.Q5.Controlled,INV.DBL.APSS.Q6.Visual,...,SSS.INT.Twelfth.Grade.Enrollment,IDX.INT.Row,IDX.CHR.Linked.ID,QCC.LGC.Linked.Attempted,QCC.LGC.Linked,QCC.LGC.Linked.No_issues,QCC.CHR.Linked.Score.Base,QCC.CHR.Linked.Score.Add,QCC.CHR.Linked.Rows,QCC.CHR.Linked.Dissimilarity
0,1,18297,986,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,1,YL TP2020 1,True,True,True,1.9:0/7:1;2.9:0/7:1;3.9:0/7:1,,"1,722;1,2322;1,6676",1.9:00000000;2.9:00000000;3.9:00000000
1,2,18297,987,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,2,YL TP2020 2,True,True,True,1.9:0/7:1;2.9:1/7:1;3.9:2/7:15,,2527,1.9:00000000;2.9:00100000;3.9:01001010
2,3,18297,988,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,3,NL TP2020 3,True,False,False,1.9:1/7:1;2.9:1/7:1;3.9:1/7:1,,,1.9:00010000;2.9:00010000;3.9:00010000
3,4,18297,991,,0.0,0.5,0.5,0.5,0.0,0.0,...,270.0,4,YL TP2020 4,True,True,True,1.9:0/7:1;2.9:0/7:1;3.9:0/7:1,,"4,649;4,2289;4,6708",1.9:00000000;2.9:00000000;3.9:00000000
4,5,18297,994,,0.5,0.0,1.0,0.0,0.0,0.0,...,270.0,5,YL TP2020 5,True,True,True,1.9:0/7:1;2.9:0/7:1;3.9:1/7:1,,"5,612;5,2160",1.9:00000000;2.9:00000000;3.9:01000100
5,6,18297,995,,0.0,0.0,0.5,0.0,0.0,0.0,...,270.0,6,YL TP2020 6,True,True,True,1.9:2/7:5;2.9:2/7:7;3.9:0/7:1,,66563,1.9:01100010;2.9:01100010;3.9:00000000
6,7,18297,998,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,7,YL TP2020 7,True,True,True,1.9:0/7:1;2.9:0/7:1;3.9:0/7:1,,"7,540;7,2209;7,6541",1.9:00000000;2.9:00000000;3.9:00000000
7,8,18297,1007,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,8,NL TP2020 8,True,False,False,1.9:0/7:1;2.9:0/7:1;3.9:0/7:1,,,1.9:00000000;2.9:00000000;3.9:00000000
8,9,18297,1008,,0.0,0.0,0.5,0.5,0.0,0.0,...,270.0,9,NL TP2020 9,True,False,False,1.9:1/7:1;2.9:1/7:1;3.9:1/7:1,,,1.9:00000100;2.9:00000100;3.9:00000100
9,10,18297,1015,,0.0,0.0,0.0,0.0,0.0,0.0,...,270.0,10,YL TP2020 10,True,True,True,1.9:0/7:1;2.9:2/7:9;3.9:2/7:9,,10592,1.9:00000000;2.9:11000010;3.9:11000010


In [29]:
linking_qs = [
    "QCC.CHR.Linked.Rows",
    "SSS.INT.Cohort",
    "SSS.INT.Grade",
    "SBJ.FCT.Link.BirthMonth",
    "SBJ.FCT.Link.OlderSiblings",
    "SBJ.FCT.Link.EyeColor",
    "SBJ.FCT.Link.MiddleInitial",
    "SBJ.CHR.Link.Streetname",
    "SBJ.INT.Link.KindergartenYearEst"
]

In [35]:
linked23.iloc[[6, 539, 2208, 6540]][linking_qs]

Unnamed: 0,QCC.CHR.Linked.Rows,SSS.INT.Cohort,SSS.INT.Grade,SBJ.FCT.Link.BirthMonth,SBJ.FCT.Link.OlderSiblings,SBJ.FCT.Link.EyeColor,SBJ.FCT.Link.MiddleInitial,SBJ.CHR.Link.Streetname,SBJ.INT.Link.KindergartenYearEst
6,"7,540;7,2209;7,6541",2023,9,May,no older siblings,Blue,l,che,2011
539,"7,540;540,2209;540,6541",2023,10,May,no older siblings,Blue,l,che,2011
2208,"7,2209;540,2209;2209,6541",2023,11,May,no older siblings,Blue,l,che,2011
6540,"7,6541;540,6541;2209,6541",2023,12,May,no older siblings,Blue,l,che,2011


In [42]:
linked23.shape

(9856, 124)

In [81]:
import pandas as pd
import numpy as np

# Function to transform dataset into person-level linked responses
def transform_linked_dataset(df):
    # Extract unique survey years
    survey_years = sorted(df["SSS.INT.SurveyYear"].unique())
    
    # Extract all column names excluding the linking column
    original_columns = [col for col in df.columns if col not in ["QCC.CHR.Linked.Rows", "SSS.INT.SurveyYear"]]
    
    # Create new column names with survey year suffix
    new_columns = [f"{col}{year}" for col in original_columns for year in survey_years]
    new_columns += [f"HasResponse{year}" for year in survey_years]  # Indicator columns for each year
    new_columns.append("PersonID")  # Unique person identifier
    new_columns.append("NumResponses") # Number of years of responses
    
    # Dictionary to hold transformed data
    person_data = {}
    
    # Dictionary to store processed linked groups
    processed = {}
    
    # Iterate through each row
    for idx, row in df.iterrows():
        # Parse linked rows
        linked_ids = row["QCC.CHR.Linked.Rows"].split(";") if pd.notna(row["QCC.CHR.Linked.Rows"]) else []
        linked_ids = [int(pair.split(",")[1]) - 1 for pair in linked_ids] + [idx]  # Adjust for 1-based index
        linked_ids = sorted(set(linked_ids))  # Remove duplicates
        
        # Assign a unique person ID (smallest observation ID in linked group)
        person_id = min(linked_ids)
        
        # Skip if this person group has already been processed
        if idx in processed:
            continue
        
        # Initialize data row
        person_row = {col: np.nan for col in new_columns}
        person_row["PersonID"] = person_id
        person_row["NumResponses"] = 0
        
        # Iterate over linked observations
        for linked_idx in linked_ids:
            if linked_idx not in df.index:
                continue  # Skip if index not found
            linked_row = df.loc[linked_idx]
            year = linked_row["SSS.INT.SurveyYear"]
            
            # Populate response data for that year
            for col in original_columns:
                person_row[f"{col}{year}"] = linked_row[col]
            
            # Mark the response presence for that year
            person_row[f"HasResponse{year}"] = True
            person_row["NumResponses"] += 1
            processed[linked_idx] = True
        
        # Fill missing response indicators as False
        for year in survey_years:
            if f"HasResponse{year}" not in person_row:
                person_row[f"HasResponse{year}"] = False
        
        # Store transformed data
        person_data[person_id] = person_row
        
        # Mark as processed
        processed[person_id] = True
    
    # Convert dictionary to DataFrame
    transformed_df = pd.DataFrame.from_dict(person_data, orient="index")
    return transformed_df

# Example usage:
# df = pd.read_csv("your_dataset.csv")
# transformed_df = transform_linked_dataset(df)
# transformed_df.to_csv("transformed_dataset.csv", index=False)


In [82]:
test = linked[2023].iloc[[6, 539, 2208, 6540]]
transformed_test = transform_linked_dataset(test)
display(transformed_test)

Unnamed: 0,Unnamed: 02020,Unnamed: 02021,Unnamed: 02022,Unnamed: 02023,IDX.INT.Origin.Database2020,IDX.INT.Origin.Database2021,IDX.INT.Origin.Database2022,IDX.INT.Origin.Database2023,IDX.INT.Origin.Record2020,IDX.INT.Origin.Record2021,...,QCC.CHR.Linked.Dissimilarity2020,QCC.CHR.Linked.Dissimilarity2021,QCC.CHR.Linked.Dissimilarity2022,QCC.CHR.Linked.Dissimilarity2023,HasResponse2020,HasResponse2021,HasResponse2022,HasResponse2023,PersonID,NumResponses
6,7,540,2209,6541,18297,18297,36844,44184,998,2133,...,1.9:00000000;2.9:00000000;3.9:00000000,4.9:00000000;5.9:00000000,6.9:00000000,,True,True,True,True,6,4


In [83]:
for cohort in cohorts:
    transform_linked_dataset(linked[cohort]).to_csv(f'final/final{cohort}.csv', index=False)

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

In [84]:
# Read formatted
final23 = pd.read_csv("final/final2023.csv")
final24 = pd.read_csv("final/final2024.csv")
final25 = pd.read_csv("final/final2025.csv")
final26 = pd.read_csv("final/final2026.csv")

final = {
    2023: final23,
    2024: final24,
    2025: final25,
    2026: final26,
}

  final23 = pd.read_csv("final/final2023.csv")
  final24 = pd.read_csv("final/final2024.csv")
  final25 = pd.read_csv("final/final2025.csv")
  final26 = pd.read_csv("final/final2026.csv")


In [85]:
for col in final23.columns:
    print(col)

Unnamed: 02020
Unnamed: 02021
Unnamed: 02022
Unnamed: 02023
IDX.INT.Origin.Database2020
IDX.INT.Origin.Database2021
IDX.INT.Origin.Database2022
IDX.INT.Origin.Database2023
IDX.INT.Origin.Record2020
IDX.INT.Origin.Record2021
IDX.INT.Origin.Record2022
IDX.INT.Origin.Record2023
INV.CHR.HelpSeeking.Other2020
INV.CHR.HelpSeeking.Other2021
INV.CHR.HelpSeeking.Other2022
INV.CHR.HelpSeeking.Other2023
INV.DBL.APSS.Q1.MindReading2020
INV.DBL.APSS.Q1.MindReading2021
INV.DBL.APSS.Q1.MindReading2022
INV.DBL.APSS.Q1.MindReading2023
INV.DBL.APSS.Q2.TVRadio2020
INV.DBL.APSS.Q2.TVRadio2021
INV.DBL.APSS.Q2.TVRadio2022
INV.DBL.APSS.Q2.TVRadio2023
INV.DBL.APSS.Q3.Spying2020
INV.DBL.APSS.Q3.Spying2021
INV.DBL.APSS.Q3.Spying2022
INV.DBL.APSS.Q3.Spying2023
INV.DBL.APSS.Q4.Auditory2020
INV.DBL.APSS.Q4.Auditory2021
INV.DBL.APSS.Q4.Auditory2022
INV.DBL.APSS.Q4.Auditory2023
INV.DBL.APSS.Q5.Controlled2020
INV.DBL.APSS.Q5.Controlled2021
INV.DBL.APSS.Q5.Controlled2022
INV.DBL.APSS.Q5.Controlled2023
INV.DBL.APSS.Q6.

In [90]:
final23[[
    "PersonID", "NumResponses", "Unnamed: 02020", "Unnamed: 02021", "Unnamed: 02022", "Unnamed: 02023",
     # 2020 linking qs
    "SSS.INT.Cohort2020",
    "SSS.INT.Grade2020",
    "SBJ.FCT.Link.BirthMonth2020",
    "SBJ.FCT.Link.OlderSiblings2020",
    "SBJ.FCT.Link.EyeColor2020",
    "SBJ.FCT.Link.MiddleInitial2020",
    "SBJ.CHR.Link.Streetname2020",
    "SBJ.INT.Link.KindergartenYearEst2020",
    # 2021 linking qs
    "SSS.INT.Cohort2021",
    "SSS.INT.Grade2021",
    "SBJ.FCT.Link.BirthMonth2021",
    "SBJ.FCT.Link.OlderSiblings2021",
    "SBJ.FCT.Link.EyeColor2021",
    "SBJ.FCT.Link.MiddleInitial2021",
    "SBJ.CHR.Link.Streetname2021",
    "SBJ.INT.Link.KindergartenYearEst2021",
]].head()

Unnamed: 0,PersonID,NumResponses,Unnamed: 02020,Unnamed: 02021,Unnamed: 02022,Unnamed: 02023,SSS.INT.Cohort2020,SSS.INT.Grade2020,SBJ.FCT.Link.BirthMonth2020,SBJ.FCT.Link.OlderSiblings2020,...,SBJ.CHR.Link.Streetname2020,SBJ.INT.Link.KindergartenYearEst2020,SSS.INT.Cohort2021,SSS.INT.Grade2021,SBJ.FCT.Link.BirthMonth2021,SBJ.FCT.Link.OlderSiblings2021,SBJ.FCT.Link.EyeColor2021,SBJ.FCT.Link.MiddleInitial2021,SBJ.CHR.Link.Streetname2021,SBJ.INT.Link.KindergartenYearEst2021
0,0,4,1.0,722.0,2322.0,6676.0,2023.0,9.0,September,no older siblings,...,edg,2011.0,2023.0,10.0,September,no older siblings,Blue,p,edg,2011.0
1,1,2,2.0,527.0,,,2023.0,9.0,July,no older siblings,...,cle,2011.0,2023.0,10.0,July,no older siblings,Brown,no middle name,cle,2011.0
2,2,1,3.0,,,,2023.0,9.0,January,1 older sibling born in November,...,gre,2011.0,,,,,,,,
3,3,4,4.0,649.0,2289.0,6708.0,2023.0,9.0,August,"2 older siblings, the oldest born in November",...,win,2011.0,2023.0,10.0,August,"2 older siblings, the oldest born in November",Blue,j,win,2011.0
4,4,3,5.0,612.0,2160.0,,2023.0,9.0,August,no older siblings,...,sun,2011.0,2023.0,10.0,August,no older siblings,Blue,p,sun,2011.0


In [92]:
complete23 = final23[final23["NumResponses"] >= 3]
display(complete23.head())
complete23.shape

Unnamed: 0,Unnamed: 02020,Unnamed: 02021,Unnamed: 02022,Unnamed: 02023,IDX.INT.Origin.Database2020,IDX.INT.Origin.Database2021,IDX.INT.Origin.Database2022,IDX.INT.Origin.Database2023,IDX.INT.Origin.Record2020,IDX.INT.Origin.Record2021,...,QCC.CHR.Linked.Dissimilarity2020,QCC.CHR.Linked.Dissimilarity2021,QCC.CHR.Linked.Dissimilarity2022,QCC.CHR.Linked.Dissimilarity2023,HasResponse2020,HasResponse2021,HasResponse2022,HasResponse2023,PersonID,NumResponses
0,1.0,722.0,2322.0,6676.0,18297.0,18297.0,36844.0,44184.0,986.0,2746.0,...,1.9:00000000;2.9:00000000;3.9:00000000,4.9:00000000;5.9:00000000,6.9:00000000,,True,True,True,True,0,4
3,4.0,649.0,2289.0,6708.0,18297.0,18297.0,36844.0,44184.0,991.0,2483.0,...,1.9:00000000;2.9:00000000;3.9:00000000,4.9:00000000;5.9:00000000,6.9:00000000,,True,True,True,True,3,4
4,5.0,612.0,2160.0,,18297.0,18297.0,36844.0,,994.0,2362.0,...,1.9:00000000;2.9:00000000;3.9:01000100,4.9:00000000;5.9:01000100,6.9:01000100,,True,True,True,,4,3
6,7.0,540.0,2209.0,6541.0,18297.0,18297.0,36844.0,44184.0,998.0,2133.0,...,1.9:00000000;2.9:00000000;3.9:00000000,4.9:00000000;5.9:00000000,6.9:00000000,,True,True,True,True,6,4
13,14.0,636.0,2267.0,6584.0,18297.0,18297.0,36844.0,44184.0,1030.0,2431.0,...,1.9:00000000;2.9:00000000;3.9:00000000,4.9:00000000;5.9:00000000,6.9:00000000,,True,True,True,True,13,4


(316, 494)

<div style="background-color: #007BFF; height: 4px; width: 100%;"></div>

## **TODO**

- Check that the data is linked
    - Reorganize data to make it easy to match people's responses over the years
- Clean the column names to make it easier to use
- Figure out which variables are most interesting