## Data Cleaning Script

This python code opens the ABC News / Washington Post poll data file "data.txt", pulls out relevant variables for this project, then outputs a dataCleaned.txt file in csv format. Using the abcnum variable, the script also removes all entries for respondents who live outside of the 21 states included in the Anti-Abortion Policymaking and Women's Representation dataset. This is to keep the public opinion data from the same states as the policymakers' data.


### Variables in dataCleaned.txt

These variables were pulled out of the data.txt file for analysis with the following script. Variables for political party and ideology were considered for data analysis, but ultimately not used. More information on all variables can be found in the 04326-Codebook.pdf file.

``caseid`` : record identifier from original dataset

``abcnum`` : abc state number

``stcode`` : FIPS state number

``Q21`` : Abortion opinion (1 = Always legal, 2 = Legal in most cases, 3 = Illegal in most cases, 4 = Illegal in all cases)

``Q21NET`` : Abortion Legal/Illegal NET (1 = Legal, 2 = Illegal)

``Q24`` : Political party aligning with values

``Q901`` : Political party ID

``Q908A`` : Ideology

``Q921`` : Gender (1 - Male, 2 = Female)

### Script Detials

This script opens data.txt and goes through it line by line. On each line, it identifies the 'ABCNUM' variable with the ABC state code for the respondent's state. If the respondent is not from one of the 21 states included in the Anti-Abortion Policymaking and Women's Representation dataset of legislators, the script moves on to the next line of the file without recording the respondent's data. Otherwise, it will pull out the relevant values for the participant, separated with commas, and add them to the output string. The location of each variable in the original data.txt file is documented in the 04326-Codebook.pdf. After the script has looped through every line of data, the output string will be written to dataCleaned.txt.

In [1]:
#open data.txt file - entire dataset for the ABC News/Wash Post Poll
f = open("data.txt")

#initialize output string with header row of variable names. Names are consistent with the codebook pdf
outstr = "caseid,abcnum,stcode,Q21,Q21NET,Q24,Q901,Q908A,Q921\n"

#loop through each line in the original data.txt file
for line in f:
    
    #if the respondent's home state is not in the whitelist of 21 states, skip this line 
    state = line[23:25].strip()
    if not ( state == '4' or state == '3' or state == '5' or state == '6' or state == '10' or state == '14' or state == '19' \
       or state == '21' or state == '25' or state == '29' or state == '31' or state == '32' or state == '35' or state == '36' \
       or state == '39' or state == '41' or state == '43' or state == '44' or state == '45' or state == '48' or state == '50'):
        continue
    
    #record the relevant variables, in order, separated by commas
    lstr = line[0:4].strip() + ','
    lstr = lstr + str.strip(line[23:25]) + ','
    lstr = lstr + str.strip(line[25:27]) + ','
    lstr = lstr + str.strip(line[143:145]) + ','
    lstr = lstr + str.strip(line[145:153]) + ','
    lstr = lstr + str.strip(line[167:169]) + ','
    lstr = lstr + str.strip(line[285:287]) + ','
    lstr = lstr + str.strip(line[287:289])+ ','
    lstr = lstr + str.strip(line[329:331])
    
    #add the variables for this respondent to the output string as a new row of data
    outstr = outstr + lstr + '\n'

#after looping through all rows of data, write the output string to dataCleaned.txt
o = open("dataCleaned.txt", "w")
o.write(outstr)
o.close()
f.close()
        
        