### IPUMS Data Cleaning Summary:
#### Variables:
- After exploring the data for all years, it was decided that the geographic variable MET2013 was the best choice for filtering out residents who reside in LA Metro (which  effectively comprises of LA & Orange counties). This decision was driven by the preference for analyzing migration out of the same exact geographic region for all years (2006-2016). MET2013 data is (1) available for all years (unlike METAREA) and (2) provides enough data for all years (unlike PUMA). **Note: PUMA is included in the data for more specific insight on migration destinations if desired.**
- The migration variables (MIGMET1 & MIGPUMA1/MIGPLAC1) were filtered to reflect the same geographic area of MET2013 (i.e. LA and Orange counties).

#### Columns:
- Some numeric values were replaced with corresponding key words if the column values were few (e.g. SEX: 'Male' for 1, 'Female' for 2)
- Other columns retain their original numeric values, though an additional column with the key words/descriptions was added for potentially easier analysis (e.g. the educational attainment column EDU2010 is complemented with the column EDU2010_DESC, which provides the text description corresponding to the column's numeric code)
- Additional columns were added (OCC2010_CAT & IND_CAT) for the 'occupation' and 'industry' variables, containing a description of the broader category of the occupation or industry

-------------------

### Variables Included:

Demographic:
- SEX (https://usa.ipums.org/usa-action/variables/SEX#codes_section)
- AGE (https://usa.ipums.org/usa-action/variables/AGE#codes_section)
- MARST (https://usa.ipums.org/usa-action/variables/MARST#codes_section)
- RACE (https://usa.ipums.org/usa-action/variables/RACE#codes_section)
- HISPAN (https://usa.ipums.org/usa-action/variables/HISPAN#codes_section)

Economic:
- OWNERSHP (https://usa.ipums.org/usa-action/variables/OWNERSHP#codes_section)
- MORTAMT1 (https://usa.ipums.org/usa-action/variables/MORTAMT1#codes_section)
- RENT (https://usa.ipums.org/usa-action/variables/RENT#codes_section)
- EMPSTAT (https://usa.ipums.org/usa-action/variables/EMPSTAT#codes_section)
- OCC2010 (https://usa.ipums.org/usa-action/variables/OCC2010#codes_section)
- IND (https://usa.ipums.org/usa-action/variables/IND#codes_section)
- INCTOT (https://usa.ipums.org/usa-action/variables/INCTOT#codes_section)
- FTOTINC (https://usa.ipums.org/usa-action/variables/FTOTINC#codes_section)
- POVERTY (https://usa.ipums.org/usa-action/variables/POVERTY#codes_section)

Education:
- EDUC (https://usa.ipums.org/usa-action/variables/EDUC#codes_section)

Geographic/Migration:
- MET2013 (https://usa.ipums.org/usa-action/variables/MET2013#codes_section)
- PUMA (https://usa.ipums.org/usa-action/variables/PUMA#codes_section)
- MIGPUMA1 (https://usa.ipums.org/usa-action/variables/MIGPUMA1#codes_section)

Technical:
- PERNUM (https://usa.ipums.org/usa-action/variables/PERNUM#codes_section)
- PERWT (https://usa.ipums.org/usa-action/variables/PERWT#codes_section)
- HHWT (https://usa.ipums.org/usa-action/variables/HHWT#codes_section)

### Filtering Raw Data:

In [1]:
import pandas as pd

In [2]:
#Combining each year's data; removing default columns that aren't useful; filtering out residents who did not move out of LA metro
dfs = {}
for year in range(2006, 2017):
    name = str(year) + "Final.csv"
    file = pd.read_csv(name)
    file = file.drop(['OWNERSHP', 'EMPSTATD', 'EDUCD', 'RACED', 'HISPAND'], axis=1)
    if year < 2012:
        file = file[(~file.MET2013.isin([0, 31080])) & (file.MIGMET1.isin([4480,4482]))]
    else:
        file = file[(~file.MET2013.isin([0, 31080])) & (file.MIGPUMA1.isin([3700, 5900])) & (file.MIGPLAC1==6)]
    dfs[year] = file

In [3]:
#Combining the filtered data for each year in a master data frame
master = pd.concat([dfs[year] for year in range(2006,2017)], sort=False, ignore_index=True)

In [4]:
#MIGPLAC1 Contains single value (6) and was only used for filtering, so no longer needed
master = master.drop('MIGPLAC1', axis=1)

In [5]:
#Filtered data contains 34,761 records for 2006-2016
len(master)

34761

-------------------

### Manipulating & Creating Columns

In [6]:
#Replacing numeric values with key words
sex = {1:'Male', 2:'Female'}
series = master.SEX.map(sex)
master.SEX = series

-------------------

In [7]:
education = {
0: 'N/A or no schooling',
1: 'Nursery school to grade 4',
2: 'Grade 5, 6, 7, or 8',
3:	'Grade 9',
4:	'Grade 10',
5:	'Grade 11',
6:	'Grade 12',
7:	'1 year of college',
8:	'2 years of college',
9:	'3 years of college',
10:	'4 years of college',
11:	'5+ years of college'}

series = master.EDUC.map(education)
master['EDUC_DESC'] = series

-------------------

In [8]:
ownership = {
0:	'N/A',
12:	'Owned free and clear',
13:	'Owned with mortgage or loan',
21:	'No cash rent',
22:	'With cash rent'}

series = master.OWNERSHPD.map(ownership)
master['OWNERSHPD_DESC'] = series

-------------------

In [9]:
employment = {
0: 'N/A',
1:'Employed',
2:'Unemployed',
3:'Not in labor force'}

series = master.EMPSTAT.map(employment)
master['EMPSTAT'] = series

-------------------

In [10]:
marital = {
1:	'Married, spouse present',
2:	'Married, spouse absent',
3:	'Separated',
4:	'Divorced',
5:	'Widowed',
6:	'Never married/single'}

series = master.MARST.map(marital)
master['MARST_DESC'] = series

-------------------

In [11]:
race = {
1:	'White',
2:	'Black/African American/Negro',
3:	'American Indian or Alaska Native',
4:	'Chinese',
5:	'Japanese',
6:	'Other Asian or Pacific Islander',
7:	'Other race, nec',
8:	'Two major races',
9:	'Three or more major races'}

series = master.RACE.map(race)
master['RACE_DESC'] = series

-------------------

In [12]:
hispanic = {
0:	'Not Hispanic',
1:	'Mexican',
2:	'Puerto Rican',
3:	'Cuban',
4:	'Other',
9:	'Not Reported'}
series = master.HISPAN.map(hispanic)
master['HISPAN_DESC'] = series

-------------------

In [13]:
migmet = {4480: 'LA County', 4482: 'Orange County'}

series = master.MIGMET1.map(migmet)
master['MIGMET1_DESC'] = series

-------------------

In [14]:
migpuma = {3700: 'LA County', 5900: 'Orange County'}

series = master.MIGPUMA1.map(migpuma)
master['MIGPUMA1_DESC'] = series

-------------------

##### *The following variables contained a lot of data, which is why I had to load/parse a text file here (rather than copy/paste like above)

In [15]:
text = open('met2013.txt')
file = text.readlines()[1:]

met2013 = {}
for line in file:
    code = line.split('\t')[0]
    city = line.split('\t')[1]
    met2013[int(code)] = city

series = master.MET2013.map(met2013)
master['MET2013_DESC'] = series

-------------------

In [16]:
text = open('occ2010.txt')
file = text.readlines()[1:]

occ2010 = {}
for line in file:
    code = line.split('\t')[0]
    city = line.split('\t')[1]
    occ2010[int(code)] = city
    
series = master.OCC2010.map(occ2010)
master['OCC2010_DESC'] = series

##### *Adding broader 'occupation category' column that identifies occupations more generally (from the bolded subheadings here: https://usa.ipums.org/usa-action/variables/OCC2010#codes_section)

In [17]:
occupations = {}
for number in range(0, 9930, 5):
    if number < 440:
        occupations[number] = 'MANAGEMENT, BUSINESS, SCIENCE, AND ARTS'
    elif number > 490 and number < 740:
        occupations[number] = 'BUSINESS OPERATIONS SPECIALISTS'
    elif number > 790 and number < 960:
        occupations[number] = 'FINANCIAL SPECIALISTS'
    elif number > 990 and number < 1550:
        occupations[number] = 'COMPUTER AND MATHEMATICAL'
    elif number > 1540 and number < 1570:
        occupations[number] = 'TECHNICIANS'
    elif number > 1590 and number < 1990:
        occupations[number] = 'LIFE, PHYSICAL, AND SOCIAL SCIENCE'
    elif number > 1990 and number < 2070:
        occupations[number] = 'COMMUNITY AND SOCIAL SERVICES'
    elif number > 2090 and number < 2160:
        occupations[number] = 'LEGAL'
    elif number > 2190 and number < 2560:
        occupations[number] = 'EDUCATION, TRAINING, AND LIBRARY'
    elif number > 2590 and number < 2930:
        occupations[number] = 'ARTS, DESIGN, ENTERTAINMENT, SPORTS, AND MEDIA'
    elif number > 2990 and number < 3550:
        occupations[number] = 'HEALTHCARE PRACTITIONERS AND TECHNICAL'
    elif number > 3590 and number < 3660:
        occupations[number] = 'HEALTHCARE SUPPORT'
    elif number > 3690 and number < 4000:
        occupations[number] = 'PROTECTIVE SERVICE'
    elif number > 4000 and number < 4160:
        occupations[number] = 'FOOD PREPARATION AND SERVING'
    elif number > 4190 and number < 4260:
        occupations[number] = 'BUILDING AND GROUNDS CLEANING AND MAINTENANCE'
    elif number > 4290 and number < 4660:
        occupations[number] = 'PERSONAL CARE AND SERVICE'
    elif number > 4690 and number < 5000:
        occupations[number] = 'SALES AND RELATED'
    elif number > 5000 and number < 5950:
        occupations[number] = 'OFFICE AND ADMINISTRATIVE SUPPORT'
    elif number > 6000 and number < 6135:
        occupations[number] = 'FARMING, FISHING, AND FORESTRY'
    elif number > 6190 and number < 6770:
        occupations[number] = 'CONSTRUCTION'
    elif number > 6790 and number < 6950:
        occupations[number] = 'EXTRACTION'
    elif number > 6990 and number < 7640:
        occupations[number] = 'INSTALLATION, MAINTENANCE, AND REPAIR'
    elif number > 7990 and number < 8970:
        occupations[number] = 'PRODUCTION'
    elif number > 8990 and number < 9760:
        occupations[number] = 'TRANSPORTATION AND MATERIAL MOVING'
    elif number > 9790:
        occupations[number] = 'MILITARY SPECIFIC'
        
series = master.OCC2010.map(occupations)
master['OCC2010_CAT'] = series

-------------------

In [18]:
text = open('industry.txt')
file = text.readlines()[1:]

ind = {}
for line in file:
    code = line.split('\t')[0]
    city = line.split('\t')[1]
    ind[int(code)] = city.rstrip()
    
series = master.IND.map(ind)
master['IND_DESC'] = series

##### *Adding broader 'industry category' column that identifies industries more generally (from the headings - and subheadings when available - here: https://usa.ipums.org/usa/volii/ind2013.shtml)

In [19]:
industries = {}
for number in range(0, 9930, 10):
    if number == 0:
        industries[number] = 'N/A (less than 16 years old/unemployed who never worked/NILF who last worked more than 5 years ago)'
    elif number > 160 and number < 300:
        industries[number] = 'Agriculture, Forestry, Fishing, and Hunting'
    elif number > 360 and number < 500:
        industries[number] = 'Mining, Quarrying, and Oil and Gas Extraction'
    elif number > 560 and number < 700:
        industries[number] = 'Utilities'
    elif number == 770:
        industries[number] = 'Construction'
    elif number > 1060 and number < 4000:
        industries[number] = 'Manufacturing'
    elif number > 4070 and number < 4600:
        industries[number] = 'Wholesale Trade'
    elif number > 4660 and number < 5800:
        industries[number] = 'Retail Trade'
    elif number > 6060 and number < 6400:
        industries[number] = 'Transportation and Warehousing'
    elif number > 6460 and number < 6790:
        industries[number] = 'Information'
    elif number > 6860 and number < 7000:
        industries[number] = 'Finance and Insurance'
    elif number > 7060 and number < 7200:
        industries[number] = 'Real Estate and Rental and Leasing'
    elif number > 7260 and number < 7500:
        industries[number] = 'Professional, Scientific, and Technical Services'
    elif number == 7570:
        industries[number] = 'Management of companies and enterprises'        
    elif number > 7570 and number < 7800:
        industries[number] = 'Administrative and support and waste management services'
    elif number > 7850 and number < 7900:
        industries[number] = 'Educational Services'
    elif number > 7960 and number < 8480:
        industries[number] = 'Health Care and Social Assistance'
    elif number > 8550 and number < 8600:
        industries[number] = 'Arts, Entertainment, and Recreation'
    elif number > 8650 and number < 8700:
        industries[number] = 'Accommodation and Food Services'
    elif number > 8760 and number < 9300:
        industries[number] = 'Other Services, Except Public Administration'
    elif number > 9360 and number < 9600:
        industries[number] = 'Public Administration'
    elif number > 9660 and number < 9930:
        industries[number] = 'Active Duty Military'
        
series = master.IND.map(industries)
master['IND_CAT'] = series

-------------------

In [20]:
#Sorting columns alphabetically
cols = list(master.columns)
cols.sort()
master = master[cols]

In [21]:
#Exporting data to csv
master.to_csv('ipums_clean.csv', encoding='utf-8', index=False)