# Investigation of California Socioeconomic Relations Dataset

This contains the chapter on how we initially manipulated and parsed the dataset

- [Requirements](#library-imports)
- [Introduction](#intro)
- [Data processing](#data-processing)

## Importing required libraries<a class="anchor" id="library-imports"></a>

In [106]:
# Standard python packages
import os
import sys
from pathlib import Path

# Other package imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

## Introduction<a class="anchor" id="intro"></a>

We began the task by looking at the BG_METADATA_2016 csv file as well as the various other csv files and checking out what sort of different areas we were interested to find a correlation between. We have decided to investigate factors affecting educational attainment. 

The possible factors we considered are:
 - Wealth
 - Household structure
 - Sex
 - Race

## Obtain and process data<a class="anchor" id="data-processing"></a>

We used pandas and dictionaries available in python in order to map the long column names for each csv into a more readable table so we could identify the different pieces of data

In [107]:
data_folder = Path("../data/")
raw_data_folder = data_folder / "raw" / "train"

metadata = pd.read_csv(raw_data_folder / "BG_METADATA_2016.csv")

In [108]:
def replace_columns(df):
    labels = pd.Series(metadata["Full_Name"].values,index=metadata["Short_Name"]).to_dict()
    df = df.rename(columns=labels)
    return df

Repeat for every CSV

In [109]:
dfs = {}

path = "../data/raw/train/"
for f in os.listdir(path):
    print(f)
    df = pd.read_csv(path+f)
    df = replace_columns(df)
    dfs[f] = df

X12_MARITAL_STATUS_AND_HISTORY.csv
X01_AGE_AND_SEX.csv
X07_MIGRATION.csv
X27_HEALTH_INSURANCE.csv
X08_COMMUTING.csv
X19_INCOME.csv
BG_METADATA_2016.csv
X22_FOOD_STAMPS.csv
X03_HISPANIC_OR_LATINO_ORIGIN.csv
X23_EMPLOYMENT_STATUS.csv
X15_EDUCATIONAL_ATTAINMENT.csv
X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv
X16_LANGUAGE_SPOKEN_AT_HOME.csv
X21_VETERAN_STATUS.csv
X09_CHILDREN_HOUSEHOLD_RELATIONSHIP.csv
X14_SCHOOL_ENROLLMENT.csv
X00_COUNTS.csv
X99_IMPUTATION.csv
X02_RACE.csv
X20_EARNINGS.csv
X17_POVERTY.csv


In [110]:
## We have identified an overall response variable 
dfs['X15_EDUCATIONAL_ATTAINMENT.csv']["EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER: Bachelor's degree: Population 25 years and over -- (Estimate)"]

0        781
1        359
2        198
3        429
4        399
5        177
6        327
7        452
8        331
9        340
10       786
11       124
12       143
13       257
14       150
15       188
16       237
17       218
18       210
19       349
20       222
21       261
22       235
23       263
24       157
25       100
26       226
27       151
28       100
29       143
        ... 
18968    237
18969     21
18970    352
18971    397
18972    157
18973    278
18974      0
18975    620
18976    281
18977    294
18978    339
18979    303
18980    273
18981    424
18982    364
18983    238
18984    281
18985    265
18986    296
18987    621
18988     96
18989    247
18990    330
18991    182
18992    340
18993    588
18994    446
18995    620
18996    127
18997    327
Name: EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER: Bachelor's degree: Population 25 years and over -- (Estimate), Length: 18998, dtype: int64

## Remove superfluous data

Some columns are duplicated. quick lil script to remove these

In [111]:
for (_, df) in dfs.items():
    df = df.loc[:, ~df.columns.duplicated()]

## Cleaning the data

### Incomplete Data

We now check the missing data and remove it from the dataset we're going to use. We do this by looking for NaN values. 

In [112]:
def percent_na(df):
    # Find rows and columns that contain a NaN value
    narows = df[df.isnull().any(axis=1)]
    nacols = df.columns[df.isna().any()].tolist()
    # Calculate the percentage of rows and columns that contain a NaN value
    percent_narows = round(len(narows)/df.shape[0]*100, 1)
    percent_nacols = round(len(nacols)/df.shape[1]*100, 1)
    print("Percent missing data for {} : rows={}% columns={}%".format(name, percent_narows, percent_nacols))
    
for (name, df) in dfs.items():
    percent_na(df)
    

Percent missing data for X12_MARITAL_STATUS_AND_HISTORY.csv : rows=0.0% columns=0.0%
Percent missing data for X01_AGE_AND_SEX.csv : rows=100.0% columns=36.8%
Percent missing data for X07_MIGRATION.csv : rows=100.0% columns=50.3%
Percent missing data for X27_HEALTH_INSURANCE.csv : rows=0.0% columns=0.0%
Percent missing data for X08_COMMUTING.csv : rows=100.0% columns=7.5%
Percent missing data for X19_INCOME.csv : rows=100.0% columns=26.3%
Percent missing data for BG_METADATA_2016.csv : rows=0.0% columns=0.0%
Percent missing data for X22_FOOD_STAMPS.csv : rows=0.0% columns=0.0%
Percent missing data for X03_HISPANIC_OR_LATINO_ORIGIN.csv : rows=0.0% columns=0.0%
Percent missing data for X23_EMPLOYMENT_STATUS.csv : rows=0.0% columns=0.0%
Percent missing data for X15_EDUCATIONAL_ATTAINMENT.csv : rows=0.0% columns=0.0%
Percent missing data for X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv : rows=0.0% columns=0.0%
Percent missing data for X16_LANGUAGE_SPOKEN_AT_HOME.csv : rows=0.0% columns=0.0%
Percent

In [113]:
dfs_no_na = {}
for (name, df) in dfs.items():
    threshold=len(df)*0.8
    dfs_no_na[name] = df.dropna(thresh=threshold, axis=1) 
    print(f"{name} columns dropped : {dfs[name].shape[1] - dfs_no_na[name].shape[1]}")
    
del dfs

X12_MARITAL_STATUS_AND_HISTORY.csv columns dropped : 0
X01_AGE_AND_SEX.csv columns dropped : 36
X07_MIGRATION.csv columns dropped : 82
X27_HEALTH_INSURANCE.csv columns dropped : 0
X08_COMMUTING.csv columns dropped : 44
X19_INCOME.csv columns dropped : 61
BG_METADATA_2016.csv columns dropped : 0
X22_FOOD_STAMPS.csv columns dropped : 0
X03_HISPANIC_OR_LATINO_ORIGIN.csv columns dropped : 0
X23_EMPLOYMENT_STATUS.csv columns dropped : 0
X15_EDUCATIONAL_ATTAINMENT.csv columns dropped : 0
X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv columns dropped : 0
X16_LANGUAGE_SPOKEN_AT_HOME.csv columns dropped : 0
X21_VETERAN_STATUS.csv columns dropped : 0
X09_CHILDREN_HOUSEHOLD_RELATIONSHIP.csv columns dropped : 0
X14_SCHOOL_ENROLLMENT.csv columns dropped : 0
X00_COUNTS.csv columns dropped : 0
X99_IMPUTATION.csv columns dropped : 14
X02_RACE.csv columns dropped : 0
X20_EARNINGS.csv columns dropped : 0
X17_POVERTY.csv columns dropped : 10


In [114]:
for (name, df) in dfs_no_na.items():
    percent_na(df)

Percent missing data for X12_MARITAL_STATUS_AND_HISTORY.csv : rows=0.0% columns=0.0%
Percent missing data for X01_AGE_AND_SEX.csv : rows=28.4% columns=18.9%
Percent missing data for X07_MIGRATION.csv : rows=0.0% columns=0.0%
Percent missing data for X27_HEALTH_INSURANCE.csv : rows=0.0% columns=0.0%
Percent missing data for X08_COMMUTING.csv : rows=0.0% columns=0.0%
Percent missing data for X19_INCOME.csv : rows=57.0% columns=13.2%
Percent missing data for BG_METADATA_2016.csv : rows=0.0% columns=0.0%
Percent missing data for X22_FOOD_STAMPS.csv : rows=0.0% columns=0.0%
Percent missing data for X03_HISPANIC_OR_LATINO_ORIGIN.csv : rows=0.0% columns=0.0%
Percent missing data for X23_EMPLOYMENT_STATUS.csv : rows=0.0% columns=0.0%
Percent missing data for X15_EDUCATIONAL_ATTAINMENT.csv : rows=0.0% columns=0.0%
Percent missing data for X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv : rows=0.0% columns=0.0%
Percent missing data for X16_LANGUAGE_SPOKEN_AT_HOME.csv : rows=0.0% columns=0.0%
Percent missin

## Imputation
Single vs. Multiple imputation?

In the interests of time, we use mean. Very fast, but decreases variance of dataset unfortunately.

In [115]:
for (name, df) in dfs_no_na.items():
    dfs_no_na[name] = df.fillna(df.mean())

In [116]:
for (name, df) in dfs_no_na.items():
    percent_na(df)

Percent missing data for X12_MARITAL_STATUS_AND_HISTORY.csv : rows=0.0% columns=0.0%
Percent missing data for X01_AGE_AND_SEX.csv : rows=0.0% columns=0.0%
Percent missing data for X07_MIGRATION.csv : rows=0.0% columns=0.0%
Percent missing data for X27_HEALTH_INSURANCE.csv : rows=0.0% columns=0.0%
Percent missing data for X08_COMMUTING.csv : rows=0.0% columns=0.0%
Percent missing data for X19_INCOME.csv : rows=0.0% columns=0.0%
Percent missing data for BG_METADATA_2016.csv : rows=0.0% columns=0.0%
Percent missing data for X22_FOOD_STAMPS.csv : rows=0.0% columns=0.0%
Percent missing data for X03_HISPANIC_OR_LATINO_ORIGIN.csv : rows=0.0% columns=0.0%
Percent missing data for X23_EMPLOYMENT_STATUS.csv : rows=0.0% columns=0.0%
Percent missing data for X15_EDUCATIONAL_ATTAINMENT.csv : rows=0.0% columns=0.0%
Percent missing data for X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv : rows=0.0% columns=0.0%
Percent missing data for X16_LANGUAGE_SPOKEN_AT_HOME.csv : rows=0.0% columns=0.0%
Percent missing da

In [117]:
for (name, df) in dfs_no_na.items():
    df.to_csv(data_folder / "processed" / name)
    print(f"Saved {name}")

Saved X12_MARITAL_STATUS_AND_HISTORY.csv
Saved X01_AGE_AND_SEX.csv
Saved X07_MIGRATION.csv
Saved X27_HEALTH_INSURANCE.csv
Saved X08_COMMUTING.csv
Saved X19_INCOME.csv
Saved BG_METADATA_2016.csv
Saved X22_FOOD_STAMPS.csv
Saved X03_HISPANIC_OR_LATINO_ORIGIN.csv
Saved X23_EMPLOYMENT_STATUS.csv
Saved X15_EDUCATIONAL_ATTAINMENT.csv
Saved X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv
Saved X16_LANGUAGE_SPOKEN_AT_HOME.csv
Saved X21_VETERAN_STATUS.csv
Saved X09_CHILDREN_HOUSEHOLD_RELATIONSHIP.csv
Saved X14_SCHOOL_ENROLLMENT.csv
Saved X00_COUNTS.csv
Saved X99_IMPUTATION.csv
Saved X02_RACE.csv
Saved X20_EARNINGS.csv
Saved X17_POVERTY.csv
