# Investigation of California Socioeconomic Relations Dataset

This contains the chapter on how we initially manipulated and parsed the dataset

- [Requirements](#library-imports)
- [Introduction](#intro)
- [Data processing](#data-processing)

## Importing required libraries<a class="anchor" id="library-imports"></a>

In [3]:
# Standard python packages
import os
import sys

# Other package imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

## Introduction<a class="anchor" id="intro"></a>

We began the task by looking at the BG_METADATA_2016 csv file as well as the various other csv files and checking out what sort of different areas we were interested to find a correlation between. We have decided to investigate factors affecting educational attainment. 

The possible factors we considered are:
 - Wealth
 - Household structure
 - Sex
 - Race

## Obtain and process data<a class="anchor" id="data-processing"></a>

We used pandas and dictionaries available in python in order to map the long column names for each csv into a more readable table so we could identify the different pieces of data

In [4]:
metadata = pd.read_csv("../data/raw/california/train/BG_METADATA_2016.csv")

In [5]:
def replace_columns(df):
    labels = pd.Series(metadata["Full_Name"].values,index=metadata["Short_Name"]).to_dict()
    df = df.rename(columns=labels)
    return df

In [6]:
dfs = {}

path = "../data/raw/california/train/"
for f in os.listdir(path):
    print(f)
    df = pd.read_csv(path+f)
    df = replace_columns(df)
    dfs[f] = df

X02_RACE.csv
X99_IMPUTATION.csv
X00_COUNTS.csv
X20_EARNINGS.csv
X01_AGE_AND_SEX.csv
X03_HISPANIC_OR_LATINO_ORIGIN.csv
X21_VETERAN_STATUS.csv
X17_POVERTY.csv
X12_MARITAL_STATUS_AND_HISTORY.csv
X16_LANGUAGE_SPOKEN_AT_HOME.csv
X22_FOOD_STAMPS.csv
X08_COMMUTING.csv
X09_CHILDREN_HOUSEHOLD_RELATIONSHIP.csv
X27_HEALTH_INSURANCE.csv
X11_HOUSEHOLD_FAMILY_SUBFAMILIES.csv
BG_METADATA_2016.csv
X19_INCOME.csv
X23_EMPLOYMENT_STATUS.csv
X14_SCHOOL_ENROLLMENT.csv
X15_EDUCATIONAL_ATTAINMENT.csv
X07_MIGRATION.csv


In [15]:
## We have identified an overall response variable 
dfs['X15_EDUCATIONAL_ATTAINMENT.csv']["EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER: Bachelor's degree: Population 25 years and over -- (Estimate)"]

0        781
1        359
2        198
3        429
4        399
5        177
6        327
7        452
8        331
9        340
10       786
11       124
12       143
13       257
14       150
15       188
16       237
17       218
18       210
19       349
20       222
21       261
22       235
23       263
24       157
25       100
26       226
27       151
28       100
29       143
        ... 
18968    237
18969     21
18970    352
18971    397
18972    157
18973    278
18974      0
18975    620
18976    281
18977    294
18978    339
18979    303
18980    273
18981    424
18982    364
18983    238
18984    281
18985    265
18986    296
18987    621
18988     96
18989    247
18990    330
18991    182
18992    340
18993    588
18994    446
18995    620
18996    127
18997    327
Name: EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER: Bachelor's degree: Population 25 years and over -- (Estimate), Length: 18998, dtype: int64

## Remove superfluous data

Some columns are duplicated. quick lil script to remove these

In [None]:
for (_, df) in dfs.items():
    df = df.loc[:, ~df.columns.duplicated()]

## Cleaning the data

### Incomplete Data

We now check the missing data and remove it from the dataset we're going to use. We do this by looking for NaN values. 

In [12]:
## For each csv file
for (name, df) in dfs.items():
    # Find rows and columns that contain a NaN value
    narows = df[df.isnull().any(axis=1)]
    nacols = df.columns[df.isna().any()].tolist()
    # Calculate the percentage of rows and columns that contain a NaN value
    percent_narows = round(len(narows)/df.shape[0]*100, 1)
    percent_nacols = round(len(nacols)/df.shape[1]*100, 1)
    print("Percent missing data for {} : rows={}% columns={}%".format(name, percent_narows, percent_nacols))
    
    

Percent missing data for X08_COMMUTING.csv : rows=100.0% columns=7.5%
Percent missing data for X12_MARITAL_STATUS_AND_HISTORY.csv : rows=0.0% columns=0.0%
Percent missing data for X15_EDUCATIONAL_ATTAINMENT.csv : rows=0.0% columns=0.0%
Percent missing data for X14_SCHOOL_ENROLLMENT.csv : rows=0.0% columns=0.0%
Percent missing data for X23_EMPLOYMENT_STATUS.csv : rows=0.0% columns=0.0%
Percent missing data for X02_RACE.csv : rows=0.0% columns=0.0%
Percent missing data for X21_VETERAN_STATUS.csv : rows=0.0% columns=0.0%
Percent missing data for X16_LANGUAGE_SPOKEN_AT_HOME.csv : rows=0.0% columns=0.0%
Percent missing data for BG_METADATA_2016.csv : rows=0.0% columns=0.0%
Percent missing data for X99_IMPUTATION.csv : rows=100.0% columns=2.5%
Percent missing data for X07_MIGRATION.csv : rows=100.0% columns=50.3%
Percent missing data for X17_POVERTY.csv : rows=98.7% columns=3.3%
Percent missing data for X19_INCOME.csv : rows=100.0% columns=26.3%
Percent missing data for X20_EARNINGS.csv : ro

In [9]:
def drop_na_cols(df, threshold):
    columns = df.columns
    for col in columns:
        narows = df[col][df[col].isnull()]
        percent_narows = round(len(narows)/df[col].shape[0]*100, 1)
        if percent_narows > 20:
            df = df.drop(col, axis=1)
            print("yeet {}".format(col))
    return df

In [None]:
for (_, df) in dfs.items():
    ## loop through each column and  
    drop_na_cols(df, 20)

In [13]:
drop_na_cols(dfs["X17_POVERTY.csv"], 20)

yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 12 MONTHS FOR FAMILIES BY FAMILY TYPE: Total: Families with income in the past 12 months below the poverty level -- (Estimate)
yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 12 MONTHS FOR FAMILIES BY FAMILY TYPE: Total: Families with income in the past 12 months below the poverty level -- (Margin of Error)
yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 12 MONTHS FOR FAMILIES BY FAMILY TYPE: Married-couple family: Families with income in the past 12 months below the poverty level -- (Estimate)
yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 12 MONTHS FOR FAMILIES BY FAMILY TYPE: Married-couple family: Families with income in the past 12 months below the poverty level -- (Margin of Error)
yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 12 MONTHS FOR FAMILIES BY FAMILY TYPE: Other family: Families with income in the past 12 months below the poverty level -- (Estimate)
yeet AGGREGATE INCOME DEFICIT (DOLLARS) IN THE PAST 

Unnamed: 0.1,Unnamed: 0,GEOID,RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: Total: Population for whom poverty status is determined -- (Estimate),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: Total: Population for whom poverty status is determined -- (Margin of Error),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: Under .50: Population for whom poverty status is determined -- (Estimate),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: Under .50: Population for whom poverty status is determined -- (Margin of Error),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: .50 to .99: Population for whom poverty status is determined -- (Estimate),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: .50 to .99: Population for whom poverty status is determined -- (Margin of Error),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: 1.00 to 1.24: Population for whom poverty status is determined -- (Estimate),RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS: 1.00 to 1.24: Population for whom poverty status is determined -- (Margin of Error),...,POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Population for whom poverty status is determined -- (Margin of Error),POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Population for whom poverty status is determined -- (Estimate),POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Population for whom poverty status is determined -- (Margin of Error),"POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Female householder, no husband present: Living alone: Population for whom poverty status is determined -- (Estimate)","POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Female householder, no husband present: Living alone: Population for whom poverty status is determined -- (Margin of Error)","POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Female householder, no husband present: Not living alone: Population for whom poverty status is determined -- (Estimate)","POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Householder: Female householder, no husband present: Not living alone: Population for whom poverty status is determined -- (Margin of Error)",POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Other living arrangement: Population for whom poverty status is determined -- (Estimate),POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 MONTHS BY LIVING ARRANGEMENT: Income in the past 12 months at or above poverty level: In non-family households and other living arrangement: Other living arrangement: Population for whom poverty status is determined -- (Margin of Error),OBJECTID
0,0,15000US060014001001,3011,196,95,70,18,21,0,12,...,154,398,81,247,76,151,61,206,103,3
1,1,15000US060014002001,1105,103,4,7,9,13,20,14,...,66,154,36,88,29,66,28,98,41,4
2,2,15000US060014002002,847,95,66,45,27,27,11,26,...,94,190,50,135,45,55,26,114,67,5
3,3,15000US060014003001,1466,533,13,21,25,30,0,12,...,134,168,75,107,69,61,48,110,89,6
4,4,15000US060014003002,1229,244,43,39,67,76,14,22,...,185,493,135,365,129,128,75,117,89,7
5,5,15000US060014003003,985,259,71,108,108,90,0,12,...,112,256,97,192,97,64,40,60,42,8
6,6,15000US060014003004,1473,303,32,52,91,77,0,12,...,154,212,92,108,76,104,67,123,82,9
7,7,15000US060014004001,1387,209,41,49,37,44,11,16,...,101,339,71,243,71,96,40,159,63,10
8,8,15000US060014004002,1187,236,15,16,66,53,59,58,...,155,256,82,118,52,138,70,146,86,11
9,9,15000US060014004003,1584,270,36,38,73,50,63,53,...,200,210,79,65,40,145,70,235,130,12
