# Census Data Types Transformation

In this project we are going to use pandas to clean, organise and prepare recently collected census data for further usage by machine learning algorithms.

The description of this dataset is:

- `first_name`: the respondent’s first name.
- `last_name`: the respondent’s last name.
- `birth_year`: the respondent’s year of birth.
- `voted`: whether the respondent participated in the current voting cycle.
- `num_children`: the number of children the respondent has.
- `income_year`: the average yearly income the respondent earns.
- `higher_tax`: the respondent’s answer to the question: “Rate your agreement with the statement: the wealthy should pay higher taxes.”
- `marital_status`: the respondent’s current marital status.


## Assessing Variable Types

The census dataframe is composed of simulated census data to represent demographics of a small community in the U.S. 

Let's impost pandas library and upload CSV file to a variable. 

In [19]:
import pandas as pd

# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

Let's investigate the data more closely.

In [20]:
census.head(5)

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,Gust,Abernathy,1945,False,2,143316.08,agree,married


Let's compare the data types and the values returned by .head(). This is an important step in understanding what preprocessing will be necessary to work with the data.

In [21]:
census.dtypes

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

Data types explanation:

- `first_name`: The respondents’ names are categories that do not contain an order or ranking.
- `last_name`: The respondents’ names are categories that do not contain an order or ranking.
- `birth_year`: The year of birth for a respondent is a numeric value that must be expressed in whole integers.
- `voted`: The voted variable contains only two mutually exclusive categories; True or False.
- `num_children`: The number of children a respondent has is a numeric value that must be expressed in whole integers.
- `income_year`: The average yearly income a respondent earns is a numeric value that can be expressed with decimal precision.
- `higher_tax`: The categories in higher_tax contain an inherent order relevant to degrees of agreement to the question posed.
- `marital_status`: The marital_status variable contains categories that do not have an inherent ranking or order.

## Altering Data

The manager of the census would like to know the average birth year of the respondents. We were able to see from `.dtypes` that `birth_year` has been assigned the `str` datatype whereas it should be expressed in `int`.

Let's print the unique values of the `birth_year`.

In [22]:
census.birth_year.unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

There appears to be a missing value in the `birth_year` column. But with some research we find that the respondent’s birth year is 1967.

We are going to replace the missing value with 1967, so that afterwords we could change the data type to `int` and recheck the values in `birth_year`.

In [23]:
# Replace data point
census.birth_year = census.birth_year.replace('missing', 1967)
# Change data type to int
census.birth_year = census.birth_year.astype('int')
# Check unique data points
census.birth_year.unique()

array([2005, 1987, 1992, 1965, 1945, 1951, 1963, 1949, 1950, 1971, 2007,
       1944, 1995, 1973, 1946, 1954, 1994, 1989, 1947, 1993, 1976, 1984,
       1967, 1966, 1941, 2000, 1953, 1956, 1960, 2001, 1980, 1955, 1985,
       1996, 1968, 1979, 2006, 1962, 1981, 1959, 1977, 1978, 1983, 1957,
       1961, 1982, 2002, 1998, 1999, 1952, 1940, 1986, 1958])

In [24]:
census.dtypes

first_name         object
last_name          object
birth_year          int32
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

Let's find some statistics about the birth years of the respondents within the census. 

In [25]:
census.birth_year.describe()

count     100.000000
mean     1973.400000
std        20.102264
min      1940.000000
25%      1955.000000
50%      1972.000000
75%      1992.000000
max      2007.000000
Name: birth_year, dtype: float64

Local community said that it would help if we store responses in the `higher_tax` variable in a more convinient way, like: `strongly disagree` < `disagree` < `neutral` < `agree` < `strongly agree`, instead of numbers.

Let's convert the `higher_tax` variable to the category data type with the appropriate order, then print it.

In [26]:
census.higher_tax = pd.Categorical(census.higher_tax, ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered = True)
census.higher_tax.unique()

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']

The community manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. 

So we are going to label encode the `higher_tax` variable and print the median using the pandas `.median()` method.

In [27]:
census['higher_tax_encoded'] = census.higher_tax.cat.codes
census.higher_tax_encoded.median()

2.0

Now let's prep this dataset for using machine learning models on the census data in the future. 

We'll use **One-Hot encode** `marital_status` to create binary variables of each category and print it to check. This technique is useful when managing nominal variables because it encodes the variable without creating an order among the categories.

In [28]:
census = pd.get_dummies(data = census, columns=['marital_status'])
census.head(5)

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,higher_tax_encoded,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed
0,Denise,Ratke,2005,False,0,92129.41,disagree,1,0,0,1,0
1,Hali,Cummerata,1987,False,0,75649.17,neutral,2,1,0,0,0
2,Salomon,Orn,1992,True,2,166313.45,agree,3,0,0,1,0
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,4,0,1,0,0
4,Gust,Abernathy,1945,False,2,143316.08,agree,3,0,1,0,0


In [31]:
census.dtypes

first_name                   object
last_name                    object
birth_year                    int32
voted                          bool
num_children                  int64
income_year                 float64
higher_tax                 category
higher_tax_encoded             int8
marital_status_divorced       uint8
marital_status_married        uint8
marital_status_single         uint8
marital_status_widowed        uint8
dtype: object

As we can see the data type is also changed to binary.

To use machine learning to predict whether a respondent thinks the wealthy should pay higher taxes based on respondent's age group.
let's create a new variable called `age_group`, which groups respondents based on their birth year. The groups will be in five-year increments, e.g., 25-30, 31-35, etc.

In [67]:
# Import library to get current year
from datetime import date
# Get current year
current_year = date.today().year

# Group respondents based on their age
census['age_group'] = (((current_year - census['birth_year']) // 5) * 5).astype('category')
census.age_group

0     15
1     35
2     30
3     55
4     75
      ..
95    60
96    20
97    35
98    35
99    60
Name: age_group, Length: 100, dtype: category
Categories (14, int64): [15, 20, 25, 30, ..., 65, 70, 75, 80]

2.8. Then **Label encode** the `age_group` variable, because there is an equal spacing between categories and they are ordinal.

In [70]:
census.age_group = census.age_group.cat.codes
census

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,higher_tax_encoded,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed,age_group
0,Denise,Ratke,2005,False,0,92129.41,disagree,1,0,0,1,0,0
1,Hali,Cummerata,1987,False,0,75649.17,neutral,2,1,0,0,0,4
2,Salomon,Orn,1992,True,2,166313.45,agree,3,0,0,1,0,3
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,4,0,1,0,0,8
4,Gust,Abernathy,1945,False,2,143316.08,agree,3,0,1,0,0,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Carisa,Hills,1958,False,3,157117.14,agree,3,0,1,0,0,9
96,Tameka,Collins,2001,False,1,61518.34,strongly disagree,0,0,0,1,0,1
97,Adams,Leuschke,1987,False,0,41784.87,strongly agree,4,0,0,1,0,4
98,Earnestine,Gutmann,1985,True,4,79021.46,disagree,1,0,0,0,1,4


## Conclusion

We have transformed some data types and prepared them for further work using machine learning algorithms.