## This workbook highlights data inspection and cleaning to help understand statistics better. 

In [68]:
import numpy as np
import pandas as pd
import os

In [69]:
os.listdir()
df_census=pd.read_csv('census.csv')

Here we see the first few rows of the dataframe.

In [70]:
df_census.head()

Unnamed: 0.1,Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,4,Gust,Abernathy,1945,False,2,143316.08,agree,married


Let's see the datatypes of the dataframe variables.

In [71]:
df_census.dtypes

Unnamed: 0          int64
first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

To compute the average bith yr of the respondents we need to change the datatype of the birth year to either int64/float64 from object. Let us first list out the unique values of the variable.

In [72]:
df_census['birth_year'].unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

As we can see over here there's a missing value in 'birth_year' column that needs to be filled in before we could convert.

In [73]:
df_census['birth_year']=df_census['birth_year'].replace('missing','1967')
df_census['birth_year'].unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', '1967', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

Now that we've filled in the missing value with the year='1967' we can change the datatypes to int64.

In [74]:
df_census['birth_year']=df_census['birth_year'].astype("int64")
df_census['birth_year'].dtypes

dtype('int64')

Now that we've changed let's compute the mean as follows.

In [75]:
mean_birth_year=round(np.mean(df_census.birth_year))
mean_birth_year

1973

Suppose the manager now wants to set an order to the higher tax variable such that,
strongly disagree < disagree < neutral < agree < strongly agree.we can do the same as follows,

In [76]:
df_census.higher_tax=pd.Categorical(df_census['higher_tax'],['strongly disagree','disagree','neutral','agree','strongly agree'],ordered=True)
df_census.higher_tax.unique()

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']

Now the manager wanna know the median sentiment of the people on the question:"Whether the wealthy should pay higher taxes?".
So, at first we label encode and then return the median sentiment.

In [77]:
df_census['higher_tax_coded']=df_census.higher_tax.cat.codes
#df_census.higher_tax.dtypes
df_census.head()
df_census.higher_tax_coded.median()

2.0

## CONCLUSION: People largely are 'Neutral' as to whether the rich should pay higher taxes.

To help run a machine learning model manager now wants to OHE(One hot encode) the marital status.

In [78]:
df_census=pd.get_dummies(df_census,columns=['marital_status'])
df_census.head()

Unnamed: 0.1,Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,higher_tax_coded,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed
0,0,Denise,Ratke,2005,False,0,92129.41,disagree,1,0,0,1,0
1,1,Hali,Cummerata,1987,False,0,75649.17,neutral,2,1,0,0,0
2,2,Salomon,Orn,1992,True,2,166313.45,agree,3,0,0,1,0
3,3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,4,0,1,0,0
4,4,Gust,Abernathy,1945,False,2,143316.08,agree,3,0,1,0,0
