# Variable Types for Data Science

In [116]:
import pandas as pd
import numpy as np

#### 1.

The `census` dataframe is composed of simulated census data to represent demographics of a small community in the U.S. Call the `.head()` method on the `census` dataframe and print the output to view the first five rows.

In [117]:
df = pd.read_csv('census_data.csv')
df.head()

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,4,Gust,Abernathy,1945,False,2,143316.08,agree,married


#### 2.

Review the dataframe description and values returned by `.head()` to assess the variable types of each of the variables. This is an important step to understand what preprocessing will be necessary to work with the data.

<details><summary>Hint</summary>

- `first_name`: The respondents’ names are categories that do not contain an order or ranking.
- `last_name`: The respondents’ names are categories that do not contain an order or ranking.
- `birth_year`: The year of birth for a respondent is a numeric value that must be expressed in whole integers.
- `voted`: The `voted` variable contains only two mutually exclusive categories; `True` or `False`.
- `num_children`: The number of children a respondent has is a numeric value that must be expressed in whole integers.
- `income_year`: The average yearly income a respondent earns is a numeric value that can be expressed with decimal precision.
- `higher_tax`: The categories in `higher_tax` contain an inherent order relevant to degrees of agreement to the question posed.
- `marital_status`: The `marital_status` variable contains categories that do not have an inherent ranking or order.

In [118]:
df.describe(include='all')

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
count,100.0,100,100,100.0,100,100.0,100.0,100,100
unique,,99,91,53.0,2,,,5,4
top,,Fumiko,Monahan,1949.0,False,,,disagree,married
freq,,2,3,4.0,51,,,37,36
mean,49.5,,,,,1.81,111380.7897,,
std,29.011492,,,,,1.433333,49015.171775,,
min,0.0,,,,,0.0,35635.14,,
25%,24.75,,,,,0.75,71246.52,,
50%,49.5,,,,,2.0,104990.805,,
75%,74.25,,,,,3.0,153492.09,,


#### 3.

Compare the values returned from the `.head()` method with the data types of each variable by calling `.dtypes` on the `census` dataframe and print the result.

In [119]:
df.dtypes

id                  int64
first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

#### 4.

The manager of the census would like to know the average birth year of the respondents. We were able to see from `.dtypes` that `birth_year` has been assigned the `str` datatype whereas it should be expressed in `int`.

Print the unique values of the variable using the `.unique()` method.

In [120]:
df.birth_year.unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

#### 5.

There appears to be a missing value in the `birth_year` column. With some research you find that the respondent’s birth year is 1967.

Use the `.replace()` method to replace the missing value with `1967`, so that the data type can be changed to int. Then recheck the values in `birth_year` by calling the `.unique()` method and printing the results.

In [121]:
df.birth_year.replace("missing", 1967, inplace=True)
df.birth_year.unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.birth_year.replace("missing", 1967, inplace=True)


array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 1967, '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

#### 6.

Now that we have adjusted the values in the `birth_year` variable, change the datatype from `str` to `int` and print the datatypes of the `census` dataframe with `.dtypes`.

In [122]:
df.birth_year = df.birth_year.astype(int)
df.dtypes

id                  int64
first_name         object
last_name          object
birth_year          int64
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

#### 7.

Having assigned `birth_year` to the appropriate data type, print the average birth year of the respondents to the census using the pandas `.mean()` method.

In [123]:
df.birth_year.mean()

np.float64(1973.4)

#### 8.

Your manager would like to set an order to the `higher_tax` variable so that: `strongly disagree` < `disagree` < `neutral` < `agree` < `strongly agree`.

Convert the `higher_tax` variable to the `category` data type with the appropriate order, then print the new order using the `.unique()` method.

In [124]:
df.higher_tax = pd.Categorical(df.higher_tax, categories=['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
df.higher_tax.unique()

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']

#### 9.

Your manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode the `higher_tax` variable and print the median using the pandas `.median()` method.

In [125]:
df.higher_tax = df.higher_tax.cat.codes
df.higher_tax.median(), df.higher_tax.unique()

(np.float64(2.0), array([1, 2, 3, 4, 0], dtype=int8))

In [126]:
df.head()

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,0,Denise,Ratke,2005,False,0,92129.41,1,single
1,1,Hali,Cummerata,1987,False,0,75649.17,2,divorced
2,2,Salomon,Orn,1992,True,2,166313.45,3,single
3,3,Sarina,Schiller,1965,False,2,71704.81,4,married
4,4,Gust,Abernathy,1945,False,2,143316.08,3,married


#### 10.

Your manager is interested in using machine learning models on the census data in the future. To help, let’s One-Hot Encode `marital_status` to create binary variables of each category. Use the pandas `get_dummies()` method to One-Hot Encode the `marital_status` variable.

Print the first five rows of the new dataframe with the `.head()` method. Note that you’ll have to scroll to the right or expand the web-browser to see the dummy variables.

In [127]:
df = pd.get_dummies(df, columns=["marital_status"], dtype=int)
df.head()

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed
0,0,Denise,Ratke,2005,False,0,92129.41,1,0,0,1,0
1,1,Hali,Cummerata,1987,False,0,75649.17,2,1,0,0,0
2,2,Salomon,Orn,1992,True,2,166313.45,3,0,0,1,0
3,3,Sarina,Schiller,1965,False,2,71704.81,4,0,1,0,0
4,4,Gust,Abernathy,1945,False,2,143316.08,3,0,1,0,0


#### 11.

Congratulations! You have used your variable skills to help the census team with managing their data. Feel free to explore the data further. There are additional operations you can perform on the data, such as:

- Create a new variable called `marital_codes` by Label Encoding the `marital_status` variable. This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.

- Create a new variable called `age_group`, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., `25-30`, `31-35`, etc. Then label encode the `age_group` variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.

In [128]:
from datetime import date

df["age_group"] = pd.cut(df['birth_year'], bins=range(1940, date.today().year, 5), right=False)
df.head()

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed,age_group
0,0,Denise,Ratke,2005,False,0,92129.41,1,0,0,1,0,"[2005, 2010)"
1,1,Hali,Cummerata,1987,False,0,75649.17,2,1,0,0,0,"[1985, 1990)"
2,2,Salomon,Orn,1992,True,2,166313.45,3,0,0,1,0,"[1990, 1995)"
3,3,Sarina,Schiller,1965,False,2,71704.81,4,0,1,0,0,"[1965, 1970)"
4,4,Gust,Abernathy,1945,False,2,143316.08,3,0,1,0,0,"[1945, 1950)"


In [129]:
df.age_group = df.age_group.cat.codes
df.head()

Unnamed: 0,id,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status_divorced,marital_status_married,marital_status_single,marital_status_widowed,age_group
0,0,Denise,Ratke,2005,False,0,92129.41,1,0,0,1,0,13
1,1,Hali,Cummerata,1987,False,0,75649.17,2,1,0,0,0,9
2,2,Salomon,Orn,1992,True,2,166313.45,3,0,0,1,0,10
3,3,Sarina,Schiller,1965,False,2,71704.81,4,0,1,0,0,5
4,4,Gust,Abernathy,1945,False,2,143316.08,3,0,1,0,0,1


In [130]:
df.dtypes

id                           int64
first_name                  object
last_name                   object
birth_year                   int64
voted                         bool
num_children                 int64
income_year                float64
higher_tax                    int8
marital_status_divorced      int64
marital_status_married       int64
marital_status_single        int64
marital_status_widowed       int64
age_group                     int8
dtype: object