# Tutorial 1: Data Exploration and regression analysis

In this tutorial you will learn how to read and explore census data from the [IPUMS USA](https://usa.ipums.org/usa/) database. IPUMS USA collects, preserves and harmonizes U.S. census microdata and provides easy access to this data with enhanced documentation. Data includes decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.

IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context.

In this tutorial, you will learn how to read the IPUMPS USA data, explore and manipulate it (to prepare for analysis) and how to perform a simple correlation and regression analysis. We will use the IPUMS USA database throughout the next four tutorials.

### Important before we start
---
Make sure that you save this file before you continue, else you will lose everything. To do so, go to **Bestand/File** and click on **Een kopie opslaan in Drive/Save a Copy on Drive**!

Now, rename the file into Week2_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen.

<h2>Tutorial Outline<span class="tocSkip"></span></h2>
<hr>
<div class="toc"><ul class="toc-item">
<li><span><a href="#introducing-the-packages" data-toc-modified-id="1.-Introducing-the-packages-2">1. Introducing the packages</a></span></li>
<li><span><a href="#reading-the-data-and-having-a-first-look" data-toc-modified-id="2.-Reading-the-data-and-having-a-first-look-3">2. Reading the data and having a first look</a></span></li>
<li><span><a href="#exploratory-data-analysis" data-toc-modified-id="3.-Exploratory-data-analysis-4">3. Exploratory data analysis</a></span></li>
<li><span><a href="#creating-new-variables" data-toc-modified-id="4.-Creating new variables-5">4. Creating new variables</a></span></li>
<li><span><a href="#nonlinear-relationships" data-toc-modified-id="5.-Nonlinear-relationships-6">5. Nonlinear relationships </a></span></li></ul></div>

## Learning Objectives
<hr>

- Work with Pandas DataFrames with real-world data.
- Introducing basic functions to explore and understand the data.
- Learn how to Make distribution plots of the data
- Create a correlation map
- Create new variables
- Set up an OLS regression

## 1. Introducing the packages
<hr>

Within this tutorial, we are going to make use of the following packages:

[**seaborn**](https://seaborn.pydata.org/index.html) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

[**NumPy**](https://numpy.org/doc/stable/) is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

[**Pandas**](https://pandas.pydata.org/docs/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

[**statsmodels**](https://www.statsmodels.org/) is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Now we will import these packages in the cell below:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

sns.set_style("whitegrid",{'axes.grid' : True})
pd.options.mode.chained_assignment = None

## 2. Reading the data and having a first look
<hr>

In [2]:
## Read the data using the pandas package
data = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforweek2tut1.csv")

There are hundreds of variables included in the IPUMS USA database. In the next line of code we read in a file with a short description of the variables that we selected for this tutorial. The dataset contains socio-demographic variables such as age, gender, education, but also economic variables, such as income, mortgage payments, energy costs, etc. In these two weeks we are going to work towards two machine learning models that can predict energy costs per household.

In [None]:
datadescription = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/shortdescription_ipumsusa_variables.csv", sep = ';', encoding = 'unicode_escape')
print(datadescription[['Variable_name','Short_description']])

Now let’s take a look at the dataset. A useful start is to explore the first rows through the `head()` function with Pandas.

In [None]:
data.head()

And let's have a look at the total size of this dataset. To do so, we can use the `shape()` function.

In [None]:
data.shape

And some more information through the `describe()` function.

In [None]:
data.describe()

<div class="alert alert-block alert-success">
<b>Question 1:</b> What information do the functions .shape() and .describe() give you about the data?
</div>

Let's find out the data types of the data and if there are non-null counts.

In [None]:
data.info()

Another way to check if there are missing values by using the following line of code.

In [None]:
data.isnull().sum().sort_values(ascending = False)

<div class="alert alert-block alert-success">
<b>Question 2:</b> <br>
    
- Please describe what the following line of Python code does:
<br>
    <b>data.isnull().sum().sort_values(ascending = False)</b>
<br>
Hint: Every dot (.) indicates a new function.
<br>
- Also describe what you see in the data. What does the output mean? Do you notice any surprises?
</div>

## 3. Exploratory data analysis
<hr>

Now we are going to take a look at some specific variables. We want to see how variables are distributed and how they are related to each other, and specifically how they are related to energy costs. We can do this in several ways.

Here we see the different values in the variable OWNERSHIP and how many times each value occurs. If you want to know what the values mean, you can search the variable on the ipums usa website.

In [None]:
data['OWNERSHP'].value_counts(dropna = False)

In [None]:
data['HHINCOME'].value_counts(dropna = False)

The function `value_counts()` is especially useful when the number of values in a variable is limited. For example, for income, which contains a lot of different values, value_counts is not so useful and a plot of the distribution would be more insightful. We can use the **seaborn** package to easily plot the distribution of the income.   

In [None]:
%matplotlib inline
sns.displot(data['HHINCOME'], bins=50,kde=False)

In [None]:
%matplotlib inline
# Less bins
sns.displot(data['HHINCOME'], bins=20,kde=False)

And we can also include the kernel density, by setting the `kde` parameter to **True**. Kernel Density Estimation (KDE) is a technique that let’s you create a smooth curve given a set of data. This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram.

In [None]:
%matplotlib inline
sns.displot(data['HHINCOME'], bins=50,kde=True)

In [None]:
# code block to create your own plot of a distribution of a variable. 

Now we are going to take a look at the variable we want to predict in a later stage, energy costs.

In [None]:
%matplotlib inline
sns.displot(data['COSTENERGY'], bins=50,kde=False)

Next we want to see how different variables are related to each other. We are specifically interested in which variables are correlated to energy costs, as we are going to predict energy costs using a simple regression model in Step 6 of this tutorial. To identify potential predictors of energy costs, we can create a correlation matrix using the code below. You can add a few variables you are interested in yourself.
As you can see, there are several options in the seaborn (imported as sns) heatmap. We have annot = True which makes sure that we include the numbers in the heatmap (try switching it off), fmt = .2f specifies that we want to have 2 decimals, with cmap we can specify the color palette and with center = 0, we scale the colors in such a way that a correlation of 0 corresponds to white color.   


In [None]:
%matplotlib inline
plt.figure(figsize=(8,6))
sns.heatmap(data[["COSTENERGY", "ROOMS", "RENT", "INCTOT", "HHINCOME", "BUILTYR2", "VEHICLES", "BEDROOMS", "FARM", "OWNERSHP", "MORTGAGE", "AGE", "SEX"]].corr(),
                           annot=True, fmt = ".2f", center=0, cmap = "RdBu")
plt.show()

<div class="alert alert-block alert-success">
    <b>Question 3:</b> Which variable is most correlated to <b>COSTENERGY</b> and which variable the least?  
</div>

## 4. Creating new variables
<hr>

One variable that is not present in the dataset, but will most likely have a high impact on energy costs is the number of household members. We can create a new variable HHSIZE (household size) using the column SERIAL, which identifies household IDs (i.e., all members of the same household share the same ID). To create a new column HHSIZE, we will count the occurrences of each unique household ID in the SERIAL column. 

In [10]:
serialcount = data['SERIAL'].value_counts().reset_index()

In [None]:
print(serialcount)

The column **count** indicates the number of persons in each household and each household is uniquely identified by the **SERIAL** column. Rename the column **count** to **HHSIZE** in the empty code block below.  

Then we add the column **HHSIZE** (household) size to our data using the `.merge()` function. Within the `merge()` function, you have to define on which column you want to merge the two dataframes using the `on` argument. Merge on the **SERIAL** column.

In [11]:
data = data.merge(serialcount, on = 'SERIAL', how = 'left')

It is important to realize that the dataset contains household variables and individual variables. Household variables have the same value for the entire household (e.g. energy costs) and individual variables differ within the household (e.g. age and gender). Individual variables are not very meaningful in predicting a household variable, such as energy costs. Think of a family with children, where a 3-year old daughter has the same energy costs as the 40-year old father. Therefore, it doesn't make sense to include AGE or SEX in the model in their current form. In this section, we will create new variables based on the variable AGE (We assume that SEX doesn't have a large effects on energy costs anyway). In the next steps, we are going to create a new variable (columns) which indicates the number of young children in a household. We hypothesize that young children will increase the household's energy bill (more dirty laundry, so increased usage of washing machine and dryer).

First, we create a column "YOUNGERTHAN". Choose an age between 1 and 8. 

In [12]:
data['YOUNGERTHAN'] = data['AGE'] <= #fill in age


This will create a new column with True and False (True for individuals younger than the given age). Let's convert this to integers because it's easier to work with numbers. 

In [13]:
data['YOUNGERTHAN'] = data['YOUNGERTHAN'].astype(int)

Now we will count the number of young children in each household. In the following line of code, we create a new dataframe, <b>childcount</b>, which gives us the number of individuals within SERIAL and YOUNGERTHAN. Specifically, this dataframe will show the count of household members younger than the given age (YOUNGERTHAN == 1) and those older than the given age (YOUNGERTHAN == 0).  

In [None]:
childcount = data[['SERIAL', 'YOUNGERTHAN']].value_counts().reset_index()
print(childcount)

The following line of code gives the same result. 

In [None]:
childcount = data[['SERIAL','YOUNGERTHAN']].groupby(['SERIAL', 'YOUNGERTHAN']).size().reset_index()
print(childcount)

<div class="alert alert-block alert-success">
    <b>Question 4:</b> Please describe what the following line of Python code does:
<br> childcount = data[['SERIAL','YOUNGERTHAN']].groupby(['SERIAL', 'YOUNGERTHAN']).size().reset_index()
</div>

We want to count the number of household members that are younger than the given age, so we run:

In [16]:
childcount1 = childcount[childcount['YOUNGERTHAN'] == 1]

This creates a new dataframe, <b> childcount1 </b>, which only contains the number of household members younger than the given age for each household ID (SERIAL). We drop the column YOUNGERTHAN and we rename the remaining two columns: 

In [17]:
childcount1 = childcount1.drop(columns = ['YOUNGERTHAN'])
childcount1.columns = ['SERIAL', 'NR_OF_YOUNGCHILDREN']

Then we merge childcount1 to data. 

In [18]:
data = data.merge(childcount1, on = 'SERIAL', how = 'left')

When we inspect the values in the column YOUNGERTHAN of data, we notice something:

In [None]:
data['NR_OF_YOUNGCHILDREN'].value_counts(dropna = False).reset_index()

<div class="alert alert-block alert-success">
    <b>Question 5:</b> Why are there so many NaN values? Use the following code block to fill in the NaN and report the line of code. 
</div>

In [20]:
# fill in the nans in column NR_OF_YOUNGCHILDREN in data

Now we will visualize the relationship between the number of young children and energy costs. 

In [None]:
plt.figure()
sns.barplot(x = 'NR_OF_YOUNGCHILDREN', y = 'COSTENERGY', data = data)
plt.title('Nr of young children in household vs Cost of Energy')
plt.show()

In the next steps, you are going to create another variable based on age. We hypothesize that older people use relative more energy than younger people, because older people are more likely to be retired and spend more time at home than younger people with a job. (The dataset dates from pre covid, so working from home was not a thing yet). We don't have information on retirement, so will proxy retirement by age.

We follow the same steps as we used to determine the number of young children. See how we can combine two lines of code in one: 

In [22]:
data['OLDERTHAN'] = (data['AGE'] >= ).astype(int) #fill in an age between 62 and 68. 

Create a dataframe, <b> eldercount </b>, that gives the number of people older than the given age and younger than the given age per household ID (SERIAL). 

In [23]:
eldercount = 

Similar as before, we are only interested in the number of household members older than the given age. 

In [24]:
eldercount1 = 

Then we drop the columns OLDERTHAN and we rename the columns. 

<div class="alert alert-block alert-success">
    <b>Question 6:</b> If you run the next two lines of code, you will get an error. Explain what goes wrong and add the missing part. 
</div>

In [25]:
eldercount1.drop(columns = ['OLDERTHAN'])
eldercount1.columns = ['SERIAL', 'NR_OF_ELDERLY']

Then, we merge eldercount1 to data. Note that we create a new dataframe, <b> datatest </b>, and that we do a <b> right merge </b>. 

In [None]:
datatest = data.merge(eldercount1, on = 'SERIAL', how = 'right')

<div class="alert alert-block alert-success">
    <b>Question 7:</b> Can you explain what happens when we do a right merge? Hint: Inspect the number of rows in data, datatest and eldercount1. Additionally, describe the other options available for the how parameter in the merge function and explain how they affect the resulting dataframe (datatest). 
</div>

In [None]:
# code block to test different merges. datatest = data.merge(eldercount1, on = 'SERIAL', how = '')

Run the following lines to merge eldercount1 to data 

In [26]:
data = data.merge(eldercount1, on = 'SERIAL', how = 'left')
data['NR_OF_ELDERLY'] = data['NR_OF_ELDERLY'].fillna(0)

In [None]:
plt.figure()
sns.barplot(x = 'NR_OF_ELDERLY', y = 'COSTENERGY', data = data)
plt.title('Nr of elderly in household vs Cost of Energy')
plt.show()

<div class="alert alert-block alert-success">
    <b>Question 8:</b> The figure shows that there is no linear relationship between the number of elderly in a household and the household's energy costs. Therefore, we will also create a binary variable that just indicates the presence of elderly persons in the household. In the next code block, create this new variable <b> ELDERLY_PRESENT </b> and report your code. </div>

In [None]:
data['ELDERLY_PRESENT'] = 

## 6. Regression analysis
<hr>

In the last part we are going to estimate a first model to predict energy costs of households. We will use a simple Ordinary Least Squares regression model. To  make life a bit easier, download the dataset for the OLS regression in the following line of code (it includes the new variables you created and it deals with missing values). We use the package statsmodels to estimate the OLS regression. We could also use the sklearn library, but the statsmodels package has the advantage that it includes standard errors of the coefficient estimates. This means that we can say something about the significance of the explanantory variables and the causality between the explanatory variables and the dependent variable, something which is not possible with a machine learning model.  

In [28]:
dataOLS = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforOLS.csv", sep = ',', encoding = 'unicode_escape')

In [30]:
Y = dataOLS[['COSTENERGY']]

In [43]:
X = dataOLS[["ROOMS", "HHINCOME", "BUILTYR2", "VEHICLES", "FARM", "OWNERSHP", "HHSIZE", "NR_OF_ELDERLY", "NR_OF_YOUNGCHILDREN"]]
X['Constant'] = 1

In [None]:
regressionOLS = sm.OLS(Y, X)
resultsOLS = regressionOLS.fit()
print(resultsOLS.summary())

<div class="alert alert-block alert-success">
    <b>Question 9:</b> Interpret the results of the OLS regression. You can discuss the following questions in your answer. 1) Do you think that the coefficient estimates are plausible? You don't need to explain all coefficient estimates, but highlight a few. 2) Would you include other variables in the regression? Or do you want to exclude variables? Explain which ones. Make another correlation heatmap to support your explanation.
</div>

Interpret the results of the OLS regression. Make another correlation heatmap to support your explanation.

In [None]:
# code for correlation heatmap.