This is the R jupyter notebook for the Earnings and Height analysis.

We will follow the exercise steps given below:
# PDF Page 1

![PDF Page 1]("../Used data and given exercise/Page 1.png")

# PDF Page 2

![PDF Page 2]("../Used data and given exercise/Page 2.png")

## Step 1 - importing the necessary data analysis libraries

We will start by importing the necessary libraries for our analysis. We will be using data.table and ggplot2 libraries. We will import more libraries as we go along.

In [None]:
library(tidyverse)
library(readODS)

## Step 2 - Reading the Excel file

We will read the excel file using the data.table library. We will be reading the file named "Earnings and Height.ods" from our "Used data and given exercise" folder, which contains the data of earnings and height of US workers.

In [None]:
df <- readODS::read_ods("../Used data and given exercise/Earnings_and_Height.ods")
setDT(df)

## Step 3 - Exploring the data

Let's first take a look at the data. We will use the head() function to see the first few rows of the data.

In [None]:
head(df)

As we can see, the data set is large and neatly organized. We have columns for the characteristics of the workers, such as their age, gender, education level, and income. We also have the height of the workers in inches.

### Step 4 - Scatter plot of Earnings and height

We will focus on the relationship between income and height. We will plot a scatter plot to see the relationship between earnings and height.

In [None]:
# Scatter plot of Earnings and Height
ggplot(df, aes(x = height, y = earnings)) + 
  geom_point() + 
  labs(title = "Scatter plot of Earnings vs Height", 
       x = "Height (inches)", 
       y = "Earnings ($)")

In [None]:
max_height <- max(df$height)
min_height <- min(df$height)
total_height_range <- max_height - min_height
print(paste("Total height range:", total_height_range))

On the scatter plot we see that like the task tells us, there are 23 distinct values of Earnings. This is due to the fact that there are only 36 distinct values of Height in the data set. The Earnings values are the averages of a tax bracket, each worker was assigned the average of their tax bracket, like stated in the data information PDF.

From the scatter plot, we can see that there seems to be a positive correlation between height and earnings. Let's examine this further.

## Step 5 - Regression of Earnings and Height

We will now perform a regression of earnings on height. We will use the lm() function to perform the regression.

In [None]:
model <- lm(earnings ~ height, data = df)
summary(model)

From the summary we can see that the t-value is significant and the p-value is less than the significance level of alpha = 0.05 (5% significance level). This means that the regression is significant. The R-squared is small, which means that the model explains only a small percentage of the variance in the data.

We can also see the estimated slope parameter. Our model estimates that this is the average effect of 1 extra inch of Height on Earnings.

# Step 6 - Ommited variable bias explanation
The task states that Case and Paxton (2008) suggested that Height correlates with an ommited factor which is intelligence, due to taller people having a more healthy upbringing and hence higher intelligence and height.

Since inteligence is hard to measure and not included it our data set, we cannot measure its effect.

The real impact of Height would of be lower than we estimated since in such case Height and Intelligence are positively correlated and both have a positive effect on Earnings. Our estimator would include the effect of Intelligence, and thus be higher than the real impact of Height.

# Step 7 - Education as a substite for Intelligence

We are asked to construct a regression model that includes Education as a substitute for Intelligence.

We will create 4 new variables:

- LT_HS = 1, if the worker has less than high school education (educ < 12).
- HS = 1, if the worker has high school education (educ = 12).
- Some_Col, if the worker has some college education (12 < educ < 16).
- College = 1, if the worker has bachelor's degree education (educ >= 16).

In [None]:
# Creating the new variables and adding them to the data frame
df[LT_HS := ifelse(educ < 12, 1, 0)]
df[HS := ifelse(educ == 12, 1, 0)]
df[Some_Col := ifelse(educ > 12 & educ < 16, 1, 0)]
df[College := ifelse(educ >= 16, 1, 0)]

# View the new variables
names(df)
head(df)

# Step 8 - Control regression of Earnings and Height for women

We will now perform the regression of earnings on height for women only. We will use the same regression method as before, but this time we will only include women in our data set.

In [None]:
# Subset data for women (sex == 0)
women_df <- df[sex == 0]

# Perform regression
women_model <- lm(earnings ~ height, data = women_df)
summary(women_model)

# Step 9 - Regression including education variables for women
women_edu_model <- lm(earnings ~ height + LT_HS + HS + Some_Col + College, 
                      data = women_df)
summary(women_edu_model)

# Step 11 - Regression for men (sex == 1)
men_df <- df[sex == 1]
men_model <- lm(earnings ~ height + LT_HS + HS + Some_Col, 
                data = men_df)
summary(men_model)


# Conclusion Analysis
### For women:
 - The impact of height on earnings is no longer significant when controlling for education.
 - R-squared increased to 0.138, explaining 13.8% of variance.
 - Education parameters are all significant.
 - More education leads to higher earnings on average.

### For men:
 - Similar patterns observed as with women.
 - Education remains a strong predictor of earnings.
 - Height's effect is reduced when controlling for education.