From b08830f4dafd9decaf0db74387e4476a59604879 Mon Sep 17 00:00:00 2001 From: mmcky Date: Tue, 20 Feb 2024 16:33:29 +1100 Subject: [PATCH 1/7] [simple_linear_regression] Review lecture pandas code, spelling with update to american spelling --- lectures/simple_linear_regression.md | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index 88a119670..daa81945a 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -61,8 +61,8 @@ ax = df.plot( x='X', y='Y', kind='scatter', - ylabel='Ice-Cream Sales ($\'s)', - xlabel='Degrees Celcius' + ylabel='Ice-cream sales ($\'s)', + xlabel='Degrees celcius' ) ``` @@ -114,7 +114,7 @@ df.plot(x='X',y='Y', kind='scatter', ax=ax) df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') ``` -However we need to think about formalising this guessing process by thinking of this problem as an optimization problem. +However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem. Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals @@ -140,7 +140,7 @@ df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); ``` -The Ordinary Least Squares (OLS) method, as the name suggests, chooses $\alpha$ and $\beta$ in such a way that **minimises** the Sum of the Squared Residuals (SSR). +The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR). $$ \min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2} @@ -152,7 +152,7 @@ $$ C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2} $$ -that we would like to minimise with parameters $\alpha$ and $\beta$. +that we would like to minimize with parameters $\alpha$ and $\beta$. ## How does error change with respect to $\alpha$ and $\beta$ @@ -173,7 +173,7 @@ for β in np.arange(20,100,0.5): errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum() ``` -Ploting the error +Plotting the error ```{code-cell} ipython3 ax = pd.Series(errors).plot(xlabel='β', ylabel='error') @@ -188,7 +188,7 @@ for α in np.arange(-500,500,5): errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum() ``` -Ploting the error +Plotting the error ```{code-cell} ipython3 ax = pd.Series(errors).plot(xlabel='α', ylabel='error') @@ -331,13 +331,6 @@ df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); ``` -:::{admonition} Why use OLS? -TODO - -1. Discuss mathematical properties for why we have chosen OLS -::: - - :::{exercise} :label: slr-ex1 @@ -347,7 +340,7 @@ Let's consider two economic variables GDP per capita and Life Expectancy. 1. What do you think their relationship would be? 2. Gather some data [from our world in data](https://ourworldindata.org) -3. Use `pandas` to import the `csv` formated data and plot a few different countries of interest +3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest 4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$ 5. Plot the line of best fit found using OLS 6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy @@ -528,9 +521,9 @@ plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy :::{exercise} :label: slr-ex2 -Minimising the sum of squares is not the **only** way to generate the line of best fit. +Minimizing the sum of squares is not the **only** way to generate the line of best fit. -For example, we could also consider minimising the sum of the **absolute values**, that would give less weight to outliers. +For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers. Solve for $\alpha$ and $\beta$ using the least absolute values ::: From 5fb3ffd00c4bae6aa159141dfa78a5a339d48a12 Mon Sep 17 00:00:00 2001 From: mmcky Date: Tue, 20 Feb 2024 16:36:44 +1100 Subject: [PATCH 2/7] update data location --- lectures/simple_linear_regression.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index daa81945a..af7ac9607 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -356,13 +356,13 @@ Let's consider two economic variables GDP per capita and Life Expectancy. ::: -You can download {download}`a copy of the data here <_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv>` if you get stuck +You can download {download}`a copy of the data here ` if you get stuck **Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest ```{code-cell} ipython3 -fl = "_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv" # TODO: Replace with GitHub link -df = pd.read_csv(fl, nrows=10) +data_url = "https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv" +df = pd.read_csv(data_url, nrows=10) ``` ```{code-cell} ipython3 @@ -446,7 +446,7 @@ df = df[df.year == 2018].reset_index(drop=True).copy() ``` ```{code-cell} ipython3 -df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)",); +df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",); ``` This data shows a couple of interesting relationships. @@ -463,7 +463,7 @@ ln -> ln == elasticities By specifying `logx` you can plot the GDP per Capita data on a log scale ```{code-cell} ipython3 -df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)", logx=True); +df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)", logx=True); ``` As you can see from this transformation -- a linear model fits the shape of the data more closely. From 7129b6bcf9ece91f4c6df8b7038e5488b6051a73 Mon Sep 17 00:00:00 2001 From: mmcky Date: Tue, 20 Feb 2024 16:52:30 +1100 Subject: [PATCH 3/7] update all fl to data_url --- lectures/simple_linear_regression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index af7ac9607..770902bdb 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -379,7 +379,7 @@ So let's built a list of the columns we want to import ```{code-cell} ipython3 cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita'] -df = pd.read_csv(fl, usecols=cols) +df = pd.read_csv(data_url, usecols=cols) df ``` From 380bedc1f96e5e63e6556ef203b035fdc5e6ef8b Mon Sep 17 00:00:00 2001 From: mmcky Date: Wed, 21 Feb 2024 10:50:03 +1100 Subject: [PATCH 4/7] TST: add label but no caption --- lectures/simple_linear_regression.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index 770902bdb..0c58229bd 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -57,6 +57,11 @@ df We can use a scatter plot of the data to see the relationship between $y_i$ (ice-cream sales in dollars (\$\'s)) and $x_i$ (degrees Celsius). ```{code-cell} ipython3 +--- +mystnb: + figure: + name: wpdisc +--- ax = df.plot( x='X', y='Y', From 8fa667bd38e3043e13e6bea2f7f73520e08bf103 Mon Sep 17 00:00:00 2001 From: mmcky Date: Wed, 21 Feb 2024 11:08:24 +1100 Subject: [PATCH 5/7] update numbered and captioned figures --- lectures/simple_linear_regression.md | 59 ++++++++++++++++++++++++---- 1 file changed, 51 insertions(+), 8 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index 0c58229bd..07f8ea625 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -60,7 +60,8 @@ We can use a scatter plot of the data to see the relationship between $y_i$ (ice --- mystnb: figure: - name: wpdisc + caption: "Scatter plot" + name: sales-v-temp --- ax = df.plot( x='X', @@ -88,8 +89,14 @@ df['Y_hat'] = α + β * df['X'] ``` ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Scatter plot with a line of fit" + name: sales-v-temp2 +--- fig, ax = plt.subplots() -df.plot(x='X',y='Y', kind='scatter', ax=ax) +ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) df.plot(x='X',y='Y_hat', kind='line', ax=ax) ``` @@ -103,8 +110,14 @@ df['Y_hat'] = α + β * df['X'] ``` ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Scatter plot with a line of fit" + name: sales-v-temp3 +--- fig, ax = plt.subplots() -df.plot(x='X',y='Y', kind='scatter', ax=ax) +ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) df.plot(x='X',y='Y_hat', kind='line', ax=ax) ``` @@ -114,8 +127,14 @@ df['Y_hat'] = α + β * df['X'] ``` ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Scatter plot with a line of fit" + name: sales-v-temp4 +--- fig, ax = plt.subplots() -df.plot(x='X',y='Y', kind='scatter', ax=ax) +ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') ``` @@ -139,9 +158,15 @@ df ``` ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Plot of the residuals" + name: plt-residuals +--- fig, ax = plt.subplots() -df.plot(x='X',y='Y', kind='scatter', ax=ax) -df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') +ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) +ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); ``` @@ -181,6 +206,12 @@ for β in np.arange(20,100,0.5): Plotting the error ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Plotting the error" + name: plt-errors +--- ax = pd.Series(errors).plot(xlabel='β', ylabel='error') plt.axvline(β_optimal, color='r'); ``` @@ -196,6 +227,12 @@ for α in np.arange(-500,500,5): Plotting the error ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "Plotting the error (2)" + name: plt-errors2 +--- ax = pd.Series(errors).plot(xlabel='α', ylabel='error') plt.axvline(α_optimal, color='r'); ``` @@ -327,12 +364,18 @@ print(α) Now we can plot the OLS solution ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: "OLS line of best fit" + name: plt-ols +--- df['Y_hat'] = α + β * df['X'] df['error'] = df['Y_hat'] - df['Y'] fig, ax = plt.subplots() -df.plot(x='X',y='Y', kind='scatter', ax=ax) -df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') +ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) +ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); ``` From 052ea5e77b2574e7c587286993b9aefea3987582 Mon Sep 17 00:00:00 2001 From: mmcky Date: Wed, 21 Feb 2024 11:38:25 +1100 Subject: [PATCH 6/7] ensure only one figure is returned --- lectures/simple_linear_regression.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index 07f8ea625..ea6e1493b 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -61,7 +61,7 @@ We can use a scatter plot of the data to see the relationship between $y_i$ (ice mystnb: figure: caption: "Scatter plot" - name: sales-v-temp + name: sales-v-temp1 --- ax = df.plot( x='X', @@ -97,7 +97,8 @@ mystnb: --- fig, ax = plt.subplots() ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) -df.plot(x='X',y='Y_hat', kind='line', ax=ax) +ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax) +plt.show() ``` We can see that this model does a poor job of estimating the relationship. @@ -113,12 +114,13 @@ df['Y_hat'] = α + β * df['X'] --- mystnb: figure: - caption: "Scatter plot with a line of fit" + caption: "Scatter plot with a line of fit #2" name: sales-v-temp3 --- fig, ax = plt.subplots() ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) -df.plot(x='X',y='Y_hat', kind='line', ax=ax) +ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax) +plt.show() ``` ```{code-cell} ipython3 @@ -130,12 +132,13 @@ df['Y_hat'] = α + β * df['X'] --- mystnb: figure: - caption: "Scatter plot with a line of fit" + caption: "Scatter plot with a line of fit #3" name: sales-v-temp4 --- fig, ax = plt.subplots() ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) -df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') +ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') +plt.show() ``` However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem. @@ -167,7 +170,8 @@ mystnb: fig, ax = plt.subplots() ax = df.plot(x='X',y='Y', kind='scatter', ax=ax) ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') -plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); +plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r') +plt.show() ``` The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR). @@ -231,7 +235,7 @@ Plotting the error mystnb: figure: caption: "Plotting the error (2)" - name: plt-errors2 + name: plt-errors-2 --- ax = pd.Series(errors).plot(xlabel='α', ylabel='error') plt.axvline(α_optimal, color='r'); From c8bc07863d6cfa5c8b4e99d9c6a0b092812f7a2f Mon Sep 17 00:00:00 2001 From: mmcky Date: Wed, 21 Feb 2024 12:07:43 +1100 Subject: [PATCH 7/7] remove tip --- lectures/simple_linear_regression.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index ea6e1493b..137d4539a 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -506,11 +506,7 @@ This data shows a couple of interesting relationships. 1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy 2. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes -Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables - -:::{tip} -ln -> ln == elasticities -::: +Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables. By specifying `logx` you can plot the GDP per Capita data on a log scale