You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to think about formalising this process by thinking of this problem as an optimization problem.
117
+
However we need to think about formalising this guessing process by thinking of this problem as an optimization problem.
116
118
117
119
Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals
The Ordinary Least Squares (OLS) method, as the name suggests, chooses $\alpha$ and $\beta$ in such a way that **minimises** the Sum of the Squared Residuals (SSR).
@@ -150,9 +152,7 @@ $$
150
152
C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
151
153
$$
152
154
153
-
that we would like to minimise.
154
-
155
-
We can then make use of calculus to find a solution by taking the partial derivative of the cost function $C$ with respect to $\alpha$ and $\beta$
155
+
that we would like to minimise with parameters $\alpha$ and $\beta$.
156
156
157
157
## How does error change with respect to $\alpha$ and $\beta$
Now let us use calculus to compute the optimal values for $\alpha$ and $\beta$ to solve the ordinary least squares solution.
201
+
Now let us use calculus to solve the optimization problem and compute the optimal values for $\alpha$ and $\beta$ to find the ordinary least squares solution.
202
202
203
203
First taking the partial derivative with respect to $\alpha$
204
204
@@ -212,8 +212,7 @@ $$
212
212
0 = \sum_{i=1}^{N}{-2(y_i - \alpha - \beta x_i)}
213
213
$$
214
214
215
-
we can remove the constant $-2$ from the summation and devide both sides by $-2$
216
-
215
+
we can remove the constant $-2$ from the summation by dividing both sides by $-2$
We can now use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to calculate the optimal values for $\alpha$ and $\beta$
296
294
297
295
Calculating $\beta$
@@ -343,14 +341,13 @@ TODO
343
341
:::{exercise}
344
342
:label: slr-ex1
345
343
346
-
Now that you know the equations to solve the simple linear regression model using OLS
347
-
you can now run your own regressions to build a model between $y$ and $x$.
344
+
Now that you know the equations that solve the simple linear regression model using OLS you can now run your own regressions to build a model between $y$ and $x$.
348
345
349
-
Consider two economic variables GDP per capita and Life Expectancy.
346
+
Let's consider two economic variables GDP per capita and Life Expectancy.
350
347
351
348
1. What do you think their relationship would be?
352
349
2. Gather some data [from our world in data](https://ourworldindata.org)
353
-
3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
350
+
3. Use `pandas` to import the `csv` formated data and plot a few different countries of interest
354
351
4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
355
352
5. Plot the line of best fit found using OLS
356
353
6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy
@@ -381,9 +378,9 @@ df
381
378
382
379
You can see that the data downloaded from Our World in Data has provided a global set of countries with the GDP per capita and Life Expectancy Data.
383
380
384
-
It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your program.
381
+
It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your DataFrame.
385
382
386
-
There are a bunch of columns we won't need to import such as `Continent`
383
+
You can observe that there are a bunch of columns we won't need to import such as `Continent`
387
384
388
385
So let's built a list of the columns we want to import
We can see there are `NaN` values or missing data so let us go ahead and drop those
400
+
We can see there are `NaN` values which represents missing data so let us go ahead and drop those
404
401
405
402
```{code-cell} ipython3
406
403
df.dropna(inplace=True)
@@ -416,7 +413,7 @@ Now we have a dataset containing life expectency and GDP per capita for a range
416
413
417
414
It is always a good idea to spend a bit of time understanding what data you actually have.
418
415
419
-
For example, you may want to explore this data to see if data is consistently reported for all countries across years
416
+
For example, you may want to explore this data to see if there is consistent reporting for all countries across years
420
417
421
418
Let's first look at the Life Expectency Data
422
419
@@ -427,7 +424,7 @@ le_years
427
424
428
425
As you can see there are a lot of countries where data is not available for the Year 1543!
429
426
430
-
Which country does report this data
427
+
Which country does report this data?
431
428
432
429
```{code-cell} ipython3
433
430
le_years[~le_years[1543].isna()]
@@ -443,12 +440,12 @@ le_years.loc['GBR'].plot()
443
440
444
441
In fact we can use pandas to quickly check how many countries are captured in each year
445
442
446
-
So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries
447
-
448
443
```{code-cell} ipython3
449
444
le_years.stack().unstack(level=0).count(axis=1).plot(xlabel="Year", ylabel="Number of countries");
450
445
```
451
446
447
+
So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries
448
+
452
449
Now let us consider the most recent year in the dataset 2018
453
450
454
451
```{code-cell} ipython3
@@ -462,9 +459,9 @@ df.plot(x='gdppc', y='life_expectency', kind='scatter', xlabel="GDP per capita"
462
459
This data shows a couple of interesting relationships.
463
460
464
461
1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectency
465
-
2. appears to be a positive relationship between GDP per capita and life expectency. Countries with higher GDP per capita tend to have higher life expectency outcomes
462
+
2. there appears to be a positive relationship between GDP per capita and life expectency. Countries with higher GDP per capita tend to have higher life expectency outcomes
466
463
467
-
Even though OLS is solving linear equations -- one option is to transform the variables, such as through a log transform, and then use OLS to estimate the relationships
464
+
Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables
468
465
469
466
:::{tip}
470
467
ln -> ln == elasticities
@@ -476,7 +473,7 @@ By specifying `logx` you can plot the GDP per Capita data on a log scale
476
473
df.plot(x='gdppc', y='life_expectency', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectency (Years)", logx=True);
477
474
```
478
475
479
-
As you can see from this transformation -- a linear model fits the shape of the data more closely.
476
+
As you can see from this transformation -- a linear model fits the shape of the data more closely.
0 commit comments