Skip to content

Commit 55a0119

Browse files
committed
a proof read
1 parent 27b451a commit 55a0119

File tree

1 file changed

+29
-32
lines changed

1 file changed

+29
-32
lines changed

lectures/simple_linear_regression.md

Lines changed: 29 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ import pandas as pd
1919
import matplotlib.pyplot as plt
2020
```
2121

22-
The simple regression model estimates the relationship between two variables $X$ and $Y$
22+
The simple regression model estimates the relationship between two variables $x_i$ and $y_i$
2323

2424
$$
2525
y_i = \alpha + \beta x_i + \epsilon_i, i = 1,2,...,N
2626
$$
2727

28-
where $\epsilon_i$ represents the error in the estimates.
28+
where $\epsilon_i$ represents the error between the line of best fit and the sample values for $y_i$ given $x_i$.
2929

30-
We would like to choose values for $\alpha$ and $\beta$ to build a line of "best" fit for some data that is available for variables $x_i$ and $y_i$.
30+
Our goal is to choose values for $\alpha$ and $\beta$ to build a line of "best" fit for some data that is available for variables $x_i$ and $y_i$.
3131

3232
Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i$:
3333

@@ -44,7 +44,7 @@ Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i
4444
|9| 1800 | 27 |
4545
|10 | 250 | 2 |
4646

47-
Let us think about $y_i$ as sales for an ice-cream cart, while $x_i$ is a variable the records the temperature in Celcius.
47+
Let us think about $y_i$ as sales for an ice-cream cart, while $x_i$ is a variable that records the day's temperature in Celcius.
4848

4949
```{code-cell} ipython3
5050
x = [32, 21, 24, 35, 10, 11, 22, 21, 27, 2]
@@ -71,7 +71,7 @@ as you can see the data suggests that more ice-cream is typically sold on hotter
7171
To build a linear model of the data we need to choose values for $\alpha$ and $\beta$ that represents a line of "best" fit such that
7272

7373
$$
74-
\hat{y}_i = \hat{\alpha} + \hat{\beta} x_i
74+
\hat{y_i} = \hat{\alpha} + \hat{\beta} x_i
7575
$$
7676

7777
Let's start with $\alpha = 5$ and $\beta = 10$
@@ -88,6 +88,8 @@ df.plot(x='X',y='Y', kind='scatter', ax=ax)
8888
df.plot(x='X',y='Y_hat', kind='line', ax=ax)
8989
```
9090

91+
We can see that this model does a poor job of estimating the relationship.
92+
9193
We can continue to guess and iterate towards a line of "best" fit by adjusting the parameters
9294

9395
```{code-cell} ipython3
@@ -112,7 +114,7 @@ df.plot(x='X',y='Y', kind='scatter', ax=ax)
112114
df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
113115
```
114116

115-
We need to think about formalising this process by thinking of this problem as an optimization problem.
117+
However we need to think about formalising this guessing process by thinking of this problem as an optimization problem.
116118

117119
Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals
118120

@@ -135,7 +137,7 @@ df
135137
fig, ax = plt.subplots()
136138
df.plot(x='X',y='Y', kind='scatter', ax=ax)
137139
df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
138-
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')
140+
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
139141
```
140142

141143
The Ordinary Least Squares (OLS) method, as the name suggests, chooses $\alpha$ and $\beta$ in such a way that **minimises** the Sum of the Squared Residuals (SSR).
@@ -150,9 +152,7 @@ $$
150152
C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
151153
$$
152154

153-
that we would like to minimise.
154-
155-
We can then make use of calculus to find a solution by taking the partial derivative of the cost function $C$ with respect to $\alpha$ and $\beta$
155+
that we would like to minimise with parameters $\alpha$ and $\beta$.
156156

157157
## How does error change with respect to $\alpha$ and $\beta$
158158

@@ -198,7 +198,7 @@ plt.axvline(α_optimal, color='r');
198198
(slr:optimal-values)=
199199
## Calculating Optimal Values
200200

201-
Now let us use calculus to compute the optimal values for $\alpha$ and $\beta$ to solve the ordinary least squares solution.
201+
Now let us use calculus to solve the optimization problem and compute the optimal values for $\alpha$ and $\beta$ to find the ordinary least squares solution.
202202

203203
First taking the partial derivative with respect to $\alpha$
204204

@@ -212,8 +212,7 @@ $$
212212
0 = \sum_{i=1}^{N}{-2(y_i - \alpha - \beta x_i)}
213213
$$
214214

215-
we can remove the constant $-2$ from the summation and devide both sides by $-2$
216-
215+
we can remove the constant $-2$ from the summation by dividing both sides by $-2$
217216

218217
$$
219218
0 = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)}
@@ -225,7 +224,7 @@ $$
225224
0 = \sum_{i=1}^{N}{y_i} - \sum_{i=1}^{N}{\alpha} - \beta \sum_{i=1}^{N}{x_i}
226225
$$
227226

228-
The middle term is simple to sum from $i=1,...N$ by a constant $\alpha$
227+
The middle term is a straight forward sum from $i=1,...N$ by a constant $\alpha$
229228

230229
$$
231230
0 = \sum_{i=1}^{N}{y_i} - N*\alpha - \beta \sum_{i=1}^{N}{x_i}
@@ -237,7 +236,7 @@ $$
237236
\alpha = \frac{\sum_{i=1}^{N}{y_i} - \beta \sum_{i=1}^{N}{x_i}}{N}
238237
$$
239238

240-
Both fractions resolve to the means $\bar{y_i}$ and $\bar{x_i}$
239+
We observe that both fractions resolve to the means $\bar{y_i}$ and $\bar{x_i}$
241240

242241
$$
243242
\alpha = \bar{y_i} - \beta\bar{x_i}
@@ -267,7 +266,7 @@ $$
267266
0 = \sum_{i=1}^{N}{(x_i y_i - \alpha x_i - \beta x_i^2)}
268267
$$
269268
270-
now substituting $\alpha$
269+
now substituting for $\alpha$
271270
272271
$$
273272
0 = \sum_{i=1}^{N}{(x_i y_i - (\bar{y_i} - \beta \bar{x_i}) x_i - \beta x_i^2)}
@@ -285,13 +284,12 @@ $$
285284
0 = \sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i) + \beta \sum_{i=1}^{N}(\bar{x_i} x_i - x_i^2)
286285
$$
287286
288-
and solving for $\beta$
287+
and solving for $\beta$ yields
289288
290289
$$
291290
\beta = \frac{\sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i)}{\sum_{i=1}^{N}(x_i^2 - \bar{x_i} x_i)}
292291
$$ (eq:optimal-beta)
293292
294-
295293
We can now use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to calculate the optimal values for $\alpha$ and $\beta$
296294
297295
Calculating $\beta$
@@ -343,14 +341,13 @@ TODO
343341
:::{exercise}
344342
:label: slr-ex1
345343
346-
Now that you know the equations to solve the simple linear regression model using OLS
347-
you can now run your own regressions to build a model between $y$ and $x$.
344+
Now that you know the equations that solve the simple linear regression model using OLS you can now run your own regressions to build a model between $y$ and $x$.
348345
349-
Consider two economic variables GDP per capita and Life Expectancy.
346+
Let's consider two economic variables GDP per capita and Life Expectancy.
350347
351348
1. What do you think their relationship would be?
352349
2. Gather some data [from our world in data](https://ourworldindata.org)
353-
3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
350+
3. Use `pandas` to import the `csv` formated data and plot a few different countries of interest
354351
4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
355352
5. Plot the line of best fit found using OLS
356353
6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy
@@ -381,9 +378,9 @@ df
381378
382379
You can see that the data downloaded from Our World in Data has provided a global set of countries with the GDP per capita and Life Expectancy Data.
383380
384-
It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your program.
381+
It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your DataFrame.
385382
386-
There are a bunch of columns we won't need to import such as `Continent`
383+
You can observe that there are a bunch of columns we won't need to import such as `Continent`
387384
388385
So let's built a list of the columns we want to import
389386
@@ -400,7 +397,7 @@ df.columns = ["cntry", "year", "life_expectency", "gdppc"]
400397
df
401398
```
402399
403-
We can see there are `NaN` values or missing data so let us go ahead and drop those
400+
We can see there are `NaN` values which represents missing data so let us go ahead and drop those
404401
405402
```{code-cell} ipython3
406403
df.dropna(inplace=True)
@@ -416,7 +413,7 @@ Now we have a dataset containing life expectency and GDP per capita for a range
416413
417414
It is always a good idea to spend a bit of time understanding what data you actually have.
418415
419-
For example, you may want to explore this data to see if data is consistently reported for all countries across years
416+
For example, you may want to explore this data to see if there is consistent reporting for all countries across years
420417
421418
Let's first look at the Life Expectency Data
422419
@@ -427,7 +424,7 @@ le_years
427424
428425
As you can see there are a lot of countries where data is not available for the Year 1543!
429426
430-
Which country does report this data
427+
Which country does report this data?
431428
432429
```{code-cell} ipython3
433430
le_years[~le_years[1543].isna()]
@@ -443,12 +440,12 @@ le_years.loc['GBR'].plot()
443440
444441
In fact we can use pandas to quickly check how many countries are captured in each year
445442
446-
So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries
447-
448443
```{code-cell} ipython3
449444
le_years.stack().unstack(level=0).count(axis=1).plot(xlabel="Year", ylabel="Number of countries");
450445
```
451446
447+
So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries
448+
452449
Now let us consider the most recent year in the dataset 2018
453450
454451
```{code-cell} ipython3
@@ -462,9 +459,9 @@ df.plot(x='gdppc', y='life_expectency', kind='scatter', xlabel="GDP per capita"
462459
This data shows a couple of interesting relationships.
463460
464461
1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectency
465-
2. appears to be a positive relationship between GDP per capita and life expectency. Countries with higher GDP per capita tend to have higher life expectency outcomes
462+
2. there appears to be a positive relationship between GDP per capita and life expectency. Countries with higher GDP per capita tend to have higher life expectency outcomes
466463
467-
Even though OLS is solving linear equations -- one option is to transform the variables, such as through a log transform, and then use OLS to estimate the relationships
464+
Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables
468465
469466
:::{tip}
470467
ln -> ln == elasticities
@@ -476,7 +473,7 @@ By specifying `logx` you can plot the GDP per Capita data on a log scale
476473
df.plot(x='gdppc', y='life_expectency', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectency (Years)", logx=True);
477474
```
478475
479-
As you can see from this transformation -- a linear model fits the shape of the data more closely.
476+
As you can see from this transformation -- a linear model fits the shape of the data more closely.
480477
481478
```{code-cell} ipython3
482479
df['log_gdppc'] = df['gdppc'].apply(np.log10)

0 commit comments

Comments
 (0)