## Some motivation and more Stata commands
### `egen` and `bysort`

Next we will go through some extremely useful and Stata idiomatics commands, namely the `egen` and `bysort` commands. `egen` stands for *extended generate* and is probably the one of the most important Stata commands after `help` and `gen`.

Recall that in the `gen` command requires the user to assign the variable values from some existing Stata commands or by typing the valus self. For instance:

```
gen dummy = 0
replace dummy = 1 if !missing(some_variable)

```

Or directly from the return value of the `missing` command:

```
gen dummy = !missing(some_variable)

```

It's clearly useful to use the return values from known Stata functions if possible. But where to find useful Stata functions?

```{note}

Strictly speaking Stata functions are a different thing from commands. So far I've used both terms when referring to stuff that you can write to the command line. 

```

`missing` is one function we are already are familiar with. So is `strpos` or `substr` and the other [string manipulation functions](https://www.stata.com/manuals/fnstringfunctions.pdf) mentioned in the previous exercise session. Other examples are [statistical functions](https://www.stata.com/manuals/fnstatisticalfunctions.pdf) such as `normal(z)` that returns the cumulative probability of a given Z-value. The statistical functions are mostly functions for statistical distributions. They can also be used with the `display` command.

Probably more useful tricks are the various [random number generators](https://www.stata.com/manuals/fnrandom-numberfunctions.pdf) that can be used for instance with simulations. More on those later. 

Next here is a minor example that demonstrates different options for generate. I use the command `quietly` to tell Stata not to print any output for commands inside the brackets.

In [1]:
quietly {
    clear
    set obs 10
    gen text = "Hello world!"
    gen first_char = substr(text,1,1)
    gen random = rnormal()
    gen p = normal(random)
}    

In [2]:
%browse

Unnamed: 0,text,first_char,random,p
1,Hello world!,H,2.0255878,0.97859645
2,Hello world!,H,1.0426311,0.85144043
3,Hello world!,H,0.29771236,0.61703867
4,Hello world!,H,-1.7221316,0.04252284
5,Hello world!,H,-0.72919953,0.23293981
6,Hello world!,H,0.86182606,0.80560839
7,Hello world!,H,-0.23935401,0.40541553
8,Hello world!,H,0.51654899,0.69726449
9,Hello world!,H,-1.8120158,0.034991879
10,Hello world!,H,-1.0151243,0.15502329


Oftentimes we want to generate variables based on some characteristics of existing variables (data). This is where we want to use `egen`. It has various functions that can do exactly that. Let's list a few:

1. mean(expression)
    * creates a constant for the mean in expression
2. sd(expression)
    * creates a constant for the standard deviation in expression
3. max(expression)
    * creates a constant for the max in expression
4. count(expression)
    * creates a constant containing the number of nonmissing observations of expression
    
The list goes on. You can view the full list of `egen` functions by typing `help egen` to the command line. Let's not try the command now in action:

In [3]:
qui{
    clear
    set obs 1000000
    gen random_var = rnormal()
    egen average = mean(random_var)
    egen sd = sd(random_var)  
}

In [4]:
%browse 5

Unnamed: 0,random_var,average,sd
1,1.0953676,-0.0010069405,1.0000892
2,1.3326492,-0.0010069405,1.0000892
3,-0.360688,-0.0010069405,1.0000892
4,0.73510557,-0.0010069405,1.0000892
5,-0.96148294,-0.0010069405,1.0000892


The real power of `egen` comes for the possibility to use some of the funtions with `by(varlist)`. This is a way to execute Stata commands separately for some groups of observations.

The best way to demostrate this is by an example. Let's continue working with our data from Sweden.

In [5]:
cd Z:/ECON-C4100 // change working dir
use data/sweden_prices.dta, clear
qui destring _all, replace dpcomma
describe


Z:\ECON-C4100




Contains data from data/sweden_prices.dta
  obs:        14,276                          
 vars:            15                          24 Jan 2021 22:20
 size:     6,395,648                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Produktnamn     str50   %50s                  Produktnamn
Varunummer      long    %10.0g                Varunummer
ATCkod          str16   %16s                  ATC-ko

In this example, we calculate the conditional expectation for prices of [the anatomical main groups in our data](https://www.whocc.no/atc/structure_and_principles/) and for products whose pharmacy purchasing price is more than 1000 kronor. More specifically, we calculate

$E(ppp \;| \;ATC_1, \;ppp > 1000)$


Steps:

1. Extract the 1st level from the ATC variable
2. use `egen` with `by` and `cond`

`cond` is a regular (and a very useful function!).

In [12]:
gen atc_1 = substr(ATCkod,1,1)
tab atc_1

bysort atc_1: egen cond_mean = mean(cond(AIP> 1000, AIP, .))
tab cond_mean




      atc_1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |      1,433       10.04       10.04
          B |        962        6.74       16.78
          C |      1,443       10.11       26.88
          D |        431        3.02       29.90
          G |        567        3.97       33.88
          H |        403        2.82       36.70
          J |      1,032        7.23       43.93
          L |      1,804       12.64       56.56
          M |        399        2.79       59.36
          N |      4,510       31.59       90.95
          P |         45        0.32       91.27
          R |        746        5.23       96.49
          S |        367        2.57       99.06
          V |        134        0.94      100.00
------------+-----------------------------------
      Total |     14,276      100.00



  cond_mean |      Freq.     Percent        Cum.
------------+-----------------------------------
   2022.123 |        567  

There's also a simpler way to do the same thing. 

In [13]:
bysort atc_1: egen cond_mean_2 = mean(AIP) if AIP > 1000
tab cond_mean_2


(10504 missing values generated)


cond_mean_2 |      Freq.     Percent        Cum.
------------+-----------------------------------
   2022.123 |        133        3.53        3.53
   2409.912 |         38        1.01        4.53
   2670.016 |        562       14.90       19.43
   3240.064 |         10        0.27       19.70
   3439.819 |         91        2.41       22.11
    6087.08 |        252        6.68       28.79
   6588.213 |        183        4.85       33.64
   7943.088 |          5        0.13       33.78
   8277.331 |         58        1.54       35.31
   8402.929 |        565       14.98       50.29
   10566.04 |         95        2.52       52.81
   13632.82 |        322        8.54       61.35
   14347.02 |        135        3.58       64.93
      14645 |      1,323       35.07      100.00
------------+-----------------------------------
      Total |      3,772      100.00


However, notice that now the expectation is saved only for those observations that satisfy the condition.

It's also worthwile to note that we can do `by` with several different notations:

In [14]:
by atc_1, sort: egen cond_mean_3 = mean(cond(AIP> 1000, AIP, .))
egen cond_mean_4 = mean(cond(AIP> 1000, AIP, .)), by(atc_1)
tab cond_mean_3
tab cond_mean_4





cond_mean_3 |      Freq.     Percent        Cum.
------------+-----------------------------------
   2022.123 |        567        3.97        3.97
   2409.912 |        431        3.02        6.99
   2670.016 |      4,510       31.59       38.58
   3240.064 |         45        0.32       38.90
   3439.819 |        134        0.94       39.84
    6087.08 |      1,433       10.04       49.87
   6588.213 |        403        2.82       52.70
   7943.088 |        367        2.57       55.27
   8277.331 |        399        2.79       58.06
   8402.929 |        962        6.74       64.80
   10566.04 |      1,443       10.11       74.91
   13632.82 |      1,032        7.23       82.14
   14347.02 |        746        5.23       87.36
      14645 |      1,804       12.64      100.00
------------+-----------------------------------
      Total |     14,276      100.00


cond_mean_4 |      Freq.     Percent        Cum.
------------+-----------------------------------
   2022.123 |        567  

### User written packages: `egenmore` and some `motivate`

Stata is a completely programmable language. While it is possible to create your own functions, most of the time some one else has already done the work for you. The Stata community has produced many used userwritten packages that you can download directly into Stata with the `ssc` or `net` command, depending where the program is published. The best of the best packages from the Stata community are published in [the Stata Journal](https://www.stata-journal.com/). One of the giants of the community, [Nick Cox](https://www.statalist.org/forums/member/6-nick-cox), is responsible for several of these commands.

The most downloaded packages can be printed with the following command:

In [17]:
ssc hot


Top 10 packages at SSC

        Dec 2020   
  Rank   # hits    Package       Author(s)
  ----------------------------------------------------------------------
     1  87810.0    ritest        Simon Hess                              
     2  42619.3    outreg2       Roy Wada                                
     3  40865.8    estout        Ben Jann                                
     4  22085.3    asdoc         Attaullah Shah                          
     5  17479.3    winsor2       Lian Yu-jun                             
     6  11304.8    ivreg2        Mark E Schaffer, Christopher F Baum,    
                                   Steven Stillman                         
     7  11267.2    ivreg210      Christopher F Baum, Steven Stillman,    
                                   Mark E Schaffer                         
     8  11082.4    ivreg29       Christopher F Baum, Mark E Schaffer,    
                                   Steven Stillman                         
     9  10893.8    

We will next download two user written packages into Stata which -besides their popularity- are not listed in the Dec 2020 hotlist. These commands are `motivate` and `egenmore`.

```
ssc install motivate, replace 
ssc install egenmore, replace

```

I've already installed these packages so we don't run the `ssc` commands above. However, we can see them in action starting with the `egenmore` package that adds some functions to `egen` (there is no `egenmore` command). For example, `corr(x y)` can be used to manually calculate OLS estimates. We won't do it here, but you will have a similar exercise in the Problem Set 3.

To calculate (and save) the covariance and correlation between `AIP` and the length of the name of the product, let's type:

In [18]:
qui {
    gen name_length = length(Produktnamn)
    egen correlation = corr(AIP name_length)
    egen covariance = corr(AIP name_length), covariance
}
tab correlation
tab covariance



. tab correlation

Correlation |
     of AIP |
name_length |      Freq.     Percent        Cum.
------------+-----------------------------------
  -.1616081 |     14,276      100.00      100.00
------------+-----------------------------------
      Total |     14,276      100.00

. tab covariance

 Covariance |
     of AIP |
name_length |      Freq.     Percent        Cum.
------------+-----------------------------------
  -11033.07 |     14,276      100.00      100.00
------------+-----------------------------------
      Total |     14,276      100.00


Having done that, we can congratulate ourselves with the `motivate` command:

In [19]:
motivate

'I find that the harder I work, the more luck I seem to have'
                                            Thomas Jefferson


### Basic univariate regressions in Stata

Doing OLS regressions in Stata is quite easy. You use the command `regress`, or `reg` in short, followed by the dependent variable and the indpendent variable(s). Let's try the following (linear) model in stata:

$AIP = \beta_0 + \beta_1 \times nameLength + u$

In [20]:
reg AIP name_length


      Source |       SS           df       MS      Number of obs   =    14,276
-------------+----------------------------------   F(1, 14274)     =    382.79
       Model |  4.2959e+10         1  4.2959e+10   Prob > F        =    0.0000
    Residual |  1.6019e+12    14,274   112225807   R-squared       =    0.0261
-------------+----------------------------------   Adj R-squared   =    0.0260
       Total |  1.6449e+12    14,275   115227359   Root MSE        =     10594

------------------------------------------------------------------------------
         AIP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 name_length |   -272.763   13.94129   -19.57   0.000    -300.0898   -245.4363
       _cons |    6318.27    201.164    31.41   0.000     5923.962    6712.577
------------------------------------------------------------------------------


The above regression table tells that products that have longer names are less expensive in average. A decrease of one chracter in the product name's length is in average associated with a decrease or roughly 273 kronors in Apotek Inkop Pris (the slope).

The table also tells us that based on our model, products whose name is zero characters long cost 6318.27 kronors to the pharmacists (the constant). Finally, the R-squared tells us that our model is able to explain roughly 2.6% of the variation of AIP in the data.