<img src="support_files/images/cropped-SummerWorkshop_Header.png">  

<h1 align="center">Python Bootcamp</h1> 
<h3 align="center">August 20-21, 2022</h3> 

<div class="alert alert-success">
<font size="6">Exercise</font>  
<font size="4"> </font>  
<font size="5">Numpy, Scipy, Pandas</font>
<p>
**Weisberg (1985) makes available a dataset of faculty salaries, along with sevaral possible predictors. We will analyze these data using a general linear model**
</p>
<p>
This exercise covers:
<ul style="list-style-type:disc">
  <li>Unexpected Formats</li>
  <li>Statistics with numpy and scipy</li>
  <li>Testing methods on dummy data</li>
</ul>
</p>
</div>

## Requirements

* pandas
* numpy
* scipy stats

You should also proabbly import division from \__future__ - just to be safe.

In [1]:
from __future__ import division

import pandas as pd
import numpy as np
import scipy.stats

## The Data

These data come from a study of salaries among university faculty. The data file is [here](http://data.princeton.edu/wws509/datasets/salary.dat) and a description of the coding is [here](http://data.princeton.edu/wws509/datasets/#salary) (You should probably at least glance at this).

Load these data into a pandas dataframe. Note - the delimiter is not a comma!

In [2]:
data = pd.read_csv('http://data.princeton.edu/wws509/datasets/salary.dat', sep='\s+')

## A fitting excercise

We'll use a general linear model to analyze these data. In order to do this, we need to be able to fit such models. Fortunately, numpy's linalg module contains a method for least squares fitting. Learn how to use this by generating some noisy (gaussian) data from a toy linear model (try numpy's random module) and then recovering your coefficents.

Note: functions are good.

In [3]:
def make_test_data(nobs, true_coefs, sigma):
    
    npar = len(true_coefs)
    design = np.random.rand(nobs, npar)
    target = np.dot(design, true_coefs) + np.random.randn(nobs) * sigma
    
    return design, target

In [4]:
test_design, test_target = make_test_data(20, np.array([2, 3, 7]), 0.1)

In [5]:
coefficients, residuals, rank, sv = np.linalg.lstsq(test_design, test_target)

In [6]:
print(coefficients)

[ 2.03368024  2.92780121  6.96839643]


## Reformatting the data

If you've taken a look at the data (hint), you probably know that it is not properly formatted for the method of least-squares fitting that we are using here. It has:

* categorical variables in single columns
* no distinction between the predictor and estimand columns
* no way to specify an intercept

Write a function to rectify this situation. Your function should have the following signature:

```python
def glm_data_reformat(dataframe, target_name, cont_pred=None, cat_pred=None, intercept=True):
    '''Sets up a dataframe for fitting with numpy (main effects only)
    
    Parameters
    ---------
    dataframe : pandas df
        contains mix of categorical and continuous predictors
    target_name : str
        column header of target variable (treated as continuous)
    cont_pred : list of str, optional
        column headers of continuous predictors, if any
    cat_pred : list of str, optional
        column headers of categorical predictors, if any
    intercept : bool, optional
        fit an intercept? Defaults to yes.
        
    Returns
    -------
    design : ndarray (n_observations x n_parameters)
        predictor data.
    target : ndarray (n_observations)
        estimand
    design_names : list of str
        names of parameters in design matrix columns
     
    '''

    # your code here

    return design, target, design_names
```

Note: You will need to code the continuous variables somehow. This will require spooling them out into multiple columns of the design matrix.

In [7]:
def glm_data_reformat(dataframe, target_name, cont_pred=None, cat_pred=None, intercept=True):
    '''Sets up a dataframe for fitting with numpy (main effects only)

    Parameters
    ---------
    dataframe : pandas df
        contains mix of categorical and continuous predictors
    target_name : str
        column header of target variable (treated as continuous)
    cont_pred : list of str, optional
        column headers of continuous predictors, if any
    cat_pred : list of str, optional
        column headers of categorical predictors, if any
    intercept : bool, optional
        fit an intercept? Defaults to yes.

    Returns
    -------
    design : ndarray (n_observations x n_parameters)
        predictor data.
    target : ndarray (n_observations)
        estimand
    design_names : list of str
        names of parameters in design matrix columns

    '''

    if cont_pred is None: cont_pred = []
    if cat_pred is None: cat_pred = []
        
    design_names = []
    columns = []
        
    for var_name in cont_pred:
        columns.append(dataframe[var_name])
        design_names.append(var_name)
        
    for var_name in cat_pred:
        
        levels = dataframe[var_name].unique()
        nlevels = len(levels)
        
        if nlevels < 2:
            continue
        
        for ii, level in enumerate(levels):
            
            if ii == nlevels - 1 :
                break
                
            indicator = np.zeros(dataframe.shape[0])
            indicator[np.where(dataframe[var_name] == level)] = 1
            columns.append(indicator)
            design_names.append('{0}_as_{1}'.format(var_name, level))
            
    if intercept:
        columns.append(np.ones(dataframe.shape[0]))
        design_names.append('intercept')
        

    return np.array(columns).T, np.array(dataframe[target_name]), design_names


In [8]:
full_design, full_target, full_design_names = glm_data_reformat(
    data, target_name='sl', cont_pred=['yr', 'yd'], cat_pred=['dg', 'rk', 'sx'], intercept=True
    )

If you have not already, test your function:

Now use this function and the linalg module to format the data and fit a model of your choice.

In [9]:
full_coefficients, residuals, rank, sv = np.linalg.lstsq(full_design, full_target)

## Analysis

You have a model, let's do something with it. In particular, we will investigate whether there is an effect of sex on salary in these data. We can use a sequential sum of squares f-test, where:

$$
f = \frac{\frac{SSE_{red} - SSE_{full}}{DFE_{red} - DFE{full}}}  {\frac{SSE_{full}}{DFE_{full}}}
$$
Here SSE is the sum of squared errors (i.e. the residuals). DFE is the error degrees of freedom (number of observations - number of design matrix columns). The full model is exactly what it sounds like, while the red (reduced) model is just the same model sans one parameter.

Fit a full and reduced model for a parameter of interest and generate an f-value.

In [10]:
red_design, red_target, red_design_names = glm_data_reformat(
    data, target_name='sl', cont_pred=['yr', 'yd'], cat_pred=['dg', 'rk'], intercept=True
    )
red_coefficients, _, _, _ = np.linalg.lstsq(red_design, red_target)


In [11]:
full_sse = ((np.dot(full_design, full_coefficients) - full_target)**2).sum()
red_sse = ((np.dot(red_design, red_coefficients) - red_target)**2).sum()

full_dfm = len(full_design_names) 
red_dfm = len(red_design_names)

full_dfe = full_design.shape[0] - full_dfm
red_dfe = red_design.shape[0] - red_dfm

In [12]:
fhat = ( (red_sse - full_sse) / (red_dfe - full_dfe) ) / (full_sse / full_dfe)

In [13]:
print(fhat)

1.58802561117


Now get a p-value by using the cdf of an f-distributed random variable. Scipy.stats has a handy function for this.

Note that your f-distribution's parameters should be:

1. $DFM_{full} - DFM_{red}$ where DFM is the number of columns in a model's design matrix.
2. $DFE_{full}$

In [14]:
fvar = scipy.stats.f

pvalue = 1 - fvar.cdf(fhat, full_dfm - red_dfm, full_dfe)

In [15]:
print(pvalue)

0.214104335593


## Continuations

* extend your glm_data_reformat to handle interactions
* evaluate the model's performance using leave-one-out cross-validation