This document is a Python exploration of this R-based document: http://m-clark.github.io/data-processing-and-visualization/programming.html. Code is *not* optimized for anything but learning.  In addition, all the content is located with the main document, not here, so many sections may not be included.  I only focus on reproducing the code chunks.

# Objects

## Object Inspection & Exploration

### DataFrames

We'll start by inspecting a `DataFrame` object from `pandas`.

In [1]:
import pandas as pd
import numpy as np

In [2]:
diamonds = pd.read_csv('../data/diamonds.csv')

In [3]:
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [4]:
type(diamonds)

pandas.core.frame.DataFrame

For pandas data frames there are a couple of ways we can get something similar to `str()` in R.

In [5]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [6]:
diamonds.dtypes

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

### Model Objects

#### Attributes and Methods

As with the R example, we'll use a regression model via `statsmodels`.

In [7]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [8]:
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data

# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('mpg ~ wt + hp + vs + am', data = mtcars).fit()

# Inspect the results
results.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.85
Model:,OLS,Adj. R-squared:,0.828
Method:,Least Squares,F-statistic:,38.23
Date:,"Wed, 12 Aug 2020",Prob (F-statistic):,9.45e-11
Time:,13:49:51,Log-Likelihood:,-72.029
No. Observations:,32,AIC:,154.1
Df Residuals:,27,BIC:,161.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,31.0788,3.393,9.160,0.000,24.117,38.040
wt,-2.5910,0.917,-2.824,0.009,-4.473,-0.709
hp,-0.0301,0.011,-2.751,0.010,-0.053,-0.008
vs,1.7855,1.327,1.345,0.190,-0.938,4.509
am,2.4171,1.379,1.752,0.091,-0.413,5.247

0,1,2,3
Omnibus:,2.428,Durbin-Watson:,1.59
Prob(Omnibus):,0.297,Jarque-Bera (JB):,2.073
Skew:,0.607,Prob(JB):,0.355
Kurtosis:,2.716,Cond. No.,1360.0


In [9]:
type(results)

statsmodels.regression.linear_model.RegressionResultsWrapper

As in R, the primary way to access the parts of a model object is through its methods and/or attributes. While there are benefits to being able to access all parts of an object, the object returned by statsmodels isn't a generic named list like with lm in R.

In [10]:
dir(results) # show attributes and methods, unfortunately with no distinction between what is callable as a function or not

['HC0_se',
 'HC1_se',
 'HC2_se',
 'HC3_se',
 '_HCCM',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache',
 '_data_attr',
 '_get_robustcov_results',
 '_is_nested',
 '_use_t',
 '_wexog_singular_values',
 'aic',
 'bic',
 'bse',
 'centered_tss',
 'compare_f_test',
 'compare_lm_test',
 'compare_lr_test',
 'condition_number',
 'conf_int',
 'conf_int_el',
 'cov_HC0',
 'cov_HC1',
 'cov_HC2',
 'cov_HC3',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'diagn',
 'eigenvals',
 'el_test',
 'ess',
 'f_pvalue',
 'f_test',
 'fittedvalues',
 'fvalue',
 'get_influence',
 'get_prediction',
 'get_robustcov_results',
 'initialize',
 'k_constant',
 'llf',
 'load',

For example, we can check the $R^2$ value as follows. In this case the $R^2$ value is an attibute of the ols object.

In [11]:
results.rsquared

0.849949878625184

In [12]:
results.params   # coefficients

Intercept    31.078788
wt           -2.590999
hp           -0.030101
vs            1.785546
am            2.417142
dtype: float64

However, `predict` is an actual method/function.

In [13]:
results.predict()

array([23.39642383, 22.73571914, 26.47098334, 21.22318397, 16.89811066,
       20.7388933 , 14.45422436, 22.73279765, 21.84311099, 20.24889875,
       20.24889875, 15.11527739, 15.99621698, 15.86666704, 11.30537866,
       10.5535368 , 10.30671361, 27.59462497, 29.53177055, 28.57044033,
       23.55774354, 17.44335092, 17.66358582, 13.75465468, 15.84875615,
       28.28123965, 25.21201858, 27.95990323, 17.33585029, 21.05122159,
       14.16229351, 24.79751099])

In Python, most modeling structures will not have a generic way to access the parts of them, like lists in R where we can use `$` or inspect the list elements. Everything you use will have custom classes with their own attributes and methods.  However, if you really want this you can use the `inspect` module, with the `getmembers` function.

In [14]:
import inspect

# inspect.getmembers(results) # not shown due to verbosity

### Inspecting Functions

To inspect functions you can use ?? in Jupyter notebook.

In [15]:
??smf.ols

[0;31mSignature:[0m [0msmf[0m[0;34m.[0m[0mols[0m[0;34m([0m[0mformula[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0msubset[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdrop_cols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;34m@[0m[0mclassmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mfrom_formula[0m[0;34m([0m[0mcls[0m[0;34m,[0m [0mformula[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0msubset[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdrop_cols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                     [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""[0m
[0;34m        Create a Model from a formula and dataframe.[0m
[0;34m[0m
[0;34m        Parameters[0m
[0;34m        ----------[0m
[0;34m        formula : str or g

## Help Files

As in R, we can access helpfiles with a `?`.

In [16]:
?smf.ols

[0;31mSignature:[0m [0msmf[0m[0;34m.[0m[0mols[0m[0;34m([0m[0mformula[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0msubset[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdrop_cols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Create a Model from a formula and dataframe.

Parameters
----------
formula : str or generic Formula object
    The formula specifying the model.
data : array_like
    The data for the model. See Notes.
subset : array_like
    An array-like object of booleans, integers, or index values that
    indicate the subset of df to use in the model. Assumes df is a
    `pandas.DataFrame`.
drop_cols : array_like
    Columns to drop from the design matrix.  Cannot be used to
    drop terms involving categoricals.
*args
    Additional positional argument that are passed to the model.
**kwargs
    These are passed to the model with one exception. T

## Objects Exercises

With one function, find out what the class, number of rows, number of columns are of the following object, including what kind of object the last three columns are. Inspect the help file also.

In [17]:
iris = sm.datasets.get_rdataset("iris", "datasets").data