# NUSA Demo of Fama French Factor Model

In [1]:
import pandas as pd
# use any other libraries you may need
import numpy as np
from scipy.stats import mstats


Read the contents of **cleaned_factset_data.csv**  into a Pandas Dataframe called **df** and drop any rows with NaN values. Note that the **CAP** column values are Strings with commas to denote thousands, so convert all the values in the column to Floats.

In [2]:
df = pd.read_csv('cleaned_factset_data.csv').dropna()
#df['CAP'] = df['CAP'].str.replace(',', '').astype(float)

To reduce the impact of outliers caused by the few number of large cap companies, add a new column to **df** called **log_mktcap** and populate it with the log of each value in **CAP**. 

In [3]:
df['log_mktcap'] = np.log(df['CAP'])
df

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.99630,48.936516,7.329091
1,MMM,3M Company,4.50,1.079156,0.074971,125018.13000,49.739280,11.736214
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.35900,75.486600,7.123962
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.75300,41.665737,9.335541
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.26390,16.560259,6.914000
...,...,...,...,...,...,...,...,...
1500,YUM,"Yum! Brands, Inc.",-0.80,0.855974,-0.214001,24953.79100,41.878730,10.124781
1501,ZBRA,Zebra Technologies Corporation Class A,3.12,1.604373,0.129115,5766.58000,39.653730,8.659834
1502,ZBH,"Zimmer Biomet Holdings, Inc.",0.98,1.182440,0.396316,23663.88900,60.496624,10.071705
1504,ZTS,"Zoetis, Inc. Class A",-1.51,1.016201,0.047275,31104.16800,64.566284,10.345097


Then calculate the z-score of each of the numeric columns and put the results into new columns with **'zscore_'** prepended to each original column name. 


The z-score formula is:

|      $Z = \frac{x - \mu}{\sigma}$

Where $\mu$ is the column mean, $\sigma$ is the column standard deviation, and $x$ is the observed value.


In [4]:
for col in df.select_dtypes(include=[np.number]).columns:
    col_mean = df[col].mean()
    col_std = df[col].std()
    df['zscore_' + col] = (df[col] - col_mean) / col_std

df

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap,zscore_monthly_return,zscore_capm_beta,zscore_book_price,zscore_CAP,zscore_GPM,zscore_log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.99630,48.936516,7.329091,-0.927340,0.650007,0.011774,-0.295252,0.563033,-0.680329
1,MMM,3M Company,4.50,1.079156,0.074971,125018.13000,49.739280,11.736214,0.938459,-0.065146,-0.730747,2.004678,0.599683,2.180214
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.35900,75.486600,7.123962,-0.180666,-1.134073,-0.399304,-0.300516,1.775176,-0.813472
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.75300,41.665737,9.335541,-0.129232,0.621859,-0.581543,-0.112557,0.231086,0.622003
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.26390,16.560259,6.914000,1.768491,0.717756,-0.815635,-0.304894,-0.915104,-0.949753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1500,YUM,"Yum! Brands, Inc.",-0.80,0.855974,-0.214001,24953.79100,41.878730,10.124781,-0.001535,-0.400115,-1.324563,0.141100,0.240810,1.134277
1501,ZBRA,Zebra Technologies Corporation Class A,3.12,1.604373,0.129115,5766.58000,39.653730,8.659834,0.693705,0.723138,-0.619485,-0.216239,0.139228,0.183420
1502,ZBH,"Zimmer Biomet Holdings, Inc.",0.98,1.182440,0.396316,23663.88900,60.496624,10.071705,0.314161,0.089870,-0.070407,0.117077,1.090809,1.099827
1504,ZTS,"Zoetis, Inc. Class A",-1.51,1.016201,0.047275,31104.16800,64.566284,10.345097,-0.127459,-0.159635,-0.787659,0.255643,1.276609,1.277278


Winsorize the data in the **'zscore'** columns at the 1st and 99th percentiles. 
(Censor the outliers, set any values less than the 1st percentile to the value of the 1st percentile and any values greater than the 99th percentile to the value at the 99th percentile).

In [5]:
zscore_cols = [col for col in df.columns if 'zscore_' in col]
for col in zscore_cols:
    df[col] = mstats.winsorize(df[col], limits=[0.01, 0.01])
df

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap,zscore_monthly_return,zscore_capm_beta,zscore_book_price,zscore_CAP,zscore_GPM,zscore_log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.99630,48.936516,7.329091,-0.927340,0.650007,0.011774,-0.295252,0.563033,-0.680329
1,MMM,3M Company,4.50,1.079156,0.074971,125018.13000,49.739280,11.736214,0.938459,-0.065146,-0.730747,2.004678,0.599683,2.180214
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.35900,75.486600,7.123962,-0.180666,-1.134073,-0.399304,-0.300516,1.775176,-0.813472
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.75300,41.665737,9.335541,-0.129232,0.621859,-0.581543,-0.112557,0.231086,0.622003
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.26390,16.560259,6.914000,1.768491,0.717756,-0.815635,-0.304894,-0.915104,-0.949753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1500,YUM,"Yum! Brands, Inc.",-0.80,0.855974,-0.214001,24953.79100,41.878730,10.124781,-0.001535,-0.400115,-1.222565,0.141100,0.240810,1.134277
1501,ZBRA,Zebra Technologies Corporation Class A,3.12,1.604373,0.129115,5766.58000,39.653730,8.659834,0.693705,0.723138,-0.619485,-0.216239,0.139228,0.183420
1502,ZBH,"Zimmer Biomet Holdings, Inc.",0.98,1.182440,0.396316,23663.88900,60.496624,10.071705,0.314161,0.089870,-0.070407,0.117077,1.090809,1.099827
1504,ZTS,"Zoetis, Inc. Class A",-1.51,1.016201,0.047275,31104.16800,64.566284,10.345097,-0.127459,-0.159635,-0.787659,0.255643,1.276609,1.277278


Run a **weighted least squares regression** using the standardized, winsorized data as explanatory variables and the monthly returns as the dependent.

In [6]:
pip install statsmodels

Note: you may need to restart the kernel to use updated packages.


In [7]:
import statsmodels.api as sm

y = df['monthly_return']
X = df[['zscore_capm_beta', 'zscore_book_price', 'zscore_CAP', 'zscore_GPM', 'zscore_log_mktcap']]
X = sm.add_constant(X)
weights = np.abs(df['monthly_return'])
model = sm.WLS(y, X, weights=weights)
results = model.fit()
print(results.summary())

                            WLS Regression Results                            
Dep. Variable:         monthly_return   R-squared:                       0.129
Model:                            WLS   Adj. R-squared:                  0.125
Method:                 Least Squares   F-statistic:                     38.45
Date:                Sat, 23 Sep 2023   Prob (F-statistic):           6.75e-37
Time:                        01:04:05   Log-Likelihood:                -5022.2
No. Observations:                1309   AIC:                         1.006e+04
Df Residuals:                    1303   BIC:                         1.009e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -1.6462      0.24

Write a sentence or two interpreting the results of the regression, what do the coefficients mean and are they statistically significant?

The regression model explains about 12.9% of the variance in monthly returns. Only the coefficients for zscore_book_price and zscore_log_mktcap are statistically significant, suggesting they are reliable predictors. Specifically, higher standardized book-to-price ratios and log market capitalizations are associated with lower monthly returns. The other variables in the model did not show a statistically significant impact.