<a href="https://colab.research.google.com/github/Matt-Brigida/FIN_420_Financial_Analytics_Colab/blob/master/week_1_session_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1 Session 0

# Do Higher Levels of Institutional Ownership Increase Management Effectiveness?

Imagine you arrive at work on a Monday morning and this is the question your boss asks you.  Theory would prompt you to answer 'yes', however you'll want some data to back up your answer.  In this session we'll gather some relevant data and provide an answer.

Load the Pandas library.

In [None]:
import pandas as pd

### Get institutional ownership data.

The URL of an Excel spreadsheet which contains the amount of Institutional and Insider ownership by industry.

In [None]:
inst_own_url = "http://www.stern.nyu.edu/~adamodar/pc/datasets/inshold.xls"

Below we are going to import the data from the url using the `read_excel` method.  This attempts to extablish a SSL connection, and as of writing this the required certificates were not on the server.  The following code tells python to ignore the lack of valid certs.  It is not the safest approach, though this is the benefit of working on a notebook in the cloud.

In [None]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

Note the data starts on row 8, so we skip the first 7 rows when creating the Pandas `DataFrame`.  Also, the data is in sheet 2 (which in Python is 1).

In [None]:
inst_own = pd.read_excel(inst_own_url, skiprows = 7, sheet_name=1)

### Inspecting Data

Often the first step will be to inspect the data.  We want to pay particular attention to missing data and the type of each variable (column).  We can first view the data with:

In [None]:
inst_own

Unnamed: 0,Industry Name,Number of Firms,CEO Holding,Corporate Holdings,Institutional Holdings,Insider Holdings
0,Advertising,58,0.079387,0.156317,0.381401,0.187405
1,Aerospace/Defense,77,0.036413,0.242979,0.487936,0.113848
2,Air Transport,21,0.025401,0.297433,0.507165,0.079854
3,Apparel,39,0.080286,0.055278,0.518454,0.170783
4,Auto & Truck,31,0.035413,0.218659,0.288059,0.136488
...,...,...,...,...,...,...
91,Trucking,35,0.051777,0.149969,0.542969,0.184851
92,Utility (General),15,0.001094,0.000000,0.832900,0.004574
93,Utility (Water),16,0.027208,0.442967,0.539700,0.090886
94,Total Market,7165,0.050500,0.143281,0.470277,0.126370


EDIT:  There is no longer a blank column, so no need to run the next line of code.  And it looks like we have a blank column (Unnamed: 5).  We can remove it with:

In [None]:
inst_own.drop(columns=["Unnamed: 5"], inplace=True)
inst_own

Now let's see how many rows and columns we have (the shape of our data frame):

In [None]:
inst_own.shape

(96, 6)

Or just get the rows:

In [None]:
inst_own.shape[0]

96

And the type of each column:

In [None]:
inst_own.dtypes

Industry Name              object
Number of Firms             int64
CEO Holding               float64
Corporate Holdings        float64
Institutional Holdings    float64
Insider Holdings          float64
dtype: object

Or we can also use `info()`.  This prints a collection of useful information about a `DataFrame`.

In [None]:
inst_own.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Industry Name           96 non-null     object 
 1   Number of Firms         96 non-null     int64  
 2   CEO Holding             96 non-null     float64
 3   Corporate Holdings      96 non-null     float64
 4   Institutional Holdings  96 non-null     float64
 5   Insider Holdings        96 non-null     float64
dtypes: float64(4), int64(1), object(1)
memory usage: 4.6+ KB


And so the Number of Firms is an integer type. With the remaining columns being floats (numbers with decimal portions).  The `Object` for Industry Name in Pandas means it is a string. These columns have been imported correctly.  

A common problem when importing numeric column, particularly in large data sets, is there may be a character in one cell.  In this case the whole column may be imported as strings ("123" instead of 123).

### Example: Importing Data with Error

In [None]:
inst_own_with_error = pd.read_excel("https://github.com/FinancialMarkets/industry_instown_with_data_error_to_clean/blob/master/which_column_has_bad_data.xls?raw=true", skiprows = 7)
inst_own_with_error.drop(columns=["Unnamed: 5"], inplace=True)

In [None]:
inst_own_with_error.dtypes

Industry Name              object
Number of Firms             int64
CEO Holding               float64
Institutional Holdings     object
Insider Holdings          float64
dtype: object

In [None]:
inst_own_with_error

Now we see `Institutional Holdings` is a string, and not a number.  Let's take a look and try to figure out where the error is:

In [None]:
pd.set_option('display.max_rows', 1000)
inst_own_with_error["Institutional Holdings"]

We see in row 25 there is a letter `g` in the number.  We have two choices at this point, we can (1) remove the row, or (2) if we are sure the number is otherwise correct we can remove the `g`.

In [None]:
inst_own_with_error.index

RangeIndex(start=0, stop=96, step=1)

#### Drop a Row

We can drop a row with the following.  Note the `inplace=True` is commented.  We want the error to stay in the DataFrame for the second posssible solution.  

In [None]:
inst_own_with_error.drop(axis=0, index=25) #, inplace=True)

In [None]:
inst_own_with_error.dtypes

Industry Name              object
Number of Firms             int64
CEO Holding               float64
Institutional Holdings     object
Insider Holdings          float64
dtype: object

#### Changing the Data in Row 25 of Institutional Holdings

In [None]:
inst_own_with_error["Institutional Holdings"].loc[25] = 0.539263
#inst_own_with_error["Institutional Holdings"].loc[25]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [None]:
inst_own_with_error["Institutional Holdings"] = pd.to_numeric(inst_own_with_error["Institutional Holdings"])

And we see below Institutional Holdings is a float.

In [None]:
inst_own_with_error.dtypes

Industry Name              object
Number of Firms             int64
CEO Holding               float64
Institutional Holdings    float64
Insider Holdings          float64
dtype: object

### Selecting Columns

In [None]:
pd.set_option('display.max_rows', 10)
inst_own["Industry Name"]

0                           Advertising
1                     Aerospace/Defense
2                         Air Transport
3                               Apparel
4                          Auto & Truck
                    ...                
91                             Trucking
92                    Utility (General)
93                      Utility (Water)
94                         Total Market
95    Total Market (without financials)
Name: Industry Name, Length: 96, dtype: object

### Adding Columns

Lets say we want to create a column which is the sum of Institutional and Insider Holdings.  Let's call it `II_Holdings`.

In [None]:
inst_own["II_Holdings"] = inst_own["Institutional Holdings"] + inst_own["Insider Holdings"]
inst_own

Unnamed: 0,Industry Name,Number of Firms,CEO Holding,Institutional Holdings,Insider Holdings,II_Holdings
0,Advertising,61,0.048845,0.303746,0.179132,0.482877
1,Aerospace/Defense,72,0.025742,0.534349,0.101948,0.636297
2,Air Transport,17,0.032045,0.649069,0.073218,0.722287
3,Apparel,51,0.075842,0.486852,0.189185,0.676037
4,Auto & Truck,19,0.087504,0.451652,0.158280,0.609932
...,...,...,...,...,...,...
91,Trucking,35,0.022909,0.531682,0.187574,0.719256
92,Utility (General),16,0.001024,0.815850,0.005227,0.821077
93,Utility (Water),17,0.016858,0.475340,0.045610,0.520950
94,Total Market,7582,0.047922,0.461086,0.128503,0.589589


# Which Industries have the Highest Levels of Institutional Ownership?

We can sort the `DataFrame` on `Institutional Ownership`, and print the first 10 rows, with:

In [None]:
inst_own.sort_values("Institutional Holdings", ascending=False).head(10)

Unnamed: 0,Industry Name,Number of Firms,CEO Holding,Institutional Holdings,Insider Holdings,II_Holdings
67,Reinsurance,2,0.0027,0.90975,0.01867,0.92842
58,Paper/Forest Products,15,0.002919,0.817112,0.013544,0.830656
92,Utility (General),16,0.001024,0.81585,0.005227,0.821077
90,Transportation (Railroads),6,0.000615,0.7976,0.002185,0.799785
80,Shoe,11,0.010664,0.76635,0.066496,0.832846
41,Homebuilding,30,0.079937,0.731192,0.162065,0.893257
76,Rubber& Tires,3,0.039813,0.715833,0.049353,0.765187
62,R.E.I.T.,238,0.016343,0.713841,0.039719,0.753561
70,Retail (Building Supply),15,0.014032,0.699114,0.062814,0.761929
46,Insurance (General),21,0.016045,0.686328,0.110753,0.797081


We can also use `describe()` to get a fell for the data.

In [None]:
inst_own.describe()

Unnamed: 0,Number of Firms,CEO Holding,Institutional Holdings,Insider Holdings
count,96.0,96.0,96.0,96.0
mean,223.010417,0.049678,0.510069,0.127378
std,991.790051,0.036787,0.14365,0.062098
min,2.0,0.000177,0.16786,0.00143
25%,22.0,0.019778,0.418634,0.086168
50%,47.0,0.048007,0.498227,0.127803
75%,96.25,0.071761,0.605439,0.166659
max,7582.0,0.233067,0.90975,0.371873


## Get accounting returns data.

In [None]:
# returns_url = "http://www.stern.nyu.edu/~adamodar/pc/datasets/pbvdata.xls"
returns_url = "https://github.com/FinancialMarkets/industry_instown_with_data_error_to_clean/blob/master/pbvdata.xls?raw=true"
returns = pd.read_excel(returns_url, skiprows = 7)
returns

Unnamed: 0,Industry Name,Number of firms,PBV,ROE,EV/ Invested Capital,ROIC
0,Advertising,61,5.729582,0.029333,7.009509,0.515119
1,Aerospace/Defense,72,4.436925,0.085432,4.236895,0.191149
2,Air Transport,17,3.224150,-0.470269,1.762879,-0.160654
3,Apparel,51,4.111694,-0.081881,3.071836,0.075423
4,Auto & Truck,19,7.578762,0.044885,2.583233,0.011709
...,...,...,...,...,...,...
91,Trucking,35,4.811726,-0.176956,2.568211,-0.040357
92,Utility (General),16,1.840435,0.074857,1.476021,0.067851
93,Utility (Water),17,3.507385,0.082466,2.376167,0.060506
94,Total Market,7582,3.813854,0.082466,2.401274,0.060506


### Merge the two data sets on the Industry Name

In [None]:
all_data = pd.merge(inst_own, returns, on = 'Industry Name')

KeyError: ignored

In [None]:
returns.columns

Index(['Industry  Name', 'Number of firms', 'PBV', 'ROE',
       'EV/ Invested Capital', 'ROIC'],
      dtype='object')

In [None]:
inst_own.columns

Index(['Industry Name', 'Number of Firms', 'CEO Holding',
       'Institutional Holdings', 'Insider Holdings', 'II_Holdings'],
      dtype='object')

However this throws an error, specifically `KeyError: 'Industry Name'`.  So it is not finding the same column in each `DataFrame`.

With a little inspection, we find that in the returns file, there are two spaces in `Industry  Name` and one space in the institutional ownership file.  

Having spaces in column names is not a good practice, but let's just change the column name in the returns file to have one space between Industry and Name.

In [None]:
returns.rename(columns={'Industry  Name': 'Industry Name'}, inplace=True)

In [None]:
returns.columns

Index(['Industry Name', 'Number of firms', 'PBV', 'ROE',
       'EV/ Invested Capital', 'ROIC'],
      dtype='object')

In [None]:
all_data = pd.merge(inst_own, returns, on = 'Industry Name')

In [None]:
all_data

# Exercise:  

Select only the columns you want and calculate a correlation matrix.  Calculate the matrix.  What is the correlation coefficient between Institutional Ownership and ROE?

[Hint](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html?highlight=correlation%20matrix)

# Addendum

At this point you may have noticed we call certain 'functions' with `name(a)` and others with `a.name()` where `a` is some object.  As examples, we have used `dir(module)` and `len(object)` but also use `inst_own.sort_values()` and `inst_own.drop()` above.

The difference has to do with the distinction between functions and methods, and this is a topic within Python's class system which we don't want or need to worry about at this point. That said, we use `name(a)` when `name` is a globally defined function.  We use `a.name()` when `name` is a method defined for an object of class (a is an instance of an object of the class).  

You can read more in Section 9.3.4 here: https://docs.python.org/3/tutorial/classes.html

You can read more here and at the referenced links: https://stackoverflow.com/questions/28703834/why-do-some-methods-use-dot-notation-and-others-dont