# 1. Preamble

In [1]:
import pandas
import math
import re
import networkx as nx
import networkx.algorithms
import sklearn
import sklearn.cluster
import sklearn.preprocessing
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

**Task 2 - Fix 1: Code Repair and Documentation Improvements**

The original notebook had multiple dense code blocks without inline comments or explanations. This made it difficult to follow the logic and purpose behind key transformations.

To improve clarity and reproducibility, I added proper comments and seperated cell into sections and sub-sections for better document structure. I have also commented out unnecessary clustering codes which is irrelevant to the goal.

# 2. Data Import and Exploration

## 2.1 First Dataset ( Company and Directors )

In [2]:
# Reads first given dataset
company_director_raw = pandas.read_csv('/content/company_directorships.csv')
company_director_raw.head(5)

Unnamed: 0,company_name,cikcode,director_name,software_background,start_date,end_date
0,1ST SOURCE CORP,34782,ALLISON N. EGIDI,f,2011-03-14,2017-03-14
1,1ST SOURCE CORP,34782,ANDREA G. SHORT,f,2023-03-10,2025-03-14
2,1ST SOURCE CORP,34782,CHRISTOPHER J. MURPHY III,t,2008-03-14,2025-03-14
3,1ST SOURCE CORP,34782,CHRISTOPHER J. MURPHY IV,f,2011-03-14,2025-03-14
4,1ST SOURCE CORP,34782,CRAIG A. KAPSON,f,2008-03-14,2017-03-14


In [3]:
# Checks for datatypes, missing values and column names.
company_director_raw.info()
missing_values = company_director_raw.isnull().sum()
print("Missing values in each column:\n", missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13347 entries, 0 to 13346
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   company_name         13347 non-null  object
 1   cikcode              13347 non-null  int64 
 2   director_name        13347 non-null  object
 3   software_background  13347 non-null  object
 4   start_date           13347 non-null  object
 5   end_date             13347 non-null  object
dtypes: int64(1), object(5)
memory usage: 625.8+ KB
Missing values in each column:
 company_name           0
cikcode                0
director_name          0
software_background    0
start_date             0
end_date               0
dtype: int64


## 2.2 Second Dataset ( Director Details )

In [4]:
# Reads second given dataset
director_details_raw = pandas.read_csv('/content/director-details.csv')
director_details_raw.sample(5)

Unnamed: 0,id,url,name,age,role,compensation,source_excerpt,created_at,gender
4596,4586,https://www.sec.gov/Archives/edgar/data/315958...,Jerome S. Flum,83,"Chairman of the Board, Executive Chairman",150000,Jerome S. Flum was appointed President and Chi...,2025-05-16 16:35:39.811723,male
5904,5906,https://www.sec.gov/Archives/edgar/data/730272...,Carrie Eglinton Manner,51,Director,306651,"Carrie Eglinton Manner, Director, President an...",2025-05-16 16:37:44.478036,female
3342,3331,https://www.sec.gov/Archives/edgar/data/90498/...,Mark C. Doramus,66,Independent Director,155078,"Mark C. Doramus\nMr. Doramus,66, was elected t...",2025-05-16 16:33:48.784569,male
135,133,https://www.sec.gov/Archives/edgar/data/2969/0...,Charles Cogut,76,"Independent Director, Retired Partner, Simpson...",305353,Charles “Casey” Cogut is a retired partner of ...,2025-05-16 16:29:20.735005,male
1713,1705,https://www.sec.gov/Archives/edgar/data/40570/...,J. Randall Waterfield,51,Director,29500,J. Randall Waterfield (1)(4) Age 51 Position...,2025-05-16 16:31:31.031363,male


In [5]:
# Checks for datatypes, missing values and column names.
director_details_raw.info()
missing_values = director_details_raw.isnull().sum()
print("Missing values in each column:\n", missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5910 entries, 0 to 5909
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              5910 non-null   int64 
 1   url             5910 non-null   object
 2   name            5910 non-null   object
 3   age             5910 non-null   int64 
 4   role            5909 non-null   object
 5   compensation    5910 non-null   int64 
 6   source_excerpt  5910 non-null   object
 7   created_at      5910 non-null   object
 8   gender          5889 non-null   object
dtypes: int64(3), object(6)
memory usage: 415.7+ KB
Missing values in each column:
 id                 0
url                0
name               0
age                0
role               1
compensation       0
source_excerpt     0
created_at         0
gender            21
dtype: int64


## 2.3 Complementary Dataset (Task 4)

**Task 4: Complementary Dataset – Fortune 1000 (2024)**

To enhance the analysis, we incorporated a complementary dataset from Kaggle:  
**“2024 Fortune 1000 Companies”** — [Dataset Link](https://www.kaggle.com/datasets/jeannicolasduval/2024-fortune-1000-companies)

This dataset provides an updated list of Fortune 1000 companies for the year 2024, which helps us:

- Identify whether a director is currently associated with a **top-performing company**.
- Enrich centrality and influence analysis by **flagging high-profile affiliations**.
- Support visualizations or filtering of **top-tier networks**.

I have also added an complementary feature "worked_in_top_company" using this dataset.

In [6]:
# Reads the complementary dataset
top_companies_raw = pandas.read_csv('/content/fortune1000_2024.csv')
top_companies_raw.sample(5)

Unnamed: 0,Rank,Company,Ticker,Sector,Industry,Profitable,Founder_is_CEO,FemaleCEO,Growth_in_Jobs,Change_in_Rank,...,Assets_M,CEO,Country,HeadquartersCity,HeadquartersState,Website,CompanyType,Footnote,MarketCap_Updated_M,Updated
948,949,Columbia Banking System,COLB,Financials,Commercial Banks,yes,no,no,yes,0.0,...,52173.6,Clint E. Stein,U.S.,Tacoma,Washington,https://www.columbiabankingsystem.com,Public,Columbia Banking System acquired Umpqua Holdin...,4050.0,2024-06-04
663,664,Marriott Vacations Worldwide,VAC,"Hotels, Restaurants & Leisure","Hotels, Casinos, Resorts",yes,no,no,yes,11.0,...,9680.0,John E. Geller Jr.,U.S.,Orlando,Florida,https://www.marriottvacationsworldwide.com,Public,"Market value as of March 28, 2024.",3790.0,2024-06-04
3,4,UnitedHealth Group,UNH,Health Care,Health Care: Insurance and Managed Care,yes,no,no,yes,1.0,...,273720.0,Andrew P. Witty,U.S.,Minnetonka,Minnesota,https://www.unitedhealthgroup.com,Public,"Market value as of July 15, 2024.",474339.0,2024-08-05
183,184,Carrier Global,CARR,Industrials,Industrial Machinery,yes,no,no,yes,12.0,...,32822.0,David L. Gitlin,U.S.,Palm Beach Gardens,Florida,https://www.corporate.carrier.com,Public,"Market value as of March 28, 2024.",52323.0,2024-06-04
787,788,Sally Beauty Holdings,SBH,Retailing,Specialty Retailers: Other,yes,no,yes,no,-20.0,...,2725.3,Denise A. Paulonis,U.S.,Denton,Texas,https://www.sallybeautyholdings.com,Public,"Figures are for fiscal year ended Sept. 30, 20...",1303.0,2024-06-04


In [7]:
# Checks for datatypes, missing values and column names.
top_companies_raw.info()
missing_values = top_companies_raw.isnull().sum()
print("Missing values in each column:\n", missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Rank                           1000 non-null   int64  
 1   Company                        1000 non-null   object 
 2   Ticker                         959 non-null    object 
 3   Sector                         1000 non-null   object 
 4   Industry                       1000 non-null   object 
 5   Profitable                     1000 non-null   object 
 6   Founder_is_CEO                 1000 non-null   object 
 7   FemaleCEO                      1000 non-null   object 
 8   Growth_in_Jobs                 1000 non-null   object 
 9   Change_in_Rank                 1000 non-null   float64
 10  Gained_in_Rank                 1000 non-null   object 
 11  Dropped_in_Rank                1000 non-null   object 
 12  Newcomer_to_the_Fortune500     500 non-null    ob

# 3. Data Cleaning and Preprocessing

## 3.1 Cleaning the names

In [8]:
def clean_name(name):
    name = re.sub(r'[^A-Z ]', ' ', name.upper())  # keeps only letters and spaces
    name = re.sub(r'\s+', ' ', name)              # collapse multiple spaces into one
    return name.strip()

In [9]:
# Cleans names in company_director dataset
company_director_raw['company_name'] = company_director_raw['company_name'].apply(clean_name)
company_director_raw['director_name'] = company_director_raw['director_name'].apply(clean_name)
# Cleans names in director_details dataset
director_details_raw['name'] = director_details_raw['name'].apply(clean_name)
# Cleans names in top_companies dataset
top_companies_raw['Company'] = top_companies_raw['Company'].apply(clean_name)

**Task 2 - Fix 2: Standardized names using regex cleanup**

I improved the cleaning step using regular expressions to replace all non-alphabet characters with spaces and collapse multiple spaces. This significantly improves name-matching accuracy across datasets specially with complementary dataset.   
For example "CLAIRE BABINEAUX- FONTENOT" and "CLAIRE BABINEAUX-FONTENOT" was considered as different persons. Now this problem is solved.



In [10]:
# Renames column names so that director_name and company_name is consistant throughout the notebook.
top_companies_raw.rename(columns={'Company': 'company_name'}, inplace=True)
director_details_raw.rename(columns={'name': 'director_name'}, inplace=True)

In [11]:
company_director_raw.sample(5)

Unnamed: 0,company_name,cikcode,director_name,software_background,start_date,end_date
12968,WESBANCO INC,203596,MICHAEL L PERKINS,f,2022-03-16,2022-03-16
796,APOGEE ENTERPRISES INC,6845,FRANK G HEARD,f,2021-05-11,2023-05-12
1547,BK TECHNOLOGIES CORP,2186,LEWIS M JOHNSON,f,2016-04-01,2020-04-28
1450,BECTON DICKINSON CO,10795,CATHERINE M BURZIK,f,2016-12-15,2024-12-19
5477,HAEMONETICS CORP,313143,DIANE M BRYANT,t,2024-06-07,2024-06-07


## 3.2 Feature Engineering

In [12]:
# Calculates number of years served in the company
company_director_raw['service_years'] = round((pandas.to_datetime(company_director_raw.end_date) - pandas.to_datetime(company_director_raw.start_date)).dt.days / 365, 2)
display(company_director_raw.head())
print(company_director_raw.size)

Unnamed: 0,company_name,cikcode,director_name,software_background,start_date,end_date,service_years
0,ST SOURCE CORP,34782,ALLISON N EGIDI,f,2011-03-14,2017-03-14,6.01
1,ST SOURCE CORP,34782,ANDREA G SHORT,f,2023-03-10,2025-03-14,2.01
2,ST SOURCE CORP,34782,CHRISTOPHER J MURPHY III,t,2008-03-14,2025-03-14,17.01
3,ST SOURCE CORP,34782,CHRISTOPHER J MURPHY IV,f,2011-03-14,2025-03-14,14.01
4,ST SOURCE CORP,34782,CRAIG A KAPSON,f,2008-03-14,2017-03-14,9.01


93429


In [13]:
# There are many duplicate names in same company. So the service time is summed up.
company_director = pandas.DataFrame({
    'service_years': company_director_raw.groupby(['director_name', 'company_name']).service_years.sum(),
}).reset_index()
display(company_director.head())
print(company_director.size)

Unnamed: 0,director_name,company_name,service_years
0,A A BUSCH III,EMERSON ELECTRIC CO,2.01
1,A ALEXANDER ARNOLD III,ACCELERATE DIAGNOSTICS INC,2.99
2,A ALEXANDER MCLEAN III,WORLD ACCEPTANCE CORP,6.02
3,A BARRY RAND,CAMPBELL S CO,1.02
4,A BART HOLADAY,MDU RESOURCES GROUP INC,7.04


39576


In [14]:
# Custom values for different role importance.
def role_importance(role):
    if pandas.isna(role): return 1
    role = role.lower()
    if 'chief executive officer' in role or 'ceo' in role:
        return 10
    elif 'chair' in role:
        return 7
    elif 'president' in role:
        return 5
    elif 'director' in role:
        return 3
    else:
        return 1

In [15]:
# Reads the roles of directors and map score according to role importance
director_details_raw['role_score'] = director_details_raw['role'].map(role_importance)
# There are many duplicate names. So we group it up.
director_details = pandas.DataFrame({
    'compensation': director_details_raw.groupby('director_name').compensation.sum(),
    'role_score': director_details_raw.groupby('director_name').role_score.max(),
})
# Creates log conversion of compensation.
director_details['log_compensation'] = (1 + director_details.compensation).map(math.log10)
director_details.drop('compensation', axis=1, inplace=True)
director_details

Unnamed: 0_level_0,role_score,log_compensation
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1
A CATHERINE NGO,3,6.002665
A EUGENE WASHINGTON,3,5.511712
A F PETROCELLI,10,4.672107
A FARAZ ABBASI,3,5.055501
A G SULZBERGER,7,6.693304
...,...,...
ZACKARY IRANI,10,5.556428
ZAHID AFZAL,3,0.000000
ZENA SRIVATSA ARNOLD,3,5.278664
ZENON S NIE,7,5.306862


**Task 3 - Explore something in the existing dataset**

To enrich the analysis, we engineered two new features:

- **`role_score`**: A custom scoring metric based on the director's role in the company. For example, CEO or Chairperson may receive a higher score than an ordinary member.
- **`service_years`**: Derived from the difference between the start and end years of a director's service, indicating their tenure on the board.

These features help capture both the **influence** (via role) and **experience** (via tenure) of directors — both potentially important for identifying high-impact individuals who could facilitate corporate acquisitions.


In [16]:
# Adds a new variable to check if the director worked in top-tier (Fortune-1000 2024) companies.
top_company_names = set(top_companies_raw['company_name'])
company_director['worked_in_top_company'] = company_director['company_name'].isin(top_company_names)
company_director

Unnamed: 0,director_name,company_name,service_years,worked_in_top_company
0,A A BUSCH III,EMERSON ELECTRIC CO,2.01,False
1,A ALEXANDER ARNOLD III,ACCELERATE DIAGNOSTICS INC,2.99,False
2,A ALEXANDER MCLEAN III,WORLD ACCEPTANCE CORP,6.02,False
3,A BARRY RAND,CAMPBELL S CO,1.02,False
4,A BART HOLADAY,MDU RESOURCES GROUP INC,7.04,False
...,...,...,...,...
13187,ZHONGLI LIU,SMART POWERR CORP,4.01,False
13188,ZI YAO LIM,KULICKE SOFFA INDUSTRIES INC,0.00,False
13189,ZIV SHOSHANI,VISHAY INTERTECHNOLOGY INC,15.99,False
13190,ZUHEIR SOFIA,LANCASTER COLONY CORP,17.99,False


In [17]:
company_director.groupby('worked_in_top_company').count()

Unnamed: 0_level_0,director_name,company_name,service_years
worked_in_top_company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,12991,12991,12991
True,201,201,201


In [18]:
# Creates a dataframe with important features for the analysis
company_directors_details = pandas.merge(company_director, director_details, on='director_name', how='left').fillna(0)
company_directors_details

Unnamed: 0,director_name,company_name,service_years,worked_in_top_company,role_score,log_compensation
0,A A BUSCH III,EMERSON ELECTRIC CO,2.01,False,0.0,0.000000
1,A ALEXANDER ARNOLD III,ACCELERATE DIAGNOSTICS INC,2.99,False,0.0,0.000000
2,A ALEXANDER MCLEAN III,WORLD ACCEPTANCE CORP,6.02,False,0.0,0.000000
3,A BARRY RAND,CAMPBELL S CO,1.02,False,0.0,0.000000
4,A BART HOLADAY,MDU RESOURCES GROUP INC,7.04,False,0.0,0.000000
...,...,...,...,...,...,...
13187,ZHONGLI LIU,SMART POWERR CORP,4.01,False,3.0,3.847819
13188,ZI YAO LIM,KULICKE SOFFA INDUSTRIES INC,0.00,False,0.0,0.000000
13189,ZIV SHOSHANI,VISHAY INTERTECHNOLOGY INC,15.99,False,0.0,0.000000
13190,ZUHEIR SOFIA,LANCASTER COLONY CORP,17.99,False,0.0,0.000000


In [19]:
# Calculates the weight of network edges
company_directors_details['weight'] = 1 / (
    company_directors_details['log_compensation'].clip(lower=1) *
    company_directors_details['role_score'].clip(lower=1) *
    company_directors_details['service_years'].clip(lower=1)
)

In [20]:
company_directors_details.sample(10)

Unnamed: 0,director_name,company_name,service_years,worked_in_top_company,role_score,log_compensation,weight
6895,KEITH M GEHL,KEWAUNEE SCIENTIFIC CORP DE,8.97,False,7.0,5.060702,0.003147
12775,WEI JIANG,STAAR SURGICAL CO,0.0,False,3.0,0.0,0.333333
10154,RICHARD H MANDEL JR,THERAPEUTICSMD INC,0.0,False,0.0,0.0,1.0
10934,RONALD G LEE,LEE PHARMACEUTICALS INC,3.99,False,7.0,5.33656,0.006709
6411,JON ISAAC,U S GOLD CORP,0.0,False,0.0,0.0,1.0
11967,TERENCE TERRY WISE,FORWARD INDUSTRIES INC,0.98,False,0.0,0.0,1.0
7025,KEVIN A PRICE,PAYCHEX INC,2.99,False,3.0,5.480664,0.020341
840,BENJAMIN ROSENZWEIG,BK TECHNOLOGIES CORP,0.99,False,0.0,0.0,1.0
5877,JOEL S FRIEDMAN,ENVELA CORP,3.4,False,0.0,0.0,0.294118
2663,DENISE L RAMOS,ITT INC,6.04,False,3.0,5.569635,0.009909


# 4. Network Analysis

In [21]:
# Creates the graph of director-company network
graph = nx.Graph()
people = []
companies = []
for company, director, weight in zip(company_directors_details.company_name, company_directors_details.director_name, company_directors_details.weight):
    graph.add_edge(company, director, weight=weight)
    people.append(director)
    companies.append(company)

In [22]:
# Selects the biggest connected graph in the global network
biggest_connected_graph = graph.subgraph(list(networkx.connected_components(graph))[0])
print(len(list(networkx.connected_components(graph))))
print(len(biggest_connected_graph.nodes()), len(biggest_connected_graph.edges()))

196
8105 9135


In [23]:
# Figure of biggest network
plt.figure(figsize=(120,120))
nx.draw_networkx(biggest_connected_graph, node_size=10, alpha = 0.5, with_labels= True)
plt.title ('Biggest Network')
plt.show()

Output hidden; open in https://colab.research.google.com to view.

In [24]:
# Calculates different centrality measures of the biggest network.
%%time
centrality = pandas.DataFrame({
    'eigen': pandas.Series(nx.eigenvector_centrality(biggest_connected_graph)),
    'degree': pandas.Series(nx.algorithms.degree_centrality(biggest_connected_graph)),
    'closeness_weighted': pandas.Series(nx.closeness_centrality(biggest_connected_graph, distance='weight'))
})
centrality['is_person'] = centrality.index.map(lambda x: x in people)

CPU times: user 11min 49s, sys: 2.62 s, total: 11min 52s
Wall time: 11min 58s


**Task 1 – Centrality Extension**

**Existing Centrality Measures**

- **Degree Centrality**: In this context, a director with high degree centrality sits on many company boards and is widely connected. Such individuals are often valuable for maximum broad access and outreach.

- **Eigenvector Centrality**: A director with high eigenvector centrality is not only well-connected but also connected to other highly influential people. This is useful for identifying elite or prestigious directors who might have more say in high-level business decisions.

**Added Centrality Measure: Closeness Centrality**

I extended the analysis by adding **Closeness Centrality** (as `closeness_weighted` in the code). This measures how close a node is to all others in terms of the shortest weighted paths.

In this project, I used **edge weights** based on director-level attributes (like compensation, role, and service years), so closeness centrality helps reveal directors who are strategically placed to reach others **efficiently** across the network. A high closeness value suggests a director can communicate or influence others quickly, making them potential **connectors or brokers** for acquisition conversations.

This added metric gives us a more nuanced perspective alongside degree and eigenvector centrality — combining visibility, prestige, and reach into a richer analysis of influence.


In [25]:
# Resets the index of centrality and seperates column director_name
print(centrality[centrality.is_person].reset_index().columns)
centrality = centrality[centrality.is_person].reset_index().rename(columns={'index': 'director_name'})
display(centrality)

Index(['index', 'eigen', 'degree', 'closeness_weighted', 'is_person'], dtype='object')


Unnamed: 0,director_name,eigen,degree,closeness_weighted,is_person
0,A A BUSCH III,1.013264e-17,0.000123,0.925667,True
1,A ALEXANDER ARNOLD III,6.009714e-10,0.000123,1.059258,True
2,A ALEXANDER MCLEAN III,1.026281e-23,0.000123,0.438744,True
3,A BARRY RAND,1.148973e-22,0.000123,0.632682,True
4,A BRAY CARY JR,2.234234e-12,0.000123,0.571249,True
...,...,...,...,...,...
7662,ZELL B MILLER,5.898645e-21,0.000123,0.327292,True
7663,ZENA SRIVATSA ARNOLD,6.429499e-18,0.000123,1.467003,True
7664,ZENON S NIE,3.225206e-28,0.000123,0.685415,True
7665,ZI YAO LIM,1.056443e-25,0.000123,0.601265,True


In [26]:
# Adds anohter variable to check if the director worked in top-tier companies.
people_df = pandas.merge(
    centrality,
    company_directors_details.groupby('director_name').worked_in_top_company.any().reset_index(),
    on="director_name",
    how="left"
)
people_df.drop('is_person', axis=1, inplace=True)
people_df

Unnamed: 0,director_name,eigen,degree,closeness_weighted,worked_in_top_company
0,A A BUSCH III,1.013264e-17,0.000123,0.925667,False
1,A ALEXANDER ARNOLD III,6.009714e-10,0.000123,1.059258,False
2,A ALEXANDER MCLEAN III,1.026281e-23,0.000123,0.438744,False
3,A BARRY RAND,1.148973e-22,0.000123,0.632682,False
4,A BRAY CARY JR,2.234234e-12,0.000123,0.571249,False
...,...,...,...,...,...
7662,ZELL B MILLER,5.898645e-21,0.000123,0.327292,False
7663,ZENA SRIVATSA ARNOLD,6.429499e-18,0.000123,1.467003,False
7664,ZENON S NIE,3.225206e-28,0.000123,0.685415,False
7665,ZI YAO LIM,1.056443e-25,0.000123,0.601265,False


In [27]:
# Scales the centrality measures
scaler = StandardScaler()
people_df_scaled = people_df.copy()
people_df_scaled[['degree', 'eigen', 'closeness_weighted']] = scaler.fit_transform(people_df[['degree', 'eigen', 'closeness_weighted']])

people_df_scaled['influence_score'] = (
    people_df_scaled['degree'] +
    people_df_scaled['eigen'] +
    people_df_scaled['closeness_weighted']
)

people_df_scaled.head()

Unnamed: 0,director_name,eigen,degree,closeness_weighted,worked_in_top_company,influence_score
0,A A BUSCH III,-0.084018,-0.199688,-0.522169,False,-0.805875
1,A ALEXANDER ARNOLD III,-0.084018,-0.199688,-0.225383,False,-0.509089
2,A ALEXANDER MCLEAN III,-0.084018,-0.199688,-1.603916,False,-1.887622
3,A BARRY RAND,-0.084018,-0.199688,-1.173063,False,-1.456769
4,A BRAY CARY JR,-0.084018,-0.199688,-1.309544,False,-1.59325


**Task 2 - Fix 4 : Enhanced Influence Scoring (New Metric)**  
Previously, the influence of directors was measured using only **eigenvector centrality**, which alone may not fully capture a node’s impact in the network.

We improved this by introducing a new feature: `influence_score`, a **composite score** derived from:
- Eigenvector centrality
- Closeness centrality
- Degree centrality

This allows for a more comprehensive ranking of influential directors by balancing reach, connectedness, and importance in the network. This not only improves analysis accuracy but also enhances fairness in identifying top influencers.


In [28]:
# Filters the directors who worked in top-tier companies and sort by the most influence director
result = people_df_scaled[people_df_scaled['worked_in_top_company']].sort_values('influence_score', ascending=False)

In [29]:
# Top 10 influencial directors.
result.head(10)

Unnamed: 0,director_name,eigen,degree,closeness_weighted,worked_in_top_company,influence_score
303,ANNE M MULCAHY,-0.084018,1.886154,1.295568,True,3.097704
4545,MARILLYN A HEWSON,-0.084018,1.886154,1.270597,True,3.072733
123,ALEX GORSKY,-0.084018,1.886154,1.25711,True,3.059246
3418,JOHN G STRATTON,-0.084018,1.886154,1.254737,True,3.056873
6290,RONALD A WILLIAMS,-0.084018,1.886154,1.254261,True,3.056397
4787,MATTHEW J ESPE,-0.084018,1.886154,1.226091,True,3.028227
4744,MARY J STEELE GUILFOILE,-0.084018,1.886154,1.085927,True,2.888063
4921,MICHAEL F ROMAN,-0.084018,0.843233,1.336082,True,2.095297
4924,MICHAEL G O GRADY,-0.084018,0.843233,1.273507,True,2.032722
269,ANGEL R MARTINEZ,-0.084018,0.843233,1.189915,True,1.94913


**Task 5b – Implementation of Complementary Dataset (Fortune 1000)**

To enhance the analysis, I incorporated the Fortune 1000 dataset, which lists the top U.S. companies by revenue. Since the VC fund's goal is to facilitate an acquisition by a wealthy U.S.-based firm, this dataset is highly relevant for identifying directors with valuable corporate connections.

I flagged each director with a boolean indicator `worked_in_top_company`, which is `True` if they served on the board of any Fortune 1000 company. This flag was merged into the main dataset using cleaned and standardized company names.

The influence analysis was driven by three centrality measures: `degree`, `eigenvector`, and `closeness`. The `influence_score` was computed as the sum of these standardized values. We then used the `worked_in_top_company` flag **to filter** and prioritize directors with both strong network centrality and prior connections to top-tier companies.

This implementation added real-world business relevance to our analysis by distinguishing directors with direct experience in large, acquisition-capable firms.


In [30]:
# df[df.director_name == 'ELIZABETH KRENTZMAN']

In [31]:
# robust_scaler = sklearn.preprocessing.RobustScaler()
# age_and_demographics_scaled = robust_scaler.fit_transform(people_df[['age', 'log_compensation', 'degree', 'eigen']])

In [32]:
# dbscan = sklearn.cluster.DBSCAN(eps=0.4)
# people_df['cluster_id'] = dbscan.fit_predict(age_and_demographics_scaled)
# people_df.cluster_id.value_counts()

In [33]:
# people_df.plot.scatter(x='age', y='log_compensation', c='cluster_id', cmap="rainbow", s=4)

In [34]:
# people_df.loc['ELIZABETH KRENTZMAN']