In [None]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install yellowbrick



## Data Overview

- Observations
- Sanity checks

## Loading the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

In [None]:
## Complete the code to import the data
data = pd.read_csv('/content/drive/MyDrive/stock_data.csv')

## Overview of the Dataset

The initial steps to get an overview of any dataset is to:
- observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
- get information about the number of rows and columns in the dataset
- find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
- check the statistical summary of the dataset to get an overview of the numerical columns of the data

### Checking the shape of the dataset

In [None]:
# checking shape of the data
data.shape

The dataset has 340 rows and 15 columns

### Displaying few rows of the dataset

In [None]:
# let's view a sample of the data
data.sample(n=10, random_state=1)

### Checking the data types of the columns for the dataset

In [None]:
# checking the column names and datatypes
data.info()

- Ticker Symbol, Security, GICS Sector and GICS Sub Industry are categorical variables with 340 levels that indicate each stock's name
- The remaining variables are of type int (integer)

### Creating a copy of original data

In [None]:
# copying the data to another variable to avoid any changes to original data
df = data.copy()

### Checking for duplicates and missing values

In [None]:
# checking for duplicate values
df.duplicated().sum()

There are no duplicate entries

In [None]:
# checking for missing values in the data
df.isna().sum()

There are no missing values in our data

In [None]:
# dropping the serial no. column as it does not provide any information
df.drop("Ticker Symbol", axis=1, inplace=True)

### Statistical summary of the dataset

**Let's check the statistical summary of the data.**

In [None]:
df.describe(include='all').T

***Observations***
- There are 340 securities listed and each column has a value for each security
- The avg current price is 80.86 with the lowest price of 4.50 and the highest current price of 1,274.95
- The average price change is +4.08 with the lowest price change of -47.13 and the highest price change of +55.05
- The average volatility rate is 1.53 with the lowest of 0.733 and the highest of 4.58
- The average ROE is 39.60 with the lowest of 1.0 and the highest of 917
- The average Cash Ratio is 70.02 with the lowest of 0 and the highest of 958
- The average Net Cash Flow is 55,537,620 with the lowest of -11,208,000,000 and the highest of 20,764,000,000
- The average Net Income is 1,494,384,602 with the lowest of -23,528,000,000 and the highest of 24,442,000,000
- The average Earnings Per Share is 2.7766 with the lowest of -61.2 and the highest of 50.09
- The average Estimated Shares Outstanding is 577,028,337 with the lowest of 27,672,156 and the highest of 6,159,292,035
- The average P/E Ratio is 32.61 with the lowest of 2.935 and the highest of 528.039
- The average P/B Ratio is -1.72 with the lowest of -76.12 and the highest of 129.06

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:

1. What does the distribution of stock prices look like?
2. The stocks of which economic sector have seen the maximum price increase on average?
3. How are the different variables correlated with each other?
4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

### Univariate analysis

## EDA

- It is a good idea to explore the data once again after manipulating it.

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=df, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        df[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        df[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

**`Current Price`**

In [None]:
histogram_boxplot(df, 'Current Price')

The distribution for Current Price is Right skewed and has a few outliers

**`Price Change`**

In [None]:
histogram_boxplot(df, 'Price Change')

Price Change is uniformally distributed

**`Volatility`**

In [None]:
histogram_boxplot(df, 'Volatility')

The distribution for volatility is slightly Right skewed and bimodal

**`ROE`**

In [None]:
histogram_boxplot(df, 'ROE')

The distribution for ROE is Right skewed and has a few outliers

**`Cash Ratio`**

In [None]:
histogram_boxplot(df, 'Cash Ratio')

The distribution for Cash Ratio is Right skewed, is bimodal and has a few outliers

**`Net Cash Flow`**

In [None]:
histogram_boxplot(df, 'Net Cash Flow')

Net Cash Flow is uniformally distributed and has a few outliers

**`Net Income`**

In [None]:
histogram_boxplot(df, 'Net Income')

Net Income is uniformally distributed and has a few outliers

**`Earnings Per Share`**

In [None]:
histogram_boxplot(df, 'Earnings Per Share')

Earning Per Share is uniformally distributed and has a few outliers

**`Estimated Shares Outstanding`**

In [None]:
histogram_boxplot(df, 'Estimated Shares Outstanding')

Estimated Shares Outstanding is right skewed with outliers

**`P/E Ratio`**

In [None]:
histogram_boxplot(df, 'P/E Ratio')

P/E Ratio is right skewed with outliers

**`P/B Ratio`**

In [None]:
histogram_boxplot(df, 'P/B Ratio')

P/B Ratio is uniformally distributed

In [None]:
# function to create labeled barplots


def labeled_barplot(df, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(df[feature])  # length of the column
    count = df[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=df,
        x=feature,
        palette="Paired",
        order=df[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

**`GICS Sector`**

In [None]:
labeled_barplot(df, 'GICS Sector', perc=True)

In [None]:
df['GICS Sector'].value_counts()

In [None]:
df['GICS Sector'].value_counts(normalize=True)

***Observations***
- 15.6% or 53 of the stocks are of the GICS Sector "Industrials"
- 14.4% or 49, Financials           
- 11.8% or 40, Health Care
- 11.8% or 40, Consumer Discretionary  
- 9.7% or 33, Information Technology

**`GICS Sub Industry`**

In [None]:
labeled_barplot(df, 'GICS Sub Industry', perc=False)

In [None]:
#let's display the top 5 Sub Industries
labeled_barplot(df, 'GICS Sub Industry', n=5, perc=True)

In [None]:
df['GICS Sub Industry'].value_counts(normalize=False)

***Observations***
- 16 or 4.7% of stocks are labeled GICS Sub Industry Gas Exploration & Production
- 14 or 4.1% are REITs
- 14 or 4.1% are Industrial Conglomerates
- 12 or 3.5% are Electric Utilities
- 12 or 3.5% are Internet Software & Services

### Bivariate Analysis

In [None]:
# correlation check
plt.figure(figsize=(15, 7))
sns.heatmap(
    df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

***Observations***
- As expected, Earnings Per Share is positively correlated with Current Price and Net Income
- P/E Ratio is positively correlated with Current Price, Volatility and negatively correlated with Net Income and Earnings per Share.
- Net Income is positively correlated with Earnings Per Share and Estimated Shares Outstanding and negatively correlated with Volatility and ROE


**Let's check the stocks of which economic sector have seen the maximum price increase on average.**

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
plt.xticks(rotation=90)
plt.show()

In [None]:
df.groupby(['GICS Sector'])['Price Change'].mean().sort_values(ascending = False)

***Observations***
- Health Care stocks saw the highest price increase by 9.59 followed by Consumer Staples and Information Technology
- Energy stocks had the highest price drop of 10.22

In [None]:
plt.figure(figsize=(15,8))
sns.barplot( )  ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()

**Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. Let's see how the average cash ratio varies across economic sectors.**

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()

In [None]:
df.groupby(['GICS Sector'])['Cash Ratio'].mean().sort_values(ascending = False)

***Observations***

The sectors with the healthiest Cash Ratio are:
- IT
- Telecommunications Services
- Health Care

**P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.**

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()

In [None]:
df.groupby(['GICS Sector'])['P/E Ratio'].mean().sort_values(ascending = False)


***Observations***
The sectors with the highest P/E ratios that help determine the value of a company's shares are:
- Energy
- IT
- Real Estate
- Health Care
- Consumer Discretionary


**Volatility accounts for the fluctuation in the stock price. A stock with high volatility will witness sharper price changes, making it a riskier investment. Let's see how volatility varies, on average, across economic sectors.**

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()

In [None]:
df.groupby(['GICS Sector'])['Volatility'].mean().sort_values(ascending = False)

The sectors with high volatility and, therefore, are riskier investments are:
- Energy
- Materials
- IT


## Data Preprocessing

- Duplicate value check
- Missing value treatment
- Outlier check
- Feature engineering (if needed)
- Any other preprocessing steps (if needed)

### Outlier Check

- Let's plot the boxplots of all numerical columns to check for outliers.

In [None]:
plt.figure(figsize=(15, 12))

num_col = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

### Scaling

- Let's scale the data before we proceed with clustering.

In [None]:
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)

In [None]:
# creating a dataframe of the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)

## K-means Clustering

### Checking Elbow Plot

In [None]:
k_means_df = subset_scaled_df.copy()

In [None]:
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(k_means_df)
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()

In [None]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure

### Let's check the silhouette scores

In [None]:
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((subset_scaled_df))
    score = silhouette_score(k_means_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()

In [None]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure

In [None]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))  ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()

In [None]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))  ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()

In [None]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))  ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()

### Creating Final Model

**Let's take 4 as the appropriate no. of clusters as the silhouette score is high enough and there is knick at 4 in the elbow curve.**

In [None]:
# final K-means model
kmeans = KMeans(n_clusters=4, random_state=1)  ## Complete the code to choose the number of clusters
kmeans.fit(k_means_df)

In [None]:
# creating a copy of the original data
df1 = df.copy()

# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_

### Cluster Profiling

In [None]:
km_cluster_profile = df1.groupby("KM_segments").mean()  ## Complete the code to groupby the cluster labels

In [None]:
km_cluster_profile["count_in_each_segment"] = (
    df1.groupby("KM_segments")["Security"].count().values  ## Complete the code to groupby the cluster labels
)

In [None]:
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)

In [None]:
## Complete the code to print the companies in each cluster
for cl in df1["KM_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df1[df1["KM_segments"] == cl]["Security"].unique())
    print()

In [None]:
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()

In [None]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df1, x="KM_segments", y=variable)

plt.tight_layout(pad=2.0)

In [None]:
df1.groupby("KM_segments").mean().plot.bar(figsize=(30,15))

### Insights

- **Cluster 0**:
    - Net Income is very low for stocks in this cluster.
    - Estimated Shares Outstanding is very low
    - Net Cash Flow is almost 0 in this cluster

- **Cluster 1**:
    - Net Income is very high for stocks in this cluster.
    - Estimated Shares Outstanding is moderate.
    - Net Cash Flow is negative for this cluster 8


- **Cluster 2**:
    - Net Income is negative for stocks in this cluster.8
    - Estimated Shares Outstanding is very low
    - Net Cash Flow is negative for this cluster 8


- **Cluster 3**:
    - Net Income is low in this cluster
    - Estimated Shares Outstanding is low
    - Net Cash Flow is very low

## Hierarchical Clustering

### Computing Cophenetic Correlation

In [None]:
hc_df = subset_scaled_df.copy()

In [None]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)

**Let's explore different linkage methods with Euclidean distance only.**

In [None]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]


high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)

**Let's view the dendrograms for the different linkage methods with Euclidean distance.**

### Checking Dendrograms

In [None]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"] ## Complete the code to add linkages

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])

In [None]:
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc

### Creating model using sklearn

In [None]:
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)

In [None]:
# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_Clusters"] = HCmodel.labels_
df2["HC_Clusters"] = HCmodel.labels_

### Cluster Profiling

In [None]:
cluster_profile = df2.groupby("HC_Clusters").mean()

In [None]:
cluster_profile["count_in_each_segments"] = (
    df2.groupby("HC_Clusters")["Current Price"].count().values
)

In [None]:
cluster_profile.style.highlight_max(color="lightgreen", axis=0)

In [None]:
# let's see the names of the securities in each cluster
for cl in df2["HC_Clusters"].unique():
    print("In cluster {}, the following securities are present:".format(cl))
    print(df2[df2["HC_Clusters"] == cl]["Security"].unique())
    print()

In [None]:
df2.groupby(["HC_Clusters", "GICS Sector"])['Security'].count()

In [None]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df2, x="HC_Clusters", y=variable)

plt.tight_layout(pad=2.0)

In [None]:
df2.groupby("HC_Clusters").mean().plot.bar(figsize=(30,15))

- **Cluster 0**:
    - Net Income is moderate for stocks in this cluster.
    - Estimated Shares Outstanding is low
    - Net Cash Flow is very low.

- **Cluster 1**:
    - Net Income is high for stocks in this cluster.
    - Estimated Shares Outstanding extremely low
    - Net Cash Flow is negative for this cluster


- **Cluster 2**:
    - Net Income is negative for stocks in this cluster.
    - Estimated Shares Outstanding is low
    - Net Cash Flow is negative for this cluster


- **Cluster 3**:
    - Net Income is very high
    - Estimated Shares Outstanding is very high
    - Net Cash Flow is low

## K-means vs Hierarchical Clustering

## Insights and Recommendations

- K-Means took less time for execution
- Both cluster techniques share the following:
  - The majority of stocks are in both cluster 0 for each technique
  - Both techniques share Apache Corp and Chesapeake Energy in Cluster 2
  - Both techniques share Facebook in Cluster 3
  - Hierarchial gave more distinct clusters than K-Means
- Similar clusters from each algorithm:
    - Net Cash Flow is negative for cluster 1 of both algorithms
    - Net Income is negative for stocks in cluster 2 of both algorithms
    - Net Cash Flow is negative for the cluster 2 of both algorithms  
- As expected, Earnings Per Share is positively correlated with Current Price and Net Income
- P/E Ratio is positively correlated with Current Price, Volatility and negatively correlated with Net Income and Earnings per Share.
- Net Income is positively correlated with Earnings Per Share and Estimated Shares Outstanding and negatively correlated with Volatility and ROE
- The sectors with the highest P/E ratios that help determine the value of a company’s shares are:
    - Energy
    - IT
    - Real Estate
    - Health Care
    - Consumer Discretionary

- The sectors with high volatility and, therefore, are riskier investments are:
    - Energy
    - Materials
    - IT
- The sectors with the healthiest Cash Ratio are:
    - IT
    - Telecommunications Services
    - Health Care
- Health Care stocks saw the highest price increase by 9.59 followed by Consumer Staples and Information Technology
- Energy stocks had the highest price drop of 10.22




