## Generating corpus with Gensim to classify companies with an unsupervised approach

We will use embeddings to calculate similarities among companies and to classify them based on the Standard Industrial Classification (SIC) system. The SIC classification is a standardized numerical code assigned to businesses and industries to facilitate uniformity in economic reporting and analysis. Developed by the U.S. government, the SIC system categorizes companies into specific industry groups based on their primary economic activities. Each SIC code consists of a unique four-digit number, with greater specificity achieved through additional digits. We will kickstart this project by only using the first level of classification. See more information on https://en.wikipedia.org/wiki/Standard_Industrial_Classification#:~:text=The%20Standard%20Industrial%20Classification%20

 By leveraging embeddings, which represent semantic relationships between words or entities in a vector space, we aim to capture nuanced similarities in the textual content extracted from company websites/wikipedia sites. This classification methodology allows for a more granular understanding of industry affiliations and can enhance the precision of clustering and categorization efforts within the broader context of data science and natural language processing applications.

![SIC Classification](sic_codes.png)

In [None]:
import pandas as pd
import gensim  ## Topic modeling and document similarity.  
import gensim.downloader as gensim_api  ## Gensim model downloader. Download pre-trained models using Gensim API.
import seaborn as sns
import matplotlib as plt

In [None]:
nlp = gensim_api.load("glove-wiki-gigaword-300")

In [None]:
## Function to apply
def get_similar_words(lst_words, top, nlp):
    lst_out = lst_words
    for tupla in nlp.most_similar(lst_words, topn=top):
        lst_out.append(tupla[0])
    return list(set(lst_out))

In [None]:
## Create Dictionary {category:[keywords]}
dic_clusters = {}
dic_clusters["Farming"] = get_similar_words(['agriculture','fishing','forestry','farming'],  top=30, nlp=nlp)
dic_clusters["Mining"] = get_similar_words(['gold','coil','silver','mining','extraction'] , top=30, nlp=nlp)
dic_clusters["Construction"] = get_similar_words(['build','construction','state'],    top=30, nlp=nlp)
dic_clusters["Manufacturing"] = get_similar_words(['manufacture','plant'],  top=30, nlp=nlp)
dic_clusters["Transportation"] = get_similar_words(['manufacture','plant'], top=30, nlp=nlp)
dic_clusters["Retail"] = get_similar_words(['wholesale','plant'], top=30, nlp=nlp)
dic_clusters["Banking"] = get_similar_words(['manufacture','plant'],  top=30, nlp=nlp)
dic_clusters["Services"] = get_similar_words(['manufacture','plant'], top=30, nlp=nlp)

## print results to explore
for k,v in dic_clusters.items():
    print(k, ": ", v[0:5], "...", len(v))

In [None]:
tot_words = [word for v in dic_clusters.values() for word in v]
X = nlp[tot_words]
tot_words

In [None]:
## pca
pca = manifold.TSNE(perplexity=40, n_components=2, init='pca')
X = pca.fit_transform(X) #obtains a numpy array of PCA reducted vectors for each of the similar keywords
my_dataframe = pd.DataFrame(X, columns=['x', 'y'])
my_dataframe

In [None]:
## create a dataframe to portray our PCA vectors visually 
dtf = pd.DataFrame()
for k,v in dic_clusters.items():
    size = len(dtf) + len(v)
    dtf_group = pd.DataFrame(X[len(dtf):size], columns=["x","y"], 
                             index=v)
    dtf_group["cluster"] = k
    dtf = pd.concat([dtf, dtf_group])
        
## plot
fig, ax = plt.subplots()
sns.scatterplot(data=dtf, x="x", y="y", hue="cluster", ax=ax)
ax.legend().texts[0].set_text(None)
ax.set(xlabel=None, ylabel=None, xticks=[], xticklabels=[], 
       yticks=[], yticklabels=[])
