#### Author: Alex Sherman

# Document Clustering Homework

In this homework, we will cluster Oracle 10-K reports from 1994 to 2015. 

Text has separated been extracted from the reports into sections, separating the section_name from the section_text. Neither the sections or section text is perfectly structured, so we must deal with the inconsistencies in the text. 

Our task is to cluster the sections into logical groups.

In [2]:
import pandas as pd

In [3]:
path = r'..\..\2_dataset\oracle_10k.csv'
sections = pd.read_csv(path, encoding='iso-8859-1')

### Observe the data

In [4]:
sections.head(20)

Unnamed: 0.1,Unnamed: 0,section_id,filename,section_name,section_text
0,0,1,oracle-corporation_annual_report_1994.docx,ORACLE SYSTEMS FORM 10-K,(Annual Report) Filed 07/27/94 for the Period ...
1,1,2,oracle-corporation_annual_report_1994.docx,"REDWOOD CITY, CA 94065",Telephone\t6505067000
2,2,3,oracle-corporation_annual_report_1994.docx,CIK\t0000777676,SIC Code\t7372 - Prepackaged Software Industry...
3,3,4,oracle-corporation_annual_report_1994.docx,ORACLE CORP /DE/ FORM 10-K,(Annual Report) Filed 7/27/1994 For Period End...
4,4,5,oracle-corporation_annual_report_1994.docx,SECURITIES AND EXCHANGE COMMISSION,"Washington, D.C. 20549"
5,5,6,oracle-corporation_annual_report_1994.docx,FORM 10-K [X] ANNUAL REPORT PURSUANT TO SECTIO...,(Exact name of registrant as specified in its ...
6,6,7,oracle-corporation_annual_report_1994.docx,SECURITIES REGISTERED PURSUANT TO SECTION 12(B...,(Title of class) Indicate by check mark whethe...
7,7,8,oracle-corporation_annual_report_1994.docx,ORACLE SYSTEMS CORPORATION 1994 FORM 10-K ANNU...,i
8,8,9,oracle-corporation_annual_report_1994.docx,PART I ITEM 1. BUSINESS,"The Company designs, develops, markets, and su..."
9,9,10,oracle-corporation_annual_report_1994.docx,BACKGROUND,Computer software can be classified into two b...


### Create X (why no y?)

In [5]:
X = sections['section_name']

### Create a CountVectorizer or TfidfVectorizer

Test out differnt parameters of the vect (e.g. stop_words='english' or None) to observe how it changes the clusters

In [6]:
# import a vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate the vectorizer
vect = TfidfVectorizer(stop_words='english')

# fit_transform the vectorizer
fit_vect = vect.fit_transform(X)

### Cluster the sections 

Start with KMeans clustering, but feel free to try out other clustering methods afterwards

In [7]:
# import a clustering model
from sklearn.cluster import KMeans

# instanitate and fit the clustering model
km = KMeans(n_clusters=20)
km.fit(fit_vect)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=20, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

### Observe the clusters

- Add the labels_ back to the original dataframe
- Sort the values by the cluster to analyze the clusters one by one, observing which sections were placed into each cluster

In [8]:
# add the labels_ back to the sections dataframe
sections['clusters'] = km.labels_

In [9]:
# sort and observe the clusters
sections.sort_values('clusters').head()

Unnamed: 0.1,Unnamed: 0,section_id,filename,section_name,section_text,clusters
3333,3333,3334,oracle-corporation_annual_report_2014.docx,BASIS OF FINANCIAL STATEMENTS,The consolidated financial statements included...,0
167,167,168,oracle-corporation_annual_report_1995.docx,ORACLE CORPORATION NOTES TO CONSOLIDATED FINAN...,"Oracle Corporation designs, develops, markets,...",0
168,168,169,oracle-corporation_annual_report_1995.docx,BASIS OF FINANCIAL STATEMENTS,The consolidated financial statements include ...,0
2096,2096,2097,oracle-corporation_annual_report_2008.docx,ORACLE CORPORATION NOTES TO CONSOLIDATED FINAN...,Deferred revenues consisted of the following:,0
2098,2098,2099,oracle-corporation_annual_report_2008.docx,ORACLE CORPORATION NOTES TO CONSOLIDATED FINAN...,We lease certain facilities and furniture and ...,0


In [10]:
# optional helper to observe clusters
for label in set(km.labels_):
    print(sections[sections.clusters == label][['section_name','clusters']].head(10))
    print()

                                          section_name  clusters
36                     ITEM 6. SELECTED FINANCIAL DATA         0
37   ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS O...         0
45   ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY...         0
51   PART IV ITEM 14. EXHIBITS, FINANCIAL STATEMENT...         0
67   ORACLE SYSTEMS CORPORATION NOTES TO CONSOLIDAT...         0
68                       BASIS OF FINANCIAL STATEMENTS         0
138  ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS O...         0
146  ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY...         0
152  PART IV ITEM 14. EXHIBITS, FINANCIAL STATEMENT...         0
167  ORACLE CORPORATION NOTES TO CONSOLIDATED FINAN...         0

                                          section_name  clusters
0                             ORACLE SYSTEMS FORM 10-K         1
20   PORTABILITY ACROSS COMPUTER HARDWARE AND OPERA...         1
22                                 SYSTEMS INTEGRATION         1
93                      

#### Describe each cluster (optional)

The below code runs a topic model (LatentDirichletAllocation) on the text of each of the cluster section_names one by one. It then returns the top five terms that best describe the topics.

- This code assumes your clustering model is named km

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

def topic_model(fit_vect, n_topics=1):
    """
    :param fit_vect: vect fit with cluster text
    :param n_topics: num topics (often 1) to explain cluster
    
    :return: fit topic model
    """

    # lda for topic model
    lda = LatentDirichletAllocation(n_components=n_topics, learning_method='batch' , random_state=42)
    lda.fit(fit_vect)

    return lda

    
def topic_explanations(vect, model, n_top_words=5):
    """
    :param vect: fit vect
    :param model: fit topic model
    :param n_top_words: terms to describe topic
    
    :return: list of terms describing topic
    """

    # get feature names (all words from documents)
    feature_names = vect.get_feature_names()

    # only one topic, so fine to have return in for loop
    for topic in model.components_:
        return " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])


for label in set(km.labels_):
    # filter to only the text in one cluster
    cluster_text = sections[sections.clusters == label]['section_name']
    
    # fit a vector on the text in the cluster
    cluster_vect = TfidfVectorizer()
    fit_vect = cluster_vect.fit_transform(cluster_text)
    
    # run lda (topic model)
    lda_model = topic_model(fit_vect)
    
    # explain the text in the topic
    topic_explanation = topic_explanations(vect=cluster_vect, model=lda_model, n_top_words=5)
    
    print('Cluster={} | {}'.format(label, topic_explanation))

Cluster=0 | financial statements and of item
Cluster=1 | support systems and hardware software
Cluster=2 | item business properties proceedings legal
Cluster=3 | stock compensation based plans and
Cluster=4 | 31 may in millions share
Cluster=5 | of and oracle exhibit corporation
Cluster=6 | currency foreign translation risk presentation
Cluster=7 | services revenues and the fiscal
Cluster=8 | assets value fair goodwill intangible
Cluster=9 | and accounting of accountants disclosure
Cluster=10 | taxes income for net other
Cluster=11 | year ended 31 may in
Cluster=12 | information segment available geographic item
Cluster=13 | executive item 11 compensation officers
Cluster=14 | software and business new services
Cluster=15 | competition
Cluster=16 | and equity of item related
Cluster=17 | acquisitions fiscal and employees development
Cluster=18 | stock plan restructuring purchase repurchases
Cluster=19 | cash flows from and activities
