#### Author: Alex Sherman

# Document Clustering Homework

In this homework, we will cluster Oracle 10-K reports from 1994 to 2015. 

Text has separated been extracted from the reports into sections, separating the section_name from the section_text. Neither the sections or section text is perfectly structured, so we must deal with the inconsistencies in the text. 

Our task is to cluster the sections into logical groups.

In [1]:
import pandas as pd

In [2]:
path = r'..\2_dataset\oracle_10k.csv'
sections = pd.read_csv(path, encoding='iso-8859-1')

### Observe the data

In [12]:
sections.head(20)

Unnamed: 0.1,Unnamed: 0,section_id,filename,section_name,section_text,clusters
0,0,1,oracle-corporation_annual_report_1994.docx,ORACLE SYSTEMS FORM 10-K,(Annual Report) Filed 07/27/94 for the Period ...,17
1,1,2,oracle-corporation_annual_report_1994.docx,"REDWOOD CITY, CA 94065",Telephone\t6505067000,5
2,2,3,oracle-corporation_annual_report_1994.docx,CIK\t0000777676,SIC Code\t7372 - Prepackaged Software Industry...,5
3,3,4,oracle-corporation_annual_report_1994.docx,ORACLE CORP /DE/ FORM 10-K,(Annual Report) Filed 7/27/1994 For Period End...,17
4,4,5,oracle-corporation_annual_report_1994.docx,SECURITIES AND EXCHANGE COMMISSION,"Washington, D.C. 20549",5
5,5,6,oracle-corporation_annual_report_1994.docx,FORM 10-K [X] ANNUAL REPORT PURSUANT TO SECTIO...,(Exact name of registrant as specified in its ...,5
6,6,7,oracle-corporation_annual_report_1994.docx,SECURITIES REGISTERED PURSUANT TO SECTION 12(B...,(Title of class) Indicate by check mark whethe...,5
7,7,8,oracle-corporation_annual_report_1994.docx,ORACLE SYSTEMS CORPORATION 1994 FORM 10-K ANNU...,i,17
8,8,9,oracle-corporation_annual_report_1994.docx,PART I ITEM 1. BUSINESS,"The Company designs, develops, markets, and su...",1
9,9,10,oracle-corporation_annual_report_1994.docx,BACKGROUND,Computer software can be classified into two b...,5


### Create X (why no y?)

In [4]:
X = sections['section_name']

### Create a CountVectorizer or TfidfVectorizer

Test out differnt parameters of the vect (e.g. stop_words='english' or None) to observe how it changes the clusters

In [5]:
# import a vectorizer

# instantiate the vectorizer

# fit_transform the vectorizer


### Cluster the sections 

Start with KMeans clustering, but feel free to try out other clustering methods afterwards

In [13]:
# import a clustering model

# instanitate and fit the clustering model


### Observe the clusters

- Add the labels_ back to the original dataframe
- Sort the values by the cluster to analyze the clusters one by one, observing which sections were placed into each cluster

In [7]:
# add the labels_ back to the sections dataframe


In [14]:
# sort and observe the clusters


In [9]:
# optional helper to observe clusters
for label in set(km.labels_):
    print(sections[sections.clusters == label][['section_name','clusters']].head(10))
    print()

                                          section_name  clusters
40                             OTHER INCOME (EXPENSE):         0
41                         PROVISION FOR INCOME TAXES:         0
77                                        INCOME TAXES         0
141                            OTHER INCOME (EXPENSE):         0
142                        PROVISION FOR INCOME TAXES:         0
160  OPERATING  INCOME................................         0
161  INCOME BEFORE PROVISION FOR INCOME TAXES AND C...         0
180                                       INCOME TAXES         0
247                        PROVISION FOR INCOME TAXES:         0
271  OPERATING  INCOME...........................\t...         0

                                          section_name  clusters
8                              PART I ITEM 1. BUSINESS         1
21         CONSULTING, EDUCATION, AND SUPPORT SERVICES         1
105                            PART I ITEM 1. BUSINESS         1
120        CONSULTING, E

#### Describe each cluster (optional)

The below code runs a topic model (LatentDirichletAllocation) on the text of each of the cluster section_names one by one. It then returns the top five terms that best describe the topics.

- This code assumes your clustering model is named km

In [10]:
from sklearn.decomposition import LatentDirichletAllocation

def topic_model(fit_vect, n_topics=1):
    """
    :param fit_vect: vect fit with cluster text
    :param n_topics: num topics (often 1) to explain cluster
    
    :return: fit topic model
    """

    # lda for topic model
    lda = LatentDirichletAllocation(n_components=n_topics, learning_method='batch' , random_state=42)
    lda.fit(fit_vect)

    return lda

    
def topic_explanations(vect, model, n_top_words=5):
    """
    :param vect: fit vect
    :param model: fit topic model
    :param n_top_words: terms to describe topic
    
    :return: list of terms describing topic
    """

    # get feature names (all words from documents)
    feature_names = vect.get_feature_names()

    # only one topic, so fine to have return in for loop
    for topic in model.components_:
        return " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])


for label in set(km.labels_):
    # filter to only the text in one cluster
    cluster_text = sections[sections.clusters == label]['section_name']
    
    # fit a vector on the text in the cluster
    cluster_vect = TfidfVectorizer()
    fit_vect = cluster_vect.fit_transform(cluster_text)
    
    # run lda (topic model)
    lda_model = topic_model(fit_vect)
    
    # explain the text in the topic
    topic_explanation = topic_explanations(vect=cluster_vect, model=lda_model, n_top_words=5)
    
    print('Cluster={} | {}'.format(label, topic_explanation))

Cluster=0 | taxes income for net accounting
Cluster=1 | business services software item and
Cluster=2 | equity stockholders and liabilities stock
Cluster=3 | item executive of compensation properties
Cluster=4 | cash flows from and activities
Cluster=5 | of and the information to
Cluster=6 | development research and product software
Cluster=7 | software recognition revenue and elements
Cluster=8 | exhibit of title date name
Cluster=9 | financial and of exhibits item
Cluster=10 | 31 may year ended in
Cluster=11 | revenues and support the software
Cluster=12 | plan restructuring stock purchase fiscal
Cluster=13 | legal proceedings item contingencies other
Cluster=14 | acquisitions fiscal other of inc
Cluster=15 | share per earnings net income
Cluster=16 | employees the significant registrant officers
Cluster=17 | oracle corporation to statements consolidated
Cluster=18 | property litigation other intellectual sap
Cluster=19 | currency foreign risk translation item
