**Clustering after NB 0**

In the first step, we will try clustering on the prelabeled data from notebook 0. At this stage, the only processing done to the raw text is lammentizing and masking.

A list of different clustering algorithms available in sklearn can be seen here
https://scikit-learn.org/stable/modules/clustering.html

x-means is also another potential solution where k is found.

In [1]:
import pandas as pd
import numpy as np
test = pd.read_table("../data/polymers.test",encoding="utf-8",header=None)

In [2]:
def process_df(df):
    """processes dfs where original file 
    is of the form [label, sentence] """
    df = pd.DataFrame(df[0].str.split(" ",1))
    df["ground_truth"] = df[0].str[0].apply(lambda x: 1 if x=='__label__yes' else 0)
    df['sentence'] = df[0].str[1].apply(lambda x: x.lstrip('b'))
    del df[0]
    return df

df = process_df(test)

In [3]:
df.head()

Unnamed: 0,ground_truth,sentence
0,0,'Enantiopure Isotactic PCHC Synthesized by Rin...
1,0,"'William Guerin\xe2\x80\xa0, Abdou Khadri Dial..."
2,0,"'Macromolecules, 2014, 47 (13), pp 4230\xe2\x8..."
3,0,'DOI: 10.1021/ma5009397'
4,0,"'Publication Date (Web): June 24, 2014'"


In [3]:
df.ground_truth.value_counts()

0    28241
1     4349
Name: ground_truth, dtype: int64

In [29]:
df.shape

(32590, 3)

As can be seen from the above, the majority cases in the dataset do not contain a polymer. First, let's try stemming on the
entire sentence

The following will explore simple count vectorization on kmeans to see if any trends can be found
for clusters without processing done beforehand.

In [3]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

ps = PorterStemmer()

def porter_stem(string):
    
    list_string = string.split(" ")
    
    for i in range(len(list_string)):
        word = list_string[i]
        if word not in stopwords.words('english'):
            list_string[i] = ps.stem(list_string[i])
        
    return " ".join(list_string)

df["porter_sentence"] = df.sentence.map(porter_stem)

In [7]:
df.head()

Unnamed: 0,ground_truth,sentence,porter_sentence
0,0,'Enantiopure Isotactic PCHC Synthesized by Rin...,'enantiopur isotact pchc synthes by ring-open ...
1,0,"'William Guerin\xe2\x80\xa0, Abdou Khadri Dial...","'william guerin\xe2\x80\xa0, abdou khadri dial..."
2,0,"'Macromolecules, 2014, 47 (13), pp 4230\xe2\x8...","'macromolecules, 2014, 47 (13), pp 4230\xe2\x8..."
3,0,'DOI: 10.1021/ma5009397','doi: 10.1021/ma5009397'
4,0,"'Publication Date (Web): June 24, 2014'","'public date (web): june 24, 2014'"


In [4]:
from ClusterRunner import KMeansRunner
from sklearn.feature_extraction.text import CountVectorizer

n_grams = [(1,2),(2,2),(1,3),(3,3)]
n_clusters = range(2,5)

kmeans_runs = KMeansRunner().run(n_grams,n_clusters,df.porter_sentence,df.ground_truth,CountVectorizer,
                                 min_size = 300,
                                 min_threshold=0.5)

(1, 2)
2


KeyboardInterrupt: 

In [10]:
 kmeans_runs

{'homogeneity_all': {(1, 1): {2: 0.00574091569911332,
   3: 0.008260383068843349,
   4: 0.01026701108749708},
  (1, 2): {2: 0.006239336568233691,
   3: 0.007263586730323506,
   4: 0.010546138160955494},
  (2, 2): {2: 0.0017193928027511185,
   3: 0.0032423801075513138,
   4: 0.005917589377338861},
  (1, 3): {2: 0.006219906665485889,
   3: 0.007181215932669023,
   4: 0.010554662668325453},
  (3, 3): {2: 0.0002733894195257013,
   3: 5.952706844845892e-05,
   4: 0.00115413652394407}},
 'homogeneity_single': {(1,
   1): {2: [(1, 0.8821279245618717, 'no'),
    (0, 0.831882050267168, 'no')], 3: [(1, 0.8853747630276969, 'no'), (0,
     0.7776726584673604,
     'no'), (2, 0.8348985156013329, 'no')], 4: [(3, 0.895406680018089, 'no'),
    (1, 0.77734375, 'no'),
    (0, 0.8513195290296387, 'no'),
    (2, 0.8219214437367304, 'no')]},
  (1, 2): {2: [(1, 0.8823812439856531, 'no'), (0, 0.8293234628829941, 'no')],
   3: [(0, 0.8845057471264368, 'no'),
    (1, 0.8048359240069085, 'no'),
    (2, 0.835384

From the above output, it can be seen that no amount of clusters found a subset of data that heavily consisted of polymers or non-polymers as each cluster was predominantly non-polymers, but not to the extent that it was a significant higher concentration than the overall population. Next, the above will be applied with TF-IDF

In [5]:
from ClusterRunner import KMeansRunner
from sklearn.feature_extraction.text import TfidfVectorizer

n_grams = [(1,2),(2,2),(1,3),(3,3)]
n_clusters = range(2,5)

kmeans_run_tf = KMeansRunner().run(n_grams,n_clusters,df.porter_sentence,df.ground_truth,TfidfVectorizer,
                                 min_size = 1000,
                                 min_threshold=0.5)

(1, 2)
2
3
4
(2, 2)
2
3
4
(1, 3)
2
3
4
(3, 3)
2
3
4


In [6]:
kmeans_run_tf

{'homogeneity_all': {(1, 2): {2: 0.015026917059195549,
   3: 0.00988356356676446,
   4: 0.013440405745402954},
  (2, 2): {2: 0.00020589232150894726,
   3: 0.0013393800798682093,
   4: 0.006616275659855298},
  (1, 3): {2: 0.012371289077022476,
   3: 0.012800834793106097,
   4: 0.013391848347362416},
  (3, 3): {2: 0.006528119466355282,
   3: 0.0020582855704386264,
   4: 0.0034908972569753414}},
 'homogeneity_single': {(1,
   2): {2: [(1, 0.8609476915206549, 'no', 26927),
    (0, 1.0, 'no', 1314)], 3: [(1, 0.8649063219020489, 'no', 23682), (2,
     0.8508832300986465,
     'no',
     3709)], 4: [(2, 0.8648109167453001, 'no', 23829),
    (1, 0.840531561461794, 'no', 3289)]},
  (2,
   2): {2: [(1, 0.8654200078874721, 'no', 26333),
    (0, 0.8825161887141536, 'no', 1908)], 3: [(2,
     0.869203282679546,
     'no',
     22136),
    (0, 0.8552949538024165, 'no', 6017)], 4: [(2,
     0.8681864929722317,
     'no',
     20260),
    (0, 0.8629611883085769, 'no', 1801),
    (3, 0.8510188679245283

When using the porter stemmed data with TF-IDF, it can be seen that unigram-bigram 3 clusters, bigram 4 clusters, bigram 4 clusters, and unigram-trigram 2 clusters all have 100% "No". We can use these to potentially filter out candidates either before or after the prediction happens.

The current problem that's occuring is that although 100% no clusters can be identified, it does very from each run as to where they exist. Typically they exist from bigram + , but not consistently. When they aren't deterministic for the 100% note label, then they go back to the 8x% no percentage, which makes them unusable for our purposes.

One option would be to set the seed to guarantee we get the same cluster, but let's explore other options instead.

Due to the above behaivor, it seems deterministic clusters might be more desirable for aiding in reducing false positives. 
Lets explore the following as listed below https://stats.stackexchange.com/questions/205833/deterministic-clustering-approaches


In [4]:
from ClusterRunner import DBScanRunner
from sklearn.feature_extraction.text import TfidfVectorizer

n_grams = [(1,2),(2,2),(1,3),(3,3)]
#n_clusters = range(2,5)

dbscan_tf = DBScanRunner().run(n_grams,df.porter_sentence,df.ground_truth,TfidfVectorizer,
                                 dbscan_kwargs = {"n_jobs":-1}, # -1 for multiprocessing
                                 min_size = 0, #setting a low min size here to start to see if any 'yes' cluster trends may show up
                                 min_threshold=0.5)

(1, 2)
(2, 2)
(1, 3)
(3, 3)


In [5]:
dbscan_tf

{'homogeneity_all': {(1, 2): 0.022641047925681737,
  (2, 2): 0.02078263796314288,
  (1, 3): 0.022557623726731556,
  (3, 3): 0.02201095951119997},
 'homogeneity_single': {(1, 2): [(-1, 0.8584601348929654, 'no', 26347),
   (0, 1.0, 'no', 12),
   (1, 1.0, 'no', 360),
   (2, 1.0, 'no', 129),
   (3, 1.0, 'no', 131),
   (4, 1.0, 'no', 111),
   (5, 1.0, 'no', 38),
   (6, 1.0, 'no', 89),
   (7, 1.0, 'no', 88),
   (8, 1.0, 'no', 64),
   (9, 1.0, 'no', 139),
   (10, 1.0, 'no', 123),
   (11, 1.0, 'no', 12),
   (12, 1.0, 'no', 74),
   (13, 1.0, 'no', 13),
   (14, 1.0, 'no', 10),
   (15, 1.0, 'no', 12),
   (16, 1.0, 'no', 11),
   (17, 1.0, 'no', 96),
   (18, 1.0, 'no', 64),
   (19, 1.0, 'no', 5),
   (20, 1.0, 'no', 24),
   (21, 1.0, 'no', 21),
   (22, 1.0, 'no', 9),
   (23, 1.0, 'no', 6),
   (24, 1.0, 'no', 15),
   (25, 1.0, 'no', 18),
   (26, 1.0, 'no', 5),
   (27, 1.0, 'no', 41),
   (28, 1.0, 'no', 6),
   (29, 1.0, 'no', 5),
   (30, 1.0, 'no', 9),
   (31, 1.0, 'no', 10),
   (32, 1.0, 'no', 13),
 

No majority "yes" clusters have been found with default params. One thing of interest is the larger 'no' cluster in
the trigram (3,3) output. 
Furthermore, it seems that in every gram iteration, almost 100% of the "yes" labels were contained within the "noisy" cluster (as DBScan assigns -1 to labels that're noisy). This could indicate that algorithm is finding outlier sentence=nces that have an extremely high chance of not containing polymers. To examine this idea, let's look at the set of the words that exist in clusters outside of -1 but don't exist in -1 to get some idea of what these sentneces may be talking about. 

Further exploration of model tuning should be performed here

In [9]:
dbscan_tf["models"][(3,3)].components_

<2233x444669 sparse matrix of type '<class 'numpy.float64'>'
	with 2593 stored elements in Compressed Sparse Row format>

In [11]:
len(dbscan_tf["models"][(3,3)].labels_)

32590

In [10]:
#Lastly, we will create a textfile with ONLY the rows that're in cluster -1
cluster_labels = dbscan_tf["models"][(3,3)].labels_
remove_ixs = []
for ix in range(len(cluster_labels)):
    if cluster_labels[ix] != -1:
        remove_ixs.append(ix)


2233

In [15]:
modified_df = df.drop(remove_ixs)

In [16]:
modified_df.shape

(30357, 3)

In [23]:
modified_df.set_index(pd.Index([x for x in range(len(modified_df))]),inplace = True)
modified_df.to_csv("df_post_clustering.csv")

In [27]:
#lets see what this algorithm discarded
discarded_df = df.drop([ix for ix in range(len(df)) if ix not in remove_ixs])
discarded_df.shape

(2233, 3)

In [31]:
for sentence in discarded_df.sentence:
    print(sentence)

'Copyright \xc2\xa9 2014 American Chemical Society'
')'
'Abstract'
'Introduction'
'Results and Discussion'
'DFT Computations'
'Conclusion'
' Supporting Information'
'This material is available free of charge via the Internet at http://pubs.acs.org'
'The authors declare no competing financial interest'
'Acknowledgment'
'References'
'Copyright \xc2\xa9 2015 American Chemical Society'
')'
'Abstract'
'Introduction'
'Experimental Methods'
'Results and Discussion'
'Conclusions'
' Supporting Information'
'The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.macromol.5b00979'
'The authors declare no competing financial interest'
'Acknowledgment'
'References'
'Copyright \xc2\xa9 2006 American Chemical Society'
'Acknowledgment'
'Supporting Information Available'
'This material is available free of charge via the Internet at http://pubs.acs.org'
'Copyright \xc2\xa9 2010 American Chemical Society'
'*Corresponding authors'
'Abstract'
'Introducti

It appears the algorithm mostly removed footers, section headers, and section numerical identifiers