Models used in this assignment

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Used for task 1
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams


#used for task 2
from nltk import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#used in part B task 1
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_validate, cross_val_score
from sklearn.preprocessing import LabelEncoder

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
#used in part B task 1 unsupervised
from sklearn.cluster import KMeans

#used in part B task 2
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion



%matplotlib inline

For the first task of the assignment we are using a drug review dataset

In [58]:
data = pd.read_csv('/Users/joel/Desktop/drugLib_raw/drugLibTrain_raw.tsv', sep='\t')

In [72]:
data.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


#### For the following tasks we considere just the column "SideEffectsReview"

In [73]:
X = data['sideEffectsReview']

In [98]:
X[1]

"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Long Lasting Cycles. It's only been 5 1/2 months, but i'm concidering changing to a different bc. This is my first time using any kind of bc, unfortunately due to the constant hassel, i'm not happy with the results."

Above we see an output of a row in our dataset. Typically people write down their side effects and how much or how long they have taken a drug. We can see each single review as a document while X is then a collection of documents

#### Task 1: Building a pipeline to remoce punctuation and stopwords. As well we want to tokenize the data. The result should be a list of bi-grams

In [74]:
# a class to clean the text data
# 1. A sentence or several sentences are tokenized in a list of separeted elements rowise
# 2. For each list punctuations are removed
# 3. returns a cleaned list
class CleanText():
    
    def __init__(self):
        return
        
    def fit(self, X, y = None):
        return self
    
    def clean_punctuation(self, x):
        
        #returns a list with tokinized words. Each string is a seperate element of the list
        x = str(x)
        w = word_tokenize(x)
        
        #removes all non alphabetic characters
        clean_list = [i for i in w if i.isalpha()]
        
        return clean_list
    
    def transform(self, X, y = None):
        
        l = []
        

        
        for i in X:
            
            cp = self.clean_punctuation(i)
            l.append(cp)
        
        
        return l #X.apply(self.clean_punctuation)

In [75]:
# a class to remove stopwords

class Stop():
    
    def __init__(self, language):
        self.lang = language
        
    def fit(self, X, y = None):
        return self
    
    def remove_stop_words(self, x):
        cleaned_list = [w for w in x if not w in stopwords.words(self.lang)]
        return cleaned_list
    
    def transform(self, X, y = None):
        l = []
        for i in X:
            cleaned_sublist = self.remove_stop_words(i)
            l.append(cleaned_sublist)
        
        return l
    

In [76]:
# a class to generate n-grams
# returns a nested list where each sublist is a list for each row with n-grams
class NGram():
    
    def __init__(self, n):
        self.n = n
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y=None):
        ngram_total_list = []
        for i in range(0, len(X)):
            ngram = ngrams(X[i], n = self.n)
            
            ngram = list(ngram)
            
            ngram_total_list.append(ngram)
            
        return ngram_total_list

In [77]:
pipe = Pipeline([("Remove_Punktuation", CleanText()),
                 ("Stopwords", Stop('english')),
               ("NGram", NGram(2))])

In [78]:
pipe_result = pipe.fit_transform(X)

In [102]:
len(pipe_result)

3107

In [104]:
pipe_result[1]

[('Heavy', 'Cycle'),
 ('Cycle', 'Cramps'),
 ('Cramps', 'Hot'),
 ('Hot', 'Flashes'),
 ('Flashes', 'Fatigue'),
 ('Fatigue', 'Long'),
 ('Long', 'Lasting'),
 ('Lasting', 'Cycles'),
 ('Cycles', 'It'),
 ('It', 'months'),
 ('months', 'concidering'),
 ('concidering', 'changing'),
 ('changing', 'different'),
 ('different', 'bc'),
 ('bc', 'This'),
 ('This', 'first'),
 ('first', 'time'),
 ('time', 'using'),
 ('using', 'kind'),
 ('kind', 'bc'),
 ('bc', 'unfortunately'),
 ('unfortunately', 'due'),
 ('due', 'constant'),
 ('constant', 'hassel'),
 ('hassel', 'happy'),
 ('happy', 'results')]

For the first Task we removed stopwords and tokenized the text. Then we built bigrams for each document. Above we see the result. We have a nested list where each individual list is a collection of bigrams for a document

In [116]:
pipe2 = Pipeline([("Remove_Punktuation", CleanText()),
                 ("Stopwords", Stop('english'))])

In [117]:
pipe_result2 = pipe2.fit_transform(X)

In [120]:
pipe_result2[1]

['Heavy',
 'Cycle',
 'Cramps',
 'Hot',
 'Flashes',
 'Fatigue',
 'Long',
 'Lasting',
 'Cycles',
 'It',
 'months',
 'concidering',
 'changing',
 'different',
 'bc',
 'This',
 'first',
 'time',
 'using',
 'kind',
 'bc',
 'unfortunately',
 'due',
 'constant',
 'hassel',
 'happy',
 'results']

#### Task 2: Split the corpus into sentences and vectorize it into a bag of words and TF-ID

In [155]:
X[1]

"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Long Lasting Cycles. It's only been 5 1/2 months, but i'm concidering changing to a different bc. This is my first time using any kind of bc, unfortunately due to the constant hassel, i'm not happy with the results."

In [84]:
# a class to create sentences out of the text
# the function is applied rowise on the input dataframe
class text2sentence():
    
    def __init__(self):
        return
    
    def fit(self, X, y = None):
        return self
    
    def sentence(self, x):
        x = str(x)
        s = sent_tokenize(x)
        
        return s
    
    def transform(self, X, y = None):
        
        #X_sentence = X.apply(self.sentence)
        X_sentence = [self.sentence(w) for w in X]
        return X_sentence #list(X_sentence)
        

In [109]:
class BagOfWords():
    
    def __init__(self):
        self.vectorizer = CountVectorizer(analyzer='word', lowercase = True)
        
    def fit(self, X, y = None):
        #bow = self.vectorizer.fit(X)
        return self
    
    def transform(self, X, y = None):

        bow_transform = self.vectorizer.fit_transform(X)
        names_of_vectors = self.vectorizer.get_feature_names()
        
        return bow_transform, names_of_vectors

In [86]:
# a class to flatten a nested list into a list
# This is necessary if we split our data into sentences befor applying the bag of words algorithm on the data
class Flatten():
    def __init__(self):
        return
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        
        flatten = [item for sublist in X for item in sublist]
        
        return flatten   

In [210]:
''' a class to lower case capitalized words. As well this class merges sub lists to a string
 Used before applying bag of Words or TF-ID '''

class Lower():
    def __init__(self):
        return
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        
        lower_list = []
        for i in X:
            X_actual = i
            lower = [x.lower() for x in X_actual]
            lower_list.append(lower)
        
        list_of_list_to_string = [''.join(l) for l in lower_list]
                          
        return list_of_list_to_string

In [223]:
class TF_ID():
    
    def __init__(self, explainable):
        self.vectorizer = TfidfVectorizer(analyzer='word')
        self.explain = explainable
        
    def fit(self, X, y = None):
        #bow = self.vectorizer.fit(X)
        return self.vectorizer.fit(X)
    
    def transform(self, X, y = None):
        
        if self.explain == True:
            
            tfid_transform = self.vectorizer.transform(X)
            names_of_vectors = self.vectorizer.get_feature_names()
        
            return tfid_transform, names_of_vectors
        
        else:
            tfid_transform = self.vectorizer.transform(X)
            return tfid_transform.toarray()
            

In [110]:
pipe_1 = Pipeline([
                ("Sentences", text2sentence()),
                 ("Flatten", Flatten()),
                 ("BoW", BagOfWords())])

#we applied a flatten function to merge sublist into a list. 
#We need to do this to allow to apply BoW on the complete corpus and not just one a singe document

In [111]:
pipe_result_1, names_bow = pipe_1.fit_transform(X)

In [112]:
bow_matrix = pd.DataFrame(pipe_result_1.toarray(), columns = names_bow)
bow_matrix.head()

Unnamed: 0,00,000,000mg,025,05,07,08,10,100,1000,...,zithromycin,zofran,zoloft,zombie,zombing,zomig,zone,zyban,zyprexa,zyrtec
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For better demonstration we built a class for bag of words, which also returns us the names. With this we can build a dataframe which shows in for each row a document and the each column counts the appeareance of the word. As we see the result is a very sparse matrix. Also we notice that we just splitted the original corpus into sentences and applied directly a a bag of words algorithm on it. As result we also have strings / words as 00, 000mg and other values. For later tasks we may will clean this as well, but in this exercise it would be out of scope

In [224]:
pipe_2 = Pipeline([("Remove_Punktuation", CleanText()),
                 ("Stopwords", Stop('english')),
                 ("Sentences", text2sentence()),
                 ("Flatten", Flatten()),
                 ("TFID", TF_ID(explainable = True))])

In [225]:
pipe_result_2, names = pipe_2.fit_transform(X)

ValueError: too many values to unpack (expected 2)

After we splitted the corpus into sentences we applied a flatten function on our list. The reason therefor is that the splitting resulted into a list of nested list which we transform through flatten into one list. On this we then apply the Bag of Words or TF-ID Method. Both pipelines output a very sparse matrix.

We applied in our pipeline as well the classes for removing punctuations and stepwords. In the original data we are also confronted with numeric values as 10mg, 1-2 days, ect. This are informations which are not directly useful for us and just increase the dimension of our matrix.

In [312]:
document_term_matrix = pd.DataFrame(pipe_result_2.toarray(), columns = names)

In [313]:
document_term_matrix.head()

Unnamed: 0,abandon,abandoning,abated,abbsessed,abdomen,abdominal,abfter,abilify,abilities,ability,...,zithromycin,zofran,zoloft,zombie,zombing,zomig,zone,zyban,zyprexa,zyrtec
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


For better visualization of the result we constructed a data frame out of the TF-ID matrix. As we see above we are dealing with a very sparse matrix where most of the elements are zeros.

# Part B

For the second part of the assignment we not directly use the same pipeline as we consotructed before. We will skip some steps and add some more. In detail we will skip the step splitting long strings into sentences. We have to remember that each row is a document and for each document we have predefined label. If we would split inside our documents the text into sentences we would have for each document a several lists.

For part B we not apply use the before constructed pipelines. We will skip to steps from the previous one. We not split up our corpus into sentences and therefor we also not apply the flatten function on the resulting list. We do this because spliting our data into sentences will increase lenght of our total list. This results in the problem that our labels from before do not fit anymore. We have more datapoints than labels and we not know the relationship between each of them.

In [185]:
a = np.array(word_tokenize(X[1]))

In [203]:
a

array(['Heavy', 'Cycle', ',', 'Cramps', ',', 'Hot', 'Flashes', ',',
       'Fatigue', ',', 'Long', 'Lasting', 'Cycles', '.', 'It', "'s",
       'only', 'been', '5', '1/2', 'months', ',', 'but', 'i', "'m",
       'concidering', 'changing', 'to', 'a', 'different', 'bc', '.',
       'This', 'is', 'my', 'first', 'time', 'using', 'any', 'kind', 'of',
       'bc', ',', 'unfortunately', 'due', 'to', 'the', 'constant',
       'hassel', ',', 'i', "'m", 'not', 'happy', 'with', 'the', 'results',
       '.'], dtype='<U13')

In [188]:
list_of_list_to_string = [''.join(i) for i in a]

In [211]:
l = Lower()

In [212]:
b = l.transform(a)

In [215]:
b[0]

'heavy'

#### Task 1

#### Classification
We construct two models for text classification. We are using the well known k-NN algorithm and AdaBoost?

Our hypothesis is, that we can classify based on the review comment about the side effects, the gravity of such effects.

In [226]:
class NotSparse():
    
    def __init__(self):
        return
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        return X.toarray()

##### 1st. Creating a training / test dataset

In [7]:
#loading the labels
y_label = data['sideEffects']

#encode the string values into numeric values
le = LabelEncoder()

y = le.fit_transform(y_label)

In [20]:
plt.bar(y, bins = 5, rwidth = 0.5)
plt.title("Histogram")
plt.xlabel('Side Effects')
plt.ylabel('Frequency')
plt.show()

TypeError: bar() missing 1 required positional argument: 'height'

In [33]:
# creating the data
X = data['sideEffectsReview'].values

In [34]:
#splitting the data into a training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42, stratify = y)

##### 2nd. Creating a pipeline:

The following workflow shows how the pipeline works.
1. remove punctuations
2. remove stopwords

> After we removed punctuations and stopwords each document is a list of words. In the next step we have to merge the single documents into one corpus to apply transform it into a vector space model

3. lower capitalized letters (We do this to avoid that the same word, where just the capitalized letter is the difference, is counted differently
4. create a TF-ID matrix
5. transfrom a sparse matrix into a numpy matrix (the TFID matrix is very sparse. To avoid memory problems the result is saved in a sparse format, this unfortunatelly does not allows directly to apply a ml algorithm. To avoid this we transform it into a numpy array)

In [227]:
base_pipeline = Pipeline([("Remove_Punktuation", CleanText()),
                         ("Stopwords", Stop('english')),
                         ("Lower", Lower()),
                          ("TFID", TF_ID(explainable = False)),
                         ("Matrix", NotSparse())])

In [228]:
base_pipeline.fit_transform(X)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

##### 3rd. apply kNN Classifier to the data

In [153]:
pipeline_kNN = Pipeline([("base", base_pipeline),
                        ("kNN", KNeighborsClassifier(algorithm = 'brute'))])

In [154]:
pipeline_kNN.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('base',
                 Pipeline(memory=None,
                          steps=[('Remove_Punktuation',
                                  <__main__.CleanText object at 0x7feccf9f1c90>),
                                 ('Stopwords',
                                  <__main__.Stop object at 0x7feccf9f1fd0>),
                                 ('Lower',
                                  <__main__.Lower object at 0x7feccf9f1f50>),
                                 ('TFID',
                                  <__main__.TF_ID object at 0x7feccf9f1250>)],
                          verbose=False)),
                ('kNN',
                 KNeighborsClassifier(algorithm='brute', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)

In [459]:
y_pred = pipeline_kNN.predict(X_test)

In [460]:
y_pred

array([2, 2, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3,
       2, 1, 1, 3, 2, 3, 2, 1, 3, 1, 3, 1, 3, 3, 2, 1, 1, 2, 3, 3, 3, 3,
       3, 3, 2, 2, 1, 2, 3, 3, 2, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 2, 1,
       3, 3, 1, 3, 2, 3, 3, 3, 1, 1, 3, 1, 1, 2, 1, 2, 3, 3, 1, 3, 3, 3,
       3, 3, 3, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 3, 2, 2, 3, 3, 2, 1, 1, 3,
       2, 2, 3, 3, 1, 2, 1, 2, 3, 3, 1, 1, 2, 1, 3, 2, 1, 1, 3, 3, 1, 3,
       3, 3, 3, 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 2,
       2, 3, 3, 1, 3, 1, 3, 2, 3, 2, 3, 3, 2, 2, 2, 1, 1, 3, 1, 1, 3, 3,
       1, 3, 2, 2, 3, 3, 3, 3, 3, 2, 2, 3, 1, 2, 1, 1, 3, 3, 3, 3, 3, 3,
       3, 3, 1, 2, 2, 3, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3,
       3, 3, 2, 3, 2, 1, 3, 2, 3, 1, 3, 3, 1, 3, 1, 2, 3, 2, 3, 3, 3, 3,
       2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 3, 2, 1, 2,
       3, 1, 3, 3, 3, 2, 3, 3, 2, 3, 3, 1, 2, 1, 1, 2, 2, 2, 1, 3, 3, 3,
       3, 3, 3, 1, 2, 1, 3, 3, 4, 3, 3, 1, 2, 3, 3,

In [155]:
pipeline_kNN.score(X_test, y_test)

0.40836012861736337

In [41]:
pipeline_RF = Pipeline([("base", base_pipeline),
                        ("RandomForest", RandomForestClassifier(n_estimators = 10))])

In [42]:
pipeline_RF.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('base',
                 Pipeline(memory=None,
                          steps=[('Remove_Punktuation',
                                  <__main__.CleanText object at 0x7fecd5722ed0>),
                                 ('Stopwords',
                                  <__main__.Stop object at 0x7fecd57227d0>),
                                 ('Lower',
                                  <__main__.Lower object at 0x7fecd57229d0>),
                                 ('TFID',
                                  <__main__.TF_ID object at 0x7fecd5722a10>)],
                          verbose=False)),
                ('RandomForest',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                              

In [43]:
pipeline_RF.score(X_test, y_test)

0.5144694533762058

#### Clustering

In [508]:
np.unique(y)

array([0, 1, 2, 3, 4])

In [509]:
pipeline_kMean = Pipeline([("base", base_pipeline),
                        ("k-Means", KMeans(n_clusters = 5))])

In [512]:
pipeline_kMean.fit(X)

Pipeline(memory=None,
         steps=[('base',
                 Pipeline(memory=None,
                          steps=[('Remove_Punktuation',
                                  <__main__.CleanText object at 0x7f837cdf9450>),
                                 ('Stopwords',
                                  <__main__.Stop object at 0x7f837cdf9bd0>),
                                 ('Lower',
                                  <__main__.Lower object at 0x7f837cdf9e10>),
                                 ('TFID',
                                  <__main__.TF_ID object at 0x7f837cdf9850>)],
                          verbose=False)),
                ('k-Means',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=5, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0))],
         verbose=False)

In [513]:
y_cluster_pred = pipeline_kMean.predict(X)

In [519]:
y_cluster_pred[0:10]

array([0, 0, 0, 1, 0, 0, 0, 2, 0, 0], dtype=int32)

#### Task 2

##### Topic modeling

Typically the TF-ID matrix is very sparse. This means most of the elements in the matrix are actual zeros. On the other hand we also have a high dimensionality. Especially algorithms which are based on distance measuring as the K-NN have troubles to compute the distance in several dimensions. This problem is known as "curse of dimensionality". Therefore we apply the principal component analysis method on our data to reduce the dimensionality without loosing to much of the variance

In [47]:
class NotSparse():
    
    def __init__(self):
        return
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        return X.toarray()

In [174]:
pipeline_base2 = Pipeline([("base", base_pipeline),
                         ("Converter", NotSparse())])

In [175]:
pipeline_base2.fit(X)

Pipeline(memory=None,
         steps=[('base',
                 Pipeline(memory=None,
                          steps=[('Remove_Punktuation',
                                  <__main__.CleanText object at 0x7feccf9f1c90>),
                                 ('Stopwords',
                                  <__main__.Stop object at 0x7feccf9f1fd0>),
                                 ('Lower',
                                  <__main__.Lower object at 0x7feccf9f1f50>),
                                 ('TFID',
                                  <__main__.TF_ID object at 0x7feccf9f1250>)],
                          verbose=False)),
                ('Converter', <__main__.NotSparse object at 0x7fecd5179910>)],
         verbose=False)

In [176]:
X_lda = pipeline_base2.transform(X)

In [168]:
X_lda.shape

(3107, 5)

In [177]:
lda = LatentDirichletAllocation(n_components = 5)

In [178]:
lda_X = lda.fit_transform(X_lda)

In [181]:
X[0]

'cough, hypotension , proteinuria, impotence , renal failure , angina pectoris , tachycardia , eosinophilic pneumonitis, tastes disturbances , anusease anorecia , weakness fatigue insominca weakness'

In [184]:
topic_terms=lda.components_
topic_terms

array([[0.37429717, 0.32348255, 0.25830738, ..., 0.20003413, 0.20682502,
        0.20000873],
       [0.20002131, 0.2000165 , 0.51165644, ..., 0.51095742, 0.20002475,
        0.2007486 ],
       [0.20002225, 0.20001694, 1.27249151, ..., 0.20005047, 0.76770701,
        0.20001188],
       [0.20002215, 0.20001764, 0.21004712, ..., 0.39927703, 0.20002777,
        1.33643811],
       [0.20002152, 0.20001768, 0.20008651, ..., 0.4898156 , 0.20002923,
        0.20001237]])

In [187]:
# Extraqcting the most important 10 terms for each topic
topic_terms=lda.components_
top_terms=10 # number of 'top terms'
topic_key_terms_idxs=np.argsort(-np.absolute(topic_terms), axis=1)[:,:top_terms]
topic_keyterms=names[topic_key_terms_idxs]
topics=[', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth',-1)
topics_df=pd.DataFrame(topics,columns=['Term per Topic'], index=['Topic'+str(t) for t in range(1,5+1)])
topics_df

NameError: name 'names' is not defined

#### Task 3

To perform document summarization we are using the genesis chapter from the bible. The dataset was downloaded from Kaggle and can be found under the following link: https://www.kaggle.com/nltkdata/genesis.
Especially we use the english web file. The alternative would be to use the King James version of the bible. But here we have to take into account that special characters are included like ";". Another thing is that the sentences are written in an older style which can cause troubles if we use models which use data trained on modern english

In [128]:
bible = open("/Users/joel/Downloads/genesis/english-web.txt", "r")

In [127]:
print(bible.read(7500))

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.
God said, "Let there be light," and there was light.
God saw the light, and saw that it was good.  God divided
the light from the darkness.
God called the light Day, and the darkness he called Night.
There was evening and there was morning, one day.
God said, "Let there be an expanse in the middle of the waters,
and let it divide the waters from the waters."
God made the expanse, and divided the waters which were under
the expanse from the waters which were above the expanse;
and it was so.
God called the expanse sky.  There was evening and there
was morning, a second day.
God said, "Let the waters under the sky be gathered together
to one place, and let the dry land appear;" and it was so.
God called the dry land Earth, and the gathering together
of the waters he called Seas.  God saw that it 

In [223]:
len(bible.read())

195315

In [129]:
bible_corpus = bible.read(7500)

Above we see the first 200 characters of the genesis chapter. As we see the chapter contains many characters. We will now use TextRank, a variation of the PageRank algorithm, to extract keywords out of the chapter

In [25]:
class TextRank4Sentences():
    def __init__(self):
        self.damping = 0.85  # damping coefficient, usually is .85
        self.min_diff = 1e-5  # convergence threshold
        self.steps = 100  # iteration steps
        self.text_str = None
        self.sentences = None
        self.pr_vector = None

    def _sentence_similarity(self, sent1, sent2, stopwords=None):
        if stopwords is None:
            stopwords = []

        sent1 = [w.lower() for w in sent1]
        sent2 = [w.lower() for w in sent2]

        all_words = list(set(sent1 + sent2))

        vector1 = [0] * len(all_words)
        vector2 = [0] * len(all_words)

        # build the vector for the first sentence
        for w in sent1:
            if w in stopwords:
                continue
            vector1[all_words.index(w)] += 1

        # build the vector for the second sentence
        for w in sent2:
            if w in stopwords:
                continue
            vector2[all_words.index(w)] += 1

        return core_cosine_similarity(vector1, vector2)

    def _build_similarity_matrix(self, sentences, stopwords=None):
        # create an empty similarity matrix
        sm = np.zeros([len(sentences), len(sentences)])

        for idx1 in range(len(sentences)):
            for idx2 in range(len(sentences)):
                if idx1 == idx2:
                    continue

                sm[idx1][idx2] = self._sentence_similarity(sentences[idx1], sentences[idx2], stopwords=stopwords)

        # Get Symmeric matrix
        sm = get_symmetric_matrix(sm)

        # Normalize matrix by column
        norm = np.sum(sm, axis=0)
        sm_norm = np.divide(sm, norm, where=norm != 0)  # this is ignore the 0 element in norm

        return sm_norm

    def _run_page_rank(self, similarity_matrix):

        pr_vector = np.array([1] * len(similarity_matrix))

        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix, pr_vector)
            if abs(previous_pr - sum(pr_vector)) < self.min_diff:
                break
            else:
                previous_pr = sum(pr_vector)

        return pr_vector

    def _get_sentence(self, index):

        try:
            return self.sentences[index]
        except IndexError:
            return ""

    def get_top_sentences(self, number=5):

        top_sentences = []

        if self.pr_vector is not None:

            sorted_pr = np.argsort(self.pr_vector)
            sorted_pr = list(sorted_pr)
            sorted_pr.reverse()

            index = 0
            for epoch in range(number):
                sent = self.sentences[sorted_pr[index]]
                sent = normalize_whitespace(sent)
                top_sentences.append(sent)
                index += 1

        return top_sentences

    def analyze(self, text, stop_words=None):
        self.text_str = text
        self.sentences = sent_tokenize(self.text_str)

        tokenized_sentences = [word_tokenize(sent) for sent in self.sentences]

        similarity_matrix = self._build_similarity_matrix(tokenized_sentences, stop_words)

        self.pr_vector = self._run_page_rank(similarity_matrix)

In [57]:
pipe_TextRank = Pipeline([("Remove_Punktuation", CleanText()),
                 ("Stopwords", Stop('english')),
                 ("Lower", Lower())])
                #("Flatten", Flatten())])
                # ("BoW", CountVectorizer()),
                 #("NotASparceMatrix", NotSparse())])

In [59]:
X_bible = pipe_TextRank.fit_transform(bible)

In [30]:
X_bible = list(bible)

In [64]:
X_bible[0]

'in beginning god created heaven earth'

In [70]:
all_words = list(set(X_bible[0] + X_bible[1]))

In [71]:
vector2 = [0] * len(all_words)

In [74]:
vector1 = [0] * len(all_words)

In [72]:
for w in X_bible[0]:

    vector2[all_words.index(w)] += 1

In [75]:
for w in X_bible[1]:

    vector1[all_words.index(w)] += 1

In [81]:
from nltk.cluster.util import cosine_distance

In [82]:
core_cosine_similarity(vector1, vector2)

0.7787915579452677

In [79]:
def core_cosine_similarity(vector1, vector2):
    """
    measure cosine similarity between two vectors
    :param vector1:
    :param vector2:
    :return: 0 < cosine similarity value < 1
    """
    return 1 - cosine_distance(vector1, vector2)

In [None]:
import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank", last=True)
doc = nlp(str(X_bible))

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

In [86]:
from gensim.summarization.summarizer import summarize

In [137]:
print(bible_corpus[0:450])

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.
God said, "Let there be light," and there was light.
God saw the light, and saw that it was good.  God divided
the light from the darkness.
God called the light Day, and the darkness he called Night.
There was evening and there was morning, one day.
God s


In [130]:
print(summarize(bible_corpus, word_count=300))

In the beginning God created the heavens and the earth.
God called the light Day, and the darkness he called Night.
God said, "Let there be an expanse in the middle of the waters,
God said, "Let the waters under the sky be gathered together
God called the dry land Earth, and the gathering together
God said, "Let the earth put forth grass, herbs yielding seed,
The earth brought forth grass, herbs yielding seed after their kind,
God said, "Let there be lights in the expanse of sky to
God set them in the expanse of sky to give light to the earth,
God said, "Let the waters swarm with swarms of living creatures,
and let birds fly above the earth in the open expanse of sky."
the waters in the seas, and let birds multiply on the earth."
God said, "Let the earth bring forth living creatures after
their kind, livestock, creeping things, and animals of the earth
God made the animals of the earth after their kind,
God said, "Let us make man in our image, after our likeness:
God created man in his

We not used the complete genesis chapter. We focused on the first 7500 characters which belongs to the story of creating the earth, elements, animals humans, ... . The summarization works well. In the original corpus the tale starts with creating heaven and earth and the creation of night and day. To tell this a lot of fill words and sentences are used. For the same story genesis needed 9 sentences. Through summarization we could reduce the same story to two sentences.