## Problem Definition
Carryout Topic modelling upon the text to find these hidden topics and their supporting words by maximising the posterior
probability of the whole corpus given the topics and words.

### The steps I am going to take to solve the problem are as follows-
  *  Import packages.
  *  Import document.
  *  Clean data.
  *  Create a dictionary form the clean words.
  *  Using BoW model create document-term matrix
  *  Generate lda model.
  *  Find number of topics = 2
  *  Evaluate the results using graphical representation. 

## Implemmentation.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


In [None]:
documents = [
  """
  Artificial intelligence (AI), sometimes called machine
  intelligence, is intelligence demonstrated by machines, unlike
  the natural intelligence displayed by humans and animals. Leading
  AI textbooks define the field as the study of "intelligent
  agents": any device that perceives its environment and takes
  actions that maximize its chance of successfully achieving its
  goals. Colloquially, the term "artificial intelligence" is often
  used to describe machines (or computers) that mimic "cognitive"
  functions that humans associate with the human mind, such
  as "learning" and "problem solving".
  """,
  """
  Association football, more commonly known as football or
  soccer, is a team sport played with a spherical ball between
  two teams of 11 players. It is played by approximately 250
  million players in over 200 countries and dependencies, making it
  the world's most popular sport. The game is played on a
  rectangular field called a pitch with a goal at each end. The
  object of the game is to outscore the opposition by moving the
  ball beyond the goal line into the opposing goal. The team with
  the higher number of goals wins the game.  
  """
]

In [None]:
# Clean the data by using stemming and stopwords removal
nltk.download('stopwords')
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in documents
  ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model 
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
        num_topics=num_topics, id2word=dictionary, passes=25)


  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [None]:
num_words = 5
for i in range(num_topics):
  print(ldamodel.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


0.036*"goal" + 0.036*"play" + 0.036*"team" + 0.026*"game" + 0.026*"ball"
0.035*"intellig" + 0.035*"human" + 0.025*"machin" + 0.015*"field" + 0.015*"call"

Top 5 contributing words to each topic:

Topic 0
"goal"      :3.60%
"play"      :3.60%
"team"      :3.60%
"game"      :2.60%
"ball"      :2.60%

Topic 1
"intellig"  :3.50%
"human"     :3.50%
"machin"    :2.50%
"field"     :1.50%
"call"      :1.50%


In [None]:
new_docs = [
  """
  Jager thinks this is just the start of AI eating the beautiful
  game. “We have a dedicated team that focuses only on artificial
  intelligence and machine learning for sports teams,” he
  says. “That is not only for soccer, but for Formula One and
  American football. We have a baseball team, and we're talking
  right now with cricket teams.”
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.49326453), (1, 0.50673556)]


### My topic for anaysis from scratch

In [None]:

import numpy as np
import json
import glob

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#spacy
import spacy
from nltk.corpus import stopwords

#vis
import pyLDAvis
import pyLDAvis.gensim

In [None]:
Doc = [
          """
          Experts usually rely on common sense when they solve problems. They also use
vague and ambiguous terms. For example, an expert might say, ‘Though the
power transformer is slightly overloaded, I can keep this load for a while.’ Other
experts have no difficulties with understanding and interpreting this statement
because they have the background to hearing problems described like this.
However, a knowledge engineer would have difficulties providing a computer
with the same level of understanding. How can we represent expert knowledge
that uses vague and ambiguous terms in a computer? Can it be done at all?
This chapter attempts to answer these questions by exploring the fuzzy set
theory (or fuzzy logic). We review the philosophical ideas behind fuzzy logic,
study its apparatus and then consider how fuzzy logic is used in fuzzy expert
systems.
""",
"""
Let us begin with a trivial, but still basic and essential, statement: fuzzy logic is
not logic that is fuzzy, but logic that is used to describe fuzziness. Fuzzy logic
is the theory of fuzzy sets, sets that calibrate vagueness. Fuzzy logic is based on
the idea that all things admit of degrees. Temperature, height, speed, distance,
beauty – all come on a sliding scale. The motor is running really hot. Tom is
a very tall guy. Electric cars are not very fast. High-performance drives require
very rapid dynamics and precise regulation. Hobart is quite a short distance
from Melbourne. Sydney is a beautiful city. Such a sliding scale often makes it
impossible to distinguish members of a class from non-members. When does a
hill become a mountain?
"""
]

### Importing my stops words from nltk

In [None]:
stopwords = stopwords.words("english")

In [None]:
#  Going to look at the words I just imported
print(stopwords)
# Kool

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# Since my data already loaded.  I am going to take a peek at the first 70 charaters
print (Doc[0][0:100])


          Experts usually rely on common sense when they solve problems. They also use
vague and am


In [None]:
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in Doc:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)


lemmatized_texts = lemmatization(texts)


  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):


In [None]:
# Now to compare the results withthe above.
print (lemmatized_texts[0][0:90])

expert usually rely common sense when solve problem also use vague ambiguous term example 


In [None]:
#  Reducing text to individual word and removing stops 
def gen_words(Doc):
    final = []
    for text in Doc:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:10])  #  Lets see if it worked

['expert', 'usually', 'rely', 'common', 'sense', 'when', 'solve', 'problem', 'also', 'use']


In [None]:
#  Create a BoW and its freqency
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])


[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 3), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1), (18, 5), (19, 1)]


In [None]:
# Let us now look at the first tuple
word = id2word[[0][:1][0]]
print (word)

all


In [None]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model 
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
        num_topics=num_topics, id2word=dictionary, passes=25)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

### Vector Projection

In [None]:
num_words = 5
for i in range(num_topics):
  print(lda_model.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in lda_model.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))

0.053*"fuzzy" + 0.053*"logic" + 0.032*"very" + 0.022*"distance" + 0.022*"member"
0.048*"expert" + 0.048*"fuzzy" + 0.030*"logic" + 0.030*"use" + 0.030*"can"

Top 5 contributing words to each topic:

Topic 4
"logic"     :0.90%
"fuzzy"     :0.90%
"very"      :0.90%
"scale"     :0.90%
"slide"     :0.90%

Topic 1
"expert"    :4.80%
"fuzzy"     :4.80%
"logic"     :3.00%
"use"       :3.00%
"can"       :3.00%


In [None]:
new_docs = [
  """
  Fuzzy or multi-valued logic was introduced in the 1930s by Jan Lukasiewicz, a
  Polish logician and philosopher (Lukasiewicz, 1930). He studied the mathematical 
  representation of fuzziness based on such terms as tall, old and hot. While
  classical logic operates with only two values, 1 (true) and 0 (false), Lukasiewicz
  introduced logic that extended the range of truth values to all real numbers in
  the interval between 0 and 1. He used a number in this interval to represent the
  possibility that a given statement was true or false. For example, the possibility
  that a man 181 cm tall is really tall might be set to a value of 0.86. It is likely that
  the man is tall. This work led to an inexact reasoning technique often called
  possibility theory.
 
  """
  ]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stopwords]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])



[(0, 0.42149878), (1, 0.5785012)]


###Graphical representation 

In [None]:
!pip install pyLDAvis==2.1.2
import pyLDAvis.gensim




### Graphical representation of my topic

In [None]:
# Visualize my topic
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=11)
vis

###Graphical representation of the first analysis

**Conclusion** There are to distinct clusters with no overlapping occuring between them in both cases.  Also, in both cases we see that the overal term frequency words were meaning full.  Hence the topics were pretty well harmnized and unique.  For example in my analysis we had words such as fuzzy, logic, expert, etc which are uniqure to the cluster.  Therefore the vector analsis looks correct.  Words such as go,and, is, etc which has no real value in cluster formation is excluded.  In the first Text analysis we see an overal term frequency of 5 words with the estimated term frequency with the topic being 3.  
