![alt text](https://bdaaosu.org/img/Logo.png)

# <center> Intro to Text Analytics with Python </center>

---



## Importing packages / downloading stuff

Packages are the backbone of Python. We'll need a couple very popular packages for our task, so let's install them and load them in.

In [0]:
import nltk
from nltk.tokenize import word_tokenize # Word tokenizer
from nltk.probability import FreqDist # Fistribution generator for token frequency
from nltk.corpus import stopwords # Stopwords
from nltk.stem.wordnet import WordNetLemmatizer # Lemmatizer using NLTK's "wordnet"
from nltk.stem.porter import PorterStemmer # Word stemmer
import gensim # LDA topic modeling
import pandas as pd # Data manipulation package; keystone python data science package
import matplotlib.pyplot as plt # Plotting function
import collections # Sort tokens by count
flatten = lambda l: [item for sublist in l for item in sublist]
import io
from google.colab import files

In [0]:
# The following statements are needed to download nltk "data packages," 
# which contain a lot of the material we'll need to use NLTK functions
nltk.download('punkt') # Sentence tokenizer data package
nltk.download('stopwords') # Stopwords data package
nltk.download('wordnet') # Wordnet (word associations) data package

## Loading in the data: TED Talks!

Find the data on GDrive here: [TED Talks Data](http://go.osu.edu/BDAA_TED_Data)

Data source: [Kaggle Ted Talks Dataset](https://www.kaggle.com/rounakbanik/ted-talks#transcripts.csv)

This dataset contains transcripts of 2,467 TED Talks from 2006 to 2017.

Say that you want to find a few great ted talks in this dataset to digest. Where should you start? Reading through all >2,000 of them would be a little silly - you want to find some interesting talks to read/watch, _and fast_. To give better insight into where you should be directing your attention in the transcripts data, we'll see if we can programmatically find a set number of topics that you can start from, as well as create a model to drop any new transcripts you come across into these topic "buckets". 

In [0]:
uploaded = files.upload()

In [0]:
# Pandas is great for reading in and manipulating CSV files!
ted = pd.read_csv(io.StringIO(uploaded['transcripts.csv'].decode('utf-8')))

In [0]:
ted.transcript[0]

## Preprocessing the transcripts

Before passing the transcripts into the tokenizer, we need to correct any inconsistencies in the text data. Before modeling, we need to investigate the transcripts to see if there's any artifacts that will not be relevant to our analysis. To do this, we need to understand how models are built based on the tokens that are generated from the transcripts.

In [0]:
# Remove transcript "artifacts," like talk queues or audience responses
# Use the apply() method on a DF series to apply a transformation to every data point in the series
ted['transcript'] = ted['transcript'].apply(lambda x: x.replace('(Laughter)', ' ').replace('(Applause)', ' ').replace('(Music)', ' ').replace('♫', ' '))

In [0]:
## ANSWER 1
# Hint for statement #1: "BDAA. Inspire. Empower. Connect.".replace('. ', ' ') -> "BDAA Inspire Empower Connect"
# Hint for statement #2: "BDAA Inspire Empower Connect".lower() -> "bdaa inspire empower connect"

## Tokenizing

Many text analytics algorithms assume that input data (i.e. text) is _tokenized_. That is, we will be feeding a list/array of words or (what we, as data scientists, identify to be) meaningful tokens into our algorithms.

In [0]:
# Use the word_tokenize function from NLTK to tokenize the first transcript
ted_tokens = word_tokenize(ted.transcript[0])

In [0]:
# What do "tokens" look like?
ted_tokens

In [0]:
# Create a frequency distribution of all tokens in the first transcript
fdist = FreqDist(ted_tokens)
print(fdist)

In [0]:
# Plot the distribution! Matplotlib is perhaps the most popular visualization package in Python
fdist.plot(30,cumulative=False)
plt.show()

## Stopwords

The English language contains lots of rather meaningless words that only serve to glue _meaningful_ words together. 

Take the following two sentences, for example:

> _"I was just in Scott, trying to eat, when all the sudden a food fight broke out"_

> _"I was just in the hospital, trying to diagnose a patient, when all the sudden a code blue came over the speaker"_


What's the general topic of each sentence? Do the "glue" words help us determine what the topic is? When we're trying to model topics, these words won't do us any service. So, let's remove them! 

In [0]:
# NLTK has compiled a list of common stopwords that we can remove from our transcripts
stop_words=set(stopwords.words("english"))
print(stop_words)

In [0]:
# Remove both stopwords and words with a character length less than 4 from our list of tokens
ted_tokens = [token for token in ted_tokens if token not in stop_words and len(token) > 4]

In [0]:
ted_tokens

## Stemming / Lemmatization

Because of all of the tenses and qualifiers, many of the verbs and nouns in the English language take several different forms, depending on context, but semantically convey the same idea.

Take the verb _break_ for example:

> **Forms:** break, breaks, broke, broken, breaking

Also, take the noun _person_:

> **Forms:** person, people, personification

Because these forms are different on a character by character level, however, a computer will not be able to discern that they are the same without background knowledge. This is where _stemming_ and _lemmatization_ come in. 

_Stemming_ removes common suffixes from words. Stemming will not change the base form of the word.

<br>
![Stemming example](https://nlp.stanford.edu/IR-book/html/htmledition/img102.png)
<br>

_Lemmatization_ involves a full morphilogical analysis to identify the correct _lemma_ (i.e. stem) of each word. It will change the base form of the word if necessary.

In [0]:
# Import and create a Stemmer object
stem = PorterStemmer()

# Import and create a Lemmatizer object
lem = WordNetLemmatizer()


word = "flew"
print("Stemmed Word:",stem.stem(word))
print("Lemmatized Word:",lem.lemmatize(word,"v"))

In [0]:
[WordNetLemmatizer().lemmatize(token,"v") for token in ted_tokens]

NameError: ignored

## Cleaning up our tokens

Let's apply all the transformation operations to every transcript in the dataset, and save the resulting tokens in a _list of lists_.

In [0]:
text_data = []

# Define a function to do the cleaning for us, given text data and a list of stopwords
def tokenize_and_clean(transcripts):
  
  data = []
  # Apply all cleaning operations to each transcript and append the resulting tokens to a "list of lists"
  for transcript in transcripts:
    ted_tokens = word_tokenize(transcript)
    ted_tokens = [token for token in ted_tokens if token not in stop_words and len(token) > 4]
    ted_tokens = [WordNetLemmatizer().lemmatize(token,"v") for token in ted_tokens]
    data.append(ted_tokens)
    
  return data
  
  
text_data = tokenize_and_clean(ted.transcript)

## Topic Modeling!

Gensim is a _topic modeling framework_, or a suite of tools to convert tokenized data into "corpuses" to be used in _Latent Dirichlet Allocation_, the model to generate topics from transcript tokens.



In [0]:
# Convert the tokens to a dictionary format
dictionary = gensim.corpora.Dictionary(text_data)
dict(dictionary)

In [0]:
# Create a text "corpus" from each transcript
corpus = [dictionary.doc2bow(text) for text in text_data]
corpus[1870]

In [0]:
# How is the computer interpreting the words?
corpus_1870 = corpus[1870]

for word_id in corpus_1870:
  print("Token {} (\"{}\") appears {} time(s).".format(word_id[0], dictionary[word_id[0]], word_id[1]))

In [0]:
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

In [0]:
WORDS_PER_TOPIC = 7
topics = ldamodel.print_topics(num_words = WORDS_PER_TOPIC)
for topic in topics:
    print(topic)

Are these topics what we're looking for? Are the words associated with each topic informative?

_Not really._

What are the most common words across all transcripts?

In [0]:
# Create a list of all tokens across transcripts
all_tokens = flatten(text_data)

In [0]:
# Create a frequency distribution of all tokens in all transcripts
fdist = FreqDist(all_tokens)
print(fdist)
# Plot the distribution! Matplotlib is perhaps the most popular visualization package in Python
fdist.plot(30,cumulative=False)
plt.show()

In [0]:
# Filter out any tokens that appear in more than 50% of the transcripts
dictionary.filter_extremes(no_above = 0.5)
corpus = [dictionary.doc2bow(text) for text in text_data]

In [0]:
# Retrain the model!
NUM_TOPICS = 5
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

In [0]:
WORDS_PER_TOPIC = 7
topics = ldamodel.print_topics(num_words = WORDS_PER_TOPIC)
for topic in topics:
    print(topic)

## Model Evaluation

How is our model generating topic "scores" under the hood? How does it react to transcripts it has never seen before?

In [0]:
ted.loc[1870]

In [0]:
# Let's evaulate our model's performance. How is it generating topic "scores" for each transcript?
for index, score in sorted(ldamodel[corpus[1870]], key = lambda x: -1*x[1]):
  print("\nScore: {}\t \nTopic: {}".format(score, ldamodel.print_topic(index, 10)))

In [0]:
# How does our model react to a transcript it hasn't seen yet?
ts = """
to do two things at once is to do
neither it's great smack down of
multitasking isn't it often attributed
to the Roman writer probably leus serous
although you know how these things are
he probably never said it what I'm
interested in though is is it true I
mean it's obviously true for emailing at
the dinner table or texting while
driving or possibly for live tweeting a
TED talk as well but I'd like to argue
that for an important kind of activity
doing two things at once all three you
even four is exactly what we should be
aiming for look no further than Albert
Einstein in 1905 he published four
remarkable scientific papers one of them
was on Brownian motion it provided
empirical evidence that atoms exist and
it laid out the basic mathematics behind
most of financial economics another one
was on the theory of special relativity
another one was on the photoelectric
effect
that's why solar panels work it's a nice
one gave him the Nobel Prize for that
one and the fourth introduced an
equation you might have heard of e
equals MC squared so tell me again how
you shouldn't do several things at once
now obviously working simultaneously on
Brownian motion special relativity in
the photoelectric effect it's not
exactly the same kind of multitasking a
snapchatting while you're watching
Westworld a very different and einstein
well Einsteins he's Einstein was one of
a kind he's unique but the pattern of
behavior that Einstein was demonstrating
that's not unique at all it's very
common among highly creative people both
artists and scientists and I'd like to
give it a name slow motion multitasking
slow motion multitasking feels like a
counterintuitive idea what I'm
describing here is having multi
projects on the go at the same time when
you move backwards and forwards between
topics as the mood takes you or as the
situation demands but the reason it
seems counterintuitive is because we're
used to lapsing into multitasking out of
desperation we're in a hurry who want to
do everything at once if we were willing
to slow multitasking down we might find
that it works quite brilliantly 60 years
ago a young psychologist by the name of
Bernice a Jason began a long research
project into the personalities and the
working habits of 40 leading scientists
Einstein was already dead but four of
her subjects won Nobel prizes including
Linus Pauling and Richard Feynman the
research went on for decades in fact it
continued even after professor Aegis and
herself had died and one of the
questions that it answered was how is it
that some scientists are able to go on
producing important work right through
their lives what is it about these
people is it their personality is their
skillset their daily routines oh well
the pattern that emerged was clear and I
think to some people surprising the top
scientists kept changing the subject
they would shift topics repeatedly
during their first hundred published
research papers you want to guess how
often three times five times no
on average the most enduringly creative
scientists switched topics 43 times in
their first hundred research papers
seems that the secret to creativity is
multitasking in slow motion her a
Jason's research suggests we need to
reclaim multitasking and remind
ourselves how powerful it can be and
she's not the only person to have found
this different researchers using
different methods to study different
highly creative people have found that
very often
have multiple projects in progress at
the same time and they're also far more
likely than most of us to have serious
hobbies slow motion multitasking among
creative people is ubiquitous so why I
think there were three reasons and the
first is the simplest a creativity often
comes when you take an idea from its
original context and you move it
somewhere else it's easier to think
outside the box if you spend your time
clambering from one box into another for
an example of this consider the original
Eureka moment Archimedes is wrestling
with a difficult problem and he realizes
in a flash he can solve it using the
displacement of water and if you believe
the story this idea comes to him as he's
taking a bath lowering himself in and
he's watching the water level rise and
fall and if solving a problem while
having a bath isn't multitasking I don't
know what is the second reason that
multitasking can work is that learning
to do one thing well can often help you
do something else
any athlete can tell you about the
benefits of cross-training it's possible
to cross train your mind to a few years
ago researchers took 18 randomly chosen
medical students and they enrolled them
in a course at the Philadelphia Museum
of Art where they learned to criticise
and analyze works of visual art and at
the end of the course these students
were compared with a control group of
their fellow medical students and the
ones who had taken the art course had
become substantially better at
performing tasks such as diagnosing
diseases of the eye by analyzing
photographs
they've become better eye doctors so if
we want to become better what we do
maybe we should spend some time doing
something else even if the two fields
appear to be as completely distinct as
ophthalmology and the history of art and
if you'd like an example of this should
we go for a less intimidating example
and
okay michael crichton creator of
Jurassic Park and ER so in the 1970s he
originally trained as a doctor but then
he wrote novels and he directed the
original Westworld movie but also and
this is less well known
he also wrote non-fiction books about
art medicine that computer programming
so in 1995 he enjoyed the fruits of all
this variety by penning the world's most
commercially successful book and the
world's most commercially successful TV
series and the world's most commercially
successful movie in 1996 he did it all
over again there's a third reason why
slow motion multitasking can help us
solve problems it can provide assistance
when we're stuck this can happen in an
instant so imagine that feeling of
working on a crossword puzzle and you
can't figure out the answer and the
reason you can't is because the wrong
answer is stuck in your head it's very
easy just go and do something else here
switch topics switch context you'll
forget the wrong answer and that gives
the right answer space to pop into the
front of your mind but on the slower
time scale that interests me being stuck
is a much more serious thing yeah you
you get turned down for funding these
cell cultures won't grow your rockets
keep crashing nobody wants to publish
your fantasy novel about a school for
wizards or maybe you just can't find a
solution to the problem that you're
working on and being stuck like that
I mean stasis stress possibly even
depression but if you have another
exciting challenging project to work on
or being stuck on one it's just an
opportunity to do something else we
could all get stuck sometimes even
Albert Einstein ten years after the
original miraculous year that I
described
Einstein was putting together the pieces
of his theory of general
Relativity his greatest achievement and
he was exhausted and so he turned to an
easier problem he proposed the
stimulated emission of radiation which
as you may know is the sir in laser so
he's laying down the theoretical
foundation for the laser beam and then
while he's doing that he moves back to
general relativity and he's refreshed he
sees what the theory implies that the
universe isn't static it's expanding
it's an idea so staggering Einstein
can't bring himself to believe it for
years look if you get started and you
lay the book you get the ball rolling on
laser beams you're in pretty good shape
so that's the case for slow motion
multitasking you know I'm not promising
that it's going to turn you into
Einstein
I'm not even promising it's going to
turn you into Mach Michael Crichton but
it is a powerful way to organize our
creative lives but there's a problem how
do we stop all of these projects
becoming completely overwhelming how do
we keep all these ideas straight in our
minds well here's a simple solution a
practical solution from the great
American choreographer Twyla Tharp over
the last few decades
she's blurred boundaries mixed genres
won prizes danced to the music of
everybody from Philip Glass to Billy
Joel she's written three books I mean
she's a Jesus slow-motion multitasker of
course she is she says you have to be
all things why exclude you have to be
everything and thoughts method for
preventing all of these different
projects from becoming overwhelming is a
simple one she gives each project a big
cardboard box writes the name of the
project on the side of the box and into
it she tosses DVDs and books magazine
cuttings
theater programs physical objects really
anything that's provided a source of
creative inspiration and she writes the
Box means I never have to worry about
forgetting one of the biggest fears for
a creative person is that some brilliant
idea will get lost because you didn't
write it down and put it in a safe place
I don't worry about that because I know
where to find it it's all in the box you
can manage many ideas like this either
in physical boxes or in their digital
equivalents so I would like to urge you
to embrace the art of slow motion
multitasking not because you're in a
hurry but because you're in no hurry at
all and I want to give you one final
example my favorite example Charles
Darwin a man whose slow-burning
multitasking is so staggering and he's a
diagram to explain it all to you
we know what Darwin was doing at
different times because the creativity
researchers Howard Gruber and Sarah
Davis have analyzed his Diaries and his
notebooks so when he left school age of
18 he was initially interested in two
fields
so it's zoology and geology pretty soon
he signed up to be the on board
naturalist on the Beagle this is the the
ship that eventually took five years to
sail all the way around the southern
oceans of the earth stopping at the Gila
Africa's passing through the Indian
Ocean while he was on the Beagle he
began researching coral reefs this is a
great synergy between his two interests
in zoology and geology and it starts to
get him thinking about slow processes
but when he gets back from the voyage
his interests start to expand even
further psychology botany for the rest
of his life he's moving backwards and
forwards between these different fields
he never quite abandons any of them in
1837 he begins work on two very
interesting projects one of them
earthworms
the other a little notebook which he
titles the transmutation of species then
Darwin starts studying my field
economics he reads a book by the
economist Thomas Malthus and he has his
Eureka moment in a flash he realizes how
species could emerge and evolve slowly
through this process of the survival of
the fittest
it all comes to him he writes it all
down every single important element of
the theory of evolution in that notebook
but then a new project is some Williams
born well as a natural experiment right
there you get to observe the development
of a human infant so immediately Darwin
starts making notes now of course he's
still working on a theory of evolution
and the development of the human infant
but during all of this he realizes he
doesn't really know enough about
taxonomy they starts studying that and
in the end he spends eight years
becoming the world's leading expert on
barnacles then natural selection a book
that he's to continue working on for his
entire life he never finishes it Origin
of Species is finally published twenty
years after Darwin sent out all the
basic elements then the descent a man
controversial book and then the book
about the development of the human
infant the one that was inspired when it
would he could see his son William
crawling on the the sitting room floor
in front of him when the book was
published
William was 37 years old and all this
time
Darwin's working on earthworms he fills
his billiard room with earthworms in
pots with gas covers he shines lights on
them to see if they'll respond
he holds a hot poker next and see if
they move away he chews tobacco and he
blows on the earthworms to see if they
have a sense of smell he even plays the
bassoon at the earthworms
I like to think of this great man when
he's tired he's stressed he's anxious
about the reception of his book The
Descent of Man you or I might log into
Facebook or turn on the television
Darwin would go into the billiard room
to relax by studying the earthworm's
intensely and that's why it's
appropriate that one of his last great
works is the formation of vegetable
mould through the action of worms
he worked upon that book for 44 years we
don't live in the 19th century anymore I
don't think any of us could sit on our
creative or scientific projects for 44
years but we do have something to learn
from the great slow motion multitaskers
from Einstein and Darwin to Michael
Crichton and Twyla Tharp the modern
world seems to present us with a choice
if we're not get a fast twitch from
browser window to browser window we have
to live like a hermit focus on one thing
to the exclusion of everything else I
think that's a false dilemma we can make
multitasking work for us unleashing our
natural creativity we just need to slow
it down so make a list of your projects
put down your phone pick up a couple of
cardboard boxes and get to work thank
you very much
[Applause]
"""
ts = ts.replace('[Applause]', ' ').replace('[Laughter]', ' ').replace('[Applause]', ' ')
ts = ts.replace('. ', ' ').lower()

ts_list = []
ts_list.append(ts)
test_tokens = tokenize_and_clean(ts_list)

In [0]:
# Make sure the tokens look okay before proceeding!
test_tokens

In [0]:
# We'll need to create another corpus for our test transcript
test_corpus = [dictionary.doc2bow(text) for text in text_data]

In [0]:
# How does the model perform on test data?
for index, score in sorted(ldamodel[test_corpus[0]], key = lambda x: -1*x[1]):
  print("\nScore: {} \nTopic: {}".format(score, ldamodel.print_topic(index, 10)))

# Answers

In [0]:
# ANSWER 1
# Fix an odd inconsistency in the transcript data, and convert all transcripts to lowercase
ted['transcript'] = ted['transcript'].apply(lambda x: x.replace('.', ' '))
ted['transcript'] = ted['transcript'].apply(lambda x: x.lower())

# Resources

> [DataCamp](https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk)

> [AnalyticsVidhya](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)

> [Coursera](https://www.coursera.org/learn/python-text-mining)