# Fundamentals of Data Science - Week 3 and Week 4 

###  <span style='color: green'>Scroll down to the bottom of the notebook to see your assignment</span> 
<p></p>
<span style='color: red'>Deadline: **NOT YET SET**</span>


In this notebook we are going to cover the following practical aspects of data science:
+ Gathering data (scraping the Twitter Streaming API)
+ Storing and organizing it (store to file or a database)
+ Preprocess the data
+ Perform sentiment, topical and correlation analysis
+ Visualize

To complete this assignment you need to have a running Anaconda installation with Python 2.7 on your device. If this is not the case, refer back to Week 1. Python package prerequisites include:
+  **Twitter API Client** [Tweepy](https://github.com/tweepy/tweepy) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Install command: **pip install tweepy**]
+  **Python Data Analysis Library** [Pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **pip install pandas**]
+  **Python Visualization Library** [MatPlotLib](https://matplotlib.org/)   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Install command: **python -m pip install matplotlib**]
+  **Python-Mongo Database Client** [PyMongo](https://api.mongodb.com/python/current/)  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **python -m pip install pymongo**]
+  **Python Topic Modelling Library** [GENSIM](https://radimrehurek.com/gensim/install.html) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **pip install --upgrade gensim**]

An additional requirement if **you would like to use a database** is MongoDB:
+ MongoDB database server instance [MongoDB Installation Instructions](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/#install-mongodb-community-edition)
+ **Windows download** (and perhaps linux, untested): [link](https://www.mongodb.com/download-center?jmp=nav#community)

## Gathering Data - Twitter API

The public Twitter API consists of a REST API and a Streaming API. Most application developers mix and match the APIs to produce their application. The Streaming API provides low-latency high-volume access to Tweets. Additionally, there are some families of APIs (such as the Ads API) which require your application to be whitelisted in order to make use of them. For this assignment we are going to use the [**Twitter Streaming API**](https://dev.twitter.com/streaming/overview).

### Twitter Streaming API

The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data. A streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.

Twitter offers several basic streaming endpoints, each customized to certain use cases:
+ **Public Streams** - Streams of the public data flowing through Twitter. Suitable for following specific users or topics, and data mining.
+ **User Streams** &nbsp;&nbsp;&nbsp;- Single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.
+ **Site Streams** &nbsp;&nbsp;&nbsp;&nbsp;- The multi-user version of user streams. Site streams are intended for servers which must connect to Twitter on behalf of many users. Site Streams is a closed beta. Applications are no longer being accepted.

In this assignment we are going to use the **Twitter Public Streams** to gather data about certain topics of interest. For using the Twitter API we need to create a Twitter Account, a Twitter APP and obtain the API Keys.

### Obtaining Twitter API Keys

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements:

+ Create a twitter account if you do not already have one.
+ Go to https://apps.twitter.com/ and log in with your twitter credentials.
+ Click "Create New App"
+ Fill out the form, agree to the terms, and click "Create your Twitter application"
+ In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
+ Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".

### Connecting to Twitter Streaming API and downloading data

Now that we have the necessary credentials we can use the Tweepy library we installed in the previous step to connect to Twitter and start gathering data.

First we import the required methods from the Tweepy library:


In [None]:
# Import the necessary methods from tweepy library

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream


Next we copy the credentials in separate variables that we are going to use through the entire assignment.

**NOTE** it is a general best practise not to keep sensitive data like API keys in a raw form in your scripts. For simplicity and demonstration purposes we can do that in this excercise, however this is not acceptible in a real live scenario.

In [None]:
access_token = ""
access_token_secret = ""
consumer_key = ''
consumer_secret = ''

Next we specify:
+ The location where we are going to dump the tweets that we obtained through the Streaming API
+ A basic function that formats and stores the tweets in a text file for later usage
+ A class consisting of a listener that attaches to a particular stream and displays the tweets directly onto the screen.

In [None]:
# We need to import json for dumping the tweets into our file
import json 

# Here we specify where the tweets would be stored
tweets_collection = 'data/tweets.txt'
tweet_file = open(tweets_collection, 'a')

#This a basic python function to append some value to a text file
def dump_tweet_to_json(tweet, dump_file):
     dump_file.write(tweet + '\n')
    
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    # When we get data through the api print the data on screen
    def on_data(self, data):
        print(data)
        return (True)

    # When an error occures print the status code so that we know what it is.
    def on_error(self, status):
        print (status)

Using the defined class we can authenticate using the example below and attach our listener to a stream that is particularly interested in these topics:
+ Data Science
+ University Of Amsterdam
+ Python
+ Artificial Inteligence

The code section below **will not stop automatically** once you run it (which is the whole point of the streaming API). To stop the execution and move on to the next section interupt the kernel using the **stop symbol** in the top toolbar.

In [None]:
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API

    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    stream.filter(track=['data science', 'university of amsterdam', 'python', 'artificial intelligence'])

If we now modify our Listener class to store the tweets to a file instead of printing it on screen, we would be able to use the tweets for our later analysis.

In [None]:
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    # When we get data through the api print the data on screen
    def on_data(self, data):
        print ("Stored a new tweet.")
        dump_tweet_to_json(data, tweet_file)
        return True

    # When an error occures print the status code so that we know what it is.
    def on_error(self, status):
        print ("Error: ", status)

If we run the main section for a short period of time again and check the tweets.txt file in the data folder we will find the streamed tweets.

In [None]:
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API

    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    stream.filter(track=['data science', 'amsterdam', 'python', 'artificial intelligence', 'netherlands'])

If MongoDB is installed on your device and a database named Twitter is created, the tweets can be stored as database entries using the following code:

**Note**: If Mongo is not installed on your device it will yield an error Connection Refused exception (on Windows it could be 'actively refused'). You need to install MongoDB community server.

In [None]:
from pymongo import MongoClient

client = MongoClient()
db = client.test

db.twitter.insert_one({"sample":"tweet"})

The final line of the code would replace the **dump_tweet_to_json()** function call in the Listener class.

## Preprocessing the data

Assuming that our stream listener has been running for a while and we have gathered some tweets, our tweets.txt file has grown to contain quite a few tweets now. If we open the file and read it line by line, we can import the tweets as json objects in a list and see their contents:

In [None]:
#Pprint is 'pretty print', simply a print function that gives 'nicer' outputs than print
from pprint import pprint

tweets_data = []
tweets_file = open(tweets_collection, "r")

for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
        print ("Imported tweet created at:", tweet['created_at'])
        print ("Tweet content: \n", tweet['text'], "\n")
    except Exception as e:
        print (e)
        continue

print ('#############################################' )       
print ('We have gathered:',len(tweets_data), 'tweets.')
print ('#############################################' ) 

print ("Information contained in a single tweet: \n")
pprint(tweets_data[0].keys())


In [None]:
for i in range(len(tweets_data)):
    print(tweets_data[i]['lang'])

In [None]:
list(map(lambda tweet: tweet['text'], tweets_data))

We can notice that the data is very noisy. It contains a lot of html artifacts, emojis, links and even extra metadata that we do not need at this time or it obstructs the clarity of the content in the tweet.

In cases like these, a preprocessing step is required before analysis can be performed.

As a first step in this direction we will structure the tweets data into a pandas DataFrame to simplify the data manipulation. We will start by creating an empty DataFrame called tweets and we will add 3 columns to the tweets DataFrame called text, lang, and country. text column contains the tweet, lang column contains the language in which the tweet was written, and country the country from which the tweet was sent.

In [None]:
import pandas as pd

tweets = pd.DataFrame()

tweets['text'] =    list(map(lambda tweet: tweet['text'], tweets_data))
tweets['lang'] =    list(map(lambda tweet: tweet['lang'], tweets_data))
tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))


Next, we will create 2 charts
+ The first one describing the Top 5 languages in which the tweets were written
+ The second the Top 5 countries from which the tweets were sent.

We will create these two charts using MatPlotLib (the library we installed in the begining of the assignment).

In [None]:
import matplotlib.pyplot as plt

# This is a directive that enables displaying charts in iPython notebooks.
%matplotlib inline


tweets_by_lang = tweets['lang'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')

We can do the same thing for countries:

In [None]:
#tweets['country']

In [None]:
#Note - many times no country is scored, so you might have very few entries in this histogram (perhaps none or 1)

tweets_by_country = tweets['country'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Countries', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')

#### Food for thought

There are plenty of ways the gathered data can be skewed and manipulated a false image about something. This is a technique used in marketing very often. Can you think of ways to prove the statistic we displayed in the previos section as biased?

### Extracting links from tweets

Tweets very often carry additional context information to the statement they are making in a hyperlink. Extracting these hyperlinks from the tweets might provide an expansion for thte dataset you are collection or analyzing. A usefull skill in data science is to extract this type of information. We will do this by using regular expressions. Python provides a library for regular expression called re. 

We will start by importing this library and creating a function that checks if a specific keyword is present in a text and a second function that extracts the hyperlink from a the tweets content.

In [None]:
import re

# A function that extracts the hyperlinks from the tweet's content.
def extract_link(text):
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

# A function that checks whether a word is included in the tweet's content
def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

Next we add a column to our predifined Data Frame with:

In [None]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

With the help of this frame we can then print all the links from the tweets that we gathered and show them on the screen below:

In [None]:
print(tweets['link'])

**Removing Hyperlinks**

An additional use case that comes up from detecting the hyperlinks in the tweets is their removal. When subjecting the tweet's content to tokenization of mapping it with some sort of an embedding, it is recomended that artifacts like hyperlinks be removed first.

In [None]:
#NOTE you have to define your own 'index_in_dataframe_containing_link'. Pick one of the non-null entries you see when you ran the above line "print(tweets['link'])"

index_in_dataframe_containing_link = 86
unescaped_tweet = tweets_data[index_in_dataframe_containing_link]['text']
# With link in the content
print("With link:\n", unescaped_tweet)

# With the link removed
result = re.sub(r"http\S+", "", unescaped_tweet)
print ("\n\nLink free:\n",result)

### Tokenization

**Definition**: Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

In [None]:
from nltk.tokenize import RegexpTokenizer
import html.parser as HTMLParser# In Python 3.4+ import html 
import nltk

tokenizer = RegexpTokenizer(r'\w+')

dirty_tweet_tokens = tokenizer.tokenize(unescaped_tweet.lower())

cleaned_tweet_tokens = tokenizer.tokenize(result.lower())

print("Clean tokens:\n", cleaned_tweet_tokens)

print("\n\nDirty tokens:\n", dirty_tweet_tokens)

In [None]:
print("Actually got to this point and understood everything(!!!)")

You can notice that the tokens we get from the tweet containing the url have data that is not relevant to natural language and therfore any further analysis based on that.

## Sentiment Analysis

**Sentiment Analysis** is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. A common use case for this technology is to discover how people feel about a particular topic.

There are two general directions in which you can steer your sentiment analysis pipeline:
+ Lexicon Based Approaches
+ Machine Learning Approaches

#### A Basic Machine Learning Approach

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

**What is a Classifier?**

Wikipedia says: "<span style='color:red'>An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.</span>"

It is basically a computer program that learns how to map a certain input to a certain output. It is able to translate the data in the input space into a different segmented output space, where each datum belongs its own dimension. For our example we are going to use the Naive Bayes classifier. This classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This makes this classifier simple and easy to use, however in a limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

So before we start doing anything we need a ready corpus of data we can use to demonstrate the concept of Sentiment Analysis.

In [None]:
# This snippet downloads the most popular datasets for experimenting with NLTK functionalities.
import nltk
nltk.download('popular')

As a first step we import the required NLTK modules and define a simple function that is going to extract our features:


In [None]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


# A function that extracts which words exist in a text based on a list of words to which we compare.
def word_feats(words):
        return dict([(word, True) for word in words])

# Get the negative reviews for movies    
negids = movie_reviews.fileids('neg')

# Get the positive reviews for movies
posids = movie_reviews.fileids('pos')
 
# Find the features that most correspond to negative reviews    
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]

# Find the features that most correspond to positive reviews
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

# We would only use 1500 instances to train on. The quarter of the reviews left is for testing purposes.
negcutoff = int(len(negfeats)*3/4)
poscutoff = int(len(posfeats)*3/4)

In [None]:
# Construct the training dataset containing 50% positive reviews and 50% negative reviews
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]

# Construct the negative dataset containing 50% positive reviews and 50% negative reviews
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

print ('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))

# Train a NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(trainfeats)

# Test the trained classifier and display the most informative features.
print ('accuracy:', nltk.classify.util.accuracy(classifier, testfeats))
classifier.show_most_informative_features()

### Tweets Example

The previous was somewhat large scale. We had a dataset fo 2000 film reviews. A smaller tweet dataset would better serve our cause. See example below:

In [None]:
# For this example we define our own dataset of 5 positive and 5 negative tweets.

# Positive tweets and their sentiment label
pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

# Negative tweets and their sentiment label
neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

# The list of tweets we are going to use for testing (groundtruth)
test_tweets = [(['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]


We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

In [None]:
# pprint is a module for pretty printing
from pprint import pprint


tweets = []

# In this for loow we create a list of tuples like: (word_longer_than_3_letters, sentiment_label)
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, sentiment))

# Printing how our dataset looks like after we have performed our own 'custom' tokenization of the tweets.
print("### Training examples ###\n")
pprint(tweets)

print("\n\n### Testing examples ###\n")
pprint(test_tweets)


Exactly like the example above, we define two functions. One for extracting the list of words in our tweet corpora and a second one to get the features on which we will train a classifier. In this case, our features would be the word appearance frequencies.

In [None]:
# Get the separate words in tweets
# Input:  A list of tweets
# Output: A list of all words in the tweets
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

# Create a dictionary measuring word frequencies
# Input: the list of words
# Output: the frequency of those words apearing in tweets
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    print ("Word frequency list\n")
    pprint(wordlist)
    return word_features




To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

In [None]:
word_features = get_word_features(get_words_in_tweets(tweets))

# Construct our features based on which tweets contain which word
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features


As you can see, **‘this’** is the most used word in our tweets, followed by **‘car’**, followed by **‘concert’**…

The variable ‘training_set’ contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

In [None]:
# Here we apply the features we constructed to our tweets data.
training_set = nltk.classify.apply_features(extract_features, tweets)

# Printing the resulting training set shows the features we are going to pass to the classifier.
pprint(training_set)

Now that we have our training set, we can train our classifier like in the previous example.

In [None]:
# This is the line of code that we use to train our classifier. Training is performed in a streamlined way so no output is visible.
classifier = nltk.NaiveBayesClassifier.train(training_set)

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input.

So in our dataset the probability of each label is 0.5 as we can see below.

### <span style='color:red'>**Interesting observation**</span>

If we observer the output of the function below, an interesting observation jumps out. Line one of the output has this content:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;contains(not) = False          positi : negati =      1.6 : 1.0

This line tells us that if a tweet doesn't contain the word not **(contains(not) = False)** than it is 60% more likely to be positive than negative visible in: **(positi : negati = 1.6 : 1.0)**.

In [None]:
print (classifier.show_most_informative_features(32))

Now having seen the feature distribution for our data we can see how our classifier behaves in a real scenario where we apply it to a tweet it ahs not seen before and is not a part of the train or test set.

If our tweet is:
** Larry is my friend **

We would expect that the attributed sentiment would be: **positive**

In [None]:
# The tweet we are about to classify
tweet = 'Larry is my friend'
print (classifier.classify(extract_features(tweet.split())))


On the other hand if our tweet is: **This dish is horrible**

We would expect that the attributed sentiment would be: **negative**

In [None]:
# The tweet we are about to classify
tweet = 'This dish is horrible'
print (classifier.classify(extract_features(tweet.split())))

However our simple classifier, trained on just 10 tweets is easy to fool. For example we have not encountered the word **horrendous**. So if our tweet would be:

**Ivo listens to horrendous electronic music.**

We would not know what to expect.

In [None]:
tweet = 'Ivo listens to horrendous electronic music'
print (classifier.classify(extract_features(tweet.split())))

In our case, the simple classifier made a mistake.

## Topic Modeling

One technique for text mining in Data Science is Topic Modelling. As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

### What is Latent Dirichlet Allocation (LDA)?

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. 



### LDA Parameters

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.

Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.


### Sample Topic Modeling Assignment

As step one of this assignment we will construct our own dataset containing 5 documents on different topics like below:


In [None]:
doc1 = "Working out is great for the body. Fitness makes you feel good."
doc2 = "Red cars are faster than blue cars."
doc3 = "Doctors suggest that fitness increases muscle mass and speeds up metabolism."
doc4 = "Cars with electrical engines cause less polution than cars with internal combustion engines."
doc5 = "Pushups make a good upper body excercise."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

We already noted that cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus. This makes it suitable for further analysis.

We introduced the word <i>stopwords</i>. Stopwords are words that are filtered out before any analysis of natural language in order to increase efficiency and remove clutter. Examples of stop words are:
+ and
+ there
+ want
+ thus
+ if 
+ etc...

Another new thing we will encounter in the code below is a Lemmatizer. A lematizer rests on the lemmatization process which essentially extracts the root of the word and removes any additional artifacts. A more formal definition is provided below:

<span style='color:red'>Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. </span>

**Note**: If unclear about the implementation of the methods please consult the NLTK documentation.

In [None]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

# Create a set of stopwords
stop = set(stopwords.words('english'))

# Create a set of punctuation words 
exclude = set(string.punctuation) 

# This is the function makeing the lemmatization
lemma = WordNetLemmatizer()

# In this function we perform the entire cleaning
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

# This is the clean corpus.
doc_clean = [clean(doc).split() for doc in doc_complete] 

### Preparing Document-Term Matrix

All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. 

Following code shows how to convert a corpus into a document-term matrix.

In [None]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

### Run the LDA model

Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.


In the code below we are running the LDA model on two topics with the words we have defined in our dictionary for 100 itterations. Feel free to change the number of itterations and see the outcome.

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=100)

Each line is a topic with individual topic terms and weights. Topic1 can be termed as Vehicles and Engines, and Topic 2 can be termed as Benefits of Fitness.

In [None]:
# Print 2 topics and describe then with 4 words.
topics = ldamodel.print_topics(num_topics=2, num_words=4)

i=0
for topic in topics:
    print ("Topic",i ,"->", topic)     
    i+=1


We notice that our LDA model performed well in guessing the two topics our documents covered.

## Assignment
### This assignment should result in a report following the report guidelines found on Blackboard.
### Due date: <b style='color: red'>  NOT YET SET </b>

So far we have covered the following sections:

+ Basic Python development
+ Pandas data management
+ Gathering data (scraping the Twitter Streaming API)
+ Storing and organizing it (store to file or a database)
+ Preprocessing the data
+ Performing sentiment and topical analysis
+ Visualizing insight

Given the newly acquired skills, your assignment is to perform an analysis on a dataset of tweets already provided in the course. The analysis should contain topic modelling and sentiment analysis. Use different splits of the data you have to perform your analysis, compare, correlate and visualize.

The dataset can be found on this link in the <i>tweets</i> folder:

[Dataset + SentiStrenght](https://surfdrive.surf.nl/files/index.php/s/OohMHoiTurxOa8I)

All of the tweets have geolocation on them so it would be natural to show the geographical distribution of the analysis you performed on the map you designed in Week 1 of the course. When visualizing the results of your analysis keep in mind that you can change the size, color, location or even boundries of the map. You can also hide and show regions depending on what is the point you are trying to make.

Make sure to correlate the findings with:
+ Demographics
+ Education
+ Income
+ Health care
+ Religion

This information can be obtained from the sources mentioned in the lectures.

As an additional resource to help with the sentiment analysis, there is a Java based utility in the same folder named SentiStrenght. You can use this to confirm your results or as an additional case study.