# Analysis of News Headlines: Topic Modelling with LSA
In this Project, LSA modelling algorithm is explored. These techniques are applied to the 'A Million News Headlines' dataset, which is a corpus of over one million news article headlines published by the ABC. 

In [1]:
## Import required packages and modules

In [2]:
## Import Dataset

## Exploratory Data Analysis
As usual, it is prudent to begin with some basic exploratory analysis.

In [3]:
## Check the count of NaN Values

In [4]:
## If the NaN values are less, then drop them or else replace them with suitable values

In [5]:
## Check new data values available

First develop a list of the top words used across all one million headlines, giving us a glimpse into the core vocabulary of the source data. Stop words should be omitted here to avoid any trivial conjunctions, prepositions, etc.

In [6]:
## Download Stopwords incase you haven't done that before
## you can use nltk.download('stopwords')

In [7]:
## Define certain new stopwords that may have no significance in determining top news headlines

In [8]:
## Define helper functions to get top n words
## Defined function must return a tuple of the top n words in a sample and their 
## accompanying counts, given a CountVectorizer object and text sample

In [9]:
## plot top 25 words in headlines dataset and their number of occurances
## Pass the new created set of stopwords to count vectoriser function
## Initially try to work on a batch of data instead of entire dataset (Say on 200000 examples)


Next you can generate a histogram of headline word lengths, and use part-of-speech tagging to understand the types of words used across the corpus. This requires first converting all headline strings to TextBlobs and calling the ```pos_tags``` method on each, yielding a list of tagged words for each headline.

In [10]:
## You can download punkt and averaged perceptron tagger for NLTK if required using
## nltk.download('punkt')
## nltk.download('averaged_perceptron_tagger')

In [14]:
## Identify Tagged Headlines

In [11]:
## For furthur analysis one can try finding average headline word length
## and Part of speech tagging for headline corpus

In [12]:
## By plotting the number of headlines published per day, per month and per year,
## one can also get a sense of the sample density.

## Topic Modelling
You can now apply a clustering algorithm to the headlines corpus in order to study the topic focus of ABC News, as well as how it has evolved through time. To do so, first experiment with a small subsample of the dataset, then scale up to a larger portion of the available data.

### Preprocessing
The only preprocessing step required in our case is feature construction, where we take the sample of text headlines and represent them in some tractable feature space. In practice, this simply means converting each string to a numerical vector. This can be done using the ```CountVectorizer``` object from SKLearn, which yields an $n×K$ document-term matrix where $K$ is the number of distinct words  across the $n$ headlines in our sample (less stop words and with a limit of ```max_features```).

Thus you have your (very high-rank and sparse) training data,  ```small_document_term_matrix```, and can now actually implement a clustering algorithm. Your choice Latent Semantic Analysis, will take document-term matrix as input and yield an $n \times N$ topic matrix as output, where $N$ is the number of topic categories (which we supply as a parameter). For the moment, we shall take this to be 15.

In [14]:
## To find top 15 topics we set
## n_topics = 15

### Latent Semantic Analysis
Let's start by experimenting with LSA. This is effectively just a truncated singular value decomposition of a (very high-rank and sparse) document-term matrix, with only the $r=$```n_topics``` largest singular values preserved.

In [15]:
## Define LSA Model

Taking the $\arg \max$ of each headline in this topic matrix will give the predicted topics of each headline in the sample. We can then sort these into counts of each topic.

In [16]:
## Define helper functions to get keys that returns an integer list of predicted topic 
## categories for a given topic matrix
## and KeysToCount that returns a tuple of topic categories and their 
## accompanying magnitudes for a given list of keys

However, these topic categories are in and of themselves a little meaningless. In order to better characterise them, it will be helpful to find the most frequent words in each.

In [17]:
## Define helper function get_top_n_words that returns a list of n_topic strings, 
## where each string contains the n most common words in a predicted category, in order

Thus we have converted our initial small sample of headlines into a list of predicted topic categories, where each category is characterised by its most frequent words. The relative magnitudes of each of these categories can then be easily visualised though use of a bar chart.

In [18]:
## Visualise each topic vs Number of headlines These will be the most discussed topics 
## In case you want to do furthur analysis you can try dimentionality reduction and 
## analyse and compare it's result to other techniques like LDA that is left as an optional assignment for you

However, this does not provide a great point of conclusion, you can instead use a dimensionality-reduction technique called $t$-SNE, which will also serve to better illuminate the success of the clustering process.

Now that you have reduced these ```n_topics```-dimensional vectors to two-dimensional representations, you can then plot the clusters using Bokeh. Before doing so however, it will be useful to derive the centroid location of each topic, so as to better contextualise our visualisation.

In [28]:
# Define helper functions that returns a list of centroid vectors from each predicted topic category

All that remains is to plot the clustered headlines. Also included are the top three words in each cluster, which are placed at the centroid for that topic.