## REI602M Machine Learning - Homework 8 (**UNDER CONSTRUCTION!**)
### Due: *Monday* 11.3.2019

**Objectives**: Topic discovery with NMF, Image compression with PCA and NMF, Spectral clustering

**Name**: Emil Gauti Friðriksson, **email: ** egf3@hi.is, **collaborators:** (if any)

1\. [*Topic discovery with NMF*, 40 points]. Here you will use non-negative matrix factorization (NMF) to analyze the content of tweets from Donald Trump. In particular, you will attempt to discover the main topics of his tweets by applying NMF to a document-term matrix derived from the tweets (or rather to a "tweet-term" matrix).

The NMF approximates a non-negative $n \times p$ matrix $X$ of rank $r$ with a rank $k \leq r$ matrix such that

$$
X \approx WH
$$

where $W$ is a $n \times k$ matrix with $W_{ij} \geq 0$ and $H$ is a $k \times p$ matrix with $H_{ij} \geq 0$. Provided that $k$ is appropriately chosen, the *weight matrix* $W$ and *coefficient matrix* $H$ can reveal interesting structures in the data. Column $j$ of $X$ is approximated with (see comment 1 below)

$$
X_{:,j} \approx (WH)_{:,j} = H_{1j}W_{:,1} + H_{2j}W_{:,2} + \ldots + H_{kj}W_{:,k}
$$

where the subscript $:,j$ denotes column $j$. The columns of $W$ in this context correspond to the main topics of Trump's tweets and column $j$ of $H$ contains information on how the topics are "mixed" together to form (approximately) column $j$ of $X$.

a) Download all tweets by Trump from http://www.trumptwitterarchive.com/archive from the period 20.1.2017 (inauguration day) to present, omitting retweets, as a CSV file (approx. 5800 tweets). Create a tweet-term matrix using word counts (see below). For a given value of $k$, perform NMF on the matrix and list the words corresponding to the largest $H_{ij}$ values for columns $j=1,\ldots,k$. You need to experiment with different values of $k$ (a.k.a. the *Trump-dimension*) to get interesting topic groupings. If $k$ is too low different topics will be mixed together, when $k$ gets large, the same subject will appear in multiple clusters. Report your results (c.a. 20 words on each topic) for the value of $k$ that you end up picking.

b) Select two topics of "interest" (e.g. Trump's nemesis Hillary Clinton). Identify the corresponding columns in $W$ and list approx. 5 tweets using the largest $W$-values as indices. Does the content of the tweets match the selected topics?

*Comments*:

1) The $n \times k$ matrix-vector product $y=Ax$ can be interpreted as a weighted sum of the columns of $A$,
$$
y=
\begin{array}{ccc}
~\mid &  & ~\mid \\
x_1 a_1 & + \ldots + & x_k a_k \\
~\mid & & ~\mid \\
\end{array}
$$
and matrix multiplication can be considered as multiple matrix-vector products.

2) Use the NMF implementation in`from sklearn.decomposition.NMF`. You can use the Wikipedia data set from HW7 to test your NMF-based topic discovery code. Once you get convincing results, apply your code to the newly constructed tweet-term matrix.

3) Use `sklearn.feature_extraction.text.CountVectorizer` to create the document-term matrix based on word counts from the raw tweets. This function performs tokenization, counting and normalization and removes stop words. Use the following parameter values `max_features=k`, `max_df=0.95` (remove words that occur in at least 95% of the documents), `min_df=2` (remove words that occur in fewer than two documents), `stop_words='english'`.

4) Use `CountVectorizer.get_feature_names()` to get the list of words that were retained. Sidenote: Rare words are downplayed by the term-frequency encoding used here but they are often found to be informative. Therefore people often encode the text using term-frequency-inverse document frequency.

5) Scikit's NMF function obtaines the factorization $X \approx WH$ by minimizing the objective function $0.5||X - WH||_F^2$ (here $||A||_F$ denotes the Frobenius norm of a matrix $A$, $||A||_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^n A_{ij}^2}$. The NMF implementation provides means to regularize the solution via parameters `alpha` and `l1_ratio`. You may want to experiment with these parameters to see if you can improve the list of topics.

6) The $H$ matrix is stored in `nmf.components_`

7) The NMF is described briefly in section 14.6 of ESL. A more detailed account can be found in the original article
http://www.columbia.edu/~jwp2128/Teaching/E4903/papers/nmf_nature.pdf

In [123]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
data = np.genfromtxt('trump_tweets.CSV',encoding="utf-8", delimiter=',',dtype=str)
k=20
vectorizer = CountVectorizer(max_features=k, max_df=0.95,min_df=2, stop_words='english')
X = vectorizer.fit_transform(data)

model = NMF(init='random', random_state=0)
W = model.fit_transform(X)
H = model.components_
print(W)
print(H)
print(W.shape)
print(H.shape)
print(X.shape)
print(vectorizer.get_feature_names())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0.00000000e+00 0.00000000e+00 0.00000000e+00 1.95210374e-01
  0.00000000e+00 9.58956267e-12 3.86694029e-04 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  6.40598027e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00
  3.54374702e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 3.82407515e-01 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00

2\. [*Image compresssion with PCA and NMF*, 30 points] Fit a non-negative matrix factorization model to the zero-digits in the subset of the MNIST database from the Jupyter workbook `v07_pca_tsne_kmeans` (download from Piazza). Perform the following using 25 basis elements in the factorization:

i) Display the $W$ matrix as an image (see Fig. 14.33 in ESL) as well as an image for the part of $H$ that corresponds to the first image in the data set.

ii)  Compare a reconstruction of the first image in the data set with the original image. What compression ratio is achieved with 25 basis elements?

b) Repeat the analysis in a using 24-component (plus mean) PCA model (see Fig. 14.33 in ESL). Compare briefly with the results in a)

*Comments*:

1) Use the NMF implementation in `sklearn.decomposition.NMF`. The *columns* of the input matrix should contain the pixel values for each image (this is opposed to how we treated image data earlier). The `fit_transform` function returns the $W$ matrix and the attribute `components_` contains the $H$ matrix.

2) When reconstructing images you may need to "clip" the data, i.e. set pixel values above 1.0 to 1.

3) Many elements of the $W$ matrix will be zero and when you use a gray-scale color map, these elements will show up as black. You might therefore want to represent positive values with black and zeros with white.

4) Use scikit to perform PCA.

3\. [*Spectral clustering*] Under construction!