### NMF notebook
Show how NMF can determine movie topics based on user ratings.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
movie_df = pd.read_csv('den20_movie_ratings.csv')

In [None]:
movie_df.head()

In [None]:
pd.set_option("display.max_columns",100)

In [None]:
movie_df

In [None]:
movie_df.columns

In [None]:
# clean up data frame
movie_df.columns = [col.lower().replace(' ', '_') for col in movie_df.columns]
movie_df.fillna(0, inplace = True)
movie_df.set_index('name', inplace = True)
movie_df.replace({" " : 0}, inplace = True) # Who put in a space!
movie_df = movie_df.apply(pd.to_numeric)

In [None]:
# sanity check
movie_df.head()
movie_df.info()


## NMF for topic analysis: motivation

You've seen with PCA and SVD that you can decompose a matrix (in this running example, of users, movies, and their ratings of the movies) into latent topics that help relate groups of movies (or words, or books, or whatever your features are in the matrix). 


In [None]:
from numpy.linalg import svd

mat = movie_df.values
movies = movie_df.columns
names = movie_df.index

In [None]:
# Compute SVD
U, sigma, VT = svd(mat)

# do 3 topics...for now 
k = 3
topics = ['latent_topic_{}'.format(i) for i in range(k)]

# Keep top k concepts for comparison
U = U[:,:k]
sigma = sigma[:k]
VT = VT[:k,:]

In [None]:
# Make pretty
U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
U = pd.DataFrame(U, index = names, columns = topics)
VT = pd.DataFrame(VT, index = topics, columns = movies)

In [None]:
print('\nMatrix U: people-topic')
print(U)
print('\nMatrix S: singular values')
print(sigma)
print('\nMatrix V: topic-movies')
print(VT)

## Problems with SVD for topic analysis

**Recall:** $M = U S V^T$

Values in $U$ and $V^T$ can be negative, which is weird and hard to interpret. For example, suppose a latent feature is the genre 'Sci-fi'. This feature can be positive (makes sense), zero (makes sense), or negative (what does that mean?).

#### Let's try using NMF instead....

In [None]:
from sklearn.decomposition import NMF

k = 3 # the number of topics

nmf = NMF(n_components = k)
nmf.fit(mat) # mat = the matrix made by movie_df.values

W = nmf.transform(mat) # the n by k matrix
H = nmf.components_ # the k by m matrix

# Make the matrices pretty DataFrames
W = pd.DataFrame(W, index = names, columns = topics)
H = pd.DataFrame(H, index = topics, columns = movies)

# Round the decimals
W,H = (np.around(x,2) for x in (W, H))

# this shows the components 
print(W.head(30), '\n\n', H.head(k))

#### Check Reconstruction Error:

In [None]:
# stop truncation
np.set_printoptions(threshold=np.inf, linewidth=np.nan)
# prevent exponential notation
np.set_printoptions(suppress=True)

# original matrix
print("\nOriginal matrix")
print(mat)

# # svd reconstruction
# print("\nSVD reconstruction")
# print('\n', np.around(np.dot(U, np.diag(sigma)).dot(VT), 2))

# # nmf reconstruction
print("\nNMF reconstruction")
print('\n', np.around(W.dot(H), 2))

## Interpreting Concepts
#### Think of NMF like 'fuzzy clustering' or 'soft clustering'
- The concepts are clusters
- Each row (document, user, etc...) can belong to more than one concept

#### Top Questions:
1. What do the concepts (clusters) mean?
2. To which concept(s) does each user/document belong?

### What are the topics?

In [None]:
# Top 10 movies in topic 0
tpic = 0
num_movies = 10
top_movies = H.iloc[tpic].sort_values(ascending=False).index[:num_movies]
top_movies

In [None]:
# Top 10 movies in topic 1
tpic = 1
num_movies = 10
top_movies = H.iloc[tpic].sort_values(ascending=False).index[:num_movies]
top_movies

In [None]:
# Top 10 movies in topic 2
tpic = 2
num_movies = 10
top_movies = H.iloc[tpic].sort_values(ascending=False).index[:num_movies]
top_movies

### Which users align with concept 0?

In [None]:
# Top 5 users for topic 0
tpic = 0
top_users = W.iloc[:,tpic].sort_values(ascending=False).index[:5]
top_users

In [None]:
# Top 5 users for topic 1
tpic = 1
top_users = W.iloc[:,tpic].sort_values(ascending=False).index[:5]
top_users

In [None]:
# Top 5 users for topic 2
tpic = 2
top_users = W.iloc[:,tpic].sort_values(ascending=False).index[:5]
top_users

### What concepts does do I align with?

In [None]:
# feel free to fill in your name and check it out for yourself 
W.loc['Jess Curley']

In [None]:
# these are the movies associated with the latent topic I align most with  
H.loc['latent_topic_1'].sort_values(ascending=False).head()

### What are all the movies in each topic?

In [None]:
# Number of movies in each concept
thresh = .6  # movie is included if at least 50% of max weight
for g in range(k):
    all_movies = H.iloc[g,:]
    included = H.columns[all_movies >= (thresh * all_movies.max())]
    print("\nTopic %i contains: %s" % (g, ', '.join(included)))

### Which users are associated with each topic?

In [None]:
# Users in each concept
thresh = .3  # movie is included if at least 30% of max weight
for g in range(k):
    all_users = W.iloc[:,g]
    included = W.index[all_users >= (thresh * all_users.max())]
    print("\nTopic {} contains: {}".format(g, ', '.join(included)))

## Choosing number of (latent) topics by looking at reconstruction error

In [None]:
# Compute NMF
from sklearn.decomposition import NMF

def fit_nmf(k):
    nmf = NMF(n_components=k)
    nmf.fit(mat)
    W = nmf.transform(mat);
    H = nmf.components_;
    return nmf.reconstruction_err_

error = [fit_nmf(i) for i in range(1,10)]
plt.plot(range(1,10), error)
plt.xlabel('k')
plt.ylabel('Reconstruction Error')


### Some other stuff you may find helpful with your assignment....

In [None]:
A = np.array([[1, 2], [-3, 4]])
b = np.array([7, -9])

print(np.linalg.solve(A, b))

### Least Squares Solver

What if we have an overdetermined system of linear equations? E.g.

$$ \begin{bmatrix} 1 & 2 \\ -3 & 4 \\ 1 & -4 \end{bmatrix} \left[ \begin{array}{c} x_1 \\ x_2 \end{array} \right] = \left[ \begin{array}{cc} 7 \\ -9 \\ 17 \end{array} \right] $$

An exact solution is not guaranteed, so we must do something else. Least Squares dictates that we find the $x$ that minimizes the residual sum of squares (RSS).

(Note: This is the solver we use when doing Linear Regression!)

In [None]:
A = np.array([[1, 2], [-3, 4], [1, -4]])
b = np.array([7, -9, 17])

print(np.linalg.lstsq(A, b)[0])


In [None]:
A.clip(min=0)

### Non-negative Least Squares Solver

What if you want to constrain the solution to be non-negative? (Doing such a thing will be important to us today.)

We have optomizers for that too!

In [None]:
from scipy.optimize import nnls

A = np.array([[1, 2], [-3, 4], [1, -4]])
b = np.array([7, -9, 17])

print(nnls(A, b))

### Cosine Similarity
- [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)
- When it's on L2 normalized data, it's the same as calling ```linear_kernel```