In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Word frequency arrays

In some cases, an alternative implementation of PCA needs to be used. Word frequency arrays are a great example. 



Only some of the words from the vocabulary appear in any one document,but most entries of the word frequency array are zero. Arrays like this are said to be "sparse", and are often represented using a special type of array called a "csr_matrix". csr_matrices save space by remembering only the non-zero entries of the array.

### tf-idf

In a word-frequency array, each row corresponds to a document, and each column corresponds to a word from a fixed vocabulary. 
The entries of the word-frequency array measure how often each word appears in each document.The frequency of each word in the document using is known as "tf-idf". 

"tf" is the frequency of the word in the document. For example: if 10% of the words in the document are "datacamp", then the tf of "datacamp" for that document is 0.1. 

"idf" is a weighting scheme that reduces the influence of frequent words like "the".

In [2]:
documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']
documents

['cats say meow', 'dogs say woof', 'dogs chase cats']

#### A tf-idf word-frequency array

In this exercise, you'll create a tf-idf word frequency array for a toy collection of documents which is a list of toy documents about pets. 

For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

In [5]:
tfidf

TfidfVectorizer()

In [6]:
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

In [7]:
csr_mat

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [8]:
# Print result of toarray() method
print(csr_mat.toarray())

[[0.51785612 0.         0.         0.68091856 0.51785612 0.        ]
 [0.         0.         0.51785612 0.         0.51785612 0.68091856]
 [0.51785612 0.68091856 0.51785612 0.         0.         0.        ]]


The columns of the array correspond to words. By calling the .get_feature_names() method of tfidf we'll Get the list of words

In [9]:
# Get the words: words
words = tfidf.get_feature_names()

In [10]:
# Print words
print(words)

['cats', 'chase', 'dogs', 'meow', 'say', 'woof']


#### TruncatedSVD and csr_matrix

Scikit-learn's PCA doesn't support csr_matrices, and you'll need to use TruncatedSVD instead. TruncatedSVD performs the same transformation as PCA, but accepts csr matrices as input. Other than that, you interact with TruncatedSVD and PCA in exactly the same way.


### Clustering Wikipedia

TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. You are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles of their titles. Use your pipeline to cluster the Wikipedia articles.

In [11]:
articles = pd.read_csv("Wikipedia_Articles_Dataset.csv")

In [12]:
articles.shape 

(60, 13125)

In [13]:
articles

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13115,13116,13117,13118,13119,13120,13121,13122,13123,13124
HTTP 404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Alexa Internet,0.0,0.0,0.029607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Internet Explorer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003772,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011594,0.0,0.0
HTTP cookie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Google Search,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006649,0.0
Tumblr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hypertext Transfer Protocol,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Social search,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Firefox,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031222,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinkedIn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
titles = articles.index
titles.shape

(60,)

In [15]:
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [16]:
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components = 50)

In [17]:
# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

In [18]:
# Create a pipeline: pipeline
pipeline = make_pipeline(svd,kmeans)

In [19]:
# Fit the pipeline to articles
pipeline.fit(articles)

Pipeline(steps=[('truncatedsvd', TruncatedSVD(n_components=50)),
                ('kmeans', KMeans(n_clusters=6))])

In [20]:
# Calculate the cluster labels: labels
labels = pipeline.predict(articles)
labels

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [21]:
labels.shape

(60,)

In [22]:
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})


In [23]:
# Display df sorted by cluster label
print(df.sort_values(by = "label"))

    label                                        article
44      0                                           Gout
41      0                                    Hepatitis B
42      0                                    Doxycycline
43      0                                       Leukemia
45      0                                    Hepatitis C
46      0                                     Prednisone
47      0                                          Fever
48      0                                     Gabapentin
40      0                                    Tonsillitis
49      0                                       Lymphoma
37      1                                       Football
36      1              2014 FIFA World Cup qualification
35      1                Colombia national football team
34      1                             Zlatan Ibrahimović
30      1                  France national football team
33      1                                 Radamel Falcao
32      1                      

#### Non-negative matrix factorization

NMF stands for "non-negative matrix factorization". NMF, like PCA, is a dimension reduction technique. In constrast to PCA, however, NMF models are interpretable. This means an NMF models are easier to understand yourself, and much easier for you to explain to others. 

NMF can not be applied to every dataset, however. It is required that the sample features be "non-negative", so greater than or equal to 0.

#### Interpretable parts

NMF achieves its interpretability by decomposing samples as sums of their parts. For example, 

- NMF decomposes documents as combinations of common themes,
- images as combinations of common patterns. 

#### Using scikit-learn NMF

NMF is available in scikit learn, and follows the same fit/transform pattern as PCA. However, unlike PCA, the desired number of components must always be specified. 

NMF works both with numpy arrays and sparse arrays in the csr_matrix format.

### NMF applied to Wikipedia articles

Firstly, import NMF. Create a model, specifying the desired number of components. Let's specify 6. Fit the model to the samples, then use the fit model to perform the transformation.

In [24]:
# Import NMF
from sklearn.decomposition import NMF

#### NMF learns interpretable parts

The components of NMF represent patterns that frequently occur in the samples. Let's consider a concrete example, where articles are represented by their word frequencies. There are 60 articles, and 13125 words. So the array has 13125 columns.

#### Applying NMF to the articles

Let's fit an NMF model with 6 components to the articles. The 6 components are stored as the 6 rows of a 2-dimensional numpy array.

In [25]:
# Create an NMF instance: model
model = NMF(n_components = 6)

In [26]:
# Fit the model to articles
model.fit(articles)



NMF(n_components=6)

#### NMF components

Just as PCA has principal components, NMF has components which it learns from the samples, and as with PCA, the dimension of the components is the same as the dimension of the samples. The entries of the NMF components are always non-negative. 

In [27]:
nmf_components = model.components_

In [28]:
nmf_components.shape

(6, 13125)

#### NMF features

The NMF feature values are non-negative, as well.

In [29]:
# Transform the articles: nmf_features
nmf_features = model.transform(articles)

In [30]:
# Print the NMF features
#print(nmf_features.round(2))

In [31]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index = titles)

In [32]:
df.head()

Unnamed: 0,0,1,2,3,4,5
HTTP 404,0.0,0.0,0.0,0.0,0.0,0.44046
Alexa Internet,0.0,0.0,0.0,0.0,0.0,0.566597
Internet Explorer,0.003821,0.0,0.0,0.0,0.0,0.398641
HTTP cookie,0.0,0.0,0.0,0.0,0.0,0.381735
Google Search,0.0,0.0,0.0,0.0,0.0,0.48551


In [33]:
# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])


0    0.003846
1    0.000000
2    0.000000
3    0.575604
4    0.000000
5    0.000000
Name: Anne Hathaway, dtype: float64


In [34]:
# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])


0    0.000000
1    0.005601
2    0.000000
3    0.422302
4    0.000000
5    0.000000
Name: Denzel Washington, dtype: float64


Notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component.  

Next we'll see that, NMF components represent topics (for instance, acting!).

#### Reconstruction of a sample

The features and the components of an NMF model can be combined to approximately reconstruct the original data samples. If we multiply each NMF components by the corresponding NMF feature value, and add up each column, we get something very close to the original sample.

NMF can only be applied to arrays of non-negative data, such as word-frequency arrays. Another example is  There are many other great examples as well, such as 
- encoding collections of images as non-negative arrays, 
- arrays encoding audio spectrograms, and 
- arrays representing the purchase histories on e-Commerce sites.

#### NMF learns topics of documents

When NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. We'll Verify this by using the NMF model that we built earlier using the Wikipedia articles. 

Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.

In [35]:
words = pd.read_csv("wikipedia-vocabulary-utf8.csv", header=None)
words = words.rename(columns = {0: "word"})

In [36]:
# words

In [37]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns = words["word"])

In [38]:
components_df

word,aaron,abandon,abandoned,abandoning,abandonment,abbas,abbey,abbreviated,abbreviation,abc,...,zealand,zenith,zeppelin,zero,zeus,zimbabwe,zinc,zone,zones,zoo
0,0.011375,0.00121,0.0,0.001739,0.000136,0.0,0.0,0.002463,2.446059e-07,0.000834,...,0.02578,0.0,0.008324,0.0,0.0,0.0,0.0,0.0,0.000424,0.0
1,0.0,1e-05,0.005663,0.0,2e-06,0.0,0.0,0.000566,0.000500241,0.0,...,0.008106,0.0,0.0,0.00171,0.0,0.0,0.0,0.002813,0.000297,0.0
2,0.0,8e-06,0.0,0.0,0.004692,0.0,0.0,0.000758,1.604255e-05,0.0,...,0.00873,0.0,0.0,0.001317,0.0,0.0,0.0,0.0,0.000143,0.0
3,0.004149,0.0,0.003057,0.0,0.000614,0.0,0.0,0.002437,8.144764e-05,0.003985,...,0.012596,0.0,0.0,0.0,0.0,0.0,0.0,0.001742,0.006721,0.0
4,0.0,0.000568,0.004918,0.0,0.0,0.0,0.0,8.9e-05,4.259536e-05,0.0,...,0.001809,0.0,0.0,1.7e-05,0.0,0.0,0.0,0.000192,0.001351,0.0
5,0.000139,0.0,0.008749,0.0,0.000185,0.0,0.0,0.008629,1.53041e-05,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002401,0.001682,0.0


In [39]:
# Print the shape of the DataFrame
print(components_df.shape)

(6, 13125)


#### NMF components are topics

The rows, or components, live in an 13125-dimensional space - there is one dimension for each of the words. Aligning the words of our vocabulary with the columns of the NMF components allows them to be interpreted.

In [40]:
# Select row 3: component
component = components_df.iloc[3, :]
component.sort_values(ascending = False)

word
film       0.627992
award      0.253178
starred    0.245329
role       0.211490
actress    0.186432
             ...   
halo       0.000000
halt       0.000000
halves     0.000000
hamburg    0.000000
zoo        0.000000
Name: 3, Length: 13125, dtype: float64

In [41]:
# Print result of nlargest
print(component.nlargest(n=10))


word
film           0.627992
award          0.253178
starred        0.245329
role           0.211490
actress        0.186432
played         0.169764
actor          0.157396
performance    0.148378
washington     0.145891
drama          0.129315
Name: 3, dtype: float64


Choosing a component, such as this one, and looking at which words have the highest values, we see that they fit a theme: the words are 'film', 'award', 'starred', 'role' and 'actress'.

#### NMF components

So if NMF is applied to documents, then the components correspond to topics, and the NMF features reconstruct the documents from the topics. 

### Building a Musical artists Recommender system using NMF

#### Finding similar articles

Suppose that you are an engineer at a large online newspaper. You've been given the task of recommending articles that are similar to the article currently being read by a customer. 

####  Strategy

Our strategy for solving this problem is to apply NMF to the word-frequency array of the articles, and to use the resulting NMF features. Because NMF features describe the topic mixture of an article. So similar articles will have similar NMF features. 

But how can two articles be compared using their NMF features?

Before answering this question, let's set the scene by doing the first step.

#### Apply NMF to the word-frequency array : 1st Step

You are given a word frequency array articles corresponding to the collection of newspaper articles in question. 

Import NMF, create the model, and use the fit_transform method to obtain the transformed articles. Now we've got NMF features for every article, given by the columns of the new array.

Now we need to define how to compare articles using their NMF features.

#### Versions of articles

Similar documents have similar topics, but it isn't always the case that the NMF feature values are exactly the same. 

For instance, one version of a document might use very direct language, whereas other versions might interleave the same content with meaningless chatter. Meaningless chatter reduces the frequency of the topic words overall, which reduces the values of the NMF features representing the topics. However, on a scatter plot of the NMF features, all these versions lie on a single line passing through the origin.

For this reason, when comparing two documents, it's a good idea to compare these lines.

####  Cosine similarity -- 2nd Step

We'll compare them using what is known as the cosine similarity, which uses the angle between the two lines. Higher values indicate greater similarity. 

To compute the cosine similarity-

- Firstly, import the normalize function, and apply it to the array of all NMF features. 

- Now select the row corresponding to the current article, and pass it to the dot method of the array of all normalized features. 

This results in the cosine similarities.

#### Which articles are similar to 'Cristiano Ronaldo'?

In this exercise we'll use NMF features and the cosine similarity to find similar articles and Apply this to your NMF model for popular Wikipedia articles, by finding the articles most similar to the article about the footballer Cristiano Ronaldo.


In this exercise, we'll build a pipeline and transform the array into normalized NMF features. 

The first step in the pipeline. 

MaxAbsScaler, transforms the data so that all the articles have the same influence on the model, regardless of how many different articles the users read.

In [60]:
# Perform the necessary imports
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

In [61]:
# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()


In [62]:
# Create an NMF model: nmf
nmf = NMF(n_components = 6)


In [63]:
# # Normalize the NMF features: norm_features
normalizer = Normalizer()

In [64]:
# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

In [65]:
# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(articles)



With the help of a pandas DataFrame, we can label the similarities with the article titles. Start by importing pandas. After normalizing the NMF features, create a DataFrame whose rows are the normalized features, using the titles as an index. 

In [66]:
# Create a DataFrame: df
df = pd.DataFrame(norm_features,index = titles)


Now use the loc method of the DataFrame to select the normalized feature values for the current article, using its title 'Cristiano Ronaldo'.

In [67]:
# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo']

Calculate the cosine similarities using the dot method of the DataFrame.

In [68]:
# Compute the dot products: similarities
similarities = df.dot(article)

Finally, use the nlargest method of the resulting pandas Series to find the articles with the highest cosine similarity.

In [69]:
# Display those with highest cosine similarity
print(similarities.nlargest())

Cristiano Ronaldo                1.000000
Radamel Falcao                   0.999941
Zlatan Ibrahimović               0.999941
Franck Ribéry                    0.999941
France national football team    0.999922
dtype: float64
