# Homework Probelm 1: LSA-based Recommender
- Work off of the LSA notebook that we went through in class
- Use the Reuters 10K article corpus raw_text_dataset.pickle (obtained from https://github.com/chrisjmccormick/LSA_Classification)
    
- Create a doc2vec(doc, tfidf_vectorizer) function corresponding to a TFIDF vectorizere
    - INPUTS: doc, tfidf_vectorizer
        - doc - any string
        - tfidf_vectorizer - a TfidfVectorizer instance
    - OUTPUTS: vec, doc_features, doc_counts
        - vec - a vector with $L_2$ norm of $1$
        - doc_features - the features after tokenization and pre-processing
        - doc_counts - the counts of each feature in this document
    - train tfidf_vectorizer on the Reuters 10K article corpus
- For each of the following doc strings, calculate their corresponding vectors
    - doc1: "The cocoa cadabra"
    - doc2: "AAPL SE"
    - doc3: "bullish stocks"
    - doc3: "I walked through a random forest and earned a high premium"
- Create a **recommend(vec, X_model, X_corpus)** function:
    - which projects any document vector onto a given X_model
        - here X_model = {X_train_tfidf, and X_train_lsa}
    - and returns doc_vec, idx_top10, sim_top10, X_top10 as follows
        - doc_vec - the (sparse) vector of similarity scores of vec and members of X_model. 
        This vector should be size (Dx1)
        - idx_top10: the indices of the top-10 similarity scores
        - sim_top10: the top-10 similarity scores
        - X_top10: the top-10 corpus articles most similar to the input model
    - what does your recommend() function ouput for the doc vectors in the previous exersise?
        - Do you see an improvement of the LSA similarity recommendation relative to the TF-IDF similarity recommendation?
        
- BONUS:
    - repeat the same exercise but instead of the Reuters 10K dataset, use this corpus of 200K English plaintext jokes https://github.com/taivop/joke-dataset . Does your recommender system actually find similar jokes? Give examples of good recommendations and bad recommendations. 


In [1]:
import pickle
import os

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.metrics.pairwise import euclidean_distances

import json

## Create a doc2vec(doc, tfidf_vectorizer) function corresponding to a TFIDF vectorizere

In [2]:
def doc2vec(doc, tfidf_vectorizer):
    """A transform function
    This function transforms documnet to a tfidf vector
    
    Parameters
    ----------
    doc : string
          The document you would like to transform.
    
    tfidf_vectorizer : sklearn.feature_extraction.text.TfidfVectorizer
                       a trained TfidfVectorizer according to which transfrom doc to vector
                       
    Returns
    -------
    vec : parse vector
          the vector doc transformed into.
              
    doc_features : list of strings
                   The vocabulary of tfidf_vectorizer
                
    doc_counts : np.array
                 the counts of each feature in this document
    
    """
    doc_tfidf = tfidf_vectorizer.transform([doc])
    vec = doc_tfidf.toarray().ravel()
    doc_features = tfidf_vectorizer.get_feature_names()
    Count_vectorizer = CountVectorizer()
    Count_vectorizer.vocabulary_ = {k:i for i,k in enumerate(tfidf_vectorizer.get_feature_names())}
    doc_counts = Count_vectorizer.transform([doc]).toarray().ravel()
    return vec, doc_features, doc_counts

### train tfidf_vectorizer on the Reuters 10K article corpus

In [3]:
fname = "raw_text_dataset.pickle"
filepath = os.getcwd() + '/' + fname
raw_text_dataset = pickle.load(open(filepath, "rb"))

In [4]:
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.5, # ignore terms which occur in more than half of the documents
    max_features=10000,
    min_df=2, # ignore terms which occur in less than 2 documents
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
#     token_pattern='(?u)\\b\\w\\w+\\b'
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b'
    )

In [5]:
corpus = raw_text_dataset[0] + raw_text_dataset[2] 
corpus_tfidf = tfidf_vectorizer.fit_transform(corpus)

### calculate their corresponding vectors

In [6]:
doc1 = "The cocoa cadabra"
doc2 = "AAPL SE"
doc3 = "bullish stocks"
doc4 = "I walked through a random forest and earned a high premium"

In [7]:
def vectorcalculation(doc,tfidf_vectorizer = tfidf_vectorizer):
    vec, doc_features, doc_counts = doc2vec(doc, tfidf_vectorizer)
    index = np.where(vec)[0]
    
    for i in index:
        print ("word : ", doc_features[i], " tfidf : ", vec[i], "numbers : ",doc_counts[i])
    
    return vec

In [8]:
for i in [1,2,3,4]:
    print ("Below are the result of word ", i)
    print ("----------")
    doc = globals () ["doc"+str(i)]
    globals () ["vec"+str(i)] = vectorcalculation(doc,tfidf_vectorizer)
    print ("")

Below are the result of word  1
----------
word :  cocoa  tfidf :  1.0 numbers :  1

Below are the result of word  2
----------
word :  aapl  tfidf :  0.7267972294274684 numbers :  1
word :  se  tfidf :  0.6868520854569459 numbers :  1

Below are the result of word  3
----------
word :  bullish  tfidf :  0.8130834300057836 numbers :  1
word :  stocks  tfidf :  0.5821471771382473 numbers :  1

Below are the result of word  4
----------
word :  earned  tfidf :  0.3417258810296154 numbers :  1
word :  forest  tfidf :  0.4421975166235162 numbers :  1
word :  high  tfidf :  0.2477341391645119 numbers :  1
word :  premium  tfidf :  0.3478755645367025 numbers :  1
word :  random  tfidf :  0.4947816947727578 numbers :  1
word :  walked  tfidf :  0.5103785271100408 numbers :  1



In [9]:
# The corresponding vectors has been stored in variable vec1, vec2, vec3,vec4
# For example we could have a look at the vector "The cocoa cadabra" coresponds to.
vec1

array([0., 0., 0., ..., 0., 0., 0.])

## Create a recommend(vec, X_model, X_corpus) function

In [10]:
def recommend(vec, X_model, X_corpus):
    """ A recommend system.
    
    This recommend system 
    
    Parameters
    ----------
    vec : array-like
          Any document vector projected onto a given X_model.
          
    X_model : {X_train_tfidf, X_train_lsa}
              The model vec and X_corpus is projected onto.
              
    X_corpus : a list of strings
               The original articles.
               
               
               
    Returns
    -------
    doc_vec : parse vector, size = (Dx1) 
              Similarity scores of vec and members of X_model.
              
    idx_top10 : list of ints, len = 10
                The indices of the top-10 similarity scores.
                
    sim_top10 : list of double, len = 10
                The top-10 similarity scores
                
    X_top10 : list of strings, len = 10
              The top-10 corpus articles most similar to the input model
    """
    
    
    doc_vec = euclidean_distances(vec,X_model).flatten()
    idx_top10 = np.argsort(doc_vec.flatten())[0:10]
    sim_top10 = doc_vec[idx_top10]
    X_top10 = []
    for i in idx_top10:
        X_top10.append(X_corpus[i])
    
    return doc_vec, idx_top10, sim_top10, X_top10

### what does your recommend() function ouput for the doc vectors in the previous exersise

In [11]:
corpus_train = raw_text_dataset[0]
X_train_tfidf = tfidf_vectorizer.fit_transform(corpus_train)

In [12]:
# Project the tfidf vectors onto the first N principal components.
# Though this is significantly fewer features than the original tfidf vector,
# they are stronger features, and the accuracy is higher.
svd = TruncatedSVD(
    n_components=200,
    random_state=42,
    algorithm='arpack'
)

lsa = make_pipeline(
    svd, 
#     Normalizer(copy=False) # try commenting this out. Do you get a better result?
)

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

In [13]:
def doc_vector_recommend(doc, X_model , method ,X_corpus = corpus_train, tfidf_vectorizer = tfidf_vectorizer,lsa=lsa):
    if method == "tfidf" :
        vec = tfidf_vectorizer.transform([doc])
    elif method == "lsa" : 
        vec = lsa.transform(tfidf_vectorizer.transform([doc]))
    doc_vec, idx_top10, sim_top10, X_top10 = recommend(vec, X_model, X_corpus)
    return doc_vec, idx_top10, sim_top10, X_top10

In [14]:
for i in [1,2,3,4]:
    doc = globals () ["doc"+str(i)]
    for j in ["tfidf","lsa"]:
        if j == "tfidf":
            X_model = X_train_tfidf
        elif j == "lsa":
            X_model = X_train_lsa
        doc_vec, idx_top10, sim_top10, X_top10 = doc_vector_recommend(doc, X_model , method = j)
        globals () ["doc_vec"+"_"+str(i)+"_"+j] = doc_vec
        globals () ["idx_top10"+"_"+str(i)+"_"+j] = idx_top10
        globals () ["sim_top10"+"_"+str(i)+"_"+j] = sim_top10
        globals () ["X_top10"+"_"+str(i)+"_"+j] = X_top10

The ouputs has been stored in the variables like doc_vec_1_tfidf, which means the similarity score between document1 and corpus articles under tfidf model.

For example, let look at the the documentation which is most similar to "bullish stocks"

In [15]:
X_top10_3_tfidf[0]

'SUBROTO SEES OIL MARKET CONTINUING BULLISH\n\nIndonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a press conference in Jakarta at the end of a two-day meeting of South-East Asian Energy Ministers that he saw prices stabilizing around 18 dlrs a barrel. "The sentiment in the market is bullish and I think it will continue that way as demand will go up in the third or fourth quarters," Subroto said. Asked about the prospect for oil prices, he said: "I think they will stabilise around 18 dlrs, although there is a little turbulence ..." "Of course the spot price will fluctuate, but the official price will remain at 18 dlrs," he added. REUTER '

If we firstly find the index of the documentation which is most similar to "bullish stocks", then use the index to get the documentation, we get the same result as above.

In [16]:
corpus[idx_top10_3_tfidf[0]]

'SUBROTO SEES OIL MARKET CONTINUING BULLISH\n\nIndonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a press conference in Jakarta at the end of a two-day meeting of South-East Asian Energy Ministers that he saw prices stabilizing around 18 dlrs a barrel. "The sentiment in the market is bullish and I think it will continue that way as demand will go up in the third or fourth quarters," Subroto said. Asked about the prospect for oil prices, he said: "I think they will stabilise around 18 dlrs, although there is a little turbulence ..." "Of course the spot price will fluctuate, but the official price will remain at 18 dlrs," he added. REUTER '

Let us calculate the min of similar score

In [17]:
doc_vec_3_tfidf.min()

1.1799903470933173

We could get the same result if we find the first element of the the top-10 similarity scores.

In [18]:
sim_top10_3_tfidf[0]

1.1799903470933173

### Do you see an improvement of the LSA similarity recommendation relative to the TF-IDF similarity recommendation?

In [19]:
X_top10_1_tfidf[0]

'GHANA COCOA PURCHASES FALL, CUMULATIVE STILL UP\n\nThe Ghana Cocoa Board said it purchased 1,323 tonnes of cocoa in the 21st week, ended February 26, of the 1986/87 main crop season, compared with 1,961 tonnes the previous week and 1,344 tonnes in the 21st week ended March six of the 1985/86 season, the board said. Cumulative purchases so far this season stand at 216,095 tonnes, still up on the 201,966 tonnes purchased by the 21st week of last season, the Board said. Reuter '

In [20]:
X_top10_1_lsa[0]

'CALNY INC CLNY> SUES PEPSICO INC PEP>\n\nCalny Inc said it has filed a multi-million-dlr suit against PepsiCo Inc and its La Petite Boulangerie unit. Calny, which holds 15 La Petite Boulangerie franchises, alleges it and PepsiCo breached their agreements with Calny by failing to support the franchises in a number of ways. The company further alleges that PepsiCo and La Petite Boulangerie had fiduciary responsibilities to Calny because of the longstanding relationship between Calny and Taco Bell, also a PepsiCo subsidiary. Calny operates 143 Taco Bell restaurants. Calny said Pepsico misrepresented the readiness of the La Petite Boulangerie to expand outside San Francisco and misrepresented costs involved in operating the restaurants. Reuter '

In [21]:
X_top10_2_tfidf[0]

'APPLE COMPUTER AAPL> UPGRADES MACINTOSH LINE\n\nApple Computer Inc today will announce the addition of two new machines to its profitable Macintosh line of personal computers, both aimed at the business market. The Macintosh was first introduced in January 1984 and has been upgraded several times since then. Both of the new machines, the Macintosh SE and the Macintosh II, will be faster and more versatile, but considerably more expensive than earlier models. The Mac SE (SE stands for "system expansion"), which Apple says will operate 15-20 pct faster than its current Mac Plus, goes on sale today. It carries a suggested retail price ranging from 2,899 to 3,699 dlrs depending on its features. The Mac II, designed to run about four times faster than the Mac Plus, is to be ready for shipping in May and priced between 4,798 and 6,998 dlrs. Mac Plus, which went on the market one year ago, sells for about 2,200 dlrs. Both new computers are to be unveiled at the AppleWorld Conference in Los A

In [22]:
X_top10_2_lsa[0]

'CALNY INC CLNY> SUES PEPSICO INC PEP>\n\nCalny Inc said it has filed a multi-million-dlr suit against PepsiCo Inc and its La Petite Boulangerie unit. Calny, which holds 15 La Petite Boulangerie franchises, alleges it and PepsiCo breached their agreements with Calny by failing to support the franchises in a number of ways. The company further alleges that PepsiCo and La Petite Boulangerie had fiduciary responsibilities to Calny because of the longstanding relationship between Calny and Taco Bell, also a PepsiCo subsidiary. Calny operates 143 Taco Bell restaurants. Calny said Pepsico misrepresented the readiness of the La Petite Boulangerie to expand outside San Francisco and misrepresented costs involved in operating the restaurants. Reuter '

In [23]:
X_top10_3_tfidf[0]

'SUBROTO SEES OIL MARKET CONTINUING BULLISH\n\nIndonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a press conference in Jakarta at the end of a two-day meeting of South-East Asian Energy Ministers that he saw prices stabilizing around 18 dlrs a barrel. "The sentiment in the market is bullish and I think it will continue that way as demand will go up in the third or fourth quarters," Subroto said. Asked about the prospect for oil prices, he said: "I think they will stabilise around 18 dlrs, although there is a little turbulence ..." "Of course the spot price will fluctuate, but the official price will remain at 18 dlrs," he added. REUTER '

In [24]:
X_top10_3_lsa[0]

'CALNY INC CLNY> SUES PEPSICO INC PEP>\n\nCalny Inc said it has filed a multi-million-dlr suit against PepsiCo Inc and its La Petite Boulangerie unit. Calny, which holds 15 La Petite Boulangerie franchises, alleges it and PepsiCo breached their agreements with Calny by failing to support the franchises in a number of ways. The company further alleges that PepsiCo and La Petite Boulangerie had fiduciary responsibilities to Calny because of the longstanding relationship between Calny and Taco Bell, also a PepsiCo subsidiary. Calny operates 143 Taco Bell restaurants. Calny said Pepsico misrepresented the readiness of the La Petite Boulangerie to expand outside San Francisco and misrepresented costs involved in operating the restaurants. Reuter '

In [25]:
X_top10_4_tfidf[0]

"NORANDA TO SPIN OFF FOREST SUBSIDIARIES\n\nNoranda Inc> said the company planned a public share offer within three months of its Noranda Forest Inc unit, which holds Noranda's forest products interests. Size of the offer is still undetermined, Noranda said. Noranda said Noranda Forest would operate as a freestanding subsidiary of Noranda. Noranda Forest holds 100 pct of Fraser Inc, James Maclaren Industries and Noranda Forest Sales and 50 pct stakes in MacMillan Bloedel Ltd MMBLF> and Northwood Pulp and Timber. Noranda Forest's consolidated 1986 revenues were more than three billion dlrs, with earnings of 158 mln dlrs. "

In [26]:
X_top10_4_lsa[0]

'ANGOLA, URUGUAY ESTABLISH DIPLOMATIC RELATIONS\n\nAngola and Uruguay have established diplomatic relations at the ambassadorial level, according to a joint communique signed by their U.N. representatives and circulated here today. Reuter '

I read the documents of each method recommand to me. For the first two, TF-IDF performs better and for the third one, LSA performs better. And they both done very poor on the random walk example. So, there is not an improvement of the LSA relative to TF-IDF

## BONUS

In [27]:
with open("reddit_jokes.json") as f:
    plaintext_jokes = json.load(f)

In [28]:
joke_doc = []
for i in plaintext_jokes:
    
    joke_doc.append(i['title']+' '+i['body'])

In [29]:
jokes_tfidf = tfidf_vectorizer.fit_transform(joke_doc)
jokes_lsa = lsa.fit_transform(jokes_tfidf)

In [30]:
doc_vector_recommend(doc3, jokes_tfidf , "tfidf" ,joke_doc, tfidf_vectorizer = tfidf_vectorizer,lsa=lsa)[3][0]

'Why was the cheesemaker lopsided? Because he only had one Stilton!'

In [31]:
doc_vector_recommend(doc3, jokes_lsa , "lsa" ,joke_doc, tfidf_vectorizer = tfidf_vectorizer,lsa=lsa)[3][0]

'What do you call Nitrogen after the sunrises? Daytrogen.'

Well, this system does not work as expected. It seems has nothing to do with bullish stocks. And these are bad examples

In [32]:
doc_vector_recommend("cheesemaker", jokes_tfidf , 
                     "tfidf" ,joke_doc, tfidf_vectorizer = tfidf_vectorizer,lsa=lsa)[3][0]

'Why was the cheesemaker lopsided? Because he only had one Stilton!'

This is a good example.