### TFIDF based Recommendation system
#### Recommender System based on tf-idf as vector representation of documents

- Represent articles in terms of bag of words
- Represent user in terms of read articles associated words
- Generate TF-IDF matrix for user read articles and unread articles
- Calculate cosine similarity between user read articles and unread articles
- Get the recommended articles

Describing parameters:

- NEWS_ARTICLES: specify the path where news_article.csv is present
- ARTICLES_READ: List of Article_Ids read by the user
- NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result

In [None]:
NEWS_ARTICLES="NewsArticles.csv"
ARTICLES_READ=[1,10]
NUM_RECOMMENDED_ARTICLES=5

In [None]:
try:
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
except ImportError:
    print('You are missing some packages! ' \
          'We will try installing them before continuing!')
    !pip install "numpy" "pandas" "sklearn" "nltk"
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
    print('Done!')

#### Text sanitization

- remove punctuations and other characters
- tokenizing
- stemming

You might encounter some issues on utf-8 encoding while loading the dataframe. Use encoding option in pandas dataframe to resolve this issue.
Issue: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 28: invalid start byte

In [None]:
news_articles = pd.read_csv(NEWS_ARTICLES,encoding='unicode_escape',index_col=0)
#drop all the unnamed columns
news_articles.drop(news_articles.columns[news_articles.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
news_articles.head()

Unnamed: 0_level_0,publish_date,article_source_link,title,subtitle,text
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2017/2/7,http://abcnews.go.com/Politics/pence-break-tie...,"Betsy DeVos Confirmed as Education Secretary, ...",,Michigan billionaire education activist Betsy ...
2,2017/2/7,http://abcnews.go.com/Politics/wireStory/melan...,Melania Trump Says White House Could Mean Mill...,,First lady Melania Trump has said little about...
3,2017/2/7,http://abcnews.go.com/Politics/wireStory/trump...,"As Trump Fears Fraud, GOP Eliminates Election ...",,A House committee voted on Tuesday to eliminat...
4,2017/2/7,http://abcnews.go.com/Politics/appeals-court-d...,Appeals Court to Decide on Challenge to Trump'...,,"This afternoon, three federal judges from the ..."
5,2017/2/7,http://abcnews.go.com/US/23-states-winter-weat...,At Least 4 Tornadoes Reported in Southeast Lou...,,At least four tornadoes touched down in Louisi...


In [None]:
news_articles = news_articles[['title','text']].dropna()
#articles is a list of all articles
articles = news_articles['text'].tolist()
articles[0] #an uncleaned article

'Michigan billionaire education activist Betsy DeVos was confirmed today to serve as the secretary of education in President Trump\'s administration, after Vice President Mike Pence cast a tie-breaking vote in the Senate. The Senate voted on DeVos"?highly contentious nomination this afternoon, and the tally was split evenly, requiring Pence to use his authority as president of the upper chamber of Congress to break the impasse. This was the first time that a vice president has broken a tie to confirm a Cabinet nominee. Pence read the vote count 50-50 and then voted himself, rendering the tally 51-50. The day before the vote, Democrats staged a 24-hour marathon of speeches, with more than 30 lawmakers taking to the floor to urge at least one additional Republican to vote against DeVos and block her confirmation. "It is hard to imagine a worse choice,"?Sen. Elizabeth Warren, D-Mass., said before she read letters from constituents urging her to vote no. DeVos stirred up vehement oppositio

In [None]:
def clean_text(document):
    document = re.sub('[^\w_\s-]', ' ',document)       #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)              #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [None]:
cleaned_articles = list(map(clean_text, articles))
cleaned_articles[0]  #a cleaned, tokenized and stemmed article

'michigan billionair educ activist betsi devo was confirm today to serv as the secretari of educ in presid trump s administr after vice presid mike penc cast a tie-break vote in the senat the senat vote on devo high contenti nomin this afternoon and the talli was split even requir penc to use his author as presid of the upper chamber of congress to break the impass this was the first time that a vice presid has broken a tie to confirm a cabinet nomine penc read the vote count 50-50 and then vote himself render the talli 51-50 the day befor the vote democrat stage a 24-hour marathon of speech with more than 30 lawmak take to the floor to urg at least one addit republican to vote against devo and block her confirm it is hard to imagin a wors choic sen elizabeth warren d-mass said befor she read letter from constitu urg her to vote no devo stir up vehement opposit from teacher union and all 48 senat democrat mani cite concern about her support of school voucher which critic believ will we

#### Get user read articles content

In [None]:
user_articles = ' '.join(cleaned_articles[i] for i in ARTICLES_READ)
user_articles

'first ladi melania trump has said littl about what she intend to do with her promin posit but in new court document her lawyer say that the multi-year term dure which she is one of the most photograph women in the world could mean million of dollar for her person brand while the new document don t specif mention her term as first ladi the unusu statement about her expect profit drew swift condemn from ethic watchdog as inappropri profit from her high-profil posit which is typic center on public servic the statement came monday in a libel lawsuit the first ladi re-fil in a state trial court in manhattan trump has been su the corpor that publish the daili mail s websit over a now-retract report that claim she onc work as an escort in the file monday trump s lawyer argu that the report was not onli fals and libel but also damag her abil to profit off her high profil and affect her busi opportun trump had the uniqu once-in-a-lifetim opportun as an extrem famous and well-known person as we

### Generate TF-IDF for read and unread articles

In [None]:
#Generate tfidf matrix model for entire corpus
tfidf_matrix = TfidfVectorizer(stop_words='english', min_df=2) 
# min_df : When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold
article_tfidf_matrix = tfidf_matrix.fit_transform(cleaned_articles)
article_tfidf_matrix #tfidf vector of an article

<3732x18771 sparse matrix of type '<class 'numpy.float64'>'
	with 670119 stored elements in Compressed Sparse Row format>

In [None]:
#Generate tfidf matrix model for read articles
user_article_tfidf_vector = tfidf_matrix.transform([user_articles])
user_article_tfidf_vector

<1x18771 sparse matrix of type '<class 'numpy.float64'>'
	with 355 stored elements in Compressed Sparse Row format>

In [None]:
user_article_tfidf_vector.toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

### Cosine similarity between user read articles and unread articles

In [None]:
articles_similarity_score=cosine_similarity(article_tfidf_matrix, user_article_tfidf_vector)

In [None]:
recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]

In [None]:
recommended_articles_id

array([  10, 1093,    1, ..., 1895, 2879,  276], dtype=int64)

In [None]:
#Remove read articles from recommendations
final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                 if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]

In [None]:
final_recommended_articles_id

[1093, 55, 233, 372, 3098]

### Article Recommendation

In [None]:
#Recommended Articles and their title
print ('Articles Read')
print (news_articles.loc[ARTICLES_READ]['title'])
print ('\n')
print ('Recommendations ')
print (news_articles.loc[final_recommended_articles_id]['title'])

Articles Read
article_id
1     Betsy DeVos Confirmed as Education Secretary, ...
10    Multi-State Manhunt in Southeast Intensifies f...
Name: title, dtype: object


Recommendations 
article_id
1093             Adele sweeps the boards at Grammy Awards
55      How child predator was caught by tiny clue in ...
233     What goes on in a far-right Facebook filter bu...
372     Beijing, Manila agree on $3.7b in shared projects
3098    Iconic New York Columnist Jimmy Breslin Dead A...
Name: title, dtype: object
