# ******** Article Recommendation System ********
                                                            - MANSI AGARWAL

### Importing the Necessary Libraries 

In [34]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Importing and Reading the Dataset by creating a data frame df 

In [35]:
df = pd.read_csv(r'C:\Users\Dell\Desktop\5 sem\Mini Project\shared_articles.csv')

In [36]:
df.head(1)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en


### Analyzing each feature/column of the dataset to select required features/columns for recommendation

1. TimeStamp - This column is about when the event has occured .

In [37]:
print(df['timestamp'].head(5))

0    1459192779
1    1459193988
2    1459194146
3    1459194474
4    1459194497
Name: timestamp, dtype: int64


This Column is not Helpful in our recommendation system.

2. EventType - This gives us the article is shared or removed at a particular timestamp

In [38]:
print(df['eventType'].value_counts())

CONTENT SHARED     3047
CONTENT REMOVED      75
Name: eventType, dtype: int64


Now I will filter our dataframe and remove all the tuples that have content removed status as they will not help in the recommendation system

In [39]:
df = df[df['eventType']=='CONTENT SHARED']

In [40]:
print(df.shape)

(3047, 13)


3. ContentID- This is the Article ID in numeric format

In [41]:
print(df['contentId'].head(5))

1   -4110354420726924665
2   -7292285110016212249
3   -6151852268067518688
4    2448026894306402386
5   -2826566343807132236
Name: contentId, dtype: int64


4. AuthorPersonID- This will help us identify unique authors and their count

In [42]:
print(df['authorPersonId'].head(5))

1    4340306774493623681
2    4340306774493623681
3    3891637997717104548
4    4340306774493623681
5    4340306774493623681
Name: authorPersonId, dtype: int64


In [43]:
print(len(df['authorPersonId'].unique()))

252


5. AuthorSessionID - This is the sessionId of the author

In [44]:
print(df['authorSessionId'].head(5))

1    8940341205206233829
2    8940341205206233829
3   -1457532940883382585
4    8940341205206233829
5    8940341205206233829
Name: authorSessionId, dtype: int64


6. AuthorUserAgent - This tells us the browser author used.

In [45]:
print(df['authorUserAgent'].tail(5))

3117    Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...
3118    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3...
3119    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0...
3120    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
3121    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
Name: authorUserAgent, dtype: object


7. AuthorRegion - This gives the states/Regions of the author

In [46]:
print(df['authorRegion'].tail(5))


3117    SP
3118    GA
3119    SP
3120    MG
3121    SP
Name: authorRegion, dtype: object


In [47]:
print(len(df['authorRegion'].unique()))
print(df['authorRegion'].isnull().sum())

20
2378


Since more than 50% of the values are null I wont use the column in the recommendation system

8. AuthorCountry= Gives the Country of the authors of article

In [48]:
print(df['authorCountry'].tail(5))

3117    BR
3118    US
3119    BR
3120    BR
3121    BR
Name: authorCountry, dtype: object


In [49]:
print(df['authorCountry'].unique())
print(df['authorCountry'].isnull().sum())

[nan 'BR' 'CA' 'US' 'AU' 'PT']
2378


Since 2378 countries are null , I wont use the column

9. ContentType - The format of the article shared.

In [50]:
print(df['contentType'].unique())
print(df['contentType'].isnull().sum())

['HTML' 'RICH' 'VIDEO']
0


10. URL - This is the URL of the Articles , this can be used for reference to navigate to the article

In [51]:
print(df['url'].tail(5))


3117    https://startupi.com.br/2017/02/liga-ventures-...
3118    https://thenextweb.com/apps/2017/02/14/amazon-...
3119                          https://code.org/about/2016
3120    https://www.bloomberg.com/news/articles/2017-0...
3121    https://www.acquia.com/blog/partner/2017-acqui...
Name: url, dtype: object


In [52]:
print(df['url'].isnull().sum())
print(df['url'].isna().sum())

0
0


This column has no null or NaN values and can be used to directly navigate to the article , so this column is useful in the system

11. Title - Headline of the Article

In [53]:
print(df['title'].head(5))

1    Ethereum, a Virtual Currency, Enables Transact...
2    Bitcoin Future: When GBPcoin of Branson Wins O...
3                         Google Data Center 360° Tour
4    IBM Wants to "Evolve the Internet" With Blockc...
5    IEEE to Talk Blockchain at Cloud Computing Oxf...
Name: title, dtype: object


In [54]:
print(df['title'].isnull().sum())
print(df['title'].isna().sum())

0
0


Since this column has no null or NaN values , I will use this column to identify the recommendation and also take inputs for the recommendation system

12. Text - This is the content of the Article

In [55]:
print(df['text'].head())

1    All of this work is still very early. The firs...
2    The alarm clock wakes me at 8:00 with stream o...
3    We're excited to share the Google Data Center ...
4    The Aite Group projects the blockchain market ...
5    One of the largest and oldest organizations fo...
Name: text, dtype: object


In [56]:
print(df['text'].isnull().sum())
print(df['text'].isna().sum())

0
0


It has no null or NaN values. This is the most critical column as the project is based on Content-Based Recommendation System.         This will be Used to create a TF-IDF matrix for analysis.

13. Lang - It gives the language in which the article is written

In [57]:
print(df['lang'].unique())

['en' 'pt' 'es' 'la' 'ja']


In [58]:
print(df['lang'].isnull().sum())
print(df['lang'].isna().sum())

0
0


Since , English is the most common language , only those articles are kept in the dataframe that are written in English

In [59]:
df=df[df['lang']== 'en']
print(df.shape)

(2211, 13)


### Exploratory Data Analysis :

Now we have an updated dataframe containing only nececssary columns for analysis

In [60]:
df = pd.DataFrame(df,columns=['contentId','authorPersonId','content','title','text'])

Now create_test is a function which will concatenate the values of the column 'TEXT' and we will store the result of this function in a column called 'TEST' in the dataframe

In [61]:
def create_test(x):
    test = ' '.join(x['text'])
    return test

In [62]:
df['test']=df.apply(create_test,axis=1)

In [63]:
df.head(2)

Unnamed: 0,contentId,authorPersonId,content,title,text,test
1,-4110354420726924665,4340306774493623681,,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,A l l o f t h i s w o r k i s s t i ...
2,-7292285110016212249,4340306774493623681,,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,T h e a l a r m c l o c k w a k e s m ...


All the stop  words like 'a' , 'the' are removed using a TfidfVectorizer object

In [64]:
tfidf = TfidfVectorizer(stop_words='english')

Now the required tfidf matrix by fitting and transforming the data

In [65]:
tfidf_matrix = tfidf.fit_transform(df['text'])

In [66]:
print(tfidf_matrix.shape)

(2211, 45496)


A word vector is ready where feature integer indices are mapped to the feature name

In [77]:
print(tfidf.get_feature_names())



Now I will use Cosine Similarity for finding the similarity between articles because it calculates the cosine angle betwwen eachbpair of elements.The less the angle is thr more similar they are.

In [68]:
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix,True)

In [69]:
print(cosine_sim.shape)

(2211, 2211)


In [70]:
print(cosine_sim)

[[1.         0.02842053 0.01414884 ... 0.04717028 0.08436331 0.01859574]
 [0.02842053 1.         0.02096081 ... 0.0256655  0.03187741 0.01120014]
 [0.01414884 0.02096081 1.         ... 0.02092281 0.04240744 0.        ]
 ...
 [0.04717028 0.0256655  0.02092281 ... 1.         0.05457163 0.07645089]
 [0.08436331 0.03187741 0.04240744 ... 0.05457163 1.         0.03965418]
 [0.01859574 0.01120014 0.         ... 0.07645089 0.03965418 1.        ]]


Now I will create a map of indices and field titles and remove the duplicate titles if any 

In [71]:
metadata = df.reset_index()
indices = pd.Series(metadata.index,index=metadata['title']).drop_duplicates()

In [72]:
print(indices[:10])

title
Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's                                          0
Bitcoin Future: When GBPcoin of Branson Wins Over USDcoin of Trump                                               1
Google Data Center 360° Tour                                                                                     2
IBM Wants to "Evolve the Internet" With Blockchain Technology                                                    3
IEEE to Talk Blockchain at Cloud Computing Oxford-Con - CoinDesk                                                 4
Banks Need To Collaborate With Bitcoin and Fintech Developers                                                    5
Blockchain Technology Could Put Bank Auditors Out of Work                                                        6
Why Decentralized Conglomerates Will Scale Better than Bitcoin - Interview with OpenLedger CEO - Bitcoin News    7
The Rise And Growth of Ethereum Gets Mainstream Coverage                  

Now I will create a simple function for recommendation that takes article title,indices,cosine similarity and the dataset as input and gives similar articles as output

In [73]:
def get_recommendations(title,indices,cosine_sim,data):
    index = indices[title]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores,key=lambda x:x[1],reverse=True)
    sim_scores=sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return data['title'].iloc[movie_indices]

Now we can call the function with an input article title and the recommendation system will give 10 similar articles

In [74]:
print(get_recommendations('Life Beyond Email: Chatbot Marketing',indices,cosine_sim,metadata))

381     How Facebook's Big Bet On Chatbots Might Remak...
2004                             Introduction To Chatbots
1262    What the future will look like when we use cha...
692     Facebook says 10K+ developers are building cha...
562             The 200 billion dollar chatbot disruption
1582          Chatbots: Are they better without the chat?
376     Facebook sends a loud message about Messenger ...
938     A new Facebook chatbot could help you find you...
45      Behind Facebook Messenger's plan to be an app ...
1601    CHATBOTS EXPLAINED: Why businesses should be p...
Name: title, dtype: object


In [75]:
print(get_recommendations('Another option for file sharing',indices,cosine_sim,metadata))

1472    [Security] How to Set Expiration Dates for Sha...
1039    Apple File System (APFS) announced for 2017, s...
1335    Git for Windows accidentally creates NTFS alte...
445     Google Drive grows more powerful, feature by f...
532     5 reasons your employees aren't sharing their ...
1476    Add the Same File to Multiple Folders in Googl...
1386    AWS Certified Solutions Architect Professional...
366                                   Building for HTTP/2
1861    Announcing new storage classes for Google Clou...
1127    What's the Version of my Deployed Application?...
Name: title, dtype: object
