# Importing of libraries

## What is TF-ID-Vectorizer?
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation of text documents that captures the importance of words, considering their frequency within a document and rarity across the entire corpus.

##What is cosine similarity?
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. It is commonly used in information retrieval and recommendation systems to compare the similarity of documents or items based on their feature vectors.

In [None]:
import pandas as pd
from IPython.display import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Importing data set of shared articles

In [None]:
articles_df = pd.read_csv('/content/drive/MyDrive/archive/shared_articles.csv')
print(articles_df.shape)
print(articles_df.columns)

(3122, 13)
Index(['timestamp', 'eventType', 'contentId', 'authorPersonId',
       'authorSessionId', 'authorUserAgent', 'authorRegion', 'authorCountry',
       'contentType', 'url', 'title', 'text', 'lang'],
      dtype='object')


# Understanding dataset

In [None]:
print(articles_df['timestamp'].head(5))

0    1459192779
1    1459193988
2    1459194146
3    1459194474
4    1459194497
Name: timestamp, dtype: int64


In [None]:
articles_df

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3117,1487946604,CONTENT SHARED,9213260650272029784,3609194402293569455,7144190892417579456,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,SP,BR,HTML,https://startupi.com.br/2017/02/liga-ventures-...,"Conheça a Liga IoT, plataforma de inovação abe...","A Liga Ventures, aceleradora de startups espec...",pt
3118,1487947067,CONTENT SHARED,-3295913657316686039,6960073744377754728,-8193630595542572738,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3...,GA,US,HTML,https://thenextweb.com/apps/2017/02/14/amazon-...,Amazon takes on Skype and GoToMeeting with its...,"Amazon has launched Chime, a video conferencin...",en
3119,1488223224,CONTENT SHARED,3618271604906293310,1908339160857512799,-183341653743161643,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0...,SP,BR,HTML,https://code.org/about/2016,Code.org 2016 Annual Report,"February 9, 2017 - We begin each year with a l...",en
3120,1488300719,CONTENT SHARED,6607431762270322325,-1393866732742189886,2367029511384577082,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,MG,BR,HTML,https://www.bloomberg.com/news/articles/2017-0...,JPMorgan Software Does in Seconds What Took La...,"At JPMorgan Chase & Co., a learning machine is...",en


In [None]:
print(articles_df['eventType'].value_counts())

CONTENT SHARED     3047
CONTENT REMOVED      75
Name: eventType, dtype: int64


In [None]:
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
print(articles_df.shape)

(3047, 13)


In [None]:
print(articles_df['contentId'].head(5))

1   -4110354420726924665
2   -7292285110016212249
3   -6151852268067518688
4    2448026894306402386
5   -2826566343807132236
Name: contentId, dtype: int64


In [None]:
print(articles_df['authorPersonId'].head(5))

1    4340306774493623681
2    4340306774493623681
3    3891637997717104548
4    4340306774493623681
5    4340306774493623681
Name: authorPersonId, dtype: int64


In [None]:
#Retrieve unique values
print(len(articles_df['authorPersonId'].unique()))

252


In [None]:
print(articles_df['authorSessionId'].head(5))

1    8940341205206233829
2    8940341205206233829
3   -1457532940883382585
4    8940341205206233829
5    8940341205206233829
Name: authorSessionId, dtype: int64


In [None]:
print(articles_df['authorUserAgent'].tail(5))

3117    Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...
3118    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3...
3119    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0...
3120    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
3121    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
Name: authorUserAgent, dtype: object


In [None]:
#Retrieve unique browsers
print(len(articles_df['authorUserAgent'].unique()))

115


In [None]:
print(articles_df['authorRegion'].tail(5))
print(len(articles_df['authorRegion'].unique()))
print(articles_df['authorRegion'].isnull().sum(axis=0))

3117    SP
3118    GA
3119    SP
3120    MG
3121    SP
Name: authorRegion, dtype: object
20
2378


In [None]:
print(articles_df['authorCountry'].tail(5))
print(articles_df['authorCountry'].unique())
print(articles_df['authorCountry'].isnull().sum(axis=0))

3117    BR
3118    US
3119    BR
3120    BR
3121    BR
Name: authorCountry, dtype: object
[nan 'BR' 'CA' 'US' 'AU' 'PT']
2378


In [None]:
print(articles_df['contentType'].unique())
print(articles_df['contentType'].isnull().sum(axis=0))

['HTML' 'RICH' 'VIDEO']
0


In [None]:
print(articles_df['url'].head(5))
print(articles_df['url'].isnull().sum(axis=0))
print(articles_df['url'].isna().sum(axis=0))

1    http://www.nytimes.com/2016/03/28/business/dea...
2    http://cointelegraph.com/news/bitcoin-future-w...
3    https://cloudplatform.googleblog.com/2016/03/G...
4    https://bitcoinmagazine.com/articles/ibm-wants...
5    http://www.coindesk.com/ieee-blockchain-oxford...
Name: url, dtype: object
0
0


In [None]:
print(articles_df['title'].head(5))
print(articles_df['title'].isnull().sum(axis=0))
print(articles_df['title'].isna().sum(axis=0))

1    Ethereum, a Virtual Currency, Enables Transact...
2    Bitcoin Future: When GBPcoin of Branson Wins O...
3                         Google Data Center 360° Tour
4    IBM Wants to "Evolve the Internet" With Blockc...
5    IEEE to Talk Blockchain at Cloud Computing Oxf...
Name: title, dtype: object
0
0


In [None]:
print(articles_df['text'].head(5))
print(articles_df['text'].isnull().sum(axis=0))
print(articles_df['text'].isna().sum(axis=0))

1    All of this work is still very early. The firs...
2    The alarm clock wakes me at 8:00 with stream o...
3    We're excited to share the Google Data Center ...
4    The Aite Group projects the blockchain market ...
5    One of the largest and oldest organizations fo...
Name: text, dtype: object
0
0


In [None]:
print(articles_df['lang'].unique())
print(articles_df['lang'].isnull().sum(axis=0))
print(articles_df['lang'].isna().sum(axis=0))

['en' 'pt' 'es' 'la' 'ja']
0
0


In [None]:
articles_df = articles_df[articles_df['lang'] == 'en']
print(articles_df.shape)

(2264, 13)


In [None]:
articles_df = pd.DataFrame(articles_df, columns=['contentId', 'authorPersonId', 'content', 'title', 'text'])

#Creating functions for modelling the dataset

In [None]:
def create_soup(x):
    soup = ' '.join(x['text'])
    return soup
articles_df['soup'] = articles_df.apply(create_soup, axis=1)

##Purifing the dataset by removing punctuations and articles

In [None]:
# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(articles_df['text'])

##Mapping array of featuress to indices

In [None]:
# Output the shape of tfidf_matrix
print(tfidf_matrix.shape)
# print(tfidf.get_stop_words().pop())

# Array mapping from feature integer indices to feature name.
print(tfidf.get_feature_names_out()[5000:5010])

(2264, 45514)
['banter' 'baptista' 'baptiste' 'baptized' 'bar' 'barack' 'barani'
 'baratheon' 'barauskas' 'barb']


##ML model Cosine Similarity

---


*   Cosine similarity is a metric used to measure the similarity between two vectors in a high-dimensional space.
*   It calculates the cosine of the angle between the two vectors.
Cosine similarity is often used in information retrieval, natural language processing, and recommendation systems
*   It quantifies the similarity based on the orientation of the vectors rather than their magnitude.
*   The resulting similarity score ranges from -1 to 1, where 1 indicates identical vectors, 0 indicates no similarity, and -1 indicates exact opposite vectors.
*   Cosine similarity is advantageous when comparing text documents or items because it considers the semantic meaning rather than just the occurrence or frequency of words.
*   It is particularly useful when dealing with high-dimensional data, such as text documents represented by the bag-of-words model or TF-IDF vectors.
*   Cosine similarity is computationally efficient and widely used in various applications, including document similarity, content-based filtering, and collaborative filtering.
*   It is a robust similarity measure that is unaffected by the vector length or scaling, making it suitable for comparing documents of different lengths.
*   Cosine similarity is an effective tool for tasks such as information retrieval, recommendation systems, clustering, and text classification.












In [None]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix, True)
display(cosine_sim.shape)
display(cosine_sim)

(2264, 2264)

array([[1.        , 1.        , 0.02838652, ..., 0.04728543, 0.08469279,
        0.01861113],
       [1.        , 1.        , 0.02838652, ..., 0.04728543, 0.08469279,
        0.01861113],
       [0.02838652, 0.02838652, 1.        , ..., 0.02562211, 0.03176227,
        0.01113377],
       ...,
       [0.04728543, 0.04728543, 0.02562211, ..., 1.        , 0.05475127,
        0.0767066 ],
       [0.08469279, 0.08469279, 0.03176227, ..., 0.05475127, 1.        ,
        0.03956185],
       [0.01861113, 0.01861113, 0.01113377, ..., 0.0767066 , 0.03956185,
        1.        ]])

In [None]:
# Reset index of main DataFrame and construct reverse mapping as before
metadata = articles_df.reset_index()
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
display(indices[:10])

title
Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's                                          0
Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's                                          1
Bitcoin Future: When GBPcoin of Branson Wins Over USDcoin of Trump                                               2
Google Data Center 360° Tour                                                                                     3
IBM Wants to "Evolve the Internet" With Blockchain Technology                                                    4
IEEE to Talk Blockchain at Cloud Computing Oxford-Con - CoinDesk                                                 5
Banks Need To Collaborate With Bitcoin and Fintech Developers                                                    6
Blockchain Technology Could Put Bank Auditors Out of Work                                                        7
Why Decentralized Conglomerates Will Scale Better than Bitcoin - Interview

In [None]:
def get_recommendations(title, indices, cosine_sim, data):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return data['title'].iloc[movie_indices]

In [None]:
print(get_recommendations('Intel\'s internal IoT platform for real-time enterprise analytics', indices, cosine_sim, metadata))

1148    Comparing IoT Platforms: Compare 4 IoT platfor...
1682    Bring a dinosaur to life with Watson IoT Platf...
1220       Decentralizing IoT networks through blockchain
264     IoT Day: A timeline of how IoT is changing the...
2150    Relating a Problem Definition to IoT Architect...
460     How IoT security can benefit from machine lear...
131     Is the Internet of Things in Your Home? Or on ...
1883    IoT Insurance: Trends in Home, Life & Auto Ins...
1580    Popular Internet of Things Forecast of 50 Bill...
1730    The Internet of Things is looking for its Visi...
Name: title, dtype: object


In [None]:
print(get_recommendations('Google Data Center 360° Tour', indices, cosine_sim, metadata))

126     Google shares data center security and design ...
739     YouTube's New Messenger Means You'll Never Hav...
555                          This year's Founders' Letter
213     Google Cloud Platform: The smart person's guid...
452     Top 5 GCP NEXT breakout sessions on YouTube (s...
724     Google I/O 2016 Preview: A Chrome/Android merg...
2111    [Tools] How to Record your Desktop Screen with...
778     Google I/O 2016 preview: Android N, Android VR...
786     Here's proof that Google is getting serious ab...
781     Google I/O 2016 Preview: Machine Learning, Vir...
Name: title, dtype: object


In [None]:
print(get_recommendations('The Rise And Growth of Ethereum Gets Mainstream Coverage', indices, cosine_sim, metadata))