# Problem Statement
You are provided with a text corpora of a Digital media news channel in two different languages, namely english and hindi. You are given the task of finding out the most relevant keywords from the text that would best represent the story. These keyword would then be used as tags of the stories.

In [0]:
# some common libraries
import pandas as pd
import numpy as np

In [0]:
shiv = pd.read_csv('/content/drive/My Drive/Bluepi/article.csv', header=None)

### So, as the file was been read by system it took first rows as column name and to avoid that i used header=None in read statement.

In [249]:
shiv.head()


Unnamed: 0,0,1,2,3,4
0,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...",,standard
1,676943,english,Trump freezes $200 mn in Syrian recovery funds...,,standard
2,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...","<p>Latur (Maharashtra) [India], Apr. 1 (ANI): ...",standard
3,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,"<p>Patna (Bihar) [India], Apr. 1 (ANI): Union ...",standard
4,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,,standard


In [0]:
# renaming columns for better understanding
shiv = shiv.rename(columns = {0: "articleid", 1: "language", 2: "content", 3: "title", 4: "type"})

In [251]:
shiv.head()

Unnamed: 0,articleid,language,content,title,type
0,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...",,standard
1,676943,english,Trump freezes $200 mn in Syrian recovery funds...,,standard
2,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...","<p>Latur (Maharashtra) [India], Apr. 1 (ANI): ...",standard
3,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,"<p>Patna (Bihar) [India], Apr. 1 (ANI): Union ...",standard
4,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,,standard


In [252]:
# Punkt Sentence Tokenizer divides a text into a list of sentences, by using an unsupervised algorithm
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [253]:
# converting type of column to str so that word_tokenizer can act upon it
shiv['content'].astype(str)

0       Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...
1       Trump freezes $200 mn in Syrian recovery funds...
2       Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...
3          Bhagalpur violence: Arijit Shashwat surrenders
4          Bhagalpur violence: Arijit Shashwat surrenders
                              ...                        
5644    बैंक कर्मी के शर्मनाक बोल, कहा- कठुआ गैंगरेप प...
5645    ऐसा हमला कोई इंसान नहीं बल्कि कोई दानव ही कर स...
5646    मथुरा में अवैध रूप से रह रहे 24 बांग्लादेशी सम...
5647    पुष्कर में वीकेंड एंजॉय करने पहुंचीं किश्वर सह...
5648    बैंक कर्मी के शर्मनाक बोल, कहा- कठुआ गैंगरेप प...
Name: content, Length: 5649, dtype: object

In [0]:
# Tokenizing the column
from nltk.tokenize import word_tokenize
# con_token = word_tokenize(shiv['content'])
# con_token
shiv['tokenized_sents'] = shiv.apply(lambda row: word_tokenize(row['content']), axis=1)

In [0]:
# getting the word count of tokenized column
shiv['word_count'] = shiv.apply(lambda row: len(row['tokenized_sents']), axis=1)

In [256]:
shiv.head()

Unnamed: 0,articleid,language,content,title,type,tokenized_sents,word_count
0,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...",,standard,"[Fadnavis, ,, Piyush, Goyal, perform, 'bhoomi,...",13
1,676943,english,Trump freezes $200 mn in Syrian recovery funds...,,standard,"[Trump, freezes, $, 200, mn, in, Syrian, recov...",12
2,676946,english,"Fadnavis, Piyush Goyal perform 'bhoomi pujan' ...","<p>Latur (Maharashtra) [India], Apr. 1 (ANI): ...",standard,"[Fadnavis, ,, Piyush, Goyal, perform, 'bhoomi,...",13
3,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,"<p>Patna (Bihar) [India], Apr. 1 (ANI): Union ...",standard,"[Bhagalpur, violence, :, Arijit, Shashwat, sur...",6
4,676941,english,Bhagalpur violence: Arijit Shashwat surrenders,,standard,"[Bhagalpur, violence, :, Arijit, Shashwat, sur...",6


In [257]:
shiv[['tokenized_sents', 'word_count']].head()

Unnamed: 0,tokenized_sents,word_count
0,"[Fadnavis, ,, Piyush, Goyal, perform, 'bhoomi,...",13
1,"[Trump, freezes, $, 200, mn, in, Syrian, recov...",12
2,"[Fadnavis, ,, Piyush, Goyal, perform, 'bhoomi,...",13
3,"[Bhagalpur, violence, :, Arijit, Shashwat, sur...",6
4,"[Bhagalpur, violence, :, Arijit, Shashwat, sur...",6


In [258]:
#Identify common words
comm = pd.Series(' '.join(shiv['content']).split()).value_counts()[:20]
comm

के        1523
में       1240
की        1121
ने         961
in         835
का         755
to         752
को         727
से         635
पर         633
of         576
for        323
उन्नाव     308
और         297
Unnao      291
है         290
on         281
CWG        260
BJP        240
2018:      239
dtype: int64

In [259]:
# Libraries for text preprocessing
import re
import nltk
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [260]:
# Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word for which i use PorterStemmer.
# Lemmatization is an organized & step by step procedure of obtaining the root form of the word and for this we use WordNetLemmatizer.

lem = WordNetLemmatizer()
stem = PorterStemmer()
word = "perfectly"
print("stemming:",stem.stem(word))
print("lemmatization:", lem.lemmatize(word, "v"))

stemming: perfectli
lemmatization: perfectly


In [261]:
# creating a list of stop words
stop_words = stopwords.words("english")
stop_words[:15]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours']

In [0]:
# By using regex expression removing punctuations and special characters
data = []

for i in range(0, 5649):
    # Remove punctuations
    txt = re.sub('[^a-zA-Z]', ' ', shiv['content'][i])
    
    # Convert to lowercase
    txt = txt.lower()
    
    # remove special characters and digits
    txt=re.sub("(\d|\W)+"," ",txt)

    # removing punctuation mark and special characters
    txt=re.sub('[-.?!,:;()|0-9]', ' ', txt)
    
    # Convert to list from string
    txt = txt.split()
    
    # Lemmatisation
    lem = WordNetLemmatizer()
    
    txt = [lem.lemmatize(word) for word in txt if not word in stop_words] 
    txt = " ".join(txt)
    data.append(txt)

In [263]:
data[10]

'kodandaram party get green signal'

### CountVectorizer

- It is a method of converting text data into vectors as model can process only numerical data.

- In this we can only count the number of times a word appears in the document which results in biasing in favour of most frequent words

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import re
cv=CountVectorizer(max_df=1.0,stop_words=stop_words, max_features=10000)
x=cv.fit_transform(data)

In [265]:
list(cv.vocabulary_.keys())[:10]

['fadnavis',
 'piyush',
 'goyal',
 'perform',
 'bhoomi',
 'pujan',
 'latur',
 'rail',
 'coach',
 'factory']

### TfidfVectorizer 

- It is also a methods of converting text data into vectors as model can process only numerical data.
- In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them.

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(x)
 
# fetch document for which keywords needs to be extracted
doc=data[101]
 
#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

In [267]:
data[101]

'england take charge day christchurch test'

In [268]:
# to check whether tf_idf score row-wise

feature_names = cv.get_feature_names()
 
#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

# Notice that some words are missing from this list. This is possibly due to internal pre-processing of CountVectorizer 
# where it removes single characters.

Unnamed: 0,tfidf
christchurch,0.495083
england,0.495083
test,0.400995
charge,0.395854
take,0.336943
...,...
footballer,0.000000
force,0.000000
forced,0.000000
forget,0.000000


The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document the higher the score.

In [0]:
#Function for sorting tf_idf in descending order
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def top_vector(feature_names, sorted_items, topn=10):

    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [270]:
# just for the sake of showing that how below code works i am using a range of (0,5) else in actual scenario range would be (0, 5649)

for i in range(0,6):
    doc=data[i]
    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
    sorted_items=sort_coo(tf_idf_vector.tocoo())
    keywords=top_vector(feature_names,sorted_items,4)
    
    print("\nRow no.:",i,"\nContent:",doc)
    print("\nKeywords:")
    for i in keywords:
        print(i,keywords[i])


Row no.: 0 
Content: fadnavis piyush goyal perform bhoomi pujan latur rail coach factory

Keywords:
pujan 0.33
piyush 0.33
perform 0.33
latur 0.33

Row no.: 1 
Content: trump freeze mn syrian recovery fund suggesting exit

Keywords:
suggesting 0.379
recovery 0.379
mn 0.379
exit 0.379

Row no.: 2 
Content: fadnavis piyush goyal perform bhoomi pujan latur rail coach factory

Keywords:
pujan 0.33
piyush 0.33
perform 0.33
latur 0.33

Row no.: 3 
Content: bhagalpur violence arijit shashwat surrender

Keywords:
surrender 0.461
arijit 0.461
shashwat 0.449
bhagalpur 0.449

Row no.: 4 
Content: bhagalpur violence arijit shashwat surrender

Keywords:
surrender 0.461
arijit 0.461
shashwat 0.449
bhagalpur 0.449

Row no.: 5 
Content: lawyer mp engaging judge impeachment barred practice court

Keywords:
practice 0.387
judge 0.387
impeachment 0.387
engaging 0.387


## In this way you can simply put in the row no of statement you want to find keywords and get the required results

In [271]:
# if we want to find important keywords of statement at row number 2500

doc=data[2500]
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

sorted_items=sort_coo(tf_idf_vector.tocoo())
keywords=top_vector(feature_names,sorted_items,3)
 
print("\ncontent:", doc)
print("\nKeywords:")
for i in keywords:
    print(i,keywords[i])


content: defexpo pm aim transform defence production procurement

Keywords:
transform 0.413
production 0.413
procurement 0.413
