## Web Mining : Information Retrival and Natural Language Processing

### Extracting most frequent bigrams based on TF-IDF Scores

### Problem Statment
Consider some results from tutorial sites by giving a query say “ tutorials on data structure” or result from any other repository. Store the results. The files are expected to be of type DOC ,pdf,HTML (ignore other files), use document parser, pdf parser and html parser to parse the contents in form of : text tables and images. 
* Divide the text into sentences and words.
* Remove stop words.
* Give a weight to each word based on following criteria : tf,tfX idf, df,tfX df. Rank the words on the basis of these weights and identify top n words as keywords . Compare and analyse the result and comment on the result.

### Step 1 : Importing all necessary libraries

In [173]:
import numpy as np
import nltk
import pandas as pd
import re
import urllib
import bs4
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import scipy

### Step 2 : Parse a website from "Tutorials on Data structures" and process the content

In [100]:
site1=urllib.request.urlopen("https://www.javatpoint.com/data-structure-introduction").read()

In [101]:
source_site1=bs4.BeautifulSoup(site1,'lxml')
print((source_site1))

<!DOCTYPE html>
<html lang="en"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>DS Introduction - javatpoint</title><link href="https://static.javatpoint.com/images/favicon2.png" rel="SHORTCUT ICON"/>
<link href="https://static.javatpoint.com/link.css" rel="stylesheet" type="text/css"/><link href="https://clients1.google.com" rel="dns-prefetch"/><link href="https://static.javatpoint.com" rel="dns-prefetch"/><link href="https://googleads.g.doubleclick.net" rel="dns-prefetch"/><link href="https://www.google.com" rel="dns-prefetch"/><link href="https://feedify.net" rel="dns-prefetch"/><meta content="#4CAF50" name="theme-color"/><meta content="DS Introduction - javatpoint" property="og:title"/><meta content="DS Introduction with Introduction, Asymptotic Analysis, Array, Pointer, Structure, Singly Linked List, Doubly Linked List, Circular Linked List, Binary Search, Linear Search, Sorting, Bucket Sort, Comb Sort, Shell Sort, Heap Sort, Merge Sort, Selection 

In [37]:
data_site1=[]
for paragraph in source_site1.find_all('p'):
    data_site1.append(paragraph.string)
#deleting last 75 useless text
data_site1=(data_site1[0:31:1])
    #print(len(data_site1[-1:-76:-1]))
print(data_site1)

['Data Structure can be defined as the group of data elements which provides an efficient way of storing and organising data in the computer so that it can be used efficiently. Some examples of Data Structures are arrays, Linked List, Stack, Queue, etc. Data Structures are widely used in almost every aspect of Computer Science i.e. Operating System, Compiler Design, Artifical intelligence, Graphics and many more.', " Data Structures are the main part of many computer science algorithms as they enable the programmers to handle the data in an efficient way. It plays a vitle role in enhancing the performance of a software or a program as the main function of the software is to store and retrieve the user's data as fast as possible ", ' Data structures are the building blocks of any program or the software. Choosing the appropriate data structure for a program is the most difficult task for a programmer. Following terminology is used as far as data structures are concerned', None, None, No

In [143]:
new=[]
new.append(source_site1.find_all('p'))
new_data=str(new)

### Step 3: Preprocessing the content
* Converting to lower case
* Removing stop words
* Removing Punctuations
* Removing apostrophe
* Stemming
* Removing tags

In [150]:
def lower_case(data):
    return np.char.lower(data)

def remove_stopwords(data):
    stop_words=nltk.corpus.stopwords.words('english')
    #print(stop_words)
    new_data=""
    word_tokens=nltk.tokenize.word_tokenize(str(data))
    for w in word_tokens:
        if w not in stop_words:
            new_data=new_data+" "+w
    return new_data

def remove_punctuations(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

def stemming(data):
    stemmer= nltk.stem.PorterStemmer()
    
    tokens = nltk.tokenize.word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text
def clean_tag(data):
    data=((re.sub(r'[<p>|</p>|]',r'',data)))
    return data

In [151]:
def preprocessing_data(data):
    data=clean_tag(data)
    data=lower_case(data)
    data=remove_punctuations(data)
    data=remove_apostrophe(data)
    data=stemming(data)
    #Once more we need to remove the punctuations
    data=remove_punctuations(data)
    data=remove_stopwords(data)
    return data

In [156]:
cleaned_data=preprocessing_data(new_data)
#new_data=((re.sub(r'[<p>|</p>|]',r'',new_data)))
cleaned_list=[]
cleaned_list.append(cleaned_data)
print(type(cleaned_list))

<class 'list'>


In [66]:
#u_grams=cleaned_data.split(" ")
#print(type(u_grams))

### Step 4: Bigram Vectorization

In [190]:
#Bigram vectorization
svm=CountVectorizer(stop_words='english',ngram_range=(2,2)) 
final_out=svm.fit_transform(cleaned_list)
#Converting the sparse matrix to dense matrix
scipy.sparse.csr_matrix.todense(final_out)

matrix([[ 1,  3,  2,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  2,  2,  1,  1,  1,  1,  2,  1,  1,  1,
          1,  1,  5,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  2,  2,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  2,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  4,  1,  1,  1,  1,  1,  4,  1,
          1,  1,  1,  1, 37,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,
          2,  2,  2,  1,  1,  5,  1,  1,  1,  1,  1

### Step 5 : Converting to the TF-IDF Vector Space Model

In [191]:
tf_idf_vect=TfidfVectorizer(ngram_range=(2,2))
final_tf_idf=tf_idf_vect.fit_transform(cleaned_list)
print(type(final_tf_idf))
#converting the sparse matrix to dense matrix
tf_idf_mat=scipy.sparse.csr_matrix.todense(final_tf_idf)
tf_idf_mat.shape

<class 'scipy.sparse.csr.csr_matrix'>


(1, 843)

In [192]:
#just to visualise better in Dataframe
features=tf_idf_vect.get_feature_names()
df = pd.DataFrame(tf_idf_mat, 
                  columns=features, 
                  index=['site1'])
df

Unnamed: 0,0120 4256464,0x 20x,106 item,13 2nd,1strong data,20 record,2011 2018,201301 india,2018 www,20x 0x,...,websit design,websit develo,week weekbr,weekbr websit,wherea node,wide use,within data,without get,world stack,www javatoint
site1,0.019905,0.059714,0.039809,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,...,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905,0.019905


In [193]:
(final_tf_idf.get_shape())
final_tf_idf2=final_tf_idf.toarray()
print(type(final_tf_idf2))
print(shape)

<class 'numpy.ndarray'>
(1, 843)


### Step 6 : Ranking top Bigrams based on TF-Idf Occourance
I have used bigrams instead of unigrams becasue it is better and efficient to deal with bi-words (ex. : Data Structures, New York etc)

In [184]:
#to get top tf_idf values:
#source : https://buhrmann.github.io/tfidf-analysis.html

def top_tf_idf_feat(row,features,top_n):
    topn_ids=np.argsort(raw)[::-1][:top_n]
    #print(topn_ids)
    #type(topn_ids)
    top_feats = [(features[i], raw[i]) for i in topn_ids]
    df=pd.DataFrame(top_feats)
    df.columns=['Features','if_idf_score']
    return (df)

In [186]:
raw=final_tf_idf[0,:].toarray()[0]
top_tfidf=top_tf_idf_feat(raw,features,10)
#print(type(final_tf_idf[1,:].toarray()[0]))
print(top_tfidf)

        Features  if_idf_score
0  data structur      0.736473
1        age age      0.099523
2    strong data      0.099523
3    linear data      0.099523
4      data item      0.079619
5   element data      0.079619
6   data element      0.079619
7       end call      0.059714
8      list list      0.059714
9  javatoint com      0.059714


### Conclusion or Comments : 

As we have analysed by taking a website content on search "Tutorials on Data Structures", Preprocessed it and convert it into Vector Space Model. By taking the Tf-idf Vectorization we can conclude that the bigram 'data structure' occures the most of time with having tf-idf score of 0.736473. And Similarly we have the top ten bigrams with their tf-idf scores.