<a href="https://colab.research.google.com/github/IfrazQazi/Web-Scraping-and-Sentiment-Analysis-/blob/main/Web_scraping_and_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Objective**
The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below. 

##**Data Extraction**


For each of the articles, given in the input.xlsx file, extract the article text and save the extracted article in a text file with URL_ID as its file name.
While extracting text, please make sure your program extracts only the article title and the article text. It should not extract the website header, footer, or anything other than the article text. 

##**Data Analysis**
For each of the extracted texts from the article, perform textual analysis and compute variables.
##**Variables**

* POSITIVE SCORE
* NEGATIVE SCORE
* POLARITY SCORE
* SUBJECTIVITY SCORE
* AVG SENTENCE LENGTH
* PERCENTAGE OF COMPLEX WORDS
* FOG INDEX
* AVG NUMBER OF WORDS PER SENTENCE
* COMPLEX WORD COUNT
* WORD COUNT
* SYLLABLE PER WORD
* PERSONAL PRONOUNS
* AVG WORD LENGTH



## Lets understand variables


##**Sentimental Analysis**
## Sentimental analysis is the process of determining whether a piece of writing is positive, negative or neutral.
* **POSITIVE SCORE**
* **NEGATIVE SCORE**

##**POLARITY SCORE**
## Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment.

##**SUBJECTIVITY SCORE**
## Subjectivity quantifies the amount of personal opinion and factual information contained in the text.
## The higher subjectivity means that the text contains personal opinion rather than factual information.

##**Analysis of Readability**
##Analysis of Readability is calculated using the Gunning Fox index formula described below.
* **Average Sentence Length** = the number of words / the number of sentences

* **Percentage of Complex words** = the number of complex words / the number of words

* **Fog Index** = 0.4 * (Average Sentence Length + Percentage of Complex words)

*  **Fog Index :**

  The fog index is commonly used to confirm that text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8.

## **Average Number of Words Per Sentence**
##The formula for calculating is:
**Average Number of Words Per Sentence** = the total number of words / the total number of sentences

## **Complex Word Count**
 ## Complex words are words in the text that contain more than two syllables.

## **Word Count**
## We count the total cleaned words present in the text by 
1. removing the stop words (using stopwords class of nltk package).
2. removing any punctuations like ? ! , . from the word before counting.



## **Syllable Count Per Word :**
## We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

## **Personal Pronouns**
## To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words

* “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

##**Average Word Length:**
## Average Word Length is calculated by the formula:
* Sum of the total number of characters in each word/Total number of words


## So first lets import libraries

In [1]:
## importing libraries
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup 

In [2]:
import nltk                 ## importing nltk library
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from nltk.corpus import stopwords

## Creating some function which we will use later

In [4]:
# extracting the stopwords from nltk library
sw = stopwords.words('english')

In [5]:
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [6]:

nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def number_of_sentences(text): ## number of sentences in paragraph
    number_of_sentence = sent_tokenize(text)
    count_of_sentences = len(number_of_sentence)
    return count_of_sentences
# print(number_of_sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [8]:
df= pd.read_excel('/content/drive/MyDrive/Web_scraping/Input.xlsx') ## read excel file

In [9]:
df.head()

Unnamed: 0,URL_ID,URL
0,1.0,https://insights.blackcoffer.com/how-is-login-...
1,2.0,https://insights.blackcoffer.com/how-does-ai-h...
2,3.0,https://insights.blackcoffer.com/ai-and-its-im...
3,4.0,https://insights.blackcoffer.com/how-do-deep-l...
4,5.0,https://insights.blackcoffer.com/how-artificia...


In [10]:
df.tail()

Unnamed: 0,URL_ID,URL
165,167.0,https://insights.blackcoffer.com/role-big-data...
166,168.0,https://insights.blackcoffer.com/sales-forecas...
167,169.0,https://insights.blackcoffer.com/detect-data-e...
168,170.0,https://insights.blackcoffer.com/data-exfiltra...
169,171.0,https://insights.blackcoffer.com/impacts-of-co...


In [11]:
import re

In [12]:
j=0
PERSONAL_PRONOUNS=[]
lengths=[]
word_count=[]
text=[]
for  i in df.URL:
    j+=1
    globals()['r%s' % j] = requests.get(i, headers={"User-Agent": "XY"}).text 
    
    globals()['soup%s' % j]= BeautifulSoup(globals()['r%s' % j],'lxml')
    
    ## extracing text of article
    globals()['pra%s' % j] = globals()['soup%s' % j].find('div', class_ = 'td-post-content').text ## extracing text of article
    
    ### calculating PERSONAL_PRONOUNS
    pronounRegex = re.compile(r'\bI\b|\bwe\b|\bWe\b|\bmy\b|\bMy\b|\bours\b|\bus\b')
    
    globals()['pronouns%s' % j] = pronounRegex.findall(globals()['pra%s' % j]) ### calculating PERSONAL_PRONOUNS
    
    PERSONAL_PRONOUNS.append(len(globals()['pronouns%s' % j]))
    
    ## removing stop words
    globals()['pra%s' % j]  = stopwords(globals()['pra%s' % j]) ## removing stop words
    
    count_of_sentences=number_of_sentences(globals()['pra%s' % j]) 
    
    lengths.append(count_of_sentences)
    
    text.append(globals()['pra%s' % j])
    
    ### removing punctuation
    globals()['pra%s' % j]=remove_punctuation(globals()['pra%s' % j])
    
    words = re.sub(r'[^\w\s]', '', globals()['pra%s' % j])
    
    # using regex (findall())
    # to count words in string
    res = len(re.findall(r'\w+', (words)))
              
    word_count.append(res)
    


In [13]:
df['sentence_count']=lengths

In [14]:
df['URL_ID'] = df['URL_ID'].apply(lambda x:int(x))

##**Word Count**

We count the total cleaned words present in the text by

removing the stop words (using stopwords class of nltk package).

removing any punctuations like ? ! , . from the word before counting.

In [15]:
df['WORD COUNT'] = word_count

--------------------------------------------------------------------------------------------------------------------------------------------------------------
##**Sentimental Analysis**

* **POSITIVE SCORE**

* **NEGATIVE SCORE**

In [16]:
nltk.download('opinion_lexicon')

[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


True

In [17]:
from nltk.corpus import opinion_lexicon

In [18]:
pos_list=set(opinion_lexicon.positive()) ## set of positive words
neg_list=set(opinion_lexicon.negative()) ## set of negative words

In [19]:
from nltk.tokenize import TreebankWordTokenizer #Tokenizers divide strings into lists of substrings
tokenizer =  TreebankWordTokenizer()

In [20]:
## extraxting total number of positive and negative words
senti=0
neg_senti=0
POSITIVE =[]
NEGATIVE =[]
for  i in range(1,len(text)+1):
    
    words = [word.lower() for word in tokenizer.tokenize(globals()['pra%s' % i])]
    for word in words:
        if word in pos_list:
          senti += 1
        elif word in neg_list:
          neg_senti += 1
    POSITIVE.append(senti)
    NEGATIVE.append(neg_senti)
    senti=0
    neg_senti=0

In [21]:
df['POSITIVE SCORE']=POSITIVE ## extraxting total number of positive and negative words
df['NEGATIVE SCORE']=NEGATIVE

##**POLARITY SCORE**
Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment.

Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

In [22]:
## calculating polarity score
df['POLARITY SCORE']= (df['POSITIVE SCORE']-df['NEGATIVE SCORE'])/((df['POSITIVE SCORE']+df['NEGATIVE SCORE'])+0.000001) # polarity score

In [23]:
df['POLARITY SCORE'].describe()

count    170.000000
mean       0.211395
std        0.400521
min       -0.866667
25%       -0.133882
50%        0.233362
75%        0.515178
max        1.000000
Name: POLARITY SCORE, dtype: float64

In [24]:
negative=[]
positive=[]
neutral=[]
for i in df['POLARITY SCORE']:
    if i <0:
        negative.append(i)
    elif i>0:
        positive.append(i)
    else:
        neutral.append(i)

In [25]:
print(f' count of negative sentiment {len(negative)/170}\n',f'count of positive sentiment {len(positive)/170}\n',
      f'count of neutral sentiment {len(neutral)/170}')

 count of negative sentiment 0.3235294117647059
 count of positive sentiment 0.6705882352941176
 count of neutral sentiment 0.0058823529411764705


##From the above output we can say that more than 67% of articles have positive sentiment and 32% articles have negative sentiment and there is only one article which is neutral

##**SUBJECTIVITY SCORE**
Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

Range is from 0 to +1

In [26]:
## calculating subjectivity score
df['SUBJECTIVITY SCORE'] = (df['POSITIVE SCORE']+df['NEGATIVE SCORE'])/((df['WORD COUNT'])+0.000001) # Subjectivity Score

In [27]:
df.head()

Unnamed: 0,URL_ID,URL,sentence_count,WORD COUNT,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE
0,1,https://insights.blackcoffer.com/how-is-login-...,24,432,17,9,0.307692,0.060185
1,2,https://insights.blackcoffer.com/how-does-ai-h...,28,384,27,3,0.8,0.078125
2,3,https://insights.blackcoffer.com/ai-and-its-im...,77,1086,82,21,0.592233,0.094843
3,4,https://insights.blackcoffer.com/how-do-deep-l...,15,254,14,0,1.0,0.055118
4,5,https://insights.blackcoffer.com/how-artificia...,57,706,53,13,0.606061,0.093484


## Subjectivity quantifies the amount of personal opinion and factual information contained in the text.
## The higher subjectivity means that the text contains personal opinion rather than factual information.

In [28]:
df['SUBJECTIVITY SCORE'].describe()

count    170.000000
mean       0.102234
std        0.031552
min        0.021798
25%        0.084920
50%        0.098744
75%        0.117735
max        0.205263
Name: SUBJECTIVITY SCORE, dtype: float64

##From the above output we can conclude that each article in this dataset contains most of the anecdotal information rather than personal opinion

### -----------------------------------------------------------------------------------------------------------------------------------------------------------

##**Analysis of Readability**
Analysis of Readability is calculated using the Gunning Fox index formula described below.

Average Sentence Length = the number of words / the number of sentences

Percentage of Complex words = the number of complex words / the number of words

Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

In [29]:
## calculating avg sentence length
df['AVG SENTENCE LENGTH'] = df['WORD COUNT'] / df['sentence_count'] 

In [30]:
df.head()

Unnamed: 0,URL_ID,URL,sentence_count,WORD COUNT,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH
0,1,https://insights.blackcoffer.com/how-is-login-...,24,432,17,9,0.307692,0.060185,18.0
1,2,https://insights.blackcoffer.com/how-does-ai-h...,28,384,27,3,0.8,0.078125,13.714286
2,3,https://insights.blackcoffer.com/ai-and-its-im...,77,1086,82,21,0.592233,0.094843,14.103896
3,4,https://insights.blackcoffer.com/how-do-deep-l...,15,254,14,0,1.0,0.055118,16.933333
4,5,https://insights.blackcoffer.com/how-artificia...,57,706,53,13,0.606061,0.093484,12.385965


In [31]:
def syllable_count(text):
    syllable = 0
    for words in text.split():

        for vowel in ['a','e','i','o','u']:
            syllable =syllable +  words.count(vowel)
        for ending in ['es','ed','e']:
            
            if words.endswith(ending):  # if word is ending with ('es' 'ed' and 'e')
               syllable = syllable -  1
        if words.endswith('le'):        # if word is ending with ('le')
            syllable =syllable + 1
    return syllable

In [32]:
def complex_word(sentences):
    count=[]
    number=0
    for word in sentences.split():
#         print(word)
        count.append(syllable_count(word))
    for i in count:       ## if more than two syllable in word
        if i>=2:
            number+=1
    return number

In [33]:
complex_words=[]
for i in range(len(text)):
    i+=1
    complex_=complex_word(globals()['pra%s' % i]) ## calculating complex words
    complex_words.append(complex_)

## **Complex Word Count**

Complex words are words in the text that contain more than two syllables.

In [34]:
df['COMPLEX WORD COUNT']=complex_words ### complex word count

In [35]:
df.head()

Unnamed: 0,URL_ID,URL,sentence_count,WORD COUNT,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,COMPLEX WORD COUNT
0,1,https://insights.blackcoffer.com/how-is-login-...,24,432,17,9,0.307692,0.060185,18.0,324
1,2,https://insights.blackcoffer.com/how-does-ai-h...,28,384,27,3,0.8,0.078125,13.714286,286
2,3,https://insights.blackcoffer.com/ai-and-its-im...,77,1086,82,21,0.592233,0.094843,14.103896,864
3,4,https://insights.blackcoffer.com/how-do-deep-l...,15,254,14,0,1.0,0.055118,16.933333,201
4,5,https://insights.blackcoffer.com/how-artificia...,57,706,53,13,0.606061,0.093484,12.385965,539


In [36]:
#Percentage of Complex words = the number of complex words / the number of words
df['PERCENTAGE OF COMPLEX WORDS'] = df['COMPLEX WORD COUNT'] / df['WORD COUNT']

In [37]:
#Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
df['FOG INDEX'] = 0.4 * (df['AVG SENTENCE LENGTH'] + df['PERCENTAGE OF COMPLEX WORDS'])

##**Fog Index**
The fog index is commonly used to confirm that text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8.

In [38]:
df['FOG INDEX'].describe()

count    170.000000
mean       5.883277
std        3.229066
min        3.023622
25%        4.739380
50%        5.522036
75%        6.362888
max       42.762947
Name: FOG INDEX, dtype: float64

In [39]:
high_readability=[]
low_readability=[]
neutral=[]
for i in df['FOG INDEX']:
    if i <12:
        high_readability.append(i)
    elif i>=12:
        low_readability.append(i)

In [40]:
print(f' aritcles with high readablity :{len(high_readability)/170}\n',f'aritcles with low readablity :{len(low_readability)/170}')

 aritcles with high readablity :0.9882352941176471
 aritcles with low readablity :0.011764705882352941


##From the above output we can say that more than 98% of the articles in this dataset are high readability, which means the text can be read by a wide audience
-----------------------------------------------------------------------------------------------------------------------------------------------------------

##**Average Number of Words Per Sentence**

The formula for calculating is:

Average Number of Words Per Sentence = the total number of words / the total number of sentences

In [41]:
df['AVG NUMBER OF WORDS PER SENTENCE'] =  df['WORD COUNT'] / df['sentence_count']

In [42]:
df.head()

Unnamed: 0,URL_ID,URL,sentence_count,WORD COUNT,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,COMPLEX WORD COUNT,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE
0,1,https://insights.blackcoffer.com/how-is-login-...,24,432,17,9,0.307692,0.060185,18.0,324,0.75,7.5,18.0
1,2,https://insights.blackcoffer.com/how-does-ai-h...,28,384,27,3,0.8,0.078125,13.714286,286,0.744792,5.783631,13.714286
2,3,https://insights.blackcoffer.com/ai-and-its-im...,77,1086,82,21,0.592233,0.094843,14.103896,864,0.79558,5.95979,14.103896
3,4,https://insights.blackcoffer.com/how-do-deep-l...,15,254,14,0,1.0,0.055118,16.933333,201,0.791339,7.089869,16.933333
4,5,https://insights.blackcoffer.com/how-artificia...,57,706,53,13,0.606061,0.093484,12.385965,539,0.763456,5.259768,12.385965


##**Syllable Count Per Word**

We count the number of Syllables in each word of the text by counting the vowels present in each word.

We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

In [43]:
syllables_count=[]             ## syllable count
for i in range(1,len(text)+1):
    
    
    syllables=syllable_count(globals()['pra%s' % i])
    syllables_count.append(syllables)

In [44]:
df['SYLLABLE PER WORD'] = syllables_count