# Data Extraction & Text Analysis - Blackcoffer Data Science Assignment

### Submitted by Manjiri H. Sawant

#### **Objectives:**

The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables.

1. `Data Extraction`
2. `Data Analysis`

`Tool Used`: **Python Jupyter Lab**

## Data Extraction

1. `BeautifulSoup` - Format and Scrap the data from the HTML
2. `Selenium` - Controlling web browser through programs

**Steps**

1. Identify URL
2. Inspect HTML code
3. Find the HTML tag for the element that you want to extract.
4. Write some code to scrap this data

## Data Analysis

`Look for these variables in the analysis:`

1.	POSITIVE SCORE
2.	NEGATIVE SCORE
3.	POLARITY SCORE
4.	SUBJECTIVITY SCORE
5.	AVG SENTENCE LENGTH
6.	PERCENTAGE OF COMPLEX WORDS
7.	FOG INDEX
8.	AVG NUMBER OF WORDS PER SENTENCE
9.	COMPLEX WORD COUNT
10.	WORD COUNT
11.	SYLLABLE PER WORD
12.	PERSONAL PRONOUNS
13.	AVG WORD LENGTH

#### Import Necessary Libraries

##### The following code written in Python 3.x. Libraries provide pre-written functionally to perform necessary tasks.

## 1. Data Extraction

* For each of the articles, given in the `input.xlsx` file, extract the article text and save the extracted article in a text file with URL_ID as its file name.
* While extracting text, please make sure your program extracts only the article title and the article text. It should not extract the website header, footer, or anything other than the article text. 

In [1]:
# Imported Required Libraries

import time
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as FirefoxOptionsx

In [2]:
# Load files

df = pd.read_excel("Input.xlsx")

In [3]:
# Here 1st Column will be extracted ie 'URL_ID'
# Converting specific df columns

u_id = df["URL_ID"].astype(int).values.tolist()

In [4]:
# Here 2nd Column will be extracted ie 'URL'
# Converting specific df columns

u_l = df["URL"].values.tolist()

In [5]:
completed = []

Identify below features based on them and try to scrape out the relevant data from website :
* article title ='?'
* article text = '?'
* Article Title = **`h1`** `class - 'entry-title'`
* Article Text = **`div`** `class  - 'td-post-content'`

**Tags are very important while scraping data from particular website.**

**soup.find()** is used for finding out the first tag with the specified name or id and 
returning an object of type bs4.

* Through WebDriver, Selenium supports all major browsers such as Chrome/Chromium, Firefox, Internet Explorer, Edge, Opera, and Safari. 
* WebDriver drives the browser using the browser’s built-in support for automation.

* **Gecko driver** links between selenium test and the Firefox browser keep **headless** to access any website easily nothing will appear on screen. 
* Everything is done on the backend side.

In [36]:
# %%time

# URL as an input 
# Iterating till the range and export as txt file


time_out = time.time() + 60*1 # 1 min from now

for ele in u_l:
    try:
        options = FirefoxOptions()
        options.add_argument("--headless")
        driver = webdriver.Firefox(options=options)
        
        
        driver.get(ele)
        content = driver.page_source
    
    finally:
        try:
            driver.close()
        except:
            pass
        

    soup = BeautifulSoup(content, features = "html.parser")

    dataset = []
    
    #title
    title = soup.find("h1", attrs = {"class" : "entry-title"})
    if title is None:
        dataset.append(np.NaN)
    else:
        dataset.append(title.text)
    
    
    #post
    post = soup.find("div", attrs = {"class": "td-post-content"})
    if post is None:
        dataset.append(np.NaN)
    else:
        dataset.append(post.text)


    
    x = u_id[0]
    print(x)
    

    with open('F:/dataset/{}.txt'.format(x), 'w+', encoding='utf-8') as t:
        for items in dataset:
            t.write('%s\n' %items)
    t.close()
    

    completed.append(ele)
    
#     print(completed)
    
    print("File written successfully")
    
    

    u_id.pop(0)
    
    if time.time() > time_out:
        break
        
for ele in completed:
    if ele in u_l:
        u_l.remove(ele)        

150
File written successfully


# 2. Text Analysis

Perform text analysis to drive sentimental opinion, sentiment scores, readability, passive words, personal pronouns and etc.

### Sentiment Analysis

Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral.

In [67]:
# Import required libraries

import os
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize

In [68]:
# get the list of all text files

path = "C:/Users/User/dataset/"

dir_list = os.listdir(path)

print("Files and directories in '", path, "' :")

# prints all files

# print(dir_list)

Files and directories in ' C:/Users/User/dataset/ ' :


In [69]:
# Python program to sort a list of strings
# Using sort() function with key as len sort 
# sort within the given list based on the length of the element present withing

dir_list.sort(key = len)

In [70]:
print(dir_list)

['37.txt', '38.txt', '39.txt', '40.txt', '41.txt', '42.txt', '43.txt', '44.txt', '45.txt', '46.txt', '47.txt', '48.txt', '49.txt', '50.txt', '51.txt', '52.txt', '53.txt', '54.txt', '55.txt', '56.txt', '57.txt', '58.txt', '59.txt', '60.txt', '61.txt', '62.txt', '63.txt', '64.txt', '65.txt', '66.txt', '67.txt', '68.txt', '69.txt', '70.txt', '71.txt', '72.txt', '73.txt', '74.txt', '75.txt', '76.txt', '77.txt', '78.txt', '79.txt', '80.txt', '81.txt', '82.txt', '83.txt', '84.txt', '85.txt', '86.txt', '87.txt', '88.txt', '89.txt', '90.txt', '91.txt', '92.txt', '93.txt', '94.txt', '95.txt', '96.txt', '97.txt', '98.txt', '99.txt', '100.txt', '101.txt', '102.txt', '103.txt', '104.txt', '105.txt', '106.txt', '107.txt', '108.txt', '109.txt', '110.txt', '111.txt', '112.txt', '113.txt', '114.txt', '115.txt', '116.txt', '117.txt', '118.txt', '119.txt', '120.txt', '121.txt', '122.txt', '123.txt', '124.txt', '125.txt', '126.txt', '127.txt', '128.txt', '129.txt', '130.txt', '131.txt', '132.txt', '133.t

**The Master Dictionary Found Here:**
* https://sraf.nd.edu/loughranmcdonald-master-dictionary/
* https://drive.google.com/file/d/17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik/view

In [71]:
# Create Pandas Dataframe
# change the string to lowercase in Pandas Dataframe using df['column name'].str.lower()
# Python zip() method takes iterable or containers and 
# returns a single iterator object, having mapped values from all the containers. 

df = pd.read_csv('Loughran-McDonald_MasterDictionary_1993-2021.csv')

sydict = dict(zip(df.Word.str.lower(),df.Syllables))

In [72]:
def count(text):
    num_of_words = 0
    lines = text.split()
    for w in lines:
        if not w.isnumeric():
            num_of_words += 1
    return num_of_words

## Word Count 

`We count the total cleaned words present in the text by:`
1.	removing the stop words (using stopwords class of nltk package).
2.	removing any punctuations like ? ! , . from the word before counting.

**The StopWords list Found Here:**
* https://sraf.nd.edu/textual-analysis/stopwords/
* https://drive.google.com/file/d/0B4niqV00F3msSktONVhfaElXeEk/view?resourcekey=0-3hFK5VYPXA7R_Q2LvA-SOw

In [73]:
stopwords = open('C:/Users/User/Case Study/blackcoffer assignment/files/StopWords_GenericLong.txt').read().lower()

In [74]:
# simple python function 1
# total cleaned word present in the text


def clean_word(text):
    filtered = []
    
    tokenizer = RegexpTokenizer(r'\w+')
    clean = tokenizer.tokenize(text)
    
    for w in clean:
        if w not in stopwords:
            if w.isalpha():
                filtered.append(w)
    return filtered

In [75]:
# simple python function 2
# count total cleaned word present in the text


def word_count(text):
    filtered = []
    
    tokenizer = RegexpTokenizer(r'\w+')
    clean = tokenizer.tokenize(text)
    
    for w in clean:
        if w not in stopwords:
            if w.isalpha():
                filtered.append(w)
    return len(filtered)

## Positive Score

**This file contains a list of POSITIVE opinion words (or sentiment words).**

* This file and the papers can all be downloaded from 
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

If you use this list, please cite one of the following two papers:
   *  Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International
       Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, 
   *   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

In [76]:
# positve words
# opening the file in read mode
file1 = open('E:/Blackcoffer case study/positive-words.txt','r')

# reading the file
data1 = file1.read()

# splitting the text 
p_word = data1.split()
# print(p_word)

In [77]:
def positive_score(text):
    values = []
    for item in text:
        if item in p_word:
            score = +1
            values.append(score)
    return sum(values)

## Negative Score

**Dictionary of Negative Words Found Here :**

* This file contains a list of NEGATIVE opinion words (or sentiment words).

* This file and the papers can all be downloaded from:
    
  http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

If you use this list, please cite one of the following two papers:
   *  Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International
       Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, 
   *   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

https://gist.github.com/mkulakowski2/4289441

In [78]:
# negative words

#opening the file in read mode
file2 = open('E:/Blackcoffer case study/negative-words.txt','r')

#reading the file
data2 = file2.read()

#spliting the text
n_word = data2.split()
# print(n_word)

In [79]:
def negative_score(text):
    values = []
    for item in text:
        if item in n_word:
            score = -1
            values.append(score)
    final = sum(values) * -1
    return final

## Polarity Score

This is the score that determines if a given text is positive or negative in nature. 

It is calculated by using the formula: 
**Polarity Score** = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

`Range is from -1 to +1`


## Subjectivity Score

This is the score that determines if a given text is objective or subjective. 

It is calculated by using the formula: 

**Subjectivity Score** = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

`Range is from 0 to +1`


In [80]:
%%time

cl_w = []         # total number of clean words
score_1 = []      # Positive Score
score_2 = []      # Negative Score
score_3 = []      # Polarity Score
score_4 = []      # Subjectivity Score


for x in dir_list:
    #absolute path is given
    article = open(("C:/Users/User/dataset/{}".format(x)), encoding = "mbcs").read().lower()
    
    # Clean Words
    f = clean_word(article)
   
    
    # Clean Word Count
    m = word_count(article)
    cl_w.append(m)
   
    
    # Positive Score
    ps = positive_score(f)
    score_1.append(ps)
   
   
    # Negative Score
    ns = negative_score(f)
    score_2.append(ns)
   
   
    # Polarity Score
    w = round(((ps-ns)/((ps+ns) + 0.000001)),2)
    score_3.append(w)
    
    # Subjectivity Score
    v = round(((ps + ns)/(m + 0.000001)),2)
    score_4.append(v)
    
    
    print(x)
    print('Successful')


37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
44.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
58.txt
Successful
59.txt
Successful
60.txt
Successful
61.txt
Successful
62.txt
Successful
63.txt
Successful
64.txt
Successful
65.txt
Successful
66.txt
Successful
67.txt
Successful
68.txt
Successful
69.txt
Successful
70.txt
Successful
71.txt
Successful
72.txt
Successful
73.txt
Successful
74.txt
Successful
75.txt
Successful
76.txt
Successful
77.txt
Successful
78.txt
Successful
79.txt
Successful
80.txt
Successful
81.txt
Successful
82.txt
Successful
83.txt
Successful
84.txt
Successful
85.txt
Successful
86.txt
Successful
87.txt
Successful
88.txt
Successful
89.txt
Successful
90.txt
Successful
91.txt
Successful
92.txt
Suc

In [47]:
# print(cl_w)
# print(score_1)
# print(score_2)
# print(score_3)
# print(score_4)

## Average Number of Words Per Sentence

`The formula for calculating is:`

**Average Number of Words Per Sentence** = the total number of words / the total number of sentences

In [81]:
# simple python function
# calculate average sentence length

def avg_sent_len(text):
    s = sent_tokenize(text)
    ns = len(s)
    asl = round((k/ns),2)
    return asl

## Complex Word Count

`Complex words are words in the text that contain more than two syllables.`

In [82]:
# simple python function 4
# count complex word which has more than two syllables

def complex_word_count(t_dict,t_lst):
    complx = []
    for w in t_lst:
        for key,val in t_dict.items():
            if w == key:
                if val > 2:
                    complx.append(w)
    return(len(complx))

## Analysis of Readability

Analysis of Readability is calculated using the Gunning Fox index formula described below.

* **Average Sentence Length** = the number of words / the number of sentences
* **Percentage of Complex words** = the number of complex words / the number of words 
* **Fog Index** = 0.4 * (Average Sentence Length + Percentage of Complex words)


* **The Gunning Fog** formula generates a grade level between `0 and 20.` 
* It estimates the education level required to understand the text.

In [84]:
%%time

num = []    # Total Number of Words
hard = []   # Complex Word Count
sent = []   # Average Sentence Length
percent = []  # Percentage of Complex Words
fog_index = [] # Fog Index


for x in dir_list:
    article = open(("C:/Users/User/dataset/{}".format(x)), encoding = "mbcs").read().lower()
    
    # Clean Words
    f = clean_word(article)
    
    # Number of Words
    k = count(article)
    num.append(k)
    
    # Complex Word Count
    cw = complex_word_count(sydict,f)
    hard.append(cw)
    
    # Average Sentence Length
    al = avg_sent_len(article)
    sent.append(al)
    
    # Percentage of Complex Words
    pcw = round((cw/k),2)
    percent.append(pcw)
    
    # Fog Index
    fi = round((0.4* (al+pcw)),2)
    fog_index.append(fi)
    
    
    print(x)
    print('Successful')

37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
44.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
58.txt
Successful
59.txt
Successful
60.txt
Successful
61.txt
Successful
62.txt
Successful
63.txt
Successful
64.txt
Successful
65.txt
Successful
66.txt
Successful
67.txt
Successful
68.txt
Successful
69.txt
Successful
70.txt
Successful
71.txt
Successful
72.txt
Successful
73.txt
Successful
74.txt
Successful
75.txt
Successful
76.txt
Successful
77.txt
Successful
78.txt
Successful
79.txt
Successful
80.txt
Successful
81.txt
Successful
82.txt
Successful
83.txt
Successful
84.txt
Successful
85.txt
Successful
86.txt
Successful
87.txt
Successful
88.txt
Successful
89.txt
Successful
90.txt
Successful
91.txt
Successful
92.txt
Suc

In [59]:
# print(hard)
# print(sent)
# print(fog_index)

## Average Word Length

Average Word Length is calculated by the formula:

`Sum of the total number of characters in each word/Total number of words`

In [85]:
# Using regular Expression

def avg_word_length(text):
    tk1 = RegexpTokenizer("[\w']+") # tokenize word
    tk2 = RegexpTokenizer("[\w']")  # tokenize char in each word
    
    x = tk1.tokenize(text)
    z = []
    
    for item in x:
        y = tk2.tokenize(item)
        total = len(y)
        z.append(total)
    
    awl = round((sum(z)/k),0)
    
    return int(awl)

In [87]:
%%time

awl = []


for x in dir_list:
    article = open(("C:/Users/User/dataset/{}".format(x)), encoding = 'mbcs').read().lower()
    
        
    # Number of Words
    k = count(article)
    
    
    # Average Word Length
    a = avg_word_length(article)
    awl.append(a)
    
    
    print(x)
    print('Successful')

37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
44.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
58.txt
Successful
59.txt
Successful
60.txt
Successful
61.txt
Successful
62.txt
Successful
63.txt
Successful
64.txt
Successful
65.txt
Successful
66.txt
Successful
67.txt
Successful
68.txt
Successful
69.txt
Successful
70.txt
Successful
71.txt
Successful
72.txt
Successful
73.txt
Successful
74.txt
Successful
75.txt
Successful
76.txt
Successful
77.txt
Successful
78.txt
Successful
79.txt
Successful
80.txt
Successful
81.txt
Successful
82.txt
Successful
83.txt
Successful
84.txt
Successful
85.txt
Successful
86.txt
Successful
87.txt
Successful
88.txt
Successful
89.txt
Successful
90.txt
Successful
91.txt
Successful
92.txt
Suc

In [63]:
# print(awl)

## Personal Pronouns

To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”.

In [88]:
def pronouns_count(text):
    pk = RegexpTokenizer("((?:^I[\s]|your|you|^he|^she|[\s]its|[\s]it[\s]|^we|they|[\s]them[\s]|[\s]+us+[\s]|[\s]him|[\s]her|[\s]his|theirs|[\s]our|[\s]my[\s]))")
    
    pp = pk.tokenize(text)
    
    return len(pp)

In [90]:
%%time

pro = []


for x in dir_list:
    article = open(("C:/Users/User/dataset/{}".format(x)), encoding = 'mbcs').read().lower()
    
        
    # Average Word Length
    p = pronouns_count(article)
    pro.append(p)
    
    print(x)
    print('Successful')

37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
44.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
58.txt
Successful
59.txt
Successful
60.txt
Successful
61.txt
Successful
62.txt
Successful
63.txt
Successful
64.txt
Successful
65.txt
Successful
66.txt
Successful
67.txt
Successful
68.txt
Successful
69.txt
Successful
70.txt
Successful
71.txt
Successful
72.txt
Successful
73.txt
Successful
74.txt
Successful
75.txt
Successful
76.txt
Successful
77.txt
Successful
78.txt
Successful
79.txt
Successful
80.txt
Successful
81.txt
Successful
82.txt
Successful
83.txt
Successful
84.txt
Successful
85.txt
Successful
86.txt
Successful
87.txt
Successful
88.txt
Successful
89.txt
Successful
90.txt
Successful
91.txt
Successful
92.txt
Suc

## Syllable Count Per Word

We count the number of **Syllables** in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

In [91]:
def syllable_count(text):
    sk = RegexpTokenizer("(?:[aeiouAEIOU][r|w]|[aA|eE|iI|oO|uU|yY][aeiouyAEIOUY]|[aeiouyAEIOUY]r|[a-zA-Z]y|es|ed|ear|[aeiouAEIOU])")
    
    d = sk.tokenize(text)
    sc = len(d)
    
    return sc

In [92]:
def syllable_count_per_word(text):
    sk = RegexpTokenizer("(?:[aeiouAEIOU][r|w]|[aA|eE|iI|oO|uU|yY][aeiouyAEIOUY]|[aeiouyAEIOUY]r|[a-zA-Z]y|es|ed|ear|[aeiouAEIOU])")
    tk1 = RegexpTokenizer("[\w']+") # tokenize word
    
    d = sk.tokenize(text)
    sc = len(d)
    
    r = tk1.tokenize(text)
    wc = len(r)
    
    pw = sc/wc
    
    return pw    

In [94]:
%%time

syl = []
syl_w = []


for x in dir_list:
    article = open(("C:/Users/User/dataset/{}".format(x)), encoding = 'mbcs').read().lower()
    
        
    # Syllable Count
    h = syllable_count(article)
    syl.append(h)
    
    
    # Syllable Count Per Word
    g = syllable_count_per_word(article)
    syl_w.append(g)
    
    
    print(x)
    print('Successful')

37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
44.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
58.txt
Successful
59.txt
Successful
60.txt
Successful
61.txt
Successful
62.txt
Successful
63.txt
Successful
64.txt
Successful
65.txt
Successful
66.txt
Successful
67.txt
Successful
68.txt
Successful
69.txt
Successful
70.txt
Successful
71.txt
Successful
72.txt
Successful
73.txt
Successful
74.txt
Successful
75.txt
Successful
76.txt
Successful
77.txt
Successful
78.txt
Successful
79.txt
Successful
80.txt
Successful
81.txt
Successful
82.txt
Successful
83.txt
Successful
84.txt
Successful
85.txt
Successful
86.txt
Successful
87.txt
Successful
88.txt
Successful
89.txt
Successful
90.txt
Successful
91.txt
Successful
92.txt
Suc

In [95]:
output = pd.read_excel('Output Data Structure.xlsx')

In [96]:
output.columns

Index(['Unnamed: 0', 'URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE',
       'POLARITY SCORE', 'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH',
       'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX',
       'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
       'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH'],
      dtype='object')

In [97]:
# save computed variable in output data structure

output['POSITIVE SCORE'] = score_1
output['NEGATIVE SCORE'] = score_2
output['POLARITY SCORE'] = score_3
output['SUBJECTIVITY SCORE'] = score_4

output['AVG SENTENCE LENGTH'] = sent
output['PERCENTAGE OF COMPLEX WORDS'] = percent
output['FOG INDEX'] = fog_index

output['AVG NUMBER OF WORDS PER SENTENCE'] = sent
output['COMPLEX WORD COUNT'] = hard
output['WORD COUNT'] = cl_w

output['SYLLABLE PER WORD'] =  syl_w
output['PERSONAL PRONOUNS'] = pro
output['AVG WORD LENGTH'] = awl

In [98]:
output.head()

Unnamed: 0.1,Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,0,37,https://insights.blackcoffer.com/ai-in-healthc...,67,36,0.3,0.1,23.95,0.22,9.67,23.95,401,1010,1.952794,25,6
1,1,38,https://insights.blackcoffer.com/what-if-the-c...,60,35,0.26,0.16,18.64,0.13,7.51,18.64,188,590,1.676712,43,5
2,2,39,https://insights.blackcoffer.com/what-jobs-wil...,66,36,0.29,0.12,20.37,0.2,8.23,20.37,345,854,1.941109,30,5
3,3,40,https://insights.blackcoffer.com/will-machine-...,70,27,0.44,0.12,19.03,0.12,7.66,19.03,221,793,1.611111,45,5
4,4,41,https://insights.blackcoffer.com/will-ai-repla...,63,26,0.42,0.11,26.38,0.16,10.62,26.38,279,824,1.733406,60,5


In [99]:
rl_dict = {6.0 : "Sixth grade", 7.0: "Seventh grade", 8.0: "Eighth grade",
                 9.0 : "High school freshman", 10.0 : "High school sophomore",
                 11.0 : "High school junior", 12.0 : "High school senior",
                 13.0 : "College freshman", 14.0 : "College sophomore",
                 15.0 : "College junior" , 16.0 : "College senior",
                 17.0 : "College graduate"}

In [100]:
def reading_level(num,grade):
    for key,val in grade.items():
            if key == round(num):
                return val               

In [101]:
%%time

rl = []

for i in fog_index:
    
    #Reading level by grade
    x = reading_level(i,rl_dict)
    rl.append(x)

CPU times: total: 15.6 ms
Wall time: 997 µs


In [102]:
# Creating new columns "Total Syllable Count" and "Reading Level Grade"

output["TOTAL SYLLABLE COUNT"] = syl
output["READING LEVEL BY GRADE"] = rl

In [103]:
# Reordering the columns in pandas df

output = output[["URL_ID", "URL", "POSITIVE SCORE", "NEGATIVE SCORE", "POLARITY SCORE",
       "SUBJECTIVITY SCORE", "AVG SENTENCE LENGTH",
       "PERCENTAGE OF COMPLEX WORDS", "FOG INDEX", "READING LEVEL BY GRADE",
       "AVG NUMBER OF WORDS PER SENTENCE", "COMPLEX WORD COUNT", "WORD COUNT",
       "SYLLABLE PER WORD", "TOTAL SYLLABLE COUNT","PERSONAL PRONOUNS", "AVG WORD LENGTH"]]

In [104]:
output.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,READING LEVEL BY GRADE,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,TOTAL SYLLABLE COUNT,PERSONAL PRONOUNS,AVG WORD LENGTH
0,37,https://insights.blackcoffer.com/ai-in-healthc...,67,36,0.3,0.1,23.95,0.22,9.67,High school sophomore,23.95,401,1010,1.952794,3599,25,6
1,38,https://insights.blackcoffer.com/what-if-the-c...,60,35,0.26,0.16,18.64,0.13,7.51,Eighth grade,18.64,188,590,1.676712,2448,43,5
2,39,https://insights.blackcoffer.com/what-jobs-wil...,66,36,0.29,0.12,20.37,0.2,8.23,Eighth grade,20.37,345,854,1.941109,3362,30,5
3,40,https://insights.blackcoffer.com/will-machine-...,70,27,0.44,0.12,19.03,0.12,7.66,Eighth grade,19.03,221,793,1.611111,2900,45,5
4,41,https://insights.blackcoffer.com/will-ai-repla...,63,26,0.42,0.11,26.38,0.16,10.62,High school junior,26.38,279,824,1.733406,3160,60,5


In [105]:
# determining the name of the file
file_name = 'Output Data Structure.xlsx'
  
# saving the excel
output.to_excel(file_name)
print('DataFrame is written to Excel File successfully.')

DataFrame is written to Excel File successfully.
