# Data Extraction & Text Analysis

<img src="images.png" width="150" height="150" align="left"/> 
<img src="data-science-isolated-icon-simple-element-illustration-general-concept-icons-editable-logo-sign-symbol-design-white-142289844.jpg" width="150" height="150" align="left"/> 

### Submitted By Manjiri H. Sawant

### Objectives : 

The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables.

1. `Data Extraction`
2. `Data Analysis`

`Tool Used`: **Python Jupyter Notebook**

## Data Extraction

1. `BeautifulSoup` - Format and Scrap the data from the HTML
2. `Selenium` - Controlling web browser through programs

**Steps**

1. Identify URL
2. Inspect HTML code
3. Find the HTML tag for the element that you want to extract.
4. Write some code to scrap this data

## Data Analysis

`Look for these variables in the analysis:`

1.	POSITIVE SCORE
2.	NEGATIVE SCORE
3.	POLARITY SCORE
4.	SUBJECTIVITY SCORE
5.	AVG SENTENCE LENGTH
6.	PERCENTAGE OF COMPLEX WORDS
7.	FOG INDEX
8.	AVG NUMBER OF WORDS PER SENTENCE
9.	COMPLEX WORD COUNT
10.	WORD COUNT
11.	SYLLABLE PER WORD
12.	PERSONAL PRONOUNS
13.	AVG WORD LENGTH

#### Import Necessary Libraries

##### The following code written in Python 3.x. Libraries provide pre-written functionally to perform necessary tasks.

# 1. Data Extraction

In [1]:
#Importing Required Libraries

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as FirefoxOptions

In [2]:
# load file

df = pd.read_excel('C:/Users/User/Case Study/blackcoffer assignment/files/Input.xlsx')

In [3]:
# Here 1st Column will be extracted ie 'URL_ID'
# Converting specific df columns

u_id = df['URL_ID'].values.tolist()

In [4]:
# Here 2nd Column will be extracted ie 'URL'
# Converting specific df columns

u_l = df['URL'].values.tolist()

In [5]:
completed = []

Identify below features based on them and I will try to scrape out the relevant data from website :
* article title ='?'
* article text = '?'
* Article Title = `'title'`
* Article Text = **`div`** `'td-post-content'`

`Tags are very important while scraping data from particular website.`

**soup.find()** is used for finding out the first tag with the specified name or id and 
returning an object of type bs4.

* Through WebDriver, Selenium supports all major browsers such as Chrome/Chromium, Firefox, Internet Explorer, Edge, Opera, and Safari. 
* WebDriver drives the browser using the browser’s built-in support for automation.

* **Gecko driver** links between selenium test and the Firefox browser keep **headless** to access any website easily nothing will appear on screen. 
* Everything is done on the backend side.

In [6]:
%%time

# URL as an input 
# Iterating till the range and export as txt file



time_out = time.time() + 60*1 # 1 min from now

for ele in u_l:
    try:
        options = FirefoxOptions()
        options.add_argument("--headless")
        driver = webdriver.Firefox(options=options)
        
        
        driver.get(ele)
        content = driver.page_source
    
    finally:
        try:
            driver.close()
        except:
            pass
        

    soup = BeautifulSoup(content, features = "html.parser")
    
    #title
    title = soup.find('title')
    
    
    #post
    post = soup.find('div', attrs = {'class': 'td-post-content'})

    
    data = [title.text, post.text]
    
    
    x = u_id[0]
    print(x)
    

    with open('E:/data/{}.txt'.format(x), 'w+', encoding='utf-8') as t:
        for items in data:
            t.write('%s\n' %items)
    t.close()
    

    completed.append(ele)
    
#     print(completed)
    
    print("File written successfully")
    
    

    u_id.pop(0)
    
    if time.time() > time_out:
        break
        
for ele in completed:
    if ele in u_l:
        u_l.remove(ele)        

1
File written successfully
2
File written successfully
3
File written successfully
Wall time: 1min 20s


# 2. Text Analysis

In [7]:
# Importing required libraries

import os
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize

In [8]:
# get the list of all text files

path = 'C:/Users/User/Case Study/blackcoffer assignment/'

dir_list = os.listdir(path)

print("Files in '", path, "' :")

Files in ' C:/Users/User/Case Study/blackcoffer assignment/ ' :


In [9]:
# append only text files present in my folder
t_file = []

for x in dir_list:
    if x.endswith('.txt'):
        t_file.append(x)

In [10]:
# Python program to sort a list of strings
# Using sort() function with key as len

t_file.sort(key = len)

**The Master Dictionary Found Here:**
* https://sraf.nd.edu/textual-analysis/resources/ 
* https://drive.google.com/file/d/17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik/view

In [11]:
# Create Pandas Dataframe
# change the string to lowercase in Pandas Dataframe using df['column name'].str.lower()
# Python zip() method takes iterable or containers and 
# returns a single iterator object, having mapped values from all the containers. 

df = pd.read_csv('C:/Users/User/Downloads/Loughran-McDonald_MasterDictionary_1993-2021.csv')

sydict = dict(zip(df.Word.str.lower(),df.Syllables))

In [12]:
def count(text):
    num_of_words = 0
    lines = text.split()
    for w in lines:
        if not w.isnumeric():
            num_of_words += 1
    return num_of_words

## Word Count 

`We count the total cleaned words present in the text by:`
1.	removing the stop words (using stopwords class of nltk package).
2.	removing any punctuations like ? ! , . from the word before counting.

**The StopWords list Found Here:**
* https://sraf.nd.edu/textual-analysis/stopwords/
* https://drive.google.com/file/d/0B4niqV00F3msSktONVhfaElXeEk/view?resourcekey=0-3hFK5VYPXA7R_Q2LvA-SOw

In [13]:
stopwords = open('C:/Users/User/Case Study/blackcoffer assignment/files/StopWords_GenericLong.txt').read().lower()

In [14]:
# simple python function 1
# total cleaned word present in the text


def clean_word(text):
    filtered = []
    
    tokenizer = RegexpTokenizer(r'\w+')
    clean = tokenizer.tokenize(text)
    
    for w in clean:
        if w not in stopwords:
            if w.isalpha():
                filtered.append(w)
    return filtered

In [15]:
# simple python function 2
# count total cleaned word present in the text


def word_count(text):
    filtered = []
    
    tokenizer = RegexpTokenizer(r'\w+')
    clean = tokenizer.tokenize(text)
    
    for w in clean:
        if w not in stopwords:
            if w.isalpha():
                filtered.append(w)
    return len(filtered)

## Positive Score

**This file contains a list of POSITIVE opinion words (or sentiment words).**

* This file and the papers can all be downloaded from 
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

If you use this list, please cite one of the following two papers:
   *  Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International
       Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, 
   *   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of        the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

In [16]:
# positve words
# opening the file in read mode
file1 = open('E:/Blackcoffer case study/positive-words.txt','r')

# reading the file
data1 = file1.read()

# splitting the text 
p_word = data1.split()
# print(p_word)

In [17]:
def positive_score(text):
    values = []
    for item in text:
        if item in p_word:
            score = +1
            values.append(score)
    return sum(values)

## Negative Score

**Dictionary of Negative Words Found Here :**

* This file contains a list of NEGATIVE opinion words (or sentiment words).

* This file and the papers can all be downloaded from:
    
  http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

If you use this list, please cite one of the following two papers:
   *  Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International
       Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, 
   *   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of        the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

https://gist.github.com/mkulakowski2/4289441

In [18]:
# negative words

#opening the file in read mode
file2 = open('E:/Blackcoffer case study/negative-words.txt','r')

#reading the file
data2 = file2.read()

#spliting the text
n_word = data2.split()
# print(n_word)

In [19]:
def negative_score(text):
    values = []
    for item in text:
        if item in n_word:
            score = -1
            values.append(score)
    final = sum(values) * -1
    return final

## Polarity Score

This is the score that determines if a given text is positive or negative in nature. 

It is calculated by using the formula: 
**Polarity Score** = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

`Range is from -1 to +1`


## Subjectivity Score

This is the score that determines if a given text is objective or subjective. 

It is calculated by using the formula: 

**Subjectivity Score** = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

`Range is from 0 to +1`


In [20]:
%%time

cl_w = []         # total number of clean words
score_1 = []      # Positive Score
score_2 = []      # Negative Score
score_3 = []      # Polarity Score
score_4 = []      # Subjectivity Score


for x in t_file:
    article = open(x, encoding = 'mbcs').read().lower()
    
    # Clean Words
    f = clean_word(article)
   
    
    # Clean Word Count
    m = word_count(article)
    cl_w.append(m)
   
    
    # Positive Score
    ps = positive_score(f)
    score_1.append(ps)
   
   
    # Negative Score
    ns = negative_score(f)
    score_2.append(ns)
   
   
    # Polarity Score
    w = round(((ps-ns)/((ps+ns) + 0.000001)),2)
    score_3.append(w)
    
    # Subjectivity Score
    v = round(((ps + ns)/(m + 0.000001)),2)
    score_4.append(v)
    
    
    print(x)
    print('Successful')


1.txt
Successful
2.txt
Successful
3.txt
Successful
4.txt
Successful
5.txt
Successful
6.txt
Successful
7.txt
Successful
8.txt
Successful
9.txt
Successful
10.txt
Successful
11.txt
Successful
12.txt
Successful
13.txt
Successful
14.txt
Successful
15.txt
Successful
16.txt
Successful
17.txt
Successful
18.txt
Successful
19.txt
Successful
20.txt
Successful
21.txt
Successful
22.txt
Successful
23.txt
Successful
24.txt
Successful
25.txt
Successful
26.txt
Successful
27.txt
Successful
28.txt
Successful
29.txt
Successful
30.txt
Successful
31.txt
Successful
32.txt
Successful
33.txt
Successful
34.txt
Successful
35.txt
Successful
36.txt
Successful
37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
5

In [21]:
# print(cl_w)
# print(score_1)
# print(score_2)
# print(score_3)
# print(score_4)

## Average Number of Words Per Sentence

`The formula for calculating is:`

**Average Number of Words Per Sentence** = the total number of words / the total number of sentences

In [22]:
# simple python function
# calculate average sentence length

def avg_sent_len(text):
    s = sent_tokenize(text)
    ns = len(s)
    asl = round((k/ns),2)
    return asl

## Complex Word Count

`Complex words are words in the text that contain more than two syllables.`

In [23]:
# simple python function 4
# count complex word which has more than two syllables

def complex_word_count(t_dict,t_lst):
    complx = []
    for w in t_lst:
        for key,val in t_dict.items():
            if w == key:
                if val > 2:
                    complx.append(w)
    return(len(complx))

## Analysis of Readability

Analysis of Readability is calculated using the Gunning Fox index formula described below.

* **Average Sentence Length** = the number of words / the number of sentences
* **Percentage of Complex words** = the number of complex words / the number of words 
* **Fog Index** = 0.4 * (Average Sentence Length + Percentage of Complex words)


* **The Gunning Fog** formula generates a grade level between `0 and 20.` 
* It estimates the education level required to understand the text.

In [24]:
%%time

num = []    # Total Number of Words
hard = []   # Complex Word Count
sent = []   # Average Sentence Length
percent = []  # Percentage of Complex Words
fog_index = [] # Fog Index


for x in t_file:
    article = open(x, encoding = 'mbcs').read().lower()
    
    # Clean Words
    f = clean_word(article)
    
    # Number of Words
    k = count(article)
    num.append(k)
    
    # Complex Word Count
    cw = complex_word_count(sydict,f)
    hard.append(cw)
    
    # Average Sentence Length
    al = avg_sent_len(article)
    sent.append(al)
    
    # Percentage of Complex Words
    pcw = round((cw/k),2)
    percent.append(pcw)
    
    # Fog Index
    fi = round((0.4* (al+pcw)),2)
    fog_index.append(fi)
    
    
    print(x)
    print('Successful')

1.txt
Successful
2.txt
Successful
3.txt
Successful
4.txt
Successful
5.txt
Successful
6.txt
Successful
7.txt
Successful
8.txt
Successful
9.txt
Successful
10.txt
Successful
11.txt
Successful
12.txt
Successful
13.txt
Successful
14.txt
Successful
15.txt
Successful
16.txt
Successful
17.txt
Successful
18.txt
Successful
19.txt
Successful
20.txt
Successful
21.txt
Successful
22.txt
Successful
23.txt
Successful
24.txt
Successful
25.txt
Successful
26.txt
Successful
27.txt
Successful
28.txt
Successful
29.txt
Successful
30.txt
Successful
31.txt
Successful
32.txt
Successful
33.txt
Successful
34.txt
Successful
35.txt
Successful
36.txt
Successful
37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
5

In [28]:
# print(hard)
# print(sent)
# print(fog_index)

## Average Word Length

Average Word Length is calculated by the formula:

`Sum of the total number of characters in each word/Total number of words`

In [29]:
# Using regular Expression

def avg_word_length(text):
    tk1 = RegexpTokenizer("[\w']+") # tokenize word
    tk2 = RegexpTokenizer("[\w']")  # tokenize char in each word
    
    x = tk1.tokenize(text)
    z = []
    
    for item in x:
        y = tk2.tokenize(item)
        total = len(y)
        z.append(total)
    
    awl = round((sum(z)/k),0)
    
    return int(awl)

In [30]:
%%time

awl = []


for x in t_file:
    article = open(x, encoding = 'mbcs').read().lower()
    
        
    # Number of Words
    k = count(article)
    
    
    # Average Word Length
    a = avg_word_length(article)
    awl.append(a)
    
    
    print(x)
    print('Successful')

1.txt
Successful
2.txt
Successful
3.txt
Successful
4.txt
Successful
5.txt
Successful
6.txt
Successful
7.txt
Successful
8.txt
Successful
9.txt
Successful
10.txt
Successful
11.txt
Successful
12.txt
Successful
13.txt
Successful
14.txt
Successful
15.txt
Successful
16.txt
Successful
17.txt
Successful
18.txt
Successful
19.txt
Successful
20.txt
Successful
21.txt
Successful
22.txt
Successful
23.txt
Successful
24.txt
Successful
25.txt
Successful
26.txt
Successful
27.txt
Successful
28.txt
Successful
29.txt
Successful
30.txt
Successful
31.txt
Successful
32.txt
Successful
33.txt
Successful
34.txt
Successful
35.txt
Successful
36.txt
Successful
37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
5

In [31]:
# print(awl)

## Personal Pronouns

To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”.

In [31]:
def pronouns_count(text):
    pk = RegexpTokenizer("((?:^I[\s]|your|you|^he|^she|[\s]its|[\s]it[\s]|^we|they|[\s]them[\s]|[\s]+us+[\s]|[\s]him|[\s]her|[\s]his|theirs|[\s]our|[\s]my[\s]))")
    
    pp = pk.tokenize(text)
    
    return len(pp)

In [32]:
%%time

pro = []


for x in t_file:
    article = open(x, encoding = 'mbcs').read().lower()
    
        
    # Average Word Length
    p = pronouns_count(article)
    pro.append(p)
    
    print(x)
    print('Successful')

1.txt
Successful
2.txt
Successful
3.txt
Successful
4.txt
Successful
5.txt
Successful
6.txt
Successful
7.txt
Successful
8.txt
Successful
9.txt
Successful
10.txt
Successful
11.txt
Successful
12.txt
Successful
13.txt
Successful
14.txt
Successful
15.txt
Successful
16.txt
Successful
17.txt
Successful
18.txt
Successful
19.txt
Successful
20.txt
Successful
21.txt
Successful
22.txt
Successful
23.txt
Successful
24.txt
Successful
25.txt
Successful
26.txt
Successful
27.txt
Successful
28.txt
Successful
29.txt
Successful
30.txt
Successful
31.txt
Successful
32.txt
Successful
33.txt
Successful
34.txt
Successful
35.txt
Successful
36.txt
Successful
37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
5

In [59]:
# print(pro)

## Syllable Count Per Word

We count the number of **Syllables** in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

In [33]:
def syllable_count(text):
    sk = RegexpTokenizer("(?:[aeiouAEIOU][r|w]|[aA|eE|iI|oO|uU|yY][aeiouyAEIOUY]|[aeiouyAEIOUY]r|[a-zA-Z]y|es|ed|ear|[aeiouAEIOU])")
    
    d = sk.tokenize(text)
    sc = len(d)
    
    return sc

In [34]:
def syllable_count_per_word(text):
    sk = RegexpTokenizer("(?:[aeiouAEIOU][r|w]|[aA|eE|iI|oO|uU|yY][aeiouyAEIOUY]|[aeiouyAEIOUY]r|[a-zA-Z]y|es|ed|ear|[aeiouAEIOU])")
    tk1 = RegexpTokenizer("[\w']+") # tokenize word
    
    d = sk.tokenize(text)
    sc = len(d)
    
    r = tk1.tokenize(text)
    wc = len(r)
    
    pw = sc/wc
    
    return pw    

In [35]:
%%time

syl = []
syl_w = []


for x in t_file:
    article = open(x, encoding = 'mbcs').read().lower()
    
        
    # Syllable Count
    h = syllable_count(article)
    syl.append(h)
    
    
    # Syllable Count Per Word
    g = syllable_count_per_word(article)
    syl_w.append(g)
    
    
    print(x)
    print('Successful')

1.txt
Successful
2.txt
Successful
3.txt
Successful
4.txt
Successful
5.txt
Successful
6.txt
Successful
7.txt
Successful
8.txt
Successful
9.txt
Successful
10.txt
Successful
11.txt
Successful
12.txt
Successful
13.txt
Successful
14.txt
Successful
15.txt
Successful
16.txt
Successful
17.txt
Successful
18.txt
Successful
19.txt
Successful
20.txt
Successful
21.txt
Successful
22.txt
Successful
23.txt
Successful
24.txt
Successful
25.txt
Successful
26.txt
Successful
27.txt
Successful
28.txt
Successful
29.txt
Successful
30.txt
Successful
31.txt
Successful
32.txt
Successful
33.txt
Successful
34.txt
Successful
35.txt
Successful
36.txt
Successful
37.txt
Successful
38.txt
Successful
39.txt
Successful
40.txt
Successful
41.txt
Successful
42.txt
Successful
43.txt
Successful
45.txt
Successful
46.txt
Successful
47.txt
Successful
48.txt
Successful
49.txt
Successful
50.txt
Successful
51.txt
Successful
52.txt
Successful
53.txt
Successful
54.txt
Successful
55.txt
Successful
56.txt
Successful
57.txt
Successful
5

In [36]:
output = pd.read_excel('Output Data Structure.xlsx')

In [37]:
output.columns

Index(['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE',
       'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH',
       'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX',
       'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
       'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH',
       'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20'],
      dtype='object')

In [38]:
# save computed variable in output data structure

output['POSITIVE SCORE'] = score_1
output['NEGATIVE SCORE'] = score_2
output['POLARITY SCORE'] = score_3
output['SUBJECTIVITY SCORE'] = score_4

output['AVG SENTENCE LENGTH'] = sent
output['PERCENTAGE OF COMPLEX WORDS'] = percent
output['FOG INDEX'] = fog_index

output['AVG NUMBER OF WORDS PER SENTENCE'] = sent
output['COMPLEX WORD COUNT'] = hard
output['WORD COUNT'] = cl_w

output['SYLLABLE PER WORD'] =  syl_w
output['PERSONAL PRONOUNS'] = pro
output['AVG WORD LENGTH'] = awl

<img src="Gunning fog.jpg" width="200" height="200" align="center"/> 

In [64]:
# New Computed Variable 'Reading Level By Grade'

rl = []

for num in fog_index:
    if 6 <= num <= 7:
        rl.append('6th Grade')
    elif 7 <= num <= 8:
        rl.append('7th Grade')
    elif 8 <= num <= 9:
        rl.append('8th Grade')
    elif 9 <= num <= 10:
        rl.append('High School freshman')
    elif 10 <= num <= 11:
        rl.append('High School sophomore')
    elif 11 <= num <= 12:
        rl.append('High School junior')
    elif 12 <= num <= 13:
        rl.append('High School senior')
    elif 13 <= num <= 14:
        rl.append('College freshman')
    elif 14 <= num <= 15:
        rl.append('College sophomore')
    elif 15 <= num <= 16:
        rl.append('College junior')
    elif 16 <= num <= 17:
        rl.append('College Senior')
    elif num > 20:
        rl.append('Hard to read')
    elif num < 6:
        rl.append('Easy to read')
    
    

In [66]:
# Creating new columns 'TOTAL SYLLABLE COUNT' & 'READING LEVEL BY GRADE'

output['TOTAL SYLLABLE COUNT'] = syl
output['READING LEVEL BY GRADE'] = rl

In [68]:
output.columns

Index(['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE',
       'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH',
       'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX',
       'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
       'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH',
       'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20', 'TOTAL SYLLABLE COUNT',
       'READING LEVEL BY GRADE'],
      dtype='object')

In [69]:
# Reordering the columns in pandas df

output = output[['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE',
       'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH',
       'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX', 'READING LEVEL BY GRADE',
       'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
       'SYLLABLE PER WORD', 'TOTAL SYLLABLE COUNT','PERSONAL PRONOUNS', 'AVG WORD LENGTH',
        'Unnamed: 15','Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20']]

In [70]:
output.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,READING LEVEL BY GRADE,...,SYLLABLE PER WORD,TOTAL SYLLABLE COUNT,PERSONAL PRONOUNS,AVG WORD LENGTH,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,1,https://insights.blackcoffer.com/how-is-login-...,13,9,0.18,0.06,29.08,0.15,11.69,High School junior,...,1.761589,1330,13,5,,,,,,
1,2,https://insights.blackcoffer.com/how-does-ai-h...,22,3,0.76,0.08,22.21,0.18,8.96,8th Grade,...,1.88172,1225,21,5,,,,,,
2,3,https://insights.blackcoffer.com/ai-and-its-im...,70,18,0.59,0.09,23.57,0.19,9.5,High School freshman,...,1.836702,3453,45,5,,,,,,
3,4,https://insights.blackcoffer.com/how-do-deep-l...,13,0,1.0,0.06,34.77,0.18,13.98,College freshman,...,1.893013,867,7,5,,,,,,
4,5,https://insights.blackcoffer.com/how-artificia...,47,12,0.59,0.1,22.0,0.17,8.87,8th Grade,...,1.791503,2277,40,5,,,,,,


In [74]:
# dropping unnamed columns
output.drop(['Unnamed: 15','Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20'],axis = 1, inplace = True)

In [76]:
# determining the name of the file
file_name = 'Output Data Structure.xlsx'
  
# saving the excel
output.to_excel(file_name)
print('DataFrame is written to Excel File successfully.')

DataFrame is written to Excel File successfully.
