Blackcoffer is an enterprise software and analytics consulting firm based in India and the European Union (Malta). It is a data-driven, technology, and decision science firm focused exclusively on big data and analytics, data-driven dashboards, applications development, information management, and consulting of any kind, from any source, on a massive scale. We are a young and global consulting shop helping enterprises and entrepreneurs to solve big data and analytics, data-driven dashboards, applications development, and information management problems to minimize risk, explore opportunities for future growth, and increase profits more effectively. We provide intelligence, accelerate innovation and implement technology with extraordinary breadth and depth of global insights into big data, data-driven dashboards, application development, and information management for organizations through combining unique, specialist services, and high-level human expertise.

# Data Extraction and NLP
### Test Assignment :
    https://drive.google.com/drive/folders/1ltdsXAS_zaZ3hI-q9eze_QCzHciyYAJY
## Objective
    The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below. 
    
## Data Extraction
    Input.xlsx
    For each of the articles, given in the input.xlsx file, extract the article text and save the extracted article in a text file with URL_ID as its file name.
    While extracting text, please make sure your program extracts only the article title and the article text. It should not extract the website header, footer, or anything other than the article text. 

* *NOTE: YOU MUST USE PYTHON PROGRAMMING TO EXTRACT DATA FROM THE URLs. YOU CAN USE BEATIFULSOUP, SELENIUM OR SCRAPY, OR ANY OTHER PYTHON LIBRARIES THAT YOU PREFER FOR DATA CRAWLING. 

## Data Analysis
    For each of the extracted texts from the article, perform textual analysis and compute variables, given in the output structure excel file. You need to save the output in the exact order as given in the output structure file, “Output Data Structure.xlsx”

* *NOTE: YOU MUST USE PYTHON PROGRAMMING FOR THE DATA ANALYSIS


## Variables
    Definition of each of the variables given in the “Text Analysis.docx” file.
    POSITIVE SCORE
    NEGATIVE SCORE
    POLARITY SCORE
    SUBJECTIVITY SCORE
    AVG SENTENCE LENGTH
    PERCENTAGE OF COMPLEX WORDS
    FOG INDEX
    AVG NUMBER OF WORDS PER SENTENCE
    COMPLEX WORD COUNT
    WORD COUNT
    SYLLABLE PER WORD
    PERSONAL PRONOUNS
    AVG WORD LENGTH

## Output Data Structure
    Output Variables: 
    All input variables in “Input.xlsx”
    POSITIVE SCORE
    NEGATIVE SCORE
    POLARITY SCORE
    SUBJECTIVITY SCORE
    AVG SENTENCE LENGTH
    PERCENTAGE OF COMPLEX WORDS
    FOG INDEX
    AVG NUMBER OF WORDS PER SENTENCE
    COMPLEX WORD COUNT
    WORD COUNT
    SYLLABLE PER WORD
    PERSONAL PRONOUNS
    AVG WORD LENGTH
    Checkout output data structure spreadsheet for the format of your output, i.e. “Output Data Structure.xlsx”.


## Sentimental Analysis
    Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral. The below Algorithm is designed for use in Financial Texts. It consists of steps:

## Cleaning using Stop Words Lists
    The Stop Words Lists (found in the folder StopWords) are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List. 

## Creating a dictionary of Positive and Negative words
    The Master Dictionary (found in the folder MasterDictionary) is used for creating a dictionary of Positive and Negative words. We add only those words in the dictionary if they are not found in the Stop Words Lists. 

## Extracting Derived variables
    We convert the text into a list of tokens using the nltk tokenize module and use these tokens to calculate the 4 variables described below:

* Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
* Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.
* Polarity Score: This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 
    *Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001) Range is from -1 to +1
* Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 
    *Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001) Range is from 0 to +1

## Analysis of Readability
    Analysis of Readability is calculated using the Gunning Fox index formula described below.
    Average Sentence Length = the number of words / the number of sentences
    Percentage of Complex words = the number of complex words / the number of words 
    Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

## Average Number of Words Per Sentence
    The formula for calculating is:
    Average Number of Words Per Sentence = the total number of words / the total number of sentences

## Complex Word Count
    Complex words are words in the text that contain more than two syllables.

## Word Count
    We count the total cleaned words present in the text by 
    removing the stop words (using stopwords class of nltk package).
    removing any punctuations like ? ! , . from the word before counting.

## Syllable Count Per Word
    We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

## Personal Pronouns
    To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

## Average Word Length
    Average Word Length is calculated by the formula:
    Sum of the total number of characters in each word/Total number of words


In [1]:
import pandas as pd
import numpy as np
import gdown
import requests
import os
from docx import Document
import warnings
from pprint import pprint
warnings.filterwarnings('ignore')

In [2]:
input_file = r'https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/edit?usp=drive_link'
objective_file = r'https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/edit?usp=drive_link'
output_file = r'https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/edit?usp=drive_link'
text_analysis = r'https://docs.google.com/document/d/11FuBgszZwCSpVWekJ6rR5tBLjU--xfIC/edit?usp=drive_link'

In [3]:
def download_docs(url, file_name):
    gdown.download(url, file_name)

In [4]:
download_docs(input_file, 'input.xlsx')
download_docs(objective_file, 'objective.docx')
download_docs(output_file, 'output.xlsx')
download_docs(text_analysis, 'test_analysis.docx')

Downloading...
From (uriginal): https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/edit?usp=drive_link
From (redirected): https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/export?format=xlsx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\input.xlsx
14.6kB [00:00, 59.2kB/s]
Downloading...
From (uriginal): https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/edit?usp=drive_link
From (redirected): https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/export?format=docx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\objective.docx
12.5kB [00:00, 153kB/s]
Downloading...
From (uriginal): https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/edit?usp=drive_link
From (redirected): https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/export?format=xlsx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\output

In [5]:
master_dic = r'https://drive.google.com/drive/folders/1YRcVlJO3ZaC78iTC6JcunfZl7Fz4AL8v?usp=drive_link'
stopwords = r'https://drive.google.com/drive/folders/1rd7YdoX8tED9mujc0c-6evJU4y7LFc_R?usp=drive_link'

In [6]:
gdown.download_folder(master_dic, output='master_dictionary')
gdown.download_folder(stopwords, output='stopwords')

Retrieving folder list


Processing file 1qqMwc_-ayS38HEOB97osO_nkIxRkbnvh negative-words.txt
Processing file 1seAj8G42SmfgUUx8lqVDJofm4Tuh2TOT positive-words.txt
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1qqMwc_-ayS38HEOB97osO_nkIxRkbnvh
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\master_dictionary\negative-words.txt
100%|██████████████████████████████████████████████████████████████████████████████| 44.8k/44.8k [00:00<00:00, 502kB/s]
Downloading...
From: https://drive.google.com/uc?id=1seAj8G42SmfgUUx8lqVDJofm4Tuh2TOT
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\master_dictionary\positive-words.txt
100%|██████████████████████████████████████████████████████████████████████████████| 19.1k/19.1k [00:00<00:00, 370kB/s]
Download completed
Retrieving folder list


Processing file 1aWxyJI0d9MOk59OZ_unfBY5E-Nvg_ezW StopWords_Auditor.txt
Processing file 1K-6MjPq5AQg4ICYY6PDfapB7JECUnryD StopWords_Currencies.txt
Processing file 13LXnH6vaJhvY4s2ai_2oW2qwongU_iAI StopWords_DatesandNumbers.txt
Processing file 1tTDfLXNPxNuUGZXHQkQhW6wPf4Xnivwr StopWords_Generic.txt
Processing file 1PnZhcsfjBVxnzwa4N6MrLWf6Kuhhjpdk StopWords_GenericLong.txt
Processing file 1RKxMOHzBdLrGuYb7MCJRTKKPwDG9Agbe StopWords_Geographic.txt
Processing file 1mBOuggD8AVNFjr9sprLoD2_6mVWAgRGE StopWords_Names.txt
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1aWxyJI0d9MOk59OZ_unfBY5E-Nvg_ezW
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_Auditor.txt
100%|███████████████████████████████████████████████████████████████████████████████████████| 88.0/88.0 [00:00<?, ?B/s]
Downloading...
From: https://drive.google.com/uc?id=1K-6MjPq5AQg4ICYY6PDfapB7JECUnryD
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_Currencies.txt
100%|█████████████████████████████████████████████████████████████████████████████| 1.76k/1.76k [00:00<00:00, 1.76MB/s]
Downloading...
From: https://drive.google.com/uc?id=13LXnH6vaJhvY4s2ai_2oW2qwongU_iAI
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_DatesandNumbers.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████| 832/832 [00:00<?, 

['stopwords\\StopWords_Auditor.txt',
 'stopwords\\StopWords_Currencies.txt',
 'stopwords\\StopWords_DatesandNumbers.txt',
 'stopwords\\StopWords_Generic.txt',
 'stopwords\\StopWords_GenericLong.txt',
 'stopwords\\StopWords_Geographic.txt',
 'stopwords\\StopWords_Names.txt']

In [7]:
input_file = pd.read_excel('input.xlsx')
input_file.head()

Unnamed: 0,URL_ID,URL
0,37,https://insights.blackcoffer.com/ai-in-healthc...
1,38,https://insights.blackcoffer.com/what-if-the-c...
2,39,https://insights.blackcoffer.com/what-jobs-wil...
3,40,https://insights.blackcoffer.com/will-machine-...
4,41,https://insights.blackcoffer.com/will-ai-repla...


In [8]:
output_file = pd.read_excel('output.xlsx')
output_file.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,37,https://insights.blackcoffer.com/ai-in-healthc...,,,,,,,,,,,,,
1,38,https://insights.blackcoffer.com/what-if-the-c...,,,,,,,,,,,,,
2,39,https://insights.blackcoffer.com/what-jobs-wil...,,,,,,,,,,,,,
3,40,https://insights.blackcoffer.com/will-machine-...,,,,,,,,,,,,,
4,41,https://insights.blackcoffer.com/will-ai-repla...,,,,,,,,,,,,,


# Getting the text

In [9]:
import requests
from bs4 import BeautifulSoup
import json
from typing import List

In [10]:
def fetch_urls(input_file) -> List[str]:
    urls = []
    for i in range(len(input_file)):
        url = input_file.iloc[i, 1]
        urls.append(url)
    return urls

In [11]:
def fetch_contents(url) -> List:
    out_text = []
    try:
        print(url)
        script = requests.get(url)
        soup_obj = BeautifulSoup(script.content)
        paragraphs = soup_obj.find_all('p')
        for i in paragraphs:
            out_text.append(i.text)
        return out_text
    except Exception as e:
        print(e)

In [12]:
urls = fetch_urls(input_file)
my_text = fetch_contents(urls[5])

https://insights.blackcoffer.com/man-and-machines-together-machines-are-more-diligent-than-humans-blackcoffe/


In [13]:
my_text

['NLP-based Approach for Data Transformation',
 'An ETL tool to pull data from Shiphero to Google Bigquery Data Warehouse',
 'Plaid Financial Analytics – A Data-Driven Dashboard to generate insights',
 'Recommendation Engine for Insurance Sector to Expand Business in the Rural Area',
 'Grafana Dashboard – Oscar Awards',
 'AutoGPT Setup',
 'Playstore & Appstore to Google Analytics (GA) or Firebase to Google Data Studio Mobile App KPI Dashboard',
 'Google Local Service Ads LSA API To Google BigQuery to Google Data Studio',
 'Rise of telemedicine and its Impact on Livelihood by 2040',
 'Rise of e-health and its impact on humans by the year 2030',
 'Rise of e-health and its impact on humans by the year 2030',
 'Rise of telemedicine and its Impact on Livelihood by 2040',
 'AI/ML and Predictive Modeling',
 'Solution for Contact Centre Problems',
 'How to Setup Custom Domain for Google App Engine Application?',
 'Code Review Checklist',
 'Where is this disruptive technology taking us? Take it

# Sentimental Analysis
    Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral. The below Algorithm is designed for use in Financial Texts. It consists of steps:

##### Cleaning using Stop Words Lists
    The Stop Words Lists (found in the folder StopWords) are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List. 

In [14]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to C:\Users\RANJIT
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\RANJIT
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\RANJIT
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


##### Average Number of Words Per Sentence
    Average Number of Words Per Sentence = the total number of words / the total number of sentences
##### Word Count
    removing the stop words (using stopwords class of nltk package).
    removing any punctuations like ? ! , . from the word before counting.
##### Average Word Length
    Sum of the total number of characters in each word/Total number of words
##### Syllable Count Per Word
    We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.
##### Complex Word Count
    Complex words are words in the text that contain more than two syllables.
##### Analysis of Readability
    Average Sentence Length = the number of words / the number of sentences
    Percentage of Complex words = the number of complex words / the number of words 
    Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
##### Personal Pronouns
    To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

In [15]:
class TextProcessorAndCounter:
    def __init__(self, sentences: list):
        self.sentences = sentences
    
    @staticmethod
    def get_stopwords() -> List[str]:
        folder = 'stopwords'
        custom_stopwords = []
        for i in os.listdir(folder):
            with open(os.path.join(folder, i), 'r') as file:
                for word in file.readlines():
                    word = word.replace('\n', '').casefold()
                    try:
                        word0 = word.split('|')[0]
                        word1 = word.split('|')[1]
                        custom_stopwords.append(word0.strip())
                        custom_stopwords.append(word1.strip())
                    except:
                        custom_stopwords.append(word.strip())
        english_stopwords = nltk.corpus.stopwords.words('english')
        return custom_stopwords, english_stopwords
    
    #Clean sentences
    def clean_sentences(self) -> List[str]:
        custom_stopwords, english_stopwords = self.get_stopwords()
        clean_lines = []
        for sentence in self.sentences:
            line = re.sub('[^a-zA-Z]', ' ', sentence)
            new_line = nltk.word_tokenize(line.casefold())
            clean_words = []
            for word in new_line:
                if word not in english_stopwords:
                    clean_words.append(word)
            clean_words = ' '.join(clean_words)
            clean_lines.append(clean_words)
        return clean_lines
    
    #AVG SENTENCE LENGTH
    def avg_sentence_length(self) -> int:
        length = 0
        for i in self.sentences:
            length += len(i)
        avg_length = int(np.round(length / len(self.sentences)))
        return avg_length
    
    #COMPLEX WORD COUNT
    #PERCENTAGE OF COMPLEX WORDS
    #SYLLABLE PER WORD
    def complex_word(self) -> int:
        sentence_list = self.clean_sentences()
        pattern = r'(es|ed)$'
        vowels = r'[aeiou]'
        complex_count = 0
        complex_words = []
        syllable = 0
        word_count = 0
        for sentence in sentence_list:
            words = nltk.word_tokenize(sentence)
            for word in words:
                if not re.findall(pattern, word):
                    if re.findall(vowels, word):
                        vowel_count = len(re.findall(vowels, word))
                        syllable += vowel_count
                        word_count += 1
                        if vowel_count > 2:
                            complex_count += 1
                            complex_words.append(word)
        total_word_count = len(nltk.word_tokenize(' '.join(sentence_list)))
        complex_percentage = np.round((complex_count / total_word_count)*100, 2)
        avg_syllable = int(np.round(syllable / word_count))
        return complex_count, complex_percentage, avg_syllable
    
    #FOG INDEX
    def fog_index(self) -> float:
        avg_length = self.avg_sentence_length()
        _, complex_percentage, _ = self.complex_word()
        value = np.round((avg_length + complex_percentage) * 0.4, 2)
        return value
    
    #PERSONAL PRONOUNS
    def personal_pronouns(self) -> str:
        pronouns_list = []
        pronouns = r'\b(I|we|my|ours|us|We|My|Ours|Us)\b'
        for sentence in self.sentences:
            found = re.findall(pronouns, sentence)
            pronouns_list.extend(found)
        return ', '.join(pronouns_list)
    
    #AVG NUMBER OF WORDS PER SENTENCE
    #WORD COUNT
    #AVG WORD LENGTH
    def sentences_words_count(self) -> int:
        clean_sentences = self.clean_sentences()
        total_sentence_count = len(clean_sentences)
        word_count = len(nltk.word_tokenize(' '.join(clean_sentences)))
        total_char_count = len(''.join(nltk.word_tokenize(''.join(clean_sentences))))
        average_words = int(np.round(word_count / total_sentence_count))
        average_word_length = int(np.round(total_char_count / word_count))
        return average_words, word_count, average_word_length

In [16]:
TextProcessorAndCounter(my_text).avg_sentence_length()

258

In [17]:
TextProcessorAndCounter(my_text).complex_word()

(301, 38.05, 3)

In [18]:
TextProcessorAndCounter(my_text).fog_index()

118.42

In [19]:
TextProcessorAndCounter(my_text).personal_pronouns()

'us, I, I, we, we, we, us, I, We, us, I, my, my, we, I, we, we, My, we, we, We, We, us'

In [20]:
TextProcessorAndCounter(my_text).sentences_words_count()

(23, 791, 7)

##### Positive Score: 
    This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
##### Negative Score: 
    This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.
##### Polarity Score: 
    This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 
* Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001) Range is from -1 to +1
##### Subjectivity Score: 
    This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 
* Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)
Range is from 0 to +1

In [21]:
class Scores(TextProcessorAndCounter):
    def __init__(self, sentences: list):
        super().__init__(sentences)
        self.sentences = sentences
    
    @staticmethod
    def positive_lib() -> List[str]:
        folder = 'master_dictionary/positive-words.txt'
        positive_words = []
        with open(folder, 'r') as file:
            for word in file.readlines():
                word = lemmatizer.lemmatize(word.replace('\n', ''))
                positive_words.append(word)
        return positive_words
    
    @staticmethod
    def negative_lib() -> List[str]:
        folder = 'master_dictionary/negative-words.txt'
        negative_words = []
        with open(folder, 'r') as file:
            for word in file.readlines():
                word = lemmatizer.lemmatize(word.replace('\n', ''))
                negative_words.append(word)
        return negative_words
    
    #POSITIVE SCORE
    def positive_score(self):
        sentences = self.clean_sentences()
        positive_words = self.positive_lib()
        count = 0
        for sentence in self.sentences:
            words = nltk.word_tokenize(sentence)
            for word in words:
                if lemmatizer.lemmatize(word) in positive_words:
                    count += 1
        return count
    
    #NEGATIVE SCORE
    def negative_score(self):
        sentences = self.clean_sentences()
        negative_words = self.negative_lib()
        count = 0
        for sentence in self.sentences:
            words = nltk.word_tokenize(sentence)
            for word in words:
                if lemmatizer.lemmatize(word) in negative_words:
                    count += 1
        return count
    
    #POLARITY SCORE
    def polarity_score(self):
        pos_score = self.positive_score()
        neg_score = self.negative_score()
        score = np.round((pos_score - neg_score)  / ((pos_score + neg_score) + 0.000001), 2)
        return score
    
    #SUBJECTIVITY SCORE
    def subjective_score(self):
        _, word_count, _ = self.sentences_words_count()
        pos_score = self.positive_score()
        neg_score = self.negative_score()
        score = np.round((pos_score + neg_score) / (word_count + + 0.000001), 2)
        return score

In [22]:
Scores(my_text).positive_score()

58

In [23]:
Scores(my_text).negative_score()

24

In [24]:
Scores(my_text).polarity_score()

0.41

In [25]:
Scores(my_text).subjective_score()

0.1

In [26]:
obj = Document('objective.docx')

In [32]:
processor = TextProcessorAndCounter(my_text)
# AVG SENTENCE LENGTH
avg_sentence_len = processor.avg_sentence_length()
# COMPLEX WORD COUNT
# PERCENTAGE OF COMPLEX WORDS
# SYLLABLE PER WORD
complex_word, percentage_complex, syllable_avg = processor.complex_word()
# FOG INDEX
fog_index = processor.fog_index()
# PERSONAL PRONOUNS
personal_pronouns = processor.personal_pronouns()
# AVG NUMBER OF WORDS PER SENTENCE
# WORD COUNT
# AVG WORD LENGTH
average_words, word_count, average_word_length = processor.sentences_words_count()
score_counter = Scores(my_text)
# POSITIVE SCORE
p_score = score_counter.positive_score()
# NEGATIVE SCORE
n_score = score_counter.negative_score()
# POLARITY SCORE
pol_score = score_counter.polarity_score()
# SUBJECTIVITY SCORE
sub_score = score_counter.subjective_score()

In [33]:
result = {
    'POSITIVE SCORE': p_score,
    'NEGATIVE SCORE': n_score,
    'POLARITY SCORE': pol_score,
    'SUBJECTIVITY SCORE': sub_score,
    'AVG SENTENCE LENGTH': avg_sentence_len,
    'PERCENTAGE OF COMPLEX WORDS': percentage_complex,
    'FOG INDEX': fog_index,
    'AVG NUMBER OF WORDS PER SENTENCE': average_words,
    'COMPLEX WORD COUNT': complex_word,
    'WORD COUNT': word_count,
    'SYLLABLE PER WORD': syllable_avg,
    'PERSONAL PRONOUNS': personal_pronouns,
    'AVG WORD LENGTH': average_word_length
}

In [43]:
new_data = pd.DataFrame([result, result, result])
new_data

Unnamed: 0,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7
1,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7
2,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7


In [42]:
my_data = input_file.iloc[:3, :]
my_data

Unnamed: 0,URL_ID,URL
0,37,https://insights.blackcoffer.com/ai-in-healthc...
1,38,https://insights.blackcoffer.com/what-if-the-c...
2,39,https://insights.blackcoffer.com/what-jobs-wil...


In [48]:
pd.concat((my_data, new_data), axis=1)

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,37,https://insights.blackcoffer.com/ai-in-healthc...,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7
1,38,https://insights.blackcoffer.com/what-if-the-c...,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7
2,39,https://insights.blackcoffer.com/what-jobs-wil...,58,24,0.41,0.1,258,38.05,118.42,23,301,791,3,"us, I, I, we, we, we, us, I, We, us, I, my, my...",7
