Blackcoffer is an enterprise software and analytics consulting firm based in India and the European Union (Malta). It is a data-driven, technology, and decision science firm focused exclusively on big data and analytics, data-driven dashboards, applications development, information management, and consulting of any kind, from any source, on a massive scale. We are a young and global consulting shop helping enterprises and entrepreneurs to solve big data and analytics, data-driven dashboards, applications development, and information management problems to minimize risk, explore opportunities for future growth, and increase profits more effectively. We provide intelligence, accelerate innovation and implement technology with extraordinary breadth and depth of global insights into big data, data-driven dashboards, application development, and information management for organizations through combining unique, specialist services, and high-level human expertise.

# Data Extraction and NLP
### Test Assignment
## Objective
    The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below. 
    
## Data Extraction
    Input.xlsx
    For each of the articles, given in the input.xlsx file, extract the article text and save the extracted article in a text file with URL_ID as its file name.
    While extracting text, please make sure your program extracts only the article title and the article text. It should not extract the website header, footer, or anything other than the article text. 

* *NOTE: YOU MUST USE PYTHON PROGRAMMING TO EXTRACT DATA FROM THE URLs. YOU CAN USE BEATIFULSOUP, SELENIUM OR SCRAPY, OR ANY OTHER PYTHON LIBRARIES THAT YOU PREFER FOR DATA CRAWLING. 

## Data Analysis
    For each of the extracted texts from the article, perform textual analysis and compute variables, given in the output structure excel file. You need to save the output in the exact order as given in the output structure file, “Output Data Structure.xlsx”

* *NOTE: YOU MUST USE PYTHON PROGRAMMING FOR THE DATA ANALYSIS


## Variables
    Definition of each of the variables given in the “Text Analysis.docx” file.
    POSITIVE SCORE
    NEGATIVE SCORE
    POLARITY SCORE
    SUBJECTIVITY SCORE
    AVG SENTENCE LENGTH
    PERCENTAGE OF COMPLEX WORDS
    FOG INDEX
    AVG NUMBER OF WORDS PER SENTENCE
    COMPLEX WORD COUNT
    WORD COUNT
    SYLLABLE PER WORD
    PERSONAL PRONOUNS
    AVG WORD LENGTH

## Output Data Structure
    Output Variables: 
    All input variables in “Input.xlsx”
    POSITIVE SCORE
    NEGATIVE SCORE
    POLARITY SCORE
    SUBJECTIVITY SCORE
    AVG SENTENCE LENGTH
    PERCENTAGE OF COMPLEX WORDS
    FOG INDEX
    AVG NUMBER OF WORDS PER SENTENCE
    COMPLEX WORD COUNT
    WORD COUNT
    SYLLABLE PER WORD
    PERSONAL PRONOUNS
    AVG WORD LENGTH
    Checkout output data structure spreadsheet for the format of your output, i.e. “Output Data Structure.xlsx”.


## Sentimental Analysis
    Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral. The below Algorithm is designed for use in Financial Texts. It consists of steps:

## Cleaning using Stop Words Lists
    The Stop Words Lists (found in the folder StopWords) are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List. 

## Creating a dictionary of Positive and Negative words
    The Master Dictionary (found in the folder MasterDictionary) is used for creating a dictionary of Positive and Negative words. We add only those words in the dictionary if they are not found in the Stop Words Lists. 

## Extracting Derived variables
    We convert the text into a list of tokens using the nltk tokenize module and use these tokens to calculate the 4 variables described below:

* Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
* Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.
* Polarity Score: This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 
    *Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001) Range is from -1 to +1
* Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 
    *Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001) Range is from 0 to +1

## Analysis of Readability
    Analysis of Readability is calculated using the Gunning Fox index formula described below.
    Average Sentence Length = the number of words / the number of sentences
    Percentage of Complex words = the number of complex words / the number of words 
    Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

## Average Number of Words Per Sentence
    The formula for calculating is:
    Average Number of Words Per Sentence = the total number of words / the total number of sentences

## Complex Word Count
    Complex words are words in the text that contain more than two syllables.

## Word Count
    We count the total cleaned words present in the text by 
    removing the stop words (using stopwords class of nltk package).
    removing any punctuations like ? ! , . from the word before counting.

## Syllable Count Per Word
    We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

## Personal Pronouns
    To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

## Average Word Length
    Average Word Length is calculated by the formula:
    Sum of the total number of characters in each word/Total number of words


In [1]:
import pandas as pd
import gdown
import requests
import os
from docx import Document
import warnings
from pprint import pprint
warnings.filterwarnings('ignore')

In [2]:
input_file = r'https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/edit?usp=drive_link'
objective_file = r'https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/edit?usp=drive_link'
output_file = r'https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/edit?usp=drive_link'
text_analysis = r'https://docs.google.com/document/d/11FuBgszZwCSpVWekJ6rR5tBLjU--xfIC/edit?usp=drive_link'

In [3]:
def download_docs(url, file_name):
    gdown.download(url, file_name)

In [4]:
download_docs(input_file, 'input.xlsx')
download_docs(objective_file, 'objctive.docx')
download_docs(output_file, 'output.xlsx')
download_docs(text_analysis, 'test_analysis.docx')

Downloading...
From (uriginal): https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/edit?usp=drive_link
From (redirected): https://docs.google.com/spreadsheets/d/1D7QkDHxUSKnQhR--q0BAwKMxQlUyoJTQ/export?format=xlsx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\input.xlsx
14.6kB [00:00, 166kB/s]
Downloading...
From (uriginal): https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/edit?usp=drive_link
From (redirected): https://docs.google.com/document/d/1wHMJDDvEKksgPRFajZXeycUcldC57lqr/export?format=docx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\objctive.docx
12.5kB [00:00, 82.8kB/s]
Downloading...
From (uriginal): https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/edit?usp=drive_link
From (redirected): https://docs.google.com/spreadsheets/d/1kHcx9epaZKB96zRItudnrDi57cFEndFI/export?format=xlsx
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\output.

In [5]:
master_dic = r'https://drive.google.com/drive/folders/1YRcVlJO3ZaC78iTC6JcunfZl7Fz4AL8v?usp=drive_link'
stopwords = r'https://drive.google.com/drive/folders/1rd7YdoX8tED9mujc0c-6evJU4y7LFc_R?usp=drive_link'

In [6]:
gdown.download_folder(master_dic, output='master dictionary')
gdown.download_folder(stopwords, output='stopwords')

Retrieving folder list


Processing file 1qqMwc_-ayS38HEOB97osO_nkIxRkbnvh negative-words.txt
Processing file 1seAj8G42SmfgUUx8lqVDJofm4Tuh2TOT positive-words.txt
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1qqMwc_-ayS38HEOB97osO_nkIxRkbnvh
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\master dictionary\negative-words.txt
100%|█████████████████████████████████████████████████████████████████████████████| 44.8k/44.8k [00:01<00:00, 32.8kB/s]
Downloading...
From: https://drive.google.com/uc?id=1seAj8G42SmfgUUx8lqVDJofm4Tuh2TOT
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\master dictionary\positive-words.txt
100%|█████████████████████████████████████████████████████████████████████████████| 19.1k/19.1k [00:00<00:00, 25.8kB/s]
Download completed
Retrieving folder list


Processing file 1aWxyJI0d9MOk59OZ_unfBY5E-Nvg_ezW StopWords_Auditor.txt
Processing file 1K-6MjPq5AQg4ICYY6PDfapB7JECUnryD StopWords_Currencies.txt
Processing file 13LXnH6vaJhvY4s2ai_2oW2qwongU_iAI StopWords_DatesandNumbers.txt
Processing file 1tTDfLXNPxNuUGZXHQkQhW6wPf4Xnivwr StopWords_Generic.txt
Processing file 1PnZhcsfjBVxnzwa4N6MrLWf6Kuhhjpdk StopWords_GenericLong.txt
Processing file 1RKxMOHzBdLrGuYb7MCJRTKKPwDG9Agbe StopWords_Geographic.txt
Processing file 1mBOuggD8AVNFjr9sprLoD2_6mVWAgRGE StopWords_Names.txt
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1aWxyJI0d9MOk59OZ_unfBY5E-Nvg_ezW
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_Auditor.txt
100%|███████████████████████████████████████████████████████████████████████████████████████| 88.0/88.0 [00:00<?, ?B/s]
Downloading...
From: https://drive.google.com/uc?id=1K-6MjPq5AQg4ICYY6PDfapB7JECUnryD
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_Currencies.txt
100%|█████████████████████████████████████████████████████████████████████████████| 1.76k/1.76k [00:00<00:00, 1.74MB/s]
Downloading...
From: https://drive.google.com/uc?id=13LXnH6vaJhvY4s2ai_2oW2qwongU_iAI
To: C:\Users\PythonFiles\PYcharm\Sentiment_analysis_Blackcoffer\notebook\stopwords\StopWords_DatesandNumbers.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████| 832/832 [00:00<?, 

['stopwords\\StopWords_Auditor.txt',
 'stopwords\\StopWords_Currencies.txt',
 'stopwords\\StopWords_DatesandNumbers.txt',
 'stopwords\\StopWords_Generic.txt',
 'stopwords\\StopWords_GenericLong.txt',
 'stopwords\\StopWords_Geographic.txt',
 'stopwords\\StopWords_Names.txt']

In [7]:
input_file = pd.read_excel('input.xlsx')
input_file.head()

Unnamed: 0,URL_ID,URL
0,37,https://insights.blackcoffer.com/ai-in-healthc...
1,38,https://insights.blackcoffer.com/what-if-the-c...
2,39,https://insights.blackcoffer.com/what-jobs-wil...
3,40,https://insights.blackcoffer.com/will-machine-...
4,41,https://insights.blackcoffer.com/will-ai-repla...


In [8]:
output_file = pd.read_excel('output.xlsx')
output_file.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,37,https://insights.blackcoffer.com/ai-in-healthc...,,,,,,,,,,,,,
1,38,https://insights.blackcoffer.com/what-if-the-c...,,,,,,,,,,,,,
2,39,https://insights.blackcoffer.com/what-jobs-wil...,,,,,,,,,,,,,
3,40,https://insights.blackcoffer.com/will-machine-...,,,,,,,,,,,,,
4,41,https://insights.blackcoffer.com/will-ai-repla...,,,,,,,,,,,,,


In [10]:
import requests