## Objective: 
The objective of this project is to extract textual data articles from the given URL(in input Excel file i.e "input.xlsx" ) and perform text analysis to compute variables that are explained in Text analysis file. 

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

df = pd.read_excel(r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\input.xlsx")
df.head()

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...


## 1. Data Extraction

For each of the articles, given in the input.xlsx file, we extracted the article text and save the extracted article in a text file with URL_ID as its file name.
While extracting text, we make sure my program extracts only the article title and the article text and It should not extract the website header, footer, or anything other than the article text.


In [4]:


def extract_article_text(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        # Attemp to extract the title 
        title = soup.find('h1', class_="entry-title")
        title_text = title.get_text(strip=True) if title else 'Title Not Found'
        # Extract article text 
        article_text = soup.find("div", class_="td-post-content tagdiv-type").get_text()
        return title_text, article_text
    
    except Exception as e:
        print(f"Error extracting article from {url} : {e}") # getting error in 11 websites as div class name has been changed
        return None, None
    
def save_article_to_file(url_id, title_text, article_text):
    directory = r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\Text_data"
    filename = f"{url_id}.txt"
    full_path= os.path.join(directory, filename) # Ensure the directory exists
    os.makedirs(directory, exist_ok=True)
    with open(full_path, 'w', encoding='utf-8') as file:
#         file.write(f"Title: {title_text}\n\n")
#         file.write(f"Article Text: {article_text}")
        file.write(f"{title_text}\n\n")
        file.write(f"{article_text}")
    
    return full_path    


        
for index, row in df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    title, article_text = extract_article_text(url)
    article_path = save_article_to_file(url_id, title, article_text) #saving the extracted article in the text file


# print("All files have been extracted and saved")


Error extracting article from https://insights.blackcoffer.com/rise-of-e-health-and-its-imapct-on-humans-by-the-year-2030-2/ : 'NoneType' object has no attribute 'get_text'
Error extracting article from https://insights.blackcoffer.com/how-advertisement-increase-your-market-value/ : 'NoneType' object has no attribute 'get_text'
Error extracting article from https://insights.blackcoffer.com/ai-in-healthcare-to-improve-patient-outcomes/ : 'NoneType' object has no attribute 'get_text'
Error extracting article from https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/ : 'NoneType' object has no attribute 'get_text'
Error extracting article from https://insights.blackcoffer.com/future-of-work-how-ai-has-entered-the-workplace/ : 'NoneType' object has no attribute 'get_text'
Error extracting article from https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/ : 'NoneType' object has no attribute 'get_text'
Error extracting a

## 2. Sentimental Analysis
Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral. The below Algorithm is designed for use in Financial Texts. It consists of steps


#### 2.1 Creating a dictionary of Positive and Negative words
The Master Dictionary (found in the folder MasterDictionary) is used for creating a dictionary of Positive and Negative words.

In [5]:
def read_article(path):
    with open(path, 'r') as file:
        article_content = file.read()
    return article_content


def loading_Master_dictionary(file_path):
    words_list = []

    with open(file_path, 'r') as file:
        for line in file:
            stripped_line = line.strip()
            words_list.append(stripped_line)

    return words_list
    #print(stop_words)
    
    
Positive_list = loading_Master_dictionary(r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\MasterDictionary\positive-words.txt")
Negative_list = loading_Master_dictionary(r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\MasterDictionary\negative-words.txt")

print(len(set(Positive_list)))
print(len(set(Negative_list)))


2006
4783


#### 2.2 Cleaning using Stop Words Lists
The Stop Words Lists are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List. We will use NLTK, NLTK stands for Natural Language Toolkit. It's a comprehensive library for natural language processing (NLP) tasks in Python.


In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import syllapy
import re

nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ohkba\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ohkba\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Extracting Derived variables
We convert the text into a list of tokens using the nltk tokenize module and use these tokens to calculate the 4 variables described below:
Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.

Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number or we can assign the value of +1 for each word if found in the Negative Dictionary and then adding up all the values.

Polarity Score: This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 

Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)
Range is from -1 to +1

Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 

Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)
Range is from 0 to +1

#### Analysis of Readability
Analysis of Readability is calculated using the Gunning Fox index formula described below.

Average Sentence Length = the number of words / the number of sentences

Percentage of Complex words = the number of complex words / the number of words 

Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

#### Average Number of Words Per Sentence
The formula for calculating is:
Average Number of Words Per Sentence = the total number of words / the total number of sentences

#### Complex Word Count
Complex words are words in the text that contain more than two syllables.

#### Word Count
We count the total cleaned words present in the text by 
removing the stop words (using stopwords class of nltk package).
removing any punctuations like ? ! , . from the word before counting.

#### Syllable Count Per Word
We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es","ed" by not counting them as a syllable.

#### Personal Pronouns
To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

#### Average Word Length
Average Word Length is calculated by the formula:
Sum of the total number of characters in each word/Total number of words


## 3. Data Analysis
For each of the extracted texts from the article, we will perform textual analysis and compute variables as explained above. Later we need to save the output in a excel file with URL_ID, URL and in the exact order in which the below analysis_result dictionay is created as output structure file, “Output.xlsx”.


In [7]:

def perform_textual_analysis(article_text, file_name):
    
    tokens = word_tokenize(article_text.lower())
    cleaned_tokens = [word for word in tokens if word not in stop_words]
    
    cleaned_text = ' '.join(cleaned_tokens)
    
#     def calculating_variables(cleaned_tokens):
    Positive_score = sum(1 for word in cleaned_tokens if word in Positive_list)
    Negative_score = sum(1 for word in cleaned_tokens if word in Negative_list)
    Polarity_score = (Positive_score - Negative_score) / (Positive_score + Negative_score + 0.000001)
    Total_words = len(cleaned_tokens)
    Subjectivity_score = (Positive_score + Negative_score) / (Total_words + 0.000001)

#         return Positive_score, Negative_score, Polarity_score, Subjectivity_score
    
    sentences = sent_tokenize(cleaned_text)
    sentences_count = len(sentences)
    Average_sentence_length = Total_words/sentences_count
    complex_words = [word for word in cleaned_tokens if syllapy.count(word) >= 3]
    complex_word_count = len(complex_words)
    Percentage_complex_words = complex_word_count/Total_words
    Fog_index = 0.4 * (Average_sentence_length + Percentage_complex_words)
    
    Average_no_words_per_sentence = Total_words/sentences_count
    total_chars = sum(len(word) for word in cleaned_tokens)
    Avg_word_length = total_chars/ Total_words
    syllables_per_word = sum(syllapy.count(word) for word in cleaned_tokens)/Total_words
    
    def count_personal_pronouns(cleaned_text):
    # Define the pattern for personal pronouns
    # Use word boundaries (\b) to ensure we're capturing the pronouns as whole words
    # Case insensitive matching except for "us" vs "US"
    # For "us" we make sure it is either followed by a non-uppercase letter or punctuation or spaces, or at the end of the string
        pattern = r'\b(I|we|my|ours)\b|\bus\b(?![A-Z])'

    # Find all matches using the pattern
        matches = re.findall(pattern, cleaned_text, re.IGNORECASE)

    # Return the count of matches
        return len(matches) #, matches we can also return the matches
    
    personal_pronouns = count_personal_pronouns(cleaned_text)
#     file_name = str(file_name)
    file_ID = file_name.replace(".txt", "")
    
    # Creating a dictionary with analysis results, will also add URL_ID for later merging the dataframes through left join on URL_ID as it's unique
    analysis_results = {
        "URL_ID": file_ID,
        "Positive Score": Positive_score,
        "Negative Score": Negative_score,
        "Polarity Score": Polarity_score,
        "Subjectivity Score": Subjectivity_score,
        "Avg Sentence Length": Average_sentence_length,
        "Percentage Complex Words": Percentage_complex_words,
        "Fog Index": Fog_index,
        "Avg No of Words per Sentence": Average_no_words_per_sentence,
        "Complex Word Count": complex_word_count,
        "Word Count": Total_words,
        "Syllables per Word": syllables_per_word,
        "Personal Pronouns": personal_pronouns,
        "Avg Word Length": Avg_word_length
        
    }
    
    return analysis_results

The below function loading_content will go through each article text file as saved during extraction and perform the textual analysis on each content using the above perform_textual_analysis funtion and concat the new dataframe each time an analysis is performed.

In [8]:

# result_df = pd.DataFrame(columns=['URL_ID', 'Positive Score', 'Negative Score', 'Polarity Score',
#        'Subjectivity Score', 'Avg Sentence Length', 'Percentage Complex Words',
#        'Fog Index', 'Avg No of Words per Sentence', 'Complex Word Count',
#        'Word Count', 'Syllables per Word', 'Personal Pronouns',
#        'Avg Word Length'])



def loading_content(dir_path):
    files=os.listdir(dir_path)
    result_df = pd.DataFrame()
    for file in files:
        file_path = os.path.join(dir_path, file)
      
        file_name = str(file)
        if os.path.isfile(file_path):
            
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                
                analysis_results_dict = perform_textual_analysis(content, file_name)
                new_df = pd.DataFrame([analysis_results_dict])
                result_df = pd.concat([result_df, new_df], ignore_index=True)
        
    return result_df


df2 = loading_content(r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\Text_data")
df2.head()


Unnamed: 0,URL_ID,Positive Score,Negative Score,Polarity Score,Subjectivity Score,Avg Sentence Length,Percentage Complex Words,Fog Index,Avg No of Words per Sentence,Complex Word Count,Word Count,Syllables per Word,Personal Pronouns,Avg Word Length
0,blackassign0001,44,6,0.76,0.063131,12.375,0.212121,5.034848,12.375,168,792,1.683081,2,5.380051
1,blackassign0002,66,31,0.360825,0.087545,13.85,0.32852,5.671408,13.85,364,1108,1.937726,3,5.990975
2,blackassign0003,40,24,0.25,0.076739,14.631579,0.425659,6.022895,14.631579,355,834,2.238609,1,6.779376
3,blackassign0004,41,75,-0.293103,0.138756,17.787234,0.38756,7.269918,17.787234,324,836,2.084928,0,6.526316
4,blackassign0005,23,8,0.483871,0.061876,12.525,0.329341,5.141737,12.525,165,501,1.962076,0,6.275449


Merging the dataframe with new dataframe after performing analysis on each article so that we will get the final dataframe in the exact order as requied.

In [9]:
df3 = df
merged_df = pd.merge(df3, df2, on='URL_ID', how='left')
merged_df.head()

Unnamed: 0,URL_ID,URL,Positive Score,Negative Score,Polarity Score,Subjectivity Score,Avg Sentence Length,Percentage Complex Words,Fog Index,Avg No of Words per Sentence,Complex Word Count,Word Count,Syllables per Word,Personal Pronouns,Avg Word Length
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,44,6,0.76,0.063131,12.375,0.212121,5.034848,12.375,168,792,1.683081,2,5.380051
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,66,31,0.360825,0.087545,13.85,0.32852,5.671408,13.85,364,1108,1.937726,3,5.990975
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,40,24,0.25,0.076739,14.631579,0.425659,6.022895,14.631579,355,834,2.238609,1,6.779376
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,41,75,-0.293103,0.138756,17.787234,0.38756,7.269918,17.787234,324,836,2.084928,0,6.526316
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,23,8,0.483871,0.061876,12.525,0.329341,5.141737,12.525,165,501,1.962076,0,6.275449


## 4. Output Data Structure
Output Variables: 
All input variables in “Input.xlsx”,
Positive Score,
Negative Score,
Polatity Score,
Subjectivity Score,
Avg Sentence Length,
Percentage of Complex Words,
Fog Index,
Avg Number of Words Per Senentces,
Complex Word Count,
Word Count,
Syllable Per Word,
Personal Pronouns,
Avg Word Length.
Check out the output data structure spreadsheet for the format of your output, i.e. “Output.xlsx”.


In [10]:
excel_file_path= r"C:\Users\ohkba\OneDrive\Documents\Assignment Black\output.xlsx"
merged_df.to_excel(excel_file_path, index=False, engine='openpyxl', sheet_name='Output')
print("File saved as Output.xlsx")

File saved as Output.xlsx
