<a href="https://colab.research.google.com/github/Shrutijana/Comprehensive-Text-Analysis-for-Sentiment-and-Readability-Assessment/blob/main/Comprehensive_Text_Analysis_for_Sentiment_and_Readability_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Comprehensive Text Analysis for Sentiment and Readability Evaluation***

# ***Introduction***
The objective of this project is to perform a detailed text analysis on articles extracted from various URLs. The analysis aims to determine sentiment scores, readability, and various linguistic metrics such as average word length, syllable count, and the use of personal pronouns. This document explains the methodology adopted for extracting textual data, performing sentiment analysis, and computing readability scores. The results of the analysis are stored in an Excel file for easy reference and further use.

# **Methodology**
The methodology involves three main steps: data extraction, sentiment analysis, and readability analysis. Each of these steps is described in detail below.

# 1. Data Extraction
The first step involves extracting the article text from the given URLs. The URLs are provided in an Excel file (Input.xlsx), and the extracted text is saved in a structured format for further analysis.

In [None]:
# Import necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Function to extract article text from URL
def extract_text_from_url(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise HTTPError for bad responses (4xx and 5xx)
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        article_text = " ".join(paragraph.get_text() for paragraph in paragraphs)
        return article_text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

# Read input URLs from uploaded Excel file
input_file_path = '/content/Input.xlsx'
input_data = pd.read_excel(input_file_path)



# 2. Sentiment Analysis
The sentiment analysis process determines whether a piece of writing is positive, negative, or neutral. The analysis includes cleaning the text using stop words, creating dictionaries of positive and negative words, and calculating various sentiment scores.

In [None]:
# Import necessary libraries
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load stop words, positive words, and negative words
def load_word_set(file_path, encoding=None):
    with open(file_path, 'r', encoding=encoding) as file:
        return set(file.read().splitlines())

stop_words_names = load_word_set('/content/StopWords_Names.txt')
stop_words_geographic = load_word_set('/content/StopWords_Geographic.txt')
stop_words_genericlong = load_word_set('/content/StopWords_GenericLong.txt')
stop_words_generic = load_word_set('/content/StopWords_Generic.txt')
stop_words_datesandnumbers = load_word_set('/content/StopWords_DatesandNumbers.txt')
stop_words_currencies = load_word_set('/content/StopWords_Currencies.txt', encoding='latin-1')
stop_words_auditor = load_word_set('/content/StopWords_Auditor.txt')
positive_words = load_word_set('/content/positive-words.txt')
negative_words = load_word_set('/content/negative-words.txt', encoding='latin-1')

# Combine all sets of stop words into one single set
all_stop_words = stop_words_names.union(stop_words_geographic, stop_words_genericlong,
                                         stop_words_generic, stop_words_datesandnumbers,
                                         stop_words_currencies, stop_words_auditor)

# Function to perform sentiment analysis
def sentiment_analysis(text):
    tokens = word_tokenize(text.lower())
    cleaned_tokens = [word for word in tokens if word.isalnum() and word not in all_stop_words]
    positive_score = sum(1 for word in cleaned_tokens if word in positive_words)
    negative_score = sum(1 for word in cleaned_tokens if word in negative_words)
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(cleaned_tokens) + 0.000001)
    return {
        'Positive Score': positive_score,
        'Negative Score': negative_score,
        'Polarity Score': polarity_score,
        'Subjectivity Score': subjectivity_score
    }

# Create an empty DataFrame to store the output
output_data_list = []

# Process each URL
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    article_text = extract_text_from_url(url)
    if article_text:
        analysis_result = sentiment_analysis(article_text)
        output_data_list.append(pd.DataFrame({
            'URL_ID': [url_id],
            'URL': [url],
            'Positive Score': [analysis_result['Positive Score']],
            'Negative Score': [analysis_result['Negative Score']],
            'Polarity Score': [analysis_result['Polarity Score']],
            'Subjectivity Score': [analysis_result['Subjectivity Score']]
        }))




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Error fetching https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Error fetching https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/


# 3. Readability Analysis
The readability analysis calculates various metrics to determine the complexity and ease of reading the text. This includes the Gunning Fog Index, average sentence length, percentage of complex words, and more.

In [None]:
import re

# Function to count syllables in a word
def count_syllables(word):
    word = word.lower()
    vowels = "aeiou"
    current_syllables = 0
    if word[0] in vowels:
        current_syllables += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            current_syllables += 1
    if word.endswith("e"):
        current_syllables -= 1
    if word.endswith("le") and len(word) > 2 and word[-3] not in vowels:
        current_syllables += 1
    if current_syllables == 0:
        current_syllables += 1
    return current_syllables

# Function to calculate readability metrics
def readability_analysis(text):
    sentences = re.split(r'[.!?]', text)
    sentences = [sentence for sentence in sentences if len(sentence) > 0]
    words = text.split()
    complex_words = [word for word in words if count_syllables(word) > 2]

    avg_sentence_length = len(words) / len(sentences) if sentences else 0
    percentage_complex_words = len(complex_words) / len(words) if words else 0
    fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)
    avg_syllables_per_word = sum(count_syllables(word) for word in words) / len(words) if words else 0
    avg_word_length = sum(len(word) for word in words) / len(words) if words else 0

    return {
        'Average Sentence Length': avg_sentence_length,
        'Percentage of Complex Words': percentage_complex_words,
        'Fog Index': fog_index,
        'Complex Word Count': len(complex_words),
        'Word Count': len(words),
        'Syllable Count Per Word': avg_syllables_per_word,
        'Average Word Length': avg_word_length
    }



# **Output**
The final output consists of an Excel file containing the results of the text analysis, including sentiment scores and readability metrics.

In [None]:
# Create an empty DataFrame to store the output
output_data_list = []

# Process each URL
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    article_text = extract_text_from_url(url)
    if article_text:
        sentiment_result = sentiment_analysis(article_text)
        readability_result = readability_analysis(article_text)
        output_data_list.append({
            'URL_ID': url_id,
            'URL': url,
            'Positive Score': sentiment_result['Positive Score'],
            'Negative Score': sentiment_result['Negative Score'],
            'Polarity Score': sentiment_result['Polarity Score'],
            'Subjectivity Score': sentiment_result['Subjectivity Score'],
            'Average Sentence Length': readability_result['Average Sentence Length'],
            'Percentage of Complex Words': readability_result['Percentage of Complex Words'],
            'Fog Index': readability_result['Fog Index'],
            'Complex Word Count': readability_result['Complex Word Count'],
            'Word Count': readability_result['Word Count'],
            'Syllable Count Per Word': readability_result['Syllable Count Per Word'],
            'Average Word Length': readability_result['Average Word Length']
        })

# Convert list of dictionaries to DataFrame
output_data = pd.DataFrame(output_data_list)

# Define the directory path for output
output_directory = '/content/'

# Save output to Excel file
output_file_path = os.path.join(output_directory, 'output.xlsx')
output_data.to_excel(output_file_path, index=False)

# Displaying first few rows of the output DataFrame
output_data.head()





Error fetching https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Error fetching https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/


Unnamed: 0,URL_ID,URL,Positive Score,Negative Score,Polarity Score,Subjectivity Score,Average Sentence Length,Percentage of Complex Words,Fog Index,Complex Word Count,Word Count,Syllable Count Per Word,Average Word Length
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,12,2,0.714286,0.048443,17.064516,0.200378,6.905958,106,529,1.756144,5.338374
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,61,32,0.311828,0.106407,18.682353,0.236776,7.567652,376,1588,1.868388,5.743703
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,44,25,0.275362,0.09465,20.112903,0.317562,8.172186,396,1247,2.073777,6.315156
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,42,76,-0.288136,0.162534,21.631579,0.294404,8.770393,363,1233,1.977291,6.201135
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,27,9,0.5,0.074689,19.2,0.237269,7.774907,205,864,1.846065,5.871528


# **Conclusion**
This project successfully demonstrates the extraction of textual data from URLs and the subsequent analysis of the text to compute sentiment scores and readability metrics. The analysis provides valuable insights into the nature and complexity of the articles, which can be used for various applications such as content analysis, opinion mining, and more. The results are stored in a structured format, making it easy to reference and utilize for further studies or reports.