# Text Analysis and Sentiment Scoring Project

## Problem Statement

Requires a system to extract textual data from given URLs and perform text analysis to compute various variables. This analysis will help in understanding the sentiment, readability, and complexity of the articles, which can be valuable for content strategy, market research, and competitor analysis.

## Tasks

1. Data Extraction:
   - Read URLs from the provided 'Input.xlsx' file
   - Extract the article text from each URL, focusing only on the main content

2. Text Analysis:
   - Compute the following variables for each article:
     a. Sentiment Scores:
        - Positive Score
        - Negative Score
        - Polarity Score
        - Subjectivity Score
     b. Readability Metrics:
        - Average Sentence Length
        - Percentage of Complex Words
        - Fog Index
        - Average Number of Words Per Sentence
     c. Other Metrics:
        - Complex Word Count
        - Word Count
        - Syllable Count Per Word
        - Personal Pronouns Count
        - Average Word Length

3. Data Processing:
   - Clean the extracted text by removing stop words and irrelevant content
   - Utilize provided positive and negative word lists for sentiment analysis
   - Handle various text encodings and potential errors in file reading

4. Output Generation:
   - Compile all computed variables into a structured format
   - Generate an Excel file ('Output Data Structure.xlsx') containing the results for each analyzed URL

## Import libraries and download NLTK data

In [3]:
# Import required libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import re

# Download necessary NLTK data
nltk.download('punkt', quiet=True)

print("Libraries imported and NLTK data downloaded.")

Libraries imported and NLTK data downloaded.


## Define utility functions

In [4]:
def safe_read_file(file_path):
    """Safely read a file with multiple encoding attempts."""
    encodings = ['utf-8', 'latin-1', 'ascii', 'utf-16']
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                return set(f.read().lower().split())
        except UnicodeDecodeError:
            continue
        except FileNotFoundError:
            print(f"Warning: File not found: {file_path}")
            return set()
    print(f"Error: Unable to read {file_path} with any of the attempted encodings.")
    return set()

def extract_article(url):
    """Extract article text from a given URL."""
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        article = soup.find('article')
        if article:
            return article.get_text(strip=True)
        else:
            return soup.get_text(strip=True)
    except Exception as e:
        print(f"Error extracting article from {url}: {e}")
        return ""

print("Utility functions defined.")

Utility functions defined.


## Define text processing functions

In [5]:
def clean_text(text, stop_words):
    """Clean the text by removing stop words and non-alphabetic tokens."""
    words = word_tokenize(text.lower())
    cleaned_words = [word for word in words if word.isalpha() and word not in stop_words]
    return ' '.join(cleaned_words)

def calculate_sentiment_scores(text, positive_words, negative_words):
    """Calculate sentiment scores for the given text."""
    words = text.split()
    positive_score = sum(1 for word in words if word in positive_words)
    negative_score = sum(1 for word in words if word in negative_words)
    
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(words) + 0.000001)
    
    return {
        'positive': positive_score,
        'negative': negative_score,
        'polarity': polarity_score,
        'subjectivity': subjectivity_score
    }

def calculate_readability_metrics(text):
    """Calculate readability metrics for the given text."""
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    
    avg_sentence_length = len(words) / len(sentences) if len(sentences) > 0 else 0
    complex_words = [word for word in words if count_syllables(word) > 2]
    percentage_complex_words = len(complex_words) / len(words) if len(words) > 0 else 0
    fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)
    
    return {
        'avg_sentence_length': avg_sentence_length,
        'percentage_complex_words': percentage_complex_words,
        'fog_index': fog_index,
        'avg_words_per_sentence': avg_sentence_length
    }

def count_complex_words(text):
    """Count the number of complex words in the text."""
    words = word_tokenize(text)
    return sum(1 for word in words if count_syllables(word) > 2)

def count_syllables(word):
    """Count the number of syllables in a word."""
    return len(re.findall(r'[aeiou]', word.lower())) + 1

def count_personal_pronouns(text):
    """Count the number of personal pronouns in the text."""
    pronouns = ['I', 'we', 'my', 'ours', 'us']
    return sum(len(re.findall(r'\b' + pronoun + r'\b', text, re.IGNORECASE)) for pronoun in pronouns)

print("Text processing functions defined.")

Text processing functions defined.


## Define main analysis function

In [6]:
def analyze_text(url, stop_words, positive_words, negative_words):
    """Perform full text analysis on the article at the given URL."""
    article_text = extract_article(url)
    cleaned_text = clean_text(article_text, stop_words)
    
    sentiment_scores = calculate_sentiment_scores(cleaned_text, positive_words, negative_words)
    readability_metrics = calculate_readability_metrics(cleaned_text)
    complex_word_count = count_complex_words(cleaned_text)
    word_count = len(cleaned_text.split())
    syllable_count = sum(count_syllables(word) for word in cleaned_text.split())
    personal_pronouns = count_personal_pronouns(article_text)
    avg_word_length = sum(len(word) for word in cleaned_text.split()) / word_count if word_count > 0 else 0
    
    return {
        'positive_score': sentiment_scores['positive'],
        'negative_score': sentiment_scores['negative'],
        'polarity_score': sentiment_scores['polarity'],
        'subjectivity_score': sentiment_scores['subjectivity'],
        'avg_sentence_length': readability_metrics['avg_sentence_length'],
        'percentage_complex_words': readability_metrics['percentage_complex_words'],
        'fog_index': readability_metrics['fog_index'],
        'avg_words_per_sentence': readability_metrics['avg_words_per_sentence'],
        'complex_word_count': complex_word_count,
        'word_count': word_count,
        'syllable_per_word': syllable_count / word_count if word_count > 0 else 0,
        'personal_pronouns': personal_pronouns,
        'avg_word_length': avg_word_length
    }

print("Main analysis function defined.")

Main analysis function defined.


## Load input data and dictionaries

In [7]:
# Load input data
try:
    input_df = pd.read_excel('Input.xlsx')
    print(f"Loaded {len(input_df)} URLs from Input.xlsx")
except FileNotFoundError:
    print("Error: Input.xlsx not found. Please ensure it's in the current directory.")
    raise

# Load positive and negative word dictionaries
positive_words = safe_read_file('positive-words.txt')
print(f"Loaded {len(positive_words)} positive words")
negative_words = safe_read_file('negative-words.txt')
print(f"Loaded {len(negative_words)} negative words")

# Load stop words from multiple files
stop_words_files = [
    'StopWords_Names.txt',
    'StopWords_Geographic.txt',
    'StopWords_GenericLong.txt',
    'StopWords_Generic.txt',
    'StopWords_DatesandNumbers.txt',
    'StopWords_Currencies.txt',
    'StopWords_Auditor.txt'
]

stop_words = set()
for file in stop_words_files:
    words = safe_read_file(file)
    stop_words.update(words)
    print(f"Loaded {len(words)} words from {file}")

print(f"Total stop words: {len(stop_words)}")

Loaded 147 URLs from Input.xlsx
Loaded 2006 positive words
Loaded 4783 negative words
Loaded 11905 words from StopWords_Names.txt
Loaded 199 words from StopWords_Geographic.txt
Loaded 570 words from StopWords_GenericLong.txt
Loaded 121 words from StopWords_Generic.txt
Loaded 116 words from StopWords_DatesandNumbers.txt
Loaded 190 words from StopWords_Currencies.txt
Loaded 8 words from StopWords_Auditor.txt
Total stop words: 12840


## Process URLs and generate output

In [8]:
# Process each URL and store results
results = []
for index, row in input_df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    
    print(f"Processing {url_id}: {url}")
    
    analysis_results = analyze_text(url, stop_words, positive_words, negative_words)
    results.append({
        'URL_ID': url_id,
        'URL': url,
        **analysis_results
    })

# Create output DataFrame and save to Excel
output_df = pd.DataFrame(results)
output_df.to_excel('Output Data Structure.xlsx', index=False)

print("Analysis complete. Results saved to 'Output Data Structure.xlsx'")

# Displaying the first few rows of the output
output_df.head()

Processing bctech2011: https://insights.blackcoffer.com/ml-and-ai-based-insurance-premium-model-to-predict-premium-to-be-charged-by-the-insurance-company/
Processing bctech2012: https://insights.blackcoffer.com/streamlined-integration-interactive-brokers-api-with-python-for-desktop-trading-application/
Processing bctech2013: https://insights.blackcoffer.com/efficient-data-integration-and-user-friendly-interface-development-navigating-challenges-in-web-application-deployment/
Processing bctech2014: https://insights.blackcoffer.com/effective-management-of-social-media-data-extraction-strategies-for-authentication-security-and-reliability/
Processing bctech2015: https://insights.blackcoffer.com/streamlined-trading-operations-interface-for-metatrader-4-empowering-efficient-management-and-monitoring/
Processing bctech2016: https://insights.blackcoffer.com/efficient-aws-infrastructure-setup-and-management-addressing-security-scalability-and-compliance/
Processing bctech2017: https://insights

Unnamed: 0,URL_ID,URL,positive_score,negative_score,polarity_score,subjectivity_score,avg_sentence_length,percentage_complex_words,fog_index,avg_words_per_sentence,complex_word_count,word_count,syllable_per_word,personal_pronouns,avg_word_length
0,bctech2011,https://insights.blackcoffer.com/ml-and-ai-bas...,103,37,0.471429,0.09498,1474.0,0.931479,589.972592,1474.0,1373,1474,4.058345,2,7.976255
1,bctech2012,https://insights.blackcoffer.com/streamlined-i...,18,2,0.8,0.064516,310.0,0.906452,124.362581,310.0,281,310,4.267742,1,8.629032
2,bctech2013,https://insights.blackcoffer.com/efficient-dat...,17,8,0.36,0.061728,405.0,0.903704,162.361481,405.0,366,405,4.135802,1,8.190123
3,bctech2014,https://insights.blackcoffer.com/effective-man...,13,2,0.733333,0.04918,305.0,0.92459,122.369836,305.0,282,305,4.281967,1,8.357377
4,bctech2015,https://insights.blackcoffer.com/streamlined-t...,13,1,0.857143,0.03794,369.0,0.915989,147.966396,369.0,338,369,4.178862,1,8.233062
