Let's break down how to approach the solution, step-by-step, to achieve the text analysis with the provided code and file structure:


Set Up Your Environment

pip install pandas nltk textblob


Create Folders:
Create a folder called "MasterDictionary" and place the following files inside:
positive-words.txt
negative-words.txt

Create a folder called "StopWords" and place the following files inside:
StopWords_Generic.txt
StopWords_GenericLong.txt
StopWords_Auditor.txt
StopWords_DatesandNumbers.txt
StopWords_Geographic.txt
StopWords_Names.txt
StopWords_Currencies.txt
You will also need a folder (e.g., "Text Files") to store your .txt files that contain the text you want to analyze


Prepare Your Data Files
MasterDictionary:
positive-words.txt: Each line should contain a single positive word (e.g., "happy", "joy", "amazing").
negative-words.txt: Each line should contain a single negative word (e.g., "sad", "terrible", "bad").

Text Files:
Create .txt files in your "Text Files" folder, and each file should contain the text you want to analyze.



Understand the Code
Import Libraries: The code imports necessary libraries for data handling (pandas), text processing (nltk), sentiment analysis (textblob), and regular expressions (re).

Create Empty Lists: It sets up lists to store the results of the analysis for each text file.

calculate_readability_scores Function: This function calculates readability metrics like average sentence length, percentage of complex words, and the Gunning Fog Index.

syllables_count Function: This function counts syllables in a word (a simple vowel-counting approach).

analyze_text Function: This is the heart of the code. It:
Loads Stop Words: Reads stop words from the files you've created.

Cleans Text: Removes stop words, punctuation, and converts words to lowercase.

Calculates Sentiment Scores: Uses the positive and negative dictionaries to determine sentiment.
Calculates Readability Scores: Uses the calculate_readability_scores function.
Calculates Other Variables: Extracts additional linguistic features like average word length, pronoun count, etc.

Load Dictionaries: The code loads the positive and negative word dictionaries.

Process Text Files: It iterates through each text file in the file_name list, reads the content, analyzes the text, and stores the results in the lists.

Create Output DataFrame: It creates a pandas DataFrame to hold the analysis results for each text file.
Save to CSV: It saves the results to a CSV file named "Output Data Structure.xlsx.csv

file_name: You need to define a list named file_name that contains the names of your text files (without the .txt extension), e.g.:
file_name = ['bctech2011', 'bctech2012', 'bctech2013', ...]

Encoding: If any of your stop word files or dictionaries are not in UTF-8 encoding, you might need to add the encoding='latin-1' parameter to the pd.read_csv calls to handle the encoding correctly.

Run the Code
Save the code as a Python file (e.g., text_analyzer.py).
Run the code from your terminal:






In [1]:
import pandas as pd
file_name=[]
df=pd.read_csv("/content/Input.csv")
for i in range(len(df.URL_ID)):
  file_name.append(df.URL_ID[i])
print(file_name)

['bctech2011', 'bctech2012', 'bctech2013', 'bctech2014', 'bctech2015', 'bctech2016', 'bctech2017', 'bctech2018', 'bctech2019', 'bctech2020', 'bctech2021', 'bctech2022', 'bctech2023', 'bctech2024', 'bctech2025', 'bctech2026', 'bctech2027', 'bctech2028', 'bctech2029', 'bctech2030', 'bctech2031', 'bctech2032', 'bctech2033', 'bctech2034', 'bctech2035', 'bctech2036', 'bctech2037', 'bctech2038', 'bctech2039', 'bctech2040', 'bctech2041', 'bctech2042', 'bctech2043', 'bctech2044', 'bctech2045', 'bctech2046', 'bctech2047', 'bctech2048', 'bctech2049', 'bctech2050', 'bctech2051', 'bctech2052', 'bctech2053', 'bctech2054', 'bctech2055', 'bctech2056', 'bctech2057', 'bctech2058', 'bctech2059', 'bctech2060', 'bctech2061', 'bctech2062', 'bctech2063', 'bctech2064', 'bctech2065', 'bctech2066', 'bctech2067', 'bctech2068', 'bctech2069', 'bctech2070', 'bctech2071', 'bctech2072', 'bctech2073', 'bctech2074', 'bctech2075', 'bctech2076', 'bctech2077', 'bctech2078', 'bctech2079', 'bctech2080', 'bctech2081', 'bcte

In [2]:
url=[]
for i in df.URL:
  url.append(i)

url


['https://insights.blackcoffer.com/ml-and-ai-based-insurance-premium-model-to-predict-premium-to-be-charged-by-the-insurance-company/',
 'https://insights.blackcoffer.com/streamlined-integration-interactive-brokers-api-with-python-for-desktop-trading-application/',
 'https://insights.blackcoffer.com/efficient-data-integration-and-user-friendly-interface-development-navigating-challenges-in-web-application-deployment/',
 'https://insights.blackcoffer.com/effective-management-of-social-media-data-extraction-strategies-for-authentication-security-and-reliability/',
 'https://insights.blackcoffer.com/streamlined-trading-operations-interface-for-metatrader-4-empowering-efficient-management-and-monitoring/',
 'https://insights.blackcoffer.com/efficient-aws-infrastructure-setup-and-management-addressing-security-scalability-and-compliance/',
 'https://insights.blackcoffer.com/streamlined-equity-waterfall-calculation-and-deal-management-system/',
 'https://insights.blackcoffer.com/automated-or

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:

import pandas as pd       # Imports the Pandas library for data manipulation and analysis.
from nltk.corpus import stopwords   # Imports the stopwords module from NLTK for removing common English words.
from nltk.tokenize import word_tokenize   # Imports the word_tokenize function for breaking text into words.
from textblob import TextBlob    #  Imports the TextBlob class for sentiment analysis
import re   #Imports the re module for working with regular expressions.


# Create empty lists to store the results of analysis such as sentimenmt scores,readiblity scores,word counts,etc.
positive_scores = []
negative_scores = []
polarity_scores = []
subjectivity_scores = []
avg_sentence_lengths = []
percentage_complex_words = []
fog_indexes = []
avg_words_per_sentences = []
complex_word_counts = []
word_counts = []
syllable_per_words = []
personal_pronouns = []
avg_word_lengths = []


#This function calculates the Gunning Fog Index, which is a measure of text readability. It takes the text as input and returns:
#avg_sentence_length: Average length of sentences in the text.
#percentage_complex_words: Percentage of words with more than two syllables.
#fog_index: The Gunning Fog Index score


def calculate_readability_scores(text):
    """Calculates readability scores using the Gunning Fox index."""
    sentences = text.split('.')
    words = word_tokenize(text)

    # Calculate average sentence length
    avg_sentence_length = len(words) / len(sentences)

    # Calculate percentage of complex words
    complex_words = [word for word in words if syllables_count(word) > 2]
    percentage_complex_words = len(complex_words) / len(words)

    # Calculate Fog Index
    fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)

    return avg_sentence_length, percentage_complex_words, fog_index

#syllables_count Function
#Counts the number of syllables in a word. It uses a simple approach to count vowels, with exceptions for words ending in "es" or "ed"

def syllables_count(word):
    """Counts the number of syllables in a word."""
    vowels = 'aeiouyAEIOUY'
    count = 0
    prev_was_vowel = False
    for char in word:
        if char in vowels:
            if not prev_was_vowel:
                count += 1
            prev_was_vowel = True
        else:
            prev_was_vowel = False

    # Handle exceptions like words ending with "es", "ed"
    if word.endswith('es') or word.endswith('ed'):
        count -= 1

    return count

#analyze_text Function:
#This is the core function for analyzing the text. It takes the text, positive dictionary (positive_dict), and negative dictionary (negative_dict) as input.
#Stop Words: It loads stop words from multiple files:
#StopWords_Generic.txt
#StopWords_GenericLong.txt
#StopWords_Auditor.txt
#StopWords_DatesandNumbers.txt
#StopWords_Geographic.txt
#StopWords_Names.txt
#StopWords_Currencies.txt
#Tokenization: The text is broken down into words using word_tokenize.
#Text Cleaning: It removes stop words and punctuation, converts words to lowercase, and keeps only alphanumeric words.
#Sentiment Calculation:
#It calculates the positive_score and negative_score by counting the occurrences of words from the provided dictionaries.
#Polarity and Subjectivity: It calculates the polarity and subjectivity scores based on the positive and negative scores.
#Readability Scores: It calls the calculate_readability_scores function to compute readability metrics.
#Other Variables: It calculates additional variables like:
#avg_words_per_sentence
#complex_word_count
#word_count
#syllable_per_word
#personal_pronouns
#avg_word_length


def analyze_text(text, positive_dict, negative_dict):
    """Performs text analysis to extract required variables."""

    # Clean text by removing stop words and punctuation
    #stop_words = set(stopwords.words('english'))
    #words = word_tokenize(text)
    #cleaned_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalnum()]

    # Load stop words from files
    stop_words_generic = set(pd.read_csv('/content/StopWords_Generic.txt', header=None, sep='|')[0].tolist())
    stop_words_generic_long = set(pd.read_csv('/content/StopWords_GenericLong.txt', header=None)[0].tolist())
    stop_words_auditor = set(pd.read_csv('/content/StopWords_Auditor.txt', header=None)[0].tolist())
    stop_words_datesandnumbers = set(pd.read_csv('/content/StopWords_DatesandNumbers.txt', header=None, sep='|')[0].tolist())
    stop_words_geographic = set(pd.read_csv('/content/StopWords_Geographic.txt', header=None, sep='|')[0].tolist())
    stop_words_names = set(pd.read_csv('/content/StopWords_Names.txt', header=None)[0].tolist())
    stop_words_currency = set(pd.read_csv('/content/StopWords_Currencies.txt', header=None, sep='|',encoding='latin-1')[0].tolist())

    stop_words = stop_words_generic.union(stop_words_generic_long, stop_words_auditor, stop_words_datesandnumbers,
                                      stop_words_geographic, stop_words_names, stop_words_currency)

    # Tokenize the text
    words = word_tokenize(text)
    cleaned_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalnum()]


    # Calculate sentiment scores (using dictionaries)
    positive_score = sum(1 for word in cleaned_words if word in positive_dict)
    negative_score = sum(1 for word in cleaned_words if word in negative_dict)

    # Calculate polarity and subjectivity scores
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(cleaned_words) + 0.000001)

    # Calculate readability scores
    avg_sentence_length, percentage_complex_words, fog_index = calculate_readability_scores(text)

    # Calculate other variables
    avg_words_per_sentence = len(cleaned_words) / len(text.split('.'))
    complex_word_count = len([word for word in cleaned_words if syllables_count(word) > 2])
    word_count = len(cleaned_words)
    syllable_per_word = sum([syllables_count(word) for word in cleaned_words]) / word_count
    personal_pronouns = len(re.findall(r'\b(I|we|my|ours|us)\b', text, re.IGNORECASE))
    avg_word_length = sum([len(word) for word in cleaned_words]) / word_count

    return positive_score, negative_score, polarity_score, subjectivity_score, avg_sentence_length, \
           percentage_complex_words, fog_index, avg_words_per_sentence, complex_word_count, \
           word_count, syllable_per_word, personal_pronouns, avg_word_length

# Load the input data from the CSV file
#input_df = pd.read_csv('Input.xlsx.csv')

#Loading Dictionaries:
#The code loads the positive and negative dictionaries from files.

# Load your positive and negative dictionaries
positive_dict = set(pd.read_csv('/content/positive-words.txt', header=None)[0].tolist())
negative_dict = set(pd.read_csv('/content/negative-words.txt',header=None,encoding='latin-1')[0].tolist())


#Processing Text Files:
#The for loop iterates over a list of text files named in the file_name variable.
#It opens each file, reads its content, and calls the analyze_text function to process it.
#The results of the analysis are appended to the appropriate lists


for i in file_name:
  with open(f"/content/{i}.txt", "r") as file:
    # Read the entire file content
    content = file.read()

    # Perform text analysis
    positive_score, negative_score, polarity_score, subjectivity_score, avg_sentence_length, \
    percentage_complex_word, fog_index, avg_words_per_sentence, complex_word_count, \
    word_count, syllable_per_word, personal_pronoun, avg_word_length = analyze_text(content, positive_dict, negative_dict)

    # Append the results to the respective lists
    positive_scores.append(positive_score)
    negative_scores.append(negative_score)
    polarity_scores.append(polarity_score)
    subjectivity_scores.append(subjectivity_score)
    avg_sentence_lengths.append(avg_sentence_length)
    percentage_complex_words.append(percentage_complex_word)
    fog_indexes.append(fog_index)
    avg_words_per_sentences.append(avg_words_per_sentence)
    complex_word_counts.append(complex_word_count)
    word_counts.append(word_count)
    syllable_per_words.append(syllable_per_word)
    personal_pronouns.append(personal_pronoun)
    avg_word_lengths.append(avg_word_length)


print("positive_scores:", positive_scores)
print("negative_scores:",negative_scores)
print("polarity_scores:",polarity_scores)
print("subjectivity_scores:",subjectivity_scores)
print("avg_sentence_lengths:",avg_sentence_lengths)
print("percentage_complex_words:",percentage_complex_words)
print("fog_indexes:",fog_indexes)
print("avg_words_per_sentences:",avg_words_per_sentences)
print("complex_word_counts:",complex_word_counts)
print("word_counts:",word_counts)
print("syllable_per_words:",syllable_per_words)
print("personal_pronouns:",personal_pronouns)
print("avg_word_lengths:",avg_word_lengths)



positive_scores: [93, 1, 8, 1, 14, 1, 1, 2, 5, 20, 4, 15, 15, 22, 21, 28, 18, 26, 34, 22, 21, 1, 8, 31, 22, 25, 17, 8, 3, 2, 2, 3, 3, 7, 4, 3, 7, 3, 6, 6, 9, 11, 6, 10, 14, 20, 6, 12, 8, 4, 8, 9, 4, 2, 5, 7, 13, 10, 4, 3, 3, 5, 3, 3, 10, 12, 5, 18, 6, 8, 23, 6, 4, 1, 6, 2, 1, 4, 0, 2, 2, 5, 4, 8, 0, 5, 1, 2, 0, 8, 4, 21, 24, 1, 5, 5, 4, 0, 7, 3, 3, 0, 1, 3, 35, 12, 0, 5, 7, 7, 10, 6, 19, 1, 8, 0, 25, 0, 0, 1, 4, 4, 2, 2, 13, 0, 2, 8, 8, 8, 3, 4, 2, 2, 1, 2, 1, 4, 5, 12, 10, 9, 6, 13, 7, 3, 1]
negative_scores: [34, 0, 5, 0, 1, 0, 0, 1, 10, 7, 0, 3, 3, 8, 6, 4, 3, 10, 10, 14, 3, 0, 7, 26, 18, 11, 16, 1, 1, 1, 2, 2, 1, 2, 4, 1, 2, 2, 4, 2, 4, 1, 1, 4, 8, 8, 1, 3, 9, 2, 4, 5, 9, 3, 2, 7, 5, 3, 5, 1, 5, 4, 1, 8, 1, 4, 3, 12, 1, 0, 4, 5, 0, 2, 2, 4, 2, 16, 12, 1, 3, 1, 5, 3, 0, 0, 3, 1, 0, 1, 7, 2, 1, 4, 3, 3, 5, 1, 2, 0, 3, 2, 4, 0, 5, 0, 0, 0, 4, 3, 2, 1, 3, 3, 1, 9, 14, 3, 0, 0, 0, 5, 0, 0, 7, 0, 7, 3, 3, 5, 0, 1, 0, 4, 0, 0, 0, 1, 1, 2, 8, 9, 8, 15, 10, 0, 0]
polarity_scores: [0.46456692

In [5]:
#Output Dataframe:
#The results from the analysis are stored in a Pandas DataFrame, combining the data from the input file with the calculated variables.
output_data={'URL_ID':file_name,'URL':url,'POSITIVE SCORE':positive_scores,'NEGATIVE SCORE':negative_scores,'POLARITY SCORE':polarity_scores,'SUBJECTIVITY SCORE':subjectivity_scores,'AVG SENTENCE LENGTH':avg_sentence_lengths,'PERCENTAGE OF COMPLEX WORDS':percentage_complex_words,'FOG INDEX':fog_indexes,'AVG NUMBER OF WORDS PER SENTENCE':avg_words_per_sentences,'COMPLEX WORD COUNT':complex_word_counts,'WORD COUNT':word_counts,'SYLLABLE PER WORD':syllable_per_words,'PERSONAL PRONOUNS':personal_pronouns,'AVG WORD LENGTH':avg_word_lengths}
output_df=pd.DataFrame(output_data)
output_df

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,bctech2011,https://insights.blackcoffer.com/ml-and-ai-bas...,93,34,0.464567,0.100954,20.946903,0.289396,8.494519,11.132743,647,1258,2.671701,2,7.783784
1,bctech2012,https://insights.blackcoffer.com/streamlined-i...,1,0,0.999999,0.013699,12.500000,0.213333,5.085333,6.083333,26,73,2.301370,1,7.287671
2,bctech2013,https://insights.blackcoffer.com/efficient-dat...,8,5,0.230769,0.049430,17.851852,0.221992,7.229537,9.740741,94,263,2.273764,1,7.178707
3,bctech2014,https://insights.blackcoffer.com/effective-man...,1,0,0.999999,0.010989,12.200000,0.196721,4.958689,6.066667,30,91,2.274725,1,6.967033
4,bctech2015,https://insights.blackcoffer.com/streamlined-t...,14,1,0.866667,0.050167,18.303030,0.223510,7.410616,9.060606,127,299,2.384615,1,7.247492
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,bctech2153,https://insights.blackcoffer.com/population-an...,6,8,-0.142857,0.034913,27.629630,0.190349,11.127991,14.851852,132,401,2.119701,3,6.511222
143,bctech2154,https://insights.blackcoffer.com/google-lsa-ap...,13,15,-0.071429,0.046589,25.760870,0.146835,10.363082,13.065217,155,601,2.094842,6,6.357737
144,bctech2155,https://insights.blackcoffer.com/healthcare-da...,7,10,-0.176471,0.106918,20.842105,0.095960,8.375226,8.368421,34,159,1.949686,14,6.044025
145,bctech2156,https://insights.blackcoffer.com/budget-sales-...,3,0,1.000000,0.096774,25.500000,0.333333,10.333333,15.500000,16,31,2.806452,1,8.000000


In [6]:
#Saving Output to CSV:
#The final DataFrame is saved to a CSV file named "Output Data Structure.xlsx.csv".
output_df.to_csv('output_1.csv')

In summary, this code is designed to process text data from files, extract various linguistic features and sentiment scores, and save the results in a structured CSV file.