<h2> Objective - 1</h2>
<h3>The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below.</h3> 



<h3>1. Extracting textual data articles from the given links</h3>

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

In [3]:

excel_file = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Input.xlsx"
df = pd.read_excel(excel_file)  


In [4]:
df.head()

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...


In [5]:
# Function to extract title and text from HTML content
output_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Data Of articles"
def extract_title_and_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    title = soup.find('title').get_text().strip() if soup.find('title') else ""
    article_text = ""
    for p in soup.find_all('p'):
        article_text += p.get_text().strip() + '\n'
    return title, article_text

In [6]:
# Function to save text content to a text file
def save_text_to_file(text, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(text)

### Iterate over links and save article title and text to separate text files


In [7]:
for index, row in df.iterrows():
    link = row['URL']
    URL_ID = row['URL_ID']
    try:
        response = requests.get(link)
        if response.status_code == 200:
            html_content = response.text
            title, article_text = extract_title_and_text(html_content)
            output_file = os.path.join(output_directory, f'{URL_ID}.txt')
            save_text_to_file(f"{title}\n\n{article_text}", output_file)
            print(f"Article extracted from link {link} and saved to {output_file}")
        else:
            print(f"Failed to fetch content from link {link}. Status code: {response.status_code}")
    except Exception as e:
        print(f"Error occurred while processing link {link}: {str(e)}")

Article extracted from link https://insights.blackcoffer.com/rising-it-cities-and-its-impact-on-the-economy-environment-infrastructure-and-city-life-by-the-year-2040-2/ and saved to C:\Users\Ideapad\Desktop\blackcoffer\Data Of articles\blackassign0001.txt
Article extracted from link https://insights.blackcoffer.com/rising-it-cities-and-their-impact-on-the-economy-environment-infrastructure-and-city-life-in-future/ and saved to C:\Users\Ideapad\Desktop\blackcoffer\Data Of articles\blackassign0002.txt
Article extracted from link https://insights.blackcoffer.com/internet-demands-evolution-communication-impact-and-2035s-alternative-pathways/ and saved to C:\Users\Ideapad\Desktop\blackcoffer\Data Of articles\blackassign0003.txt
Article extracted from link https://insights.blackcoffer.com/rise-of-cybercrime-and-its-effect-in-upcoming-future/ and saved to C:\Users\Ideapad\Desktop\blackcoffer\Data Of articles\blackassign0004.txt
Article extracted from link https://insights.blackcoffer.com/ott-

<h2>Text Analysis. </h2>

<h3>Objective of this document is to explain methodology adopted to perform text analysis to drive sentimental opinion, sentiment scores, readability, passive words, personal pronouns and etc. </h3>
    

<h3>2. Here we are calculating given variables in the document file at the time of converting paragraphs into sentences and sentences into words </h3>
<p>    a. Avegage Sentences Length </p>
<p>    b. Avegage Words Per Sentence </p>
<p>    c. Fog Index </p>
<p>    d. Average Syllable Count Per Word </p>
<p>    e. Total Complex Words </p>
<p>    f. Percentage of Complex Words </p>
<p> <b>Note:</b> Here i have calculated these variables because after doing cleaning and preprocessing the article paragraphs will be converted into single sentences because of the
removing stopwords and punctuation there will be no fullstop in the sentences. So that's why i calculated these variables before preprocessing the article. </p>



In [8]:
import string
import nltk.data
from nltk.tokenize import word_tokenize, sent_tokenize

# Load the pre-trained sentence tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Function to count syllables in a word
def count_syllables(word):
    vowels = 'aeiouy'
    count = 0
    word = word.lower()
    
    if word.endswith(('es', 'ed')):
        return 0
    
    if word[0] in vowels:
        count += 1
        
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
            
    if word.endswith('e'):
        count -= 1
        
    if count == 0:
        count += 1
        
    return count

# Function to calculate readability using Gunning Fox index formula
def calculate_readability(text):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    num_sentences = len(sentences)
    
    # Tokenize text into words
    words = word_tokenize(text.lower())
    num_words = len(words)
    
    # Calculate average sentence length
    average_sentence_length = num_words / num_sentences
    
    # Count total syllables and complex words
    total_syllables = 0
    total_complex_words = 0
    for word in words:
        syllable_count = count_syllables(word)
        total_syllables += syllable_count
        if syllable_count > 2:
            total_complex_words += 1
    
    # Calculate percentage of complex words
    percentage_complex_words = total_complex_words / num_words * 100
    
    # Calculate Fog Index
    fog_index = 0.4 * (average_sentence_length + percentage_complex_words)
    
    # Calculate average number of words per sentence
    average_words_per_sentence = num_words / num_sentences

    # Calculate average syllable count per word
    average_syllable_count_per_word = total_syllables / num_words
    
    return {
        'average_sentence_length': average_sentence_length,
        'fog_index': fog_index,
        'average_words_per_sentence': average_words_per_sentence,
        'average_syllable_count_per_word': average_syllable_count_per_word,
        'total_complex_words': total_complex_words,
        'percentage_complex_words': percentage_complex_words
    }

# Define full file paths
text_files_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Data Of articles"
stopwords_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\StopWords-20240228T165233Z-001"
output_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"

# List stopwords files
stopwords_files = [os.path.join(stopwords_directory, f) for f in os.listdir(stopwords_directory) if f.endswith('.txt')]

# Iterate over each text file in the directory
for filename in os.listdir(text_files_directory):
    if filename.endswith('.txt'):
        text_file_path = os.path.join(text_files_directory, filename)
        output_file_path = os.path.join(output_directory, filename)

        # Read text from text file
        with open(text_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Perform readability analysis
        readability_results = calculate_readability(text)
        
        # Print readability results for the current file
        print(f"Readability Analysis for {filename}:")
        print("Average Sentence Length:", readability_results['average_sentence_length'])
        print("Fog Index:", readability_results['fog_index'])
        print("Average Words Per Sentence:", readability_results['average_words_per_sentence'])
        print("Average Syllable Count Per Word:", readability_results['average_syllable_count_per_word'])
        print("Total Complex Words:", readability_results['total_complex_words'])
        print("Percentage of Complex Words:", readability_results['percentage_complex_words'])
        print()

print("Readability analysis for all text files completed successfully!")


Readability Analysis for blackassign0001.txt:
Average Sentence Length: 21.466666666666665
Fog Index: 14.549399585921325
Average Words Per Sentence: 21.466666666666665
Average Syllable Count Per Word: 1.5372670807453417
Total Complex Words: 96
Percentage of Complex Words: 14.906832298136646

Readability Analysis for blackassign0002.txt:
Average Sentence Length: 22.9390243902439
Fog Index: 15.831643780552639
Average Words Per Sentence: 22.9390243902439
Average Syllable Count Per Word: 1.5598086124401913
Total Complex Words: 313
Percentage of Complex Words: 16.640085061137693

Readability Analysis for blackassign0003.txt:
Average Sentence Length: 24.114754098360656
Fog Index: 18.429042359942496
Average Words Per Sentence: 24.114754098360656
Average Syllable Count Per Word: 1.698164513936098
Total Complex Words: 323
Percentage of Complex Words: 21.957851801495583

Readability Analysis for blackassign0004.txt:
Average Sentence Length: 26.035714285714285
Fog Index: 19.27573976092495
Average 

<h3>3. Here i have done cleaning (like removing stopwords and punctuation i.e removing meaningless words using given stopwords file in the document)</h3>

In [9]:
# Load the pre-trained sentence tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Function to read stopwords from file
def read_stopwords(stopwords_files):
    stopwords = []
    for stopwords_file in stopwords_files:
        with open(stopwords_file, 'r', encoding='latin-1') as file:
            stopwords.extend(file.readlines())
    return [word.strip() for word in stopwords]

# Function to clean text by removing stopwords, punctuation, and non-English words
def remove_stopwords(text, stopwords):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    
    # Initialize list to store cleaned words
    cleaned_words = []
    
    for sentence in sentences:
        # Remove punctuation and convert to lowercase
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))
        sentence = sentence.lower()
        
        # Tokenize sentence into words
        words = sentence.split()
        
        # Filter out stopwords and non-English words
        cleaned_words.extend([word for word in words if word not in stopwords and word.isalpha()])
    
    return ' '.join(cleaned_words)

# Define full file paths
text_files_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Data Of articles"
stopwords_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\StopWords-20240228T165233Z-001"
output_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"

# List stopwords files
stopwords_files = [os.path.join(stopwords_directory, f) for f in os.listdir(stopwords_directory) if f.endswith('.txt')]

# Iterate over each text file in the directory
for filename in os.listdir(text_files_directory):
    if filename.endswith('.txt'):
        text_file_path = os.path.join(text_files_directory, filename)
        output_file_path = os.path.join(output_directory, filename)

        # Read text from text file
        with open(text_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Read stopwords from stopwords files
        stopwords = read_stopwords(stopwords_files)

        # Clean text by removing stopwords, punctuation, and non-English words
        cleaned_text = remove_stopwords(text, stopwords)
        
        # Save cleaned text to new file
        with open(output_file_path, 'w', encoding='utf-8') as file:
            file.write(cleaned_text)

        print(f"Text cleaned and saved to {output_file_path}")

print("All text files cleaned successfully!")


Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0001.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0002.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0003.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0004.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0005.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0006.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0007.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords_data_articles\blackassign0008.txt
Text cleaned and saved to C:\Users\Ideapad\Desktop\blackcoffer\Without_stopwords

<h3>4. This is the step where i have created the dictionary of positive and negative words using the master dictionary provided in the document.</h3>
<p>     a. Function to read positive and negative words from files</p>
<p>     b. Function to create positive and negative word dictionaries.</p>
<p>     c. Iterate over cleaned text files.</p>
<p>     d. Define file paths. </p>
<p>     e. Create positive and negative word dictionaries. </p>
<p>     f. Print the dictionaries. </p>

In [10]:

# Function to read positive and negative words from files
def read_word_file(word_file, encoding='utf-8'):
    with open(word_file, 'r', encoding=encoding) as file:
        words = file.readlines()
    return {word.strip().lower() for word in words}

# Function to create positive and negative word dictionaries
def create_word_dictionary(master_positive_words_file, master_negative_words_file, cleaned_text_directory):
    # Read positive and negative words from master dictionary files
    master_positive_words = read_word_file(master_positive_words_file, encoding='latin-1')
    master_negative_words = read_word_file(master_negative_words_file, encoding='latin-1')


    # Initialize dictionaries to store positive and negative words found in cleaned text files
    positive_word_dict = {}
    negative_word_dict = {}

    # Iterate over cleaned text files
    for filename in os.listdir(cleaned_text_directory):
        if filename.endswith('.txt'):
            text_file_path = os.path.join(cleaned_text_directory, filename)
            with open(text_file_path, 'r', encoding='utf-8') as file:
                text = file.read().lower()  # Convert text to lowercase for case-insensitive matching
                for word in text.split():
                    if word in master_positive_words:
                        positive_word_dict[word] = True
                    elif word in master_negative_words:
                        negative_word_dict[word] = True

    return {'positive_words': list(positive_word_dict.keys()), 'negative_words': list(negative_word_dict.keys())}

# Define file paths
master_positive_words_file = "C:\\Users\\Ideapad\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\positive-words.txt"
master_negative_words_file = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\negative-words.txt"
cleaned_text_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"

# Create positive and negative word dictionaries
word_dictionaries = create_word_dictionary(master_positive_words_file, master_negative_words_file, cleaned_text_directory)

# Print the dictionaries
print(word_dictionaries)






<h3>5. In this step i have calculated the variables which have to be calculated after the cleaning and preprocessing </h3>
<p>     a. Positive Score</p>
<p>     b. Negative Score</p>
<p>     c. Polarity Score</p>
<p>     d. Subjectivity Score</p>

In [11]:
def read_word_file(word_file, encoding='latin-1'):
    with open(word_file, 'r', encoding=encoding) as file:
        words = file.readlines()
    return {word.strip().lower() for word in words}


# Function to calculate derived variables
def calculate_derived_variables(text, positive_words, negative_words):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Initialize variables
    positive_score = sum(1 for token in tokens if token in positive_words)
    negative_score = sum(1 for token in tokens if token in negative_words)
    total_words = len(tokens)

    # Calculate derived variables
    polarity_score = (positive_score - negative_score) / (positive_score + negative_score + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (total_words + 0.000001)

    return positive_score, negative_score, polarity_score, subjectivity_score

# Define file paths
positive_words_file = "C:\\Users\\Ideapad\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\positive-words.txt"
negative_words_file =  "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\negative-words.txt"
# cleaned_text_directory = "without_stopword_directory"

# Read positive and negative words from files
positive_words = read_word_file(positive_words_file)
negative_words = read_word_file(negative_words_file)

# Iterate over cleaned text files
for filename in os.listdir(cleaned_text_directory):
    if filename.endswith('.txt'):
        text_file_path = os.path.join(cleaned_text_directory, filename)

        # Read text from text file
        with open(text_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Calculate derived variables
        positive_score, negative_score, polarity_score, subjectivity_score = calculate_derived_variables(text, positive_words, negative_words)

        # Print or save the derived variables
        print(f"File: {filename}")
        print(f"Positive Score: {positive_score}")
        print(f"Negative Score: {negative_score}")
        print(f"Polarity Score: {polarity_score}")
        print(f"Subjectivity Score: {subjectivity_score}")
        print("---------------------------------------------")


File: blackassign0001.txt
Positive Score: 10
Negative Score: 3
Polarity Score: 0.5384614970414233
Subjectivity Score: 0.04333333318888889
---------------------------------------------
File: blackassign0002.txt
Positive Score: 59
Negative Score: 33
Polarity Score: 0.2826086925803403
Subjectivity Score: 0.10574712631523318
---------------------------------------------
File: blackassign0003.txt
Positive Score: 42
Negative Score: 26
Polarity Score: 0.23529411418685128
Subjectivity Score: 0.09103078970410872
---------------------------------------------
File: blackassign0004.txt
Positive Score: 41
Negative Score: 77
Polarity Score: -0.30508474317724793
Subjectivity Score: 0.16208791186526386
---------------------------------------------
File: blackassign0005.txt
Positive Score: 24
Negative Score: 10
Polarity Score: 0.41176469377162667
Subjectivity Score: 0.06813627240854454
---------------------------------------------
File: blackassign0006.txt
Positive Score: 90
Negative Score: 28
Polarity

<p>e. Word Count</p>
<p>f. Average Word Length</p>

In [12]:
cleaned_text_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"

# Function to count words and calculate average word length in a file
def count_words_and_avg_word_length(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        words = word_tokenize(text.lower())
        word_count = len(words)
        total_word_length = sum(len(word) for word in words)
        avg_word_length = total_word_length / word_count if word_count > 0 else 0
        return word_count, avg_word_length

# Iterate over each file in the directory
for filename in os.listdir(cleaned_text_directory):
    if filename.endswith('.txt'):
        file_path = os.path.join(cleaned_text_directory, filename)
        word_count, avg_word_length = count_words_and_avg_word_length(file_path)
        print(f"File: {filename}")
        print(f"Word Count: {word_count}")
        print(f"Average Word Length: {avg_word_length:.2f}")
        print("---------------------------------------------")


File: blackassign0001.txt
Word Count: 300
Average Word Length: 6.90
---------------------------------------------
File: blackassign0002.txt
Word Count: 870
Average Word Length: 7.52
---------------------------------------------
File: blackassign0003.txt
Word Count: 747
Average Word Length: 8.17
---------------------------------------------
File: blackassign0004.txt
Word Count: 728
Average Word Length: 8.01
---------------------------------------------
File: blackassign0005.txt
Word Count: 499
Average Word Length: 7.51
---------------------------------------------
File: blackassign0006.txt
Word Count: 1187
Average Word Length: 7.95
---------------------------------------------
File: blackassign0007.txt
Word Count: 602
Average Word Length: 7.28
---------------------------------------------
File: blackassign0008.txt
Word Count: 595
Average Word Length: 8.18
---------------------------------------------
File: blackassign0009.txt
Word Count: 705
Average Word Length: 8.12
-------------------

<p>g. Personal Pronoun</p>
<p><b>Note: I have considered personal pronouns only those which are given in the document </b></p>

In [13]:
import re

text_files_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Data Of articles"

# Function to count personal pronouns in a file
def count_personal_pronouns(file_path):
    pronoun_counts = {'I': 0, 'we': 0, 'my': 0, 'ours': 0, 'us': 0}

    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read().lower()
        for pronoun in pronoun_counts.keys():
            if pronoun == 'us':
                # Using negative lookahead to exclude 'US' as a country name
                pattern = r'\b' + pronoun + r'\b(?!\.)'
            else:
                pattern = r'\b' + pronoun + r'\b'
            pronoun_counts[pronoun] = len(re.findall(pattern, text))

    return pronoun_counts

# Iterate over each file in the directory
for filename in os.listdir(text_files_directory):
    if filename.endswith('.txt'):
        file_path = os.path.join(text_files_directory, filename)
        pronoun_counts = count_personal_pronouns(file_path)
        
        total_count = sum(pronoun_counts.values())
        
        print(f"File: {filename}")
        for pronoun, count in pronoun_counts.items():
            print(f"{pronoun.capitalize()}: {count}")
        print(f"Total Count: {total_count}")
        print("---------------------------------------------")


File: blackassign0001.txt
I: 0
We: 4
My: 0
Ours: 0
Us: 1
Total Count: 5
---------------------------------------------
File: blackassign0002.txt
I: 0
We: 4
My: 0
Ours: 0
Us: 4
Total Count: 8
---------------------------------------------
File: blackassign0003.txt
I: 0
We: 13
My: 0
Ours: 0
Us: 2
Total Count: 15
---------------------------------------------
File: blackassign0004.txt
I: 0
We: 6
My: 0
Ours: 0
Us: 1
Total Count: 7
---------------------------------------------
File: blackassign0005.txt
I: 0
We: 5
My: 0
Ours: 0
Us: 1
Total Count: 6
---------------------------------------------
File: blackassign0006.txt
I: 0
We: 7
My: 0
Ours: 0
Us: 1
Total Count: 8
---------------------------------------------
File: blackassign0007.txt
I: 0
We: 1
My: 0
Ours: 0
Us: 2
Total Count: 3
---------------------------------------------
File: blackassign0008.txt
I: 0
We: 4
My: 0
Ours: 0
Us: 1
Total Count: 5
---------------------------------------------
File: blackassign0009.txt
I: 0
We: 4
My: 0
Ours: 0
Us:

<h3>6. Here i have taken all derived variables and made a csv file </h3>

In [15]:
# Function to count syllables in a word
import csv
def count_syllables(word):
    vowels = 'aeiou'
    count = 0
    word = word.lower()
    
    if word.endswith(('es', 'ed')):
        return 0
    
    if word[0] in vowels:
        count += 1
        
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
            
    if word.endswith('e'):
        count -= 1
        
    if count == 0:
        count += 1
        
    return count

# Function to calculate readability using Gunning Fox index formula
def calculate_readability(text):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    num_sentences = len(sentences)
    
    # Tokenize text into words
    words = word_tokenize(text.lower())
    num_words = len(words)
    
    # Calculate average sentence length
    average_sentence_length = num_words / num_sentences
    
    # Count total syllables and complex words
    total_syllables = 0
    total_complex_words = 0
    for word in words:
        syllable_count = count_syllables(word)
        total_syllables += syllable_count
        if syllable_count > 2:
            total_complex_words += 1
    
    # Calculate percentage of complex words
    percentage_complex_words = total_complex_words / num_words * 100
    
    # Calculate Fog Index
    fog_index = 0.4 * (average_sentence_length + percentage_complex_words)
    
    # Calculate average number of words per sentence
    average_words_per_sentence = num_words / num_sentences

    # Calculate average syllable count per word
    average_syllable_count_per_word = total_syllables / num_words
    
    return {
        'average_sentence_length': average_sentence_length,
        'fog_index': fog_index,
        'average_words_per_sentence': average_words_per_sentence,
        'average_syllable_count_per_word': average_syllable_count_per_word,
        'total_complex_words': total_complex_words,
        'percentage_complex_words': percentage_complex_words
    }

# Function to read stopwords from file
def read_stopwords(stopwords_files):
    stopwords = []
    for stopwords_file in stopwords_files:
        with open(stopwords_file, 'r', encoding='latin-1') as file:
            stopwords.extend(file.readlines())
    return [word.strip() for word in stopwords]

# Function to clean text by removing stopwords, punctuation, and non-English words
def remove_stopwords(text, stopwords):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    
    # Initialize list to store cleaned words
    cleaned_words = []
    
    for sentence in sentences:
        # Remove punctuation and convert to lowercase
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))
        sentence = sentence.lower()
        
        # Tokenize sentence into words
        words = sentence.split()
        
        # Filter out stopwords and non-English words
        cleaned_words.extend([word for word in words if word not in stopwords and word.isalpha()])
    
    return ' '.join(cleaned_words)

def read_word_file(word_file, encoding='latin-1'):
    with open(word_file, 'r', encoding=encoding) as file:
        words = file.readlines()
    return {word.strip().lower() for word in words}


# Function to calculate derived variables
def calculate_derived_variables(text, positive_words, negative_words):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Initialize variables
    positive_score = sum(1 for token in tokens if token in positive_words)
    negative_score = sum(1 for token in tokens if token in negative_words)
    total_words = len(tokens)

    # Calculate derived variables
    polarity_score = (positive_score - negative_score) / (positive_score + negative_score + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (total_words + 0.000001)

    return positive_score, negative_score, polarity_score, subjectivity_score

# Define full file paths
text_files_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Data Of articles"
stopwords_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\StopWords-20240228T165233Z-001"
output_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"
master_positive_words_file = "C:\\Users\\Ideapad\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\positive-words.txt"
master_negative_words_file = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\MasterDictionary-20240228T165453Z-001\\MasterDictionary\\negative-words.txt"
cleaned_text_directory = "C:\\Users\\Ideapad\\Desktop\\blackcoffer\\Without_stopwords_data_articles"

# List stopwords files
stopwords_files = [os.path.join(stopwords_directory, f) for f in os.listdir(stopwords_directory) if f.endswith('.txt')]

# Read positive and negative words from files
positive_words = read_word_file(master_positive_words_file)
negative_words = read_word_file(master_negative_words_file)

# Initialize list to store data rows
data_rows = []

# Iterate over each text file in the directory
for filename in os.listdir(text_files_directory):
    if filename.endswith('.txt'):
        text_file_path = os.path.join(text_files_directory, filename)
        output_file_path = os.path.join(output_directory, filename)

        # Read text from text file
        with open(text_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Perform readability analysis
        readability_results = calculate_readability(text)

        # Read stopwords from stopwords files
        stopwords = read_stopwords(stopwords_files)

        # Clean text by removing stopwords, punctuation, and non-English words
        cleaned_text = remove_stopwords(text, stopwords)

        # Calculate derived variables
        positive_score, negative_score, polarity_score, subjectivity_score = calculate_derived_variables(cleaned_text, positive_words, negative_words)

        # Count words and calculate average word length
        words = word_tokenize(cleaned_text.lower())
        word_count = len(words)
        avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0

        # Count personal pronouns
        pronoun_counts = count_personal_pronouns(text_file_path)
        total_personal_pronouns = sum(pronoun_counts.values())

        # Append the data for the current file to the data rows list
        data_rows.append({
            "URL_ID": filename,
            "POSITIVE SCORE": positive_score,
            "NEGATIVE SCORE": negative_score,
            "POLARITY SCORE": polarity_score,
            "SUBJECTIVITY SCORE": subjectivity_score,
            "AVG SENTENCE LENGTH": readability_results['average_sentence_length'],
            "PERCENTAGE OF COMPLEX WORDS": readability_results['percentage_complex_words'],
            "FOG INDEX": readability_results['fog_index'],
            "AVG NUMBER OF WORDS PER SENTENCE": readability_results['average_words_per_sentence'],
            "COMPLEX WORD COUNT": readability_results['total_complex_words'],
            "WORD COUNT": word_count,
            "SYLLABLE PER WORD": readability_results['average_syllable_count_per_word'],
            "PERSONAL PRONOUNS": total_personal_pronouns,
            "AVG WORD LENGTH": avg_word_length,
        })

# Define the CSV file path
csv_file_path = "output_data.csv"

# Define the field names
field_names = [
    "URL_ID", "POSITIVE SCORE", "NEGATIVE SCORE", "POLARITY SCORE", "SUBJECTIVITY SCORE",
    "AVG SENTENCE LENGTH", "PERCENTAGE OF COMPLEX WORDS", "FOG INDEX",
    "AVG NUMBER OF WORDS PER SENTENCE", "COMPLEX WORD COUNT", "WORD COUNT",
    "SYLLABLE PER WORD", "PERSONAL PRONOUNS", "AVG WORD LENGTH"
]


# Write the data to the CSV file
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=field_names)
    
    # Write the header
    writer.writeheader()
    
    # Write the data rows
    writer.writerows(data_rows)

print("Data successfully written to CSV file:", csv_file_path)


Data successfully written to CSV file: output_data.csv


In [16]:
df3 = pd.read_csv("C:\\Users\\Ideapad\\Desktop\\blackcoffer\\output_data.csv")

In [17]:
df3.head()

Unnamed: 0,URL_ID,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001.txt,10,3,0.538461,0.043333,21.466667,13.819876,14.114617,21.466667,89,300,1.484472,5,6.903333
1,blackassign0002.txt,59,33,0.282609,0.105747,22.939024,15.417331,15.342542,22.939024,290,870,1.507709,8,7.52069
2,blackassign0003.txt,42,26,0.235294,0.091031,24.114754,20.666213,17.912387,24.114754,304,747,1.646499,15,8.165997
3,blackassign0004.txt,41,77,-0.305085,0.162088,26.035714,19.958848,18.397825,26.035714,291,728,1.631001,7,8.013736
4,blackassign0005.txt,24,10,0.411765,0.068136,23.090909,15.15748,15.299356,23.090909,154,499,1.531496,6,7.509018


<h3>7. After making csv file i have merged the csv file with output file that were provided in the document with the derived varibles </h3>

In [28]:
import pandas as pd
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows

# Read the Excel file and extract the "URL" column
excel_file_path = r"C:\Users\Ideapad\Desktop\Output Data Structure.xlsx"
df_excel = pd.read_excel(excel_file_path)

# Extract URL column as a DataFrame
url_column = df_excel[['URL']]

# Read the CSV file
csv_file_path = r"C:\Users\Ideapad\Desktop\blackcoffer\output_data.csv"
df_csv = pd.read_csv(csv_file_path)

# Insert URL column as the second attribute
df_combined = pd.concat([df_csv.iloc[:, :1], url_column, df_csv.iloc[:, 1:]], axis=1)

# Create a new Excel workbook and worksheet
wb = Workbook()
ws = wb.active

# Write DataFrame to worksheet
for r in dataframe_to_rows(df_combined, index=False, header=True):
    ws.append(r)

# Save the workbook
output_excel_file_path = r"C:\Users\Ideapad\Desktop\blackcoffer\output_data.xlsx"
wb.save(output_excel_file_path)

print("URL column added as the second attribute to Excel file:", output_excel_file_path)


URL column added as the second attribute to Excel file: C:\Users\Ideapad\Desktop\blackcoffer\output_data.xlsx


<h3>8. In this snippet i bolded the heading column </h3>

In [29]:
from openpyxl import load_workbook
from openpyxl.styles import Font

# Load the workbook
wb = load_workbook(filename=r"C:\Users\Ideapad\Desktop\blackcoffer\output_data.xlsx")

# Select the active worksheet
ws = wb.active

# Make the header row bold
for cell in ws[1]:
    cell.font = Font(bold=True)

# Save the workbook
wb.save(r"C:\Users\Ideapad\Desktop\blackcoffer\output_data.xlsx")

print("Header row made bold in the Excel file.")


Header row made bold in the Excel file.
