## Instructions:
### So my approach is as follows.
    Firstly I went through a bunch of websites from the Input file provided.
    Then to extract the Title and Text I used Python's Beautiful Soup package.
    After extracting the Text, I cleaned it, removed the unnecessary part i.e. Stopwords, Pronouns, etc.
    After that I calculated all the required Values
    And then Exported the sheet as an excel file

## Importing the required libaries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import nltk
from nltk.tokenize import word_tokenize
import textstat
from nltk.corpus import stopwords

## Reading the excel file which have the URL

In [2]:
df_url = pd.read_excel('C:/Users/DHEERAJ/OneDrive/Documents/Desktop/BlackCoffer/Input/Input.xlsx')
df_url[:5]

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...


## Creating the list of the URL

In [3]:
# Creating the list of all the URL to navigate using Beautifulsoup
url_list = df_url['URL'].tolist()

## Extract the Title and Text

In [10]:
# we have Extract the Title and Text
def extract_content(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.text.strip()
        section = soup.find('div', class_='td-post-content tagdiv-type')  
        text = section.text.strip() if section else 'Section not found'
        return title, text
    except Exception as e:
        print(f"Error occurred while extracting content from {url}: {e}")
        return None, None

titles = []
texts = []
for url in url_list:
    title, text = extract_content(url)
    titles.append(title)
    texts.append(text)
df = pd.DataFrame({'Title': titles, 'Text': texts})

## Check the Positive words

In [6]:
# In this we have check the positive Word
excel_file_path = 'Raw_data.xlsx' 
try:
    df_excel = pd.read_excel(excel_file_path)
except FileNotFoundError:
    print("Excel file not found.")
    exit()

def read_words_from_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        words = file.read().split()
    return words

text_document_path = 'C:/Users/DHEERAJ/OneDrive/Documents/Desktop/BlackCoffer/Words Database/MasterDictionary-20240331T100457Z-001/MasterDictionary/positive-words.txt' 

try:
    words = read_words_from_text(text_document_path)
except FileNotFoundError:
    print("Text document not found.")
    exit()

df_excel['Text'] = df_excel['Text'].str.strip().str.lower() 
results = []
for word in words:
    word = word.strip().lower() 
    pattern = r'\b{}\b'.format(re.escape(word)) 
    if df_excel['Text'].str.contains(pattern, regex=True).any():
        results.append((word, True))
    else:
        results.append((word, False))

if results:
    output_df = pd.DataFrame(results, columns=['Word', 'Match'])
    output_df.to_excel('positive.xlsx', index=False)
    print("Results saved")
else:
    print("No results to save.")

df_temp = pd.read_excel('positive.xlsx')
df_temp[df_temp["Match"]==True]


Results saved


Unnamed: 0,Word,Match
4,abundant,True
6,accessible,True
8,acclaimed,True
11,accolades,True
14,accomplish,True
...,...,...
1989,works,True
1991,worth,True
1994,worthwhile,True
1995,worthy,True


## Check the negative word

In [7]:
# In this we have check the negative word
excel_file_path = 'Raw_data.xlsx' 
try:
    df_excel = pd.read_excel(excel_file_path)
except FileNotFoundError:
    print("Excel file not found.")
    exit()

def read_words_from_text(file_path):
    with open(file_path, 'r') as file:
        words = file.read().split()
    return words

text_document_path = 'C:/Users/DHEERAJ/OneDrive/Documents/Desktop/BlackCoffer/Words Database/MasterDictionary-20240331T100457Z-001/MasterDictionary/negative-words.txt'  # Update with your text document path

try:
    words = read_words_from_text(text_document_path)
except FileNotFoundError:
    print("Text document not found.")
    exit()

df_excel['Text'] = df_excel['Text'].str.strip().str.lower()  
results = []
for word in words:
    word = word.strip().lower() 
    pattern = r'\b{}\b'.format(re.escape(word))  
    if df_excel['Text'].str.contains(pattern, regex=True).any():
        results.append((word, True))
    else:
        results.append((word, False))

if results:
    output_df = pd.DataFrame(results, columns=['Word', 'Match'])
    output_df.to_excel('negative.xlsx', index=False)
    print("Results saved")
else:
    print("No results to save.")
df_tem = pd.read_excel('negative.xlsx')
df_tem [df_tem["Match"]==True]

Results saved


Unnamed: 0,Word,Match
3,abolish,True
13,abrupt,True
16,absence,True
23,abuse,True
53,adamant,True
...,...,...
4745,worsening,True
4746,worst,True
4754,wreak,True
4770,writhe,True


## Count the Word and Sentence 

In [12]:
# We have Count the Word and Sentence in the text  
nltk.download('punkt') 
def count_words(paragraph):
    return len(nltk.word_tokenize(paragraph))

def count_sentences(paragraph):
    return len(nltk.sent_tokenize(paragraph))

df['Word Count'] = df['Text'].apply(count_words)
df['Sentence Count'] = df['Text'].apply(count_sentences)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DHEERAJ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Count the Syllables Word

In [32]:
# we have count the Syllable word in the text 
def syllable_count(word):
    return textstat.syllable_count(word)

def total_syllable_count(sentence):
    words = word_tokenize(sentence)
    syllable_counts = [syllable_count(word) for word in words]
    return sum(syllable_counts)

df['Syllable Count'] = df['Text'].apply(total_syllable_count)

## Removing the Stopword

In [15]:
# In this we have remowe the Stopword
nltk.download('stopwords')
stop_words = list(stopwords.words('english'))

def remove_stopwords(sentence):
    words = word_tokenize(sentence)
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_sentence = ' '.join(filtered_words)
    return filtered_sentence

df['cleaned_sentences'] = df['Text'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DHEERAJ\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Removing the Punctuations Symbols

In [16]:
# In this we have remove the  punctuations symbols 
nltk.download('punkt')

def remove_symbols(sentence):
    words = word_tokenize(sentence)
    pattern = re.compile(r'[^a-zA-Z0-9\s]')
    filtered_words = [pattern.sub('', word) for word in words]
    filtered_text = ' '.join(filtered_words)
    return filtered_text

df['After removing Symbols'] = df['cleaned_sentences'].apply(remove_symbols)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DHEERAJ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Count the Character in text 

In [17]:
# In this we have count the character in text
def count_letters(paragraph):
    return len(paragraph.replace(" ", ""))

df['letter_count'] = df['After removing Symbols'].apply(count_letters)

## Count the Average Sentence Length

In [19]:
# In this we have count the Average sentence length
df["Avg_sentence_length"]=df["Word Count"]/df['Sentence Count']

## Count the Complex Word

In [20]:
# In this we have count the Percentage of Complex word
df["Percentage_Complex"]=df['Syllable Count']/df['Word Count']

## Count the Fog Index

In [21]:
# In this we have count the Fog Index
df['Fog Index']=0.4*(df['Avg_sentence_length']+df['Percentage_Complex'])

## Count the Average number of word per sentence 

In [22]:
#In this we have count the  Average number of word per sentence
sum1=df['Word Count'].sum()
sum2=df['Sentence Count'].sum()
Result=sum1/sum2
print(Result)

22.84016784283807


## Count the Pronoun Sentence

In [23]:
# In this we have count the Pronoun sentences
pattern = r'\b(I|we|my|ours|us)\b'

def pro_cnt(text):
    matches = re.findall(pattern, text, flags=re.IGNORECASE)
    return len(matches)

df['Personal_Pronoun_Counts'] = df['Text'].apply(pro_cnt)

## Count the Average word length

In [24]:
# In this we have count the Average word length for Individual text
df['Avg_word_length']=((df['letter_count']/df['Word Count']).round(0)).astype(int)

In [25]:
df 

Unnamed: 0,Title,Text,Word Count,Sentence Count,Syllable Count,cleaned_sentences,After removing Symbols,letter_count,Avg_sentence_length,Percentage_Complex,Fog Index,Personal_Pronoun_Counts,Avg_word_length
0,Rising IT cities and its impact on the economy...,We have seen a huge development and dependence...,1371,78,1803,seen huge development dependence people techno...,seen huge development dependence people techno...,4037,17.576923,1.315098,7.556809,12,3
1,Rising IT Cities and Their Impact on the Econo...,"Throughout history, from the industrial revolu...",1689,80,2571,"Throughout history , industrial revolution 18t...",Throughout history industrial revolution 18th...,6341,21.112500,1.522202,9.053881,6,4
2,"Internet Demand's Evolution, Communication Imp...",Introduction\nIn the span of just a few decade...,1218,57,2149,"Introduction span decades , internet undergone...",Introduction span decades internet undergone ...,5417,21.368421,1.764368,9.253116,13,4
3,Rise of Cybercrime and its Effect in upcoming ...,"The way we live, work, and communicate has unq...",1219,52,2046,"way live , work , communicate unquestionably c...",way live work communicate unquestionably cha...,5253,23.442308,1.678425,10.048293,5,4
4,OTT platform and its impact on the entertainme...,The year 2040 is poised to witness a continued...,762,39,1176,year 2040 poised witness continued revolution ...,year 2040 poised witness continued revolution ...,3003,19.538462,1.543307,8.432707,6,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Due to the COVID-19 the repercussion of the en...,"Epidemics, in general, have both direct and in...",1215,50,1877,"Epidemics , general , direct indirect costs as...",Epidemics general direct indirect costs asso...,4584,24.300000,1.544856,10.337942,4,4
96,Impact of COVID-19 pandemic on office space an...,COVID 19 has bought the world to its knees. Wi...,1206,38,1621,COVID 19 bought world knees . businesses shut ...,COVID 19 bought world knees businesses shut ...,3394,31.736842,1.344113,13.232382,7,3
97,Contribution of handicrafts (Visual Arts & Lit...,Handicrafts is an art of making crafts by hand...,431,12,672,Handicrafts art making crafts hand India calle...,Handicrafts art making crafts hand India calle...,1782,35.916667,1.559165,14.990333,0,4
98,How COVID-19 is impacting payment preferences?...,Section not found,3,1,4,Section found,Section found,12,3.000000,1.333333,1.733333,0,4


## Sum of letter count

In [27]:
# we have sum all letter after cleaning the Text
a= df['letter_count'].sum()
print(a)

413775


## Removing the "es"and "ed"

In [30]:
# we have remove the "es" and "ed" from the text 
def syllable_count(word):
    exceptions = ['es', 'ed']
    if word[-2:] in exceptions:
        return textstat.syllable_count(word[:-2])
    else:
        return textstat.syllable_count(word)
        
def total_syllable_count(sentence):
    words = word_tokenize(sentence)
    syllable_counts = [syllable_count(word) for word in words]
    return sum(syllable_counts)
df['Syllable Count Per Word'] = df['Text'].apply(total_syllable_count)


## Save the Dataframe

In [35]:
df.to_excel('Output_Data.xlsx', index=False)