* **APPROACH TO THE SOLUTION :**

1. **Data Frame:** For working with data frame, 'pandas' library is imported along with 'numpy'.
2. **Data Scraping:** The script starts by importing necessary libraries for web scraping, such as 'requests' and 'BeautifulSoup'. Then the google drive is mounted with the colab noot book, and taking input the provided excel file contains the information of URLs.
3. **Text Processing and Analysis:** After scraping,the script proceeds to process the text data. Here, the libraries of NLKT is used for text processing. Such as, tokenization, removing stopwords. Apart from this, 'syllapy' is use to find the syllable.
4. **Calculation and Reporting:** Using the processed text data, the script calculates different linguistic features for each URL, including positive and negative scores, polarity score, subjectivity score, etc. These features are then compiled into a final report.
5. **Exporting Results:** Finally, the script exports the final report as a CSV file.

* **HOW TO RUN THE .PY FILE :**

1. First of all, python should be installed on the system.
2. Install the required the libraries by running 'pip install pandas numpy requests beautifulsoup4 nltk syllapy'.
3. Download the MasterDictionary folder and place it in the same directory as your .py file.
4. Open the excel file named 'Input.xlsx' in python program.
6. Run the .py file using a Python interpreter.
7. Once execution is complete, you'll find the final report saved as Final.csv in the specified directory.


In [270]:
# Import important libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [271]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [272]:
# Input file path
path="/content/drive/MyDrive/BlackCoffer/Input.xlsx"

In [273]:
file_link=pd.read_excel(path)

In [274]:
file_link

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...
...,...,...
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...
97,blackassign0098,https://insights.blackcoffer.com/contribution-...
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...


In [275]:
# URL_ID is the the list of URL IDs
URL_ID=file_link.iloc[:,0].tolist()

# df is the list of URL links
df_URL_LINK=file_link.iloc[:,1].tolist()

In [276]:
#------------------------------------------------------------------------------WEB SCRAPING------------------------------------------------------------------------------------------------------------------------------------------------

def scraping_from_url(url):
    try:
        # Title and text lists
        texts = []

        # Fetch webpage content
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "lxml")

        # Get title
        title = soup.title.text.strip()

        # Get all paragraphs
        paragraphs = soup.find_all("p")
        text = end=".".join([p.text.strip() for p in paragraphs])
        texts.append(text)

        return title, texts

    except Exception as e:
        print("Error:", e)
        return [], []

# Initialize lists to store data
all_titles = []
all_texts = []
url_link=[]
# Iterate over URLs and scrape data
for url in df_URL_LINK:
    url_link.append(url)
    titles, texts = scraping_from_url(url)
    all_titles.append(titles)
    all_texts.append(texts)






In [277]:
# For checking the number of rows

print(f"No. of URL_ID {len(URL_ID)}")
print(f"No. of URL {len(url_link)}")
print(f"No. of Titles {len(all_titles)}")
print(f"No. of Texts {len(all_texts)}")


No. of URL_ID 100
No. of URL 100
No. of Titles 100
No. of Texts 100


In [278]:
# Flattened the 'text' format
flattened_texts=[text for sublist in all_texts for text in sublist]

In [279]:
# A Data Frame contains the lists of URL_ID, URL, Title, and text
scraping_details=pd.DataFrame({'URL_ID':URL_ID,'URL':url_link, 'Title':all_titles,'Text':flattened_texts})
display(scraping_details)

Unnamed: 0,URL_ID,URL,Title,Text
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,Rising IT cities and its impact on the economy...,Efficient Supply Chain Assessment: Overcoming ...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,Rising IT Cities and Their Impact on the Econo...,Efficient Supply Chain Assessment: Overcoming ...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"Internet Demand's Evolution, Communication Imp...",Efficient Supply Chain Assessment: Overcoming ...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,Rise of Cybercrime and its Effect in upcoming ...,Efficient Supply Chain Assessment: Overcoming ...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,OTT platform and its impact on the entertainme...,Efficient Supply Chain Assessment: Overcoming ...
...,...,...,...,...
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,Due to the COVID-19 the repercussion of the en...,Efficient Supply Chain Assessment: Overcoming ...
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,Impact of COVID-19 pandemic on office space an...,Efficient Supply Chain Assessment: Overcoming ...
97,blackassign0098,https://insights.blackcoffer.com/contribution-...,Contribution of handicrafts (Visual Arts & Lit...,Efficient Supply Chain Assessment: Overcoming ...
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...,How COVID-19 is impacting payment preferences?...,Efficient Supply Chain Assessment: Overcoming ...


In [239]:
!pip install syllapy

Collecting syllapy
  Downloading syllapy-0.7.2-py3-none-any.whl (24 kB)
Installing collected packages: syllapy
Successfully installed syllapy-0.7.2


In [280]:
# Import all essential libreries of nltk

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import cmudict
import string
import syllapy
#import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('cmudict')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


True

In [281]:
# For text looping (Getting only the rows of text column)
text_loop=scraping_details.iloc[:,3]

In [282]:

# Step 1: Text cleaning
def clean_text(text):
    stop_words=set(stopwords.words('english'))
    tokens=word_tokenize(text)
    cleaned_tokens=[word.lower() for word in tokens if word.isalnum() and word.lower() not in stop_words]
    return cleaned_tokens

# Step 2: Create Positive and Negative word dictionary from the MasterDictionary
def create_word_dic():
    positive_words=set()
    negative_words=set()

    positive_path='/content/drive/MyDrive/BlackCoffer/MasterDictionary-20240425T054227Z-001/MasterDictionary/positive-words.txt'
    negative_path='/content/drive/MyDrive/BlackCoffer/MasterDictionary-20240425T054227Z-001/MasterDictionary/negative-words.txt'

    with open(positive_path,'r',encoding='latin-1') as f:
        for word in f.readlines():
            positive_words.add(word.strip())

    with open(negative_path,'r',encoding='latin-1') as f:
        for word in f.readlines():
            negative_words.add(word.strip())

    return positive_words,negative_words

# Step 3: Calculation of positive and negative words
def calculate_score(cleaned_token,positive_words,negative_words):
    positive_score=sum(1 for word in cleaned_token if word in positive_words)
    negative_score=sum(1 for word in cleaned_token if word in negative_words)
    return round(positive_score,2),round(negative_score,2)

# Step 4: Popularity Score
def popularity_score(positive_score,negative_score):
    pop_score=(positive_score-negative_score)/((positive_score+negative_score)+0.000001)
    return round(pop_score,2)

# Step 5: Subjectivity Score
def subjectivity_score(cleaned_token,positive_score,negative_score):
    sub_score=(positive_score+negative_score)/((len(cleaned_token))+0.000001)
    return round(sub_score,2)

# Step 6: Average Sentence Length
def average_sentence_length(text):
    sentences=sent_tokenize(text)
    total_words=sum(len(word_tokenize(sentence)) for sentence in sentences)
    total_sentences=len(sentences)
    return round(total_words/(total_sentences + 0.000001),2)

# Step 7: Remove punctuation
def remove_punctuation(text):
    translator=str.maketrans('', '', string.punctuation)
    text=text.translate(translator)
    return text

# Step 8: Word count
def word_count(cleaned_token):
    return len(cleaned_token)

# Step 9: Syllable Count Per Word
def syllable_count(cleaned_token):
    syllable_counts=[syllapy.count(word) for word in cleaned_token]
    return round(sum(syllable_counts)/len(cleaned_token),2)

# Step 10: Percentage of Complex words
def percentage_of_complex_words(cleaned_token):
    syllable_counts=[syllapy.count(word) for word in cleaned_token if syllapy.count(word) > 2]
    percentage=(len(syllable_counts)/len(cleaned_token))*100
    return round(percentage,2)

# Step 11: Fog index
def fog_index(avg_sentence_length, percentage_complex):
    return round(0.4*(avg_sentence_length+percentage_complex),2)

# Step 12: Average number of words per sentence
def avg_word_sentence(text):
    sentences=sent_tokenize(text)
    total_words=sum(len(word_tokenize(sentence)) for sentence in sentences)
    total_sentences=len(sentences)
    return round(total_words/total_sentences,2)

# Step 13: Complex word count
def complex_word_count(cleaned_token):
    complex_words=[word for word in cleaned_token if syllapy.count(word) > 2]
    return len(complex_words)

# Step 14: PERSONAL PRONOUNS
def count_personal_pronouns(text):
    personal_pronouns=['I','we','my','ours','us']
    tokens=word_tokenize(text)
    tokens_excluding_US=[token for token in tokens if token!='US']
    tokens_lower=[token.lower() for token in tokens_excluding_US]
    count=sum(1 for token in tokens_lower if token in personal_pronouns)
    return count

# Step 15: AVG WORD LENGTH
def avg_word_length(cleaned_token):
    total_char=sum(len(word) for word in cleaned_token)
    return round(total_char/(len(cleaned_token) + 0.000001),2)



# Lists to store results
positive_scores=[]
negative_scores=[]
polarity_scores=[]
subjectivity_scores=[]
avg_sentence_lengths=[]
percentage_of_complex_words_list=[]
fog_indices=[]
avg_number_of_words_per_sentence=[]
complex_word_counts=[]
word_counts=[]
syllable_per_words=[]
personal_pronouns_counts=[]
avg_word_lengths=[]

for text in text_loop:
    cleaned_token=clean_text(text)
    positive_words,negative_words=create_word_dic()

    pos_score,neg_score=calculate_score(cleaned_token,positive_words,negative_words)
    positive_scores.append(pos_score)
    negative_scores.append(neg_score)

    pop_score=popularity_score(pos_score,neg_score)
    polarity_scores.append(pop_score)

    sub_score=subjectivity_score(cleaned_token,pos_score,neg_score)
    subjectivity_scores.append(sub_score)

    avg_sent_length=average_sentence_length(text)
    avg_sentence_lengths.append(avg_sent_length)

    percentage_complex=percentage_of_complex_words(cleaned_token)
    percentage_of_complex_words_list.append(percentage_complex)

    fog=fog_index(avg_sent_length, percentage_complex)
    fog_indices.append(fog)

    avg_word_per_sent=avg_word_sentence(text)
    avg_number_of_words_per_sentence.append(avg_word_per_sent)

    complex_word_count_val=complex_word_count(cleaned_token)
    complex_word_counts.append(complex_word_count_val)

    word_count_val=word_count(cleaned_token)
    word_counts.append(word_count_val)

    syllable_count_val=syllable_count(cleaned_token)
    syllable_per_words.append(syllable_count_val)

    personal_pronouns_count=count_personal_pronouns(text)
    personal_pronouns_counts.append(personal_pronouns_count)

    avg_word_length_val=avg_word_length(cleaned_token)
    avg_word_lengths.append(avg_word_length_val)


In [283]:

# Final report

final_report=pd.DataFrame({'URL_ID':URL_ID,'URL':url_link,'POSITIVE SCORE':positive_scores,'NEGATIVE SCORE':negative_scores,'POLARITY SCORE':polarity_scores,'SUBJECTIVITY SCORE':subjectivity_scores,'AVG SENTENCE LENGTH':avg_sentence_lengths,'PERCENTAGE OF COMPLEX WORDS':percentage_of_complex_words_list,'FOG INDEX':fog_indices,'AVG NUMBER OF WORDS PER SENTENCE':avg_number_of_words_per_sentence,'COMPLEX WORD COUNT':complex_word_counts,'WORD COUNT':word_counts,'SYLLABLE PER WORD':syllable_per_words,'PERSONAL PRONOUNS':personal_pronouns_counts,'AVG WORD LENGTH':avg_word_lengths})
display(final_report)




Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,13,1,0.86,0.05,25.35,33.68,23.61,25.35,96,285,2.22,4,6.65
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,66,31,0.36,0.10,28.56,40.95,27.80,28.56,387,945,2.40,6,7.17
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,45,24,0.30,0.09,30.15,50.86,32.40,30.15,384,755,2.69,15,7.89
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,44,75,-0.26,0.15,36.41,47.28,33.48,36.41,365,772,2.55,7,7.70
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,27,8,0.54,0.07,37.31,39.14,30.58,37.31,200,511,2.32,6,7.15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,35,57,-0.24,0.12,30.91,39.78,28.28,30.91,294,739,2.31,3,7.02
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,37,35,0.03,0.12,52.30,34.03,34.53,52.30,213,626,2.13,8,6.58
97,blackassign0098,https://insights.blackcoffer.com/contribution-...,6,0,1.00,0.03,94.00,41.26,54.10,94.00,85,206,2.33,1,6.95
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...,28,3,0.81,0.08,40.75,29.80,28.22,40.75,121,406,2.09,3,6.45


In [284]:
# Saving as a csv file

final_report.to_csv('/content/drive/MyDrive/BlackCoffer/Final_report/Final.csv')