# Introduction
The following Python script is designed to extract textual data from a list of URLs
given in an Excel file ("Input.xlsx")
and perform various text analysis tasks to compute a range of variables.
The code uses libraries such as pandas, requests, BeautifulSoup, TextBlob, and syllables to achieve this objective.

# Import Libraries 

In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from textblob import TextBlob
import syllables


# Step 1: Read the Input Data

In [6]:
input_data = pd.read_excel("Input.xlsx")

# note:
The script begins by loading the input data from an Excel file ("Input.xlsx") using the pandas library. 
The input data is assumed to contain URLs and corresponding URL_ID values.

# Initialize an empty list to store the output data



In [33]:
output_data = []

# Step 2: Data Extraction and Analysis

In [8]:
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']


# Send an HTTP request to fetch the webpage content

In [13]:
import requests
from bs4 import BeautifulSoup

response = requests.get(url)
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

 # Extract article title and text

In [16]:
title = soup.find('title').get_text()
article_text = " ".join([p.get_text() for p in soup.find_all('p')])


 # Perform NLP analysis using TextBlob


In [18]:
blob = TextBlob(article_text)
positive_score = blob.sentiment.polarity
negative_score = blob.sentiment.subjectivity
polarity_score = blob.sentiment.polarity
subjectivity_score = blob.sentiment.subjectivity


# Calculate other variables as needed

In [20]:
sentences = blob.sentences
avg_sentence_length = sum(len(sentence.split()) for sentence in sentences) / len(sentences)



**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\pant/nltk_data'
    - 'C:\\Users\\pant\\anaconda3\\nltk_data'
    - 'C:\\Users\\pant\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\pant\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\pant\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************



MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


In [24]:
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pant\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Calculate other variables as needed

In [25]:
sentences = blob.sentences
avg_sentence_length = sum(len(sentence.split()) for sentence in sentences) / len(sentences)


 # Count complex words (words with more than 2 syllables)

In [26]:
 complex_word_count = sum(1 for word in blob.words if syllables.estimate(word) > 2)


 # Count total words and syllables

In [27]:
word_count = len(blob.words)
syllable_per_word = sum(syllables.estimate(word) for word in blob.words) / word_count


 # Count personal pronouns (you can customize this list)

In [28]:
personal_pronouns = sum(1 for word in blob.words if word.lower() in ["i", "me", "my", "mine", "myself"])


# Calculate Fog Index

In [29]:
fog_index = 0.4 * (avg_sentence_length + (complex_word_count / len(sentences)))


 # Calculate average number of words per sentence

In [30]:
avg_words_per_sentence = word_count / len(sentences)

    # Calculate average word length

In [31]:
 avg_word_length = sum(len(word) for word in blob.words) / word_count


In [34]:
output_data.append({
            'URL_ID': url_id,
            'URL': url,
            'POSITIVE SCORE': positive_score,
            'NEGATIVE SCORE': negative_score,
            'POLARITY SCORE': polarity_score,
            'SUBJECTIVITY SCORE': subjectivity_score,
            'AVG SENTENCE LENGTH': avg_sentence_length,
            'PERCENTAGE OF COMPLEX WORDS': (complex_word_count / word_count) * 100,
            'FOG INDEX': fog_index,
            'AVG NUMBER OF WORDS PER SENTENCE': avg_words_per_sentence,
            'COMPLEX WORD COUNT': complex_word_count,
            'WORD COUNT': word_count,
            'SYLLABLE PER WORD': syllable_per_word,
            'PERSONAL PRONOUNS': personal_pronouns,
            'AVG WORD LENGTH': avg_word_length,
        })

# Step 3: Create and save the output data

In [35]:
output_df = pd.DataFrame(output_data)

# Save the output data to an Excel file

In [36]:
output_df.to_excel("Output_Data.xlsx", index=False)

# Conclusion:
This project showcases the capability of Python and various libraries, including pandas, 
requests, BeautifulSoup, TextBlob, and syllables, to extract and analyze textual 
data from web articles. The resulting analysis and metrics can be valuable for understanding the 
content and sentiment of these articles, making it a valuable tool for data-driven decision-making in content analysis 
and web scraping tasks.

This project can be extended by adding more features or conducting deeper text analysis, 
depending on specific needs and requirements.