<a href="https://colab.research.google.com/github/Pratikshathorat96/Data-Extraction-and-NLP/blob/main/Dataextracton%26NLP_Pratiksha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction and NLP for **Blackcoffer**

## Project By - Pratiksha Thorat

**Output link to Google Drive** - https://docs.google.com/spreadsheets/d/1ikjquITX32sNksW8R6tx30tBCdFz0fGK/edit?usp=sharing&ouid=113051524455290143470&rtpof=true&sd=true

## Objective

Objective of this document is to explain methodology adopted to perform text analysis to drive sentimental opinion, sentiment scores, readability, passive words, personal pronouns and etc.

## Summary

The project involves extracting textual data from a set of URLs provided in an input Excel file and performing text analysis to compute various variables. The objective is to extract article text from the URLs, clean the text, and analyze it to calculate sentiment scores, readability metrics, and other textual variables.

Here's an overview of the key components and steps involved:

1. Data Extraction: Python code is developed to extract article text from the URLs provided in the input Excel file. Libraries such as requests and BeautifulSoup are used for web scraping. The extracted text is cleaned to remove unnecessary elements such as HTML tags, punctuation, and stopwords.

2. Text Analysis: The cleaned text is subjected to text analysis to compute various variables. This includes sentiment analysis to determine positive and negative scores, polarity score, and subjectivity score using the TextBlob library. Readability metrics such as average sentence length, percentage of complex words, and Fog index are calculated. Additionally, the count of personal pronouns and the average word length are computed.

3. Output Generation: The computed variables are structured according to the required output format specified in an output Excel file. The output includes URL IDs, URLs, sentiment scores, readability metrics, and other textual variables.

4. Code Implementation: Python code is developed to automate the entire process. Libraries such as NLTK (Natural Language Toolkit) are utilized for text preprocessing tasks such as tokenization, stemming, and syllable counting. The output is generated as a new Excel file containing the computed variables.

Overall, the project involves a combination of web scraping, text preprocessing, sentiment analysis, and readability analysis techniques to extract and analyze textual data from URLs and generate structured output for further analysis or reporting. The code is designed to be modular and scalable, allowing for easy adaptation to different sets of URLs and analytical requirements.

## Let's Begin

In [None]:
!pip install requests beautifulsoup4 nltk textstat  # Install required libraries




In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from textblob import TextBlob

In [None]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
# Function to extract article text from URL
def extract_text_from_url(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assuming article text is enclosed in <p> tags
        article_text = ' '.join([p.get_text() for p in soup.find_all('p')])
        return article_text
    except Exception as e:
        print(f"Error extracting text from URL: {url}")
        print(e)
        return None

In [None]:
# Function to clean text
def clean_text(text):
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove punctuation
    tokens = [word.lower() for word in tokens if word.isalpha()]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    # Join tokens back into text
    cleaned_text = ' '.join(tokens)
    return cleaned_text

In [None]:
# Function to calculate readability metrics
def calculate_readability(text):
    # Average sentence length
    sentences = sent_tokenize(text)
    num_sentences = len(sentences)
    words_per_sentence = len(word_tokenize(text)) / num_sentences

    # Percentage of complex words
    words = [word for word in text.split() if word.isalpha()]
    complex_words = [word for word in words if count_syllables(word) > 2]
    percentage_complex_words = (len(complex_words) / len(words)) * 100

    # Fog Index
    fog_index = 0.4 * (words_per_sentence + percentage_complex_words)

    return words_per_sentence, percentage_complex_words, fog_index

In [None]:
# Function to count syllables
def count_syllables(word):
    return max(1, len([char for char in word if char.lower() in 'aeiou']))

# Function to count personal pronouns
def count_personal_pronouns(text):
    personal_pronouns = re.findall(r'\b(I|we|my|ours|us)\b', text)
    return len(personal_pronouns)

In [None]:
# Function to calculate average word length
def calculate_avg_word_length(text):
    words = word_tokenize(text)
    total_chars = sum(len(word) for word in words)
    avg_word_length = total_chars / len(words)
    return avg_word_length

In [None]:
# Function to perform sentiment analysis
def perform_sentiment_analysis(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment
    positive_score = sum(1 for sentence in blob.sentences if sentence.sentiment.polarity > 0)
    negative_score = sum(1 for sentence in blob.sentences if sentence.sentiment.polarity < 0)
    polarity_score = sentiment.polarity
    subjectivity_score = sentiment.subjectivity
    return positive_score, negative_score, polarity_score, subjectivity_score

In [None]:
# Read input Excel file
input_df = pd.read_excel('/content/Input.xlsx')

In [None]:
# Iterate over each row in the input dataframe
output_data = []
for index, row in input_df.iterrows():
    url = row['URL']
    url_id = row['URL_ID']
    article_text = extract_text_from_url(url)
    if article_text:
        cleaned_text = clean_text(article_text)
        words_per_sentence, percentage_complex_words, fog_index = calculate_readability(cleaned_text)
        complex_word_count = len([word for word in cleaned_text.split() if count_syllables(word) > 2])
        word_count = len(cleaned_text.split())
        syllables_per_word = sum(count_syllables(word) for word in cleaned_text.split()) / len(cleaned_text.split())
        personal_pronouns_count = count_personal_pronouns(article_text)
        avg_word_length = calculate_avg_word_length(article_text)
        positive_score, negative_score, polarity_score, subjectivity_score = perform_sentiment_analysis(cleaned_text)

        # Append results to output data list
        output_data.append([url_id, url, positive_score, negative_score, polarity_score, subjectivity_score,
                            words_per_sentence, percentage_complex_words, fog_index, words_per_sentence,
                            complex_word_count, word_count, syllables_per_word,
                            personal_pronouns_count, avg_word_length])

In [None]:
# Create output dataframe
output_df = pd.DataFrame(output_data, columns=['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE',
                                               'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH', 'PERCENTAGE OF COMPLEX WORDS',
                                               'FOG INDEX', 'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
                                               'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH'])


In [None]:
output_df.info() # checking for the all info & variables in dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   URL_ID                            100 non-null    object 
 1   URL                               100 non-null    object 
 2   POSITIVE SCORE                    100 non-null    int64  
 3   NEGATIVE SCORE                    100 non-null    int64  
 4   POLARITY SCORE                    100 non-null    float64
 5   SUBJECTIVITY SCORE                100 non-null    float64
 6   AVG SENTENCE LENGTH               100 non-null    float64
 7   PERCENTAGE OF COMPLEX WORDS       100 non-null    float64
 8   FOG INDEX                         100 non-null    float64
 9   AVG NUMBER OF WORDS PER SENTENCE  100 non-null    float64
 10  COMPLEX WORD COUNT                100 non-null    int64  
 11  WORD COUNT                        100 non-null    int64  
 12  SYLLABLE 

In [None]:
output_df.describe() # short overview of data

Unnamed: 0,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.87,0.13,0.072109,0.358864,704.6,26.469823,292.427929,704.6,184.73,704.6,2.171523,6.37,4.863712
std,0.337998,0.337998,0.077283,0.067278,309.509485,3.44468,123.579366,309.509485,84.25105,309.509485,0.076307,5.868345,0.29474
min,0.0,0.0,-0.201276,0.160476,146.0,20.034101,70.728767,146.0,45.0,146.0,1.98433,1.0,4.273816
25%,1.0,0.0,0.036571,0.328083,501.5,23.940183,210.948207,501.5,125.25,501.5,2.114487,2.0,4.630254
50%,1.0,0.0,0.068836,0.361099,720.0,26.070335,299.139875,720.0,182.5,720.0,2.161977,5.0,4.849509
75%,1.0,0.0,0.104471,0.392872,904.25,29.403897,371.609604,904.25,232.0,904.25,2.227163,8.0,5.047494
max,1.0,1.0,0.406108,0.622491,2052.0,35.920177,831.560234,2052.0,552.0,2052.0,2.398214,33.0,5.594937


In [None]:
# Write output to Excel file
output_df.to_excel('/content/OutputDF_by_Pratiksh_Thorat.xlsx', index=False)

## Conclusion

In conclusion, the assignment successfully achieved its objective of extracting textual data from provided URLs and conducting comprehensive text analysis to compute various variables. The project demonstrated proficiency in web scraping techniques using libraries such as requests and BeautifulSoup to extract article text from websites.

Furthermore, the implementation of text analysis involved preprocessing techniques such as tokenization, stemming, and stop word removal using the NLTK library. Sentiment analysis using TextBlob provided insights into the sentiment polarity and subjectivity of the extracted text. Additionally, readability metrics such as average sentence length, percentage of complex words, and Fog index were calculated to assess the readability of the text.

The assignment showcased the ability to automate the entire process through Python programming, ensuring efficiency and scalability for analyzing large datasets. The generated output, structured according to the specified format, provides valuable insights into the textual content of the articles, facilitating further analysis or decision-making processes.

Overall, the assignment demonstrated proficiency in web scraping, text preprocessing, sentiment analysis, and readability analysis, showcasing the capability to extract meaningful information from textual data sourced from the web. The project underscores the importance of leveraging programming and computational techniques to derive insights from unstructured data sources, contributing to informed decision-making and data-driven strategies.

# Thank you so much for reaching end 😀