<a href="https://colab.research.google.com/github/Mehul6112/Data-Science_curve/blob/main/WebScraping_and_TextAnalysis_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective
* The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below.


# Text Extraction
## Approach

### Part 1: Text Extraction

**Fetch Article Content:**
   - Read URLs from an Excel file.
   - Extract article text and title from each URL using BeautifulSoup.
   - Save each article's text into a file named with its `URL_ID`.



## How to Run the .ipynb File to Generate Output

### Ensure Required Libraries are Installed
You can install them using pip:
```bash
pip install pandas openpyxl requests beautifulsoup4 nltk


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

input_file = 'input.xlsx'
df = pd.read_excel(input_file)

output_dir = 'articles'
os.makedirs(output_dir, exist_ok=True)

def extract_article_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors
        soup = BeautifulSoup(response.content, 'html.parser')

        # Assuming the article title is within <h1> tags and the main content is within <article> tags
        title = soup.find('h1').get_text(strip=True)
        article_body = soup.find('article')

        if not article_body:
            # Fall back to a more general approach if <article> is not present
            article_body = soup.find('div', class_='article-content') or soup.find('div', class_='post-content')

        paragraphs = article_body.find_all('p')
        article_text = '\n'.join([para.get_text(strip=True) for para in paragraphs])

        return title, article_text
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None, None

# Iterate through each URL and save the article content
for index, row in df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    title, content = extract_article_content(url)

    if title and content:
        output_file_path = os.path.join(output_dir, f"{url_id}.txt")
        with open(output_file_path, 'w', encoding='utf-8') as file:
            file.write(title + "\n\n" + content)

print("Articles have been successfully extracted and saved.")

Error fetching https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Error fetching https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/
Articles have been successfully extracted and saved.


### Looks like 2 articles raised error 404. Let's retry for these specific URL.

In [None]:
missing_url_ids = []
for index, row in df.iterrows():
    url_id = row['URL_ID']
    output_file_path = os.path.join(output_dir, f"{url_id}.txt")
    if not os.path.isfile(output_file_path):
        missing_url_ids.append(url_id)

print(f"Retrying extraction for {len(missing_url_ids)} missing articles.")

# Retry extraction for missing URL_IDs
for url_id in missing_url_ids:
    url = df.loc[df['URL_ID'] == url_id, 'URL'].values[0]
    title, content = extract_article_content(url)

    if title and content:
        output_file_path = os.path.join(output_dir, f"{url_id}.txt")
        with open(output_file_path, 'w', encoding='utf-8') as file:
            file.write(title + "\n\n" + content)
    else:
        print(f"Failed to retrieve content for URL_ID {url_id}")


print("Retry process completed.")

Retrying extraction for 2 missing articles.
Error fetching https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Failed to retrieve content for URL_ID blackassign0036
Error fetching https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/
Failed to retrieve content for URL_ID blackassign0049
Retry process completed.


### Out of 100 articles given in input.xlsx, url_id 36 and 49 seems to indicate that the requested URL is not available on the server. This error can occur if the URL has been removed or is incorrect.

# Text Analysis Instructions

## Text Analysis Process

For each file in the article directory, perform the following text analysis:

1. **Sentiment Analysis:**
   - Exclude words from the "StopWords" directory during sentiment analysis.
   - Use the "MasterDictionary" directory containing "positive-words.txt" and "negative-words.txt".
   - Convert the text into tokens using the NLTK tokenize module.
   - Calculate the following variables:
     - **Positive Score:** Assign +1 for each word found in the Positive Dictionary and sum all values.
     - **Negative Score:** Assign -1 for each word found in the Negative Dictionary and sum all values (multiplied by -1 to ensure positive score).
     - **Polarity Score:** Calculate as (Positive Score – Negative Score)/((Positive Score + Negative Score) + 0.000001). Range is from -1 to +1.
     - **Subjectivity Score:** Calculate as (Positive Score + Negative Score)/((Total Words after cleaning) + 0.000001). Range is from 0 to +1.

2. **Readability Analysis:**
   - Use the Gunning Fox index formula:
     - **Average Sentence Length:** Calculate as the number of words / the number of sentences.
     - **Percentage of Complex Words:** Calculate as the number of complex words / the number of words.
     - **Fog Index:** Calculate as 0.4 * (Average Sentence Length + Percentage of Complex words).

3. **Additional Metrics:**
   - **Average Number of Words Per Sentence:** Calculate as the total number of words / the total number of sentences.
   - **Complex Word Count:** Count words in the text containing more than two syllables.
   - **Word Count:** Count the total cleaned words present in the text by removing stop words (using NLTK stopwords) and punctuations like ? ! , . from each word before counting.
   - **Syllable Count Per Word:** Count the number of syllables in each word of the text by counting the vowels present in each word, handling exceptions like words ending with "es" or "ed".
   - **Personal Pronouns:** Use regex to find counts of words such as “I,” “we,” “my,” “ours,” and “us”, excluding the country name "US".
   - **Average Word Length:** Calculate the average length of words.
   
4. **Tokenization and Syllable Counting:**
   - Use `nltk` library to tokenize the text into words and sentences.
   - Count syllables in each word.

5. **Merge and Save Results:**
   - Merge the original DataFrame with the results DataFrame on the `URL_ID` column.
   - Save the combined DataFrame to an Excel file named `output.xlsx`.

## How to Run the Text Analysis

1. **Ensure Required Libraries are Installed:**
   Ensure you have the required libraries installed. You can install them using pip:
   ```bash
   pip install pandas nltk


In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

stopwords_dir = 'StopWords'
master_dict_dir = 'MasterDictionary'

# Load positive and negative words
with open(os.path.join(master_dict_dir, 'positive-words.txt'), 'r') as f:
    positive_words = set(f.read().split())
with open(os.path.join(master_dict_dir, 'negative-words.txt'), 'r') as f:
    negative_words = set(f.read().split())

# Load all stopwords
stop_words = set(stopwords.words('english'))
for file in os.listdir(stopwords_dir):
    with open(os.path.join(stopwords_dir, file), 'r') as f:
        stop_words.update(f.read().split())

# Function to clean and tokenize text
def clean_tokenize(text):
    # Remove punctuation and tokenize
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text.lower())
    # Remove stopwords
    cleaned_tokens = [word for word in tokens if word not in stop_words]
    return cleaned_tokens

# Function to calculate sentiment scores
def sentiment_analysis(tokens):
    positive_score = sum(1 for word in tokens if word in positive_words)
    negative_score = sum(-1 for word in tokens if word in negative_words) * -1
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(tokens) + 0.000001)
    return positive_score, negative_score, polarity_score, subjectivity_score

# Function to calculate readability scores
def readability_analysis(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    avg_sentence_length = len(words) / len(sentences)
    complex_words_count = sum(1 for word in words if count_syllables(word) > 2)
    percent_complex_words = complex_words_count / len(words)
    fog_index = 0.4 * (avg_sentence_length + percent_complex_words)
    avg_words_per_sentence = len(words) / len(sentences)
    return avg_sentence_length, percent_complex_words, fog_index, avg_words_per_sentence, complex_words_count

# Function to count syllables in a word
def count_syllables(word):
    word = word.lower()
    syllable_count = len(re.findall(r'[aeiouy]+', word))
    if word.endswith('es') or word.endswith('ed'):
        syllable_count = max(1, syllable_count - 1)
    return syllable_count

# Function to count personal pronouns
def count_personal_pronouns(text):
    pronouns = re.findall(r'\b(I|we|my|ours|us)\b', text, re.I)
    return len(pronouns)

# Function to analyze each file in the articles directory
def analyze_files(directory):
    results = []
    for file_name in os.listdir(directory):
        if file_name.endswith('.txt'):
            file_path = os.path.join(directory, file_name)
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
            tokens = clean_tokenize(text)
            if len(tokens) < 50:
                continue

            positive_score, negative_score, polarity_score, subjectivity_score = sentiment_analysis(tokens)
            avg_sentence_length, percent_complex_words, fog_index, avg_words_per_sentence, complex_words_count = readability_analysis(text)
            total_words = len(tokens)
            syllable_counts = [count_syllables(word) for word in tokens]
            avg_syllables_per_word = sum(syllable_counts) / total_words
            personal_pronouns_count = count_personal_pronouns(text)
            avg_word_length = sum(len(word) for word in tokens) / total_words

            results.append({
                'URL_ID': file_name.replace('.txt', ''),
                'Positive Score': positive_score,
                'Negative Score': negative_score,
                'Polarity Score': polarity_score,
                'Subjectivity Score': subjectivity_score,
                'Average Sentence Length': avg_sentence_length,
                'Percentage of Complex Words': percent_complex_words,
                'Fog Index': fog_index,
                'Average Words Per Sentence': avg_words_per_sentence,
                'Complex Word Count': complex_words_count,
                'Word Count': total_words,
                'Syllable Count Per Word': avg_syllables_per_word,
                'Personal Pronouns Count': personal_pronouns_count,
                'Average Word Length': avg_word_length
            })
    return results

# Directory where the articles are saved
output_dir = 'articles'

# Perform analysis on each file
results = analyze_files(output_dir)

# Convert results to DataFrame for tabular output
results_df = pd.DataFrame(results)
combined_df = df.merge(results_df, on='URL_ID')

# Write the combined DataFrame to an Excel file
output_file = 'output.xlsx'
combined_df.to_excel(output_file, index=False)

print("Analysis and concatenation completed. Results saved to output.xlsx.")


Analysis and concatenation completed. Results saved to output.xlsx.


## Output

In [None]:
combined_df.head()

Unnamed: 0,URL_ID,URL,Positive Score,Negative Score,Polarity Score,Subjectivity Score,Average Sentence Length,Percentage of Complex Words,Fog Index,Average Words Per Sentence,Complex Word Count,Word Count,Syllable Count Per Word,Personal Pronouns Count,Average Word Length
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,6,1,0.714286,0.042169,15.76,0.126904,6.354761,15.76,50,166,2.168675,4,6.487952
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,54,29,0.301205,0.109499,21.44,0.212687,8.661075,21.44,342,758,2.490765,6,7.544855
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,38,24,0.225806,0.1,21.803571,0.292383,8.838382,21.803571,357,620,2.824194,13,8.293548
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,37,75,-0.339286,0.184211,23.745098,0.270025,9.606049,23.745098,327,608,2.720395,5,8.080592
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,20,8,0.428571,0.076294,19.589744,0.201571,7.916526,19.589744,154,367,2.381471,6,7.53406
