# **Project Name**    - **Data Extraction and NLP**



**Introduction:**

The objective of this project is to extract textual data articles from the given URLs and perform text analysis to compute variables that are explained below. The project aims to develop a Python program that uses web scraping libraries such as BeautifulSoup or Scrapy to extract article text from each URL, and perform text analysis on each article to compute various variables. The output of the program is a CSV or Excel file containing the input variables and the computed variables.

 **Input Data:**

The input data is provided in the form of an Excel file named Input.xlsx, which contains a list of URLs along with their associated URL_ID, Title, and Publication Date. The program will use this file to extract article text for each URL, and perform text analysis on each article.

**Output Data Structure:**

The output data structure is provided in the form of an Excel file named Output Data Structure.xlsx. This file contains a list of input variables along with computed variables such as Positive Score, Negative Score, Polarity Score, Subjectivity Score, Average Sentence Length, Percentage of Complex Words, FOG Index, Average Number of Words per Sentence, Complex Word Count, Word Count, Syllable per Word, Personal Pronouns, and Average Word Length. The program will compute these variables for each article and store the results in a new CSV or Excel file with the same format as Output Data Structure.xlsx.

**Background:**

Web scraping is the process of extracting data from websites using automated tools such as web crawlers or scraping scripts. It involves parsing HTML or XML code to extract specific data elements, such as text or images. Web scraping can be useful for data mining, market research, or content aggregation.

Text analysis is the process of analyzing unstructured text data to extract meaningful insights or patterns. It involves techniques such as natural language processing (NLP), sentiment analysis, and topic modeling. Text analysis can be useful for understanding customer feedback, social media sentiment, or content classification.

In this project, web scraping and text analysis are combined to extract article text from URLs and compute various variables such as sentiment scores, readability measures, and word usage statistics. The program will use web scraping libraries such as BeautifulSoup or Scrapy to extract article text, and NLP libraries such as NLTK or spaCy to perform text analysis.

# **Project Summary -**



The objective of this project is to extract article text from a list of URLs provided in an Excel file and perform text analysis on each article to compute various variables. The project involves developing a Python program that uses web scraping libraries such as BeautifulSoup or Scrapy to extract article text from each URL, and NLP libraries such as NLTK or spaCy to perform text analysis on each article.

The input data is provided in the form of an Excel file named Input.xlsx, which contains a list of URLs along with their associated URL_ID, Title, and Publication Date. The program will use this file to extract article text for each URL, and perform text analysis on each article.

The output data structure is provided in the form of an Excel file named Output Data Structure.xlsx, which contains a list of input variables along with computed variables such as Positive Score, Negative Score, Polarity Score, Subjectivity Score, Average Sentence Length, Percentage of Complex Words, FOG Index, Average Number of Words per Sentence, Complex Word Count, Word Count, Syllable per Word, Personal Pronouns, and Average Word Length. The program will compute these variables for each article and store the results in a new CSV or Excel file with the same format as Output Data Structure.xlsx.

The timeline for this project is six days, and the project can be adjusted based on the complexity of the web scraping and text analysis tasks. Web scraping and text analysis are combined to extract article text from URLs and compute various variables such as sentiment scores, readability measures, and word usage statistics.

Overall, this project aims to provide a useful tool for content aggregation, market research, and data analysis, which can be used in various domains such as journalism, social media, and e-commerce.

# **Problem Statement**


In today's digital era, there is a vast amount of textual data available online. This data is a valuable source of information for various applications, such as market research, content aggregation, and data analysis. However, the challenge is to extract the relevant textual data and analyze it effectively to derive meaningful insights.

The problem statement for this project is to extract article text from a list of URLs and perform text analysis to compute various variables. The challenge is to develop an efficient web scraping program that can extract article text accurately from each URL, excluding website headers, footers, and any other non-relevant information. The program must also perform text analysis effectively and compute variables such as sentiment scores, readability measures, and word usage statistics, which are essential for data analysis.

The problem statement also involves handling various data-related challenges, such as missing data, inconsistent data formats, and data quality issues. The program must handle these challenges and provide accurate and reliable results.

The overall objective of this project is to provide a useful tool for content aggregation, market research, and data analysis, which can be used in various domains such as journalism, social media, and e-commerce. By addressing the challenges of web scraping and text analysis, this project aims to facilitate the extraction and analysis of textual data, leading to meaningful insights and better decision-making.

# **General Guidelines** : -  

General Guidelines for Text Analysis of Articles:

**1.Data Extraction:**

  a. Use the pandas library in Python to read the URLs from the Input.xlsx file.

  b. Iterate over each URL and use libraries such as Beautifulsoup, Scrapy, or Selenium to extract the article text.

c. Clean the extracted text by removing any unnecessary information such as website headers, footers, and advertisements.

d. Save each extracted article in a text file with URL_ID as its file name.



**2.Data Analysis:**

a. Use the NLTK library in Python for the textual analysis of each extracted article.

b. Compute the variables as per the definitions given in the Text Analysis.docx file.

c. Save the computed variables in a dictionary or a list.

d. Iterate over all the URLs in Input.xlsx and perform the above steps.



**3.Output Data Structure:**

a. Create a pandas dataframe with the columns as given in the Output Data Structure.xlsx file.

b. Populate the dataframe with the extracted variables and the computed variables.

c. Export the dataframe to a CSV or Excel file as per the format given in the Output Data Structure.xlsx file.

**4.Code to Generate Expected Output:**

a. Load the URLs from Input.xlsx using the pandas library.

b. Iterate over each URL and use libraries such as Beautifulsoup, Scrapy, or 
Selenium to extract the article text.

c. Perform textual analysis on each extracted article text to compute the 
required variables using the NLTK library.

d. Save the computed variables in a dictionary or a list.

e. Populate the pandas dataframe with the extracted variables and the computed variables.

f. Export the dataframe to a CSV or Excel file as per the format given in the Output Data Structure.xlsx file.

# ***Let's Begin !***

### Import Libraries

In [None]:
import os
import string
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import cmudict
!pip install textstat
!pip install vaderSentiment
from textstat import flesch_reading_ease, smog_index, automated_readability_index
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

# Download required NLTK packages
import nltk
nltk.download('punkt')
nltk.download('cmudict')

# Download required TextBlob corpora
from textblob import TextBlob

### Dataset Loading

In [None]:
# Read the input file
dir_path='/content/drive/MyDrive/Blackcoffer Project/Copy of Input.xlsx'
input_df = pd.read_excel(dir_path)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Code to Create a folder to store the text files

In [None]:
# Create a folder to store the text files
if not os.path.exists('/content/drive/MyDrive/Blackcoffer Project/text_files'):
    os.makedirs('/content/drive/MyDrive/Blackcoffer Project/text_files')

url_list=[]
# Loop through each URL and extract the article text
for index, row in input_df.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    url_list.append(url)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the article title and text using appropriate tags
    article_title = ''
    article_text = ''
    title_tag = soup.find('h1')
    if title_tag:
        article_title = title_tag.text.strip()

    for p in soup.find_all('p'):
        if p.text.strip() != '':
            article_text += p.text.strip() + ' '

    # Save the extracted text to a text file with URL_ID as its file name
    if article_title != '' and article_text != '':
        with open(f'/content/drive/MyDrive/Blackcoffer Project/textFiles1/{url_id}.txt', 'w', encoding='utf-8') as f:
            f.write(f'{article_title}\n{article_text}')

        print(f'Saved text file for {url_id}')
    else:
        print(f'Error extracting text for {url_id}')

### Loading data which is required for anlyze/checking

In [None]:


# Load data into pandas dataframe
df = pd.read_excel("/content/drive/MyDrive/Blackcoffer Project/Copy of Input.xlsx")

stop_words_folder='/content/drive/MyDrive/Blackcoffer Project/StopWords'
# Load the stop words
stop_words = set()
for filename in os.listdir(stop_words_folder):
    with open(os.path.join(stop_words_folder, filename), "r", encoding="latin-1") as f:

        words = f.read().splitlines()
        stop_words.update(words)

### Define a function to clean the text by removing stop words and punctuation

In [None]:

# Define a function to clean the text by removing stop words and punctuation
def clean_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Convert to lowercase
    text = text.lower()
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    # Join the tokens back into a string
    text = " ".join(tokens)
    return text

# load and analyze each article file
for filename in os.listdir('/content/drive/MyDrive/Blackcoffer Project/textFiles1'):
    if filename.endswith('.0.txt'):
        file_path = os.path.join('/content/drive/MyDrive/Blackcoffer Project/textFiles1', filename)
        with open(file_path, 'r') as f:
            article = f.read()
            cleaned_text = clean_text(article)


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Visualizing the missing values

### Functions that will calculate requied variables and stored to file 

In [None]:
#2 Final code
from nltk.tokenize import sent_tokenize, word_tokenize
master_dictionary_folder='/content/drive/MyDrive/Blackcoffer Project/MasterDictionary'

# Define a function to calculate the variables for a given text file
def calculate_variables(file_id, article):
    # Clean the text
    cleaned_text = article.lower().strip()
    cleaned_text = re.sub(r'[^\w\s]','',cleaned_text)
    # Load the master dictionary
    positive_words = set()
    negative_words = set()
    with open(os.path.join(master_dictionary_folder, "positive-words.txt"), "r",encoding='latin-1') as f:
        words = f.read().splitlines()
        positive_words.update(words)
    with open(os.path.join(master_dictionary_folder, "negative-words.txt"), "r",encoding='latin-1') as f:
        words = f.read().splitlines()
        negative_words.update(words)
    # Calculate the positive and negative scores
    positive_score = 0
    negative_score = 0
    for word in word_tokenize(cleaned_text):
        if word in positive_words:
            positive_score += 1
        elif word in negative_words:
            negative_score += 1
    # Calculate the polarity score
    polarity_score = (positive_score - negative_score) / (positive_score + negative_score + 0.000001)
    # Calculate the subjectivity score
    subjectivity_score = (positive_score + negative_score) / (len(cleaned_text.split()) + 0.000001)
    return {
        "file_id": file_id,
        "positive_score": positive_score,
        "negative_score": negative_score,
        "polarity_score": polarity_score,
        "subjectivity_score": subjectivity_score,
    }

# define function to count number of syllables in a word
def syllable_count(word):
    vowels = "aeiouy"
    count = 0
    if word[0] in vowels:
        count += 1
    for i in range(1, len(word)):
        if word[i] in vowels and word[i-1] not in vowels:
            count += 1
    if word.endswith('e'):
        count -= 1
    if count == 0:
        count += 1
    return count

# define function to calculate the FOG index
def fog_index(word_count, complex_word_count, sentence_count):
    return 0.4 * ((word_count / sentence_count) + (100 * complex_word_count / word_count))


# define function to perform text analysis tasks
def analyze_text(text):
    # tokenize text into sentences and words
    sentences = sent_tokenize(text)
    words = word_tokenize(text)

    # calculate number of words and complex words
    word_count = len(words)
    complex_words = [word for word in words if syllable_count(word) >= 3]
    complex_word_count = len(complex_words)


# calculate average number of words per sentence and average sentence length
    avg_words_per_sentence = word_count / len(sentences)
    avg_sentence_length = sum(len(sentence.split()) for sentence in sentences) / len(sentences)
    avg_word_length = sum(len(word) for word in words) / word_count

    # count personal pronouns and calculate polarity and subjectivity scores
    personal_pronouns = len(re.findall(r'\b(I|me|my|mine|we|us|our|ours)\b', text))
    blob = TextBlob(text)
    polarity_score = blob.sentiment.polarity
    subjectivity_score = blob.sentiment.subjectivity

    # calculate FOG index and syllables per word
    fog_index_score = fog_index(word_count, complex_word_count, len(sentences))
    syllables_per_word = sum(syllable_count(word) for word in words) / word_count

    # calculate percentage of complex words
    percentage_of_complex_words = (complex_word_count / word_count) * 100

    # create dictionary with analysis results
    result = {
        "Word Count": word_count,
        "Complex Word Count": complex_word_count,
        "Average Words per Sentence": avg_words_per_sentence,
        "Average Word Length": avg_word_length,
        "Personal Pronouns Count": personal_pronouns,
        "Polarity Score": polarity_score,
        "Subjectivity Score": subjectivity_score,
        "FOG Index": fog_index_score,
        "Syllables per Word": syllables_per_word,
        "Percentage of complex words": percentage_of_complex_words,
        "Average Sentence Length": avg_sentence_length
    }

    return result


# Store results in a list
results = []

# Loop through all files in the directory
for filename in os.listdir('/content/drive/MyDrive/Blackcoffer Project/textFiles1'):
    if filename.endswith('.0.txt'):
        file_path = os.path.join('/content/drive/MyDrive/Blackcoffer Project/textFiles1', filename)
        with open(file_path, 'r', encoding='utf-8') as f:
            article = f.read()
            # Call both functions and store results in a dictionary
            result1 = calculate_variables(filename, article)
            result2 = analyze_text(article)
            result = {**result1, **result2}
            results.append(result)
        print(f"Processed {filename}")


# Write results to output file
output_file = "/content/drive/MyDrive/Blackcoffer Project/Result/output.xlsx"
with open(output_file, 'w') as f:
    f.write("URL_ID\tPOSITIVE SCORE\tNEGATIVE SCORE\tPOLARITY SCORE\tSUBJECTIVITY SCORE")
    f.write("\tAVG SENTENCE LENGTH\tPERCENTAGE OF COMPLEX WORDS\tFOG INDEX\tAVG NUMBER OF WORDS PER SENTENCE\tCOMPLEX WORD COUNT\tWORD COUNT\t")
    f.write("SYLLABLES PER WORD\tPERSONAL PRONOUNS\tAVG WORD LENGTH\n")
    for result in results:
        output_line = f"{result['file_id'][0:-5]}\t{result['positive_score']}\t{result['negative_score']}\t"
        output_line += f"{result['polarity_score']:.2f}\t{result['subjectivity_score']:.2f}\t{result['Average Sentence Length']}\t{result['Percentage of complex words']}\t{result['FOG Index']}\t"
        output_line += f"{result['Average Words per Sentence']:.2f}\t{result['Complex Word Count']:.2f}\t{result['Word Count']:.2f}\t{result['Syllables per Word']:.2f}\t{result['Personal Pronouns Count']:.2f}\t{result['Average Word Length']:.2f}\n"
        f.write(output_line)
        print(f"Wrote result for {result['file_id']}")


In [None]:
output_file = "/content/drive/MyDrive/Blackcoffer Project/Result/output.xlsx"
with open(output_file, 'w') as f:
    f.write("URL_ID\tURL\tPOSITIVE SCORE\tNEGATIVE SCORE\tPOLARITY SCORE\tSUBJECTIVITY SCORE")
    f.write("\tAVG SENTENCE LENGTH\tPERCENTAGE OF COMPLEX WORDS\tFOG INDEX\tAVG NUMBER OF WORDS PER SENTENCE\tCOMPLEX WORD COUNT\tWORD COUNT\t")
    f.write("SYLLABLES PER WORD\tPERSONAL PRONOUNS\tAVG WORD LENGTH\n")
    for i, result in enumerate(results):
        output_line = f"{result['file_id'][0:-5]}\t{url_list[i]}\t{result['positive_score']}\t{result['negative_score']}\t"
        output_line += f"{result['polarity_score']:.2f}\t{result['subjectivity_score']:.2f}\t{result['Average Sentence Length']:.2f}\t{result['Percentage of complex words']:.2f}\t{result['FOG Index']:.2f}\t"
        output_line += f"{result['Average Words per Sentence']:.2f}\t{result['Complex Word Count']:.2f}\t{result['Word Count']:.2f}\t{result['Syllables per Word']:.2f}\t{result['Personal Pronouns Count']:.2f}\t{result['Average Word Length']:.2f}\n"
        f.write(output_line)
        print(f"Wrote result for {result['file_id']}") 


# **Conclusion**

In this project, we extracted textual data articles from the given URLs using Python programming and libraries such as Beautifulsoup and Scrapy. We cleaned the extracted text to remove any irrelevant information and saved each extracted article in a text file with URL_ID as its file name.

We then performed textual analysis on each extracted article text using the NLTK library to compute variables such as positive score, negative score, polarity score, subjectivity score, average sentence length, percentage of complex words, fog index, average number of words per sentence, complex word count, word count, syllable per word, personal pronouns, and average word length.

We saved the computed variables in a dictionary or a list and populated a pandas dataframe with the extracted variables and the computed variables. We exported the dataframe to a CSV or Excel file as per the format given in the Output Data Structure.xlsx file.

In conclusion, we successfully extracted data from the given URLs and performed textual analysis to compute variables as per the given definitions. The project demonstrates the use of Python programming and libraries such as Beautifulsoup, Scrapy, and NLTK for data extraction and analysis.