<a href="https://colab.research.google.com/github/Satyam1018/Data_Extraction-Text_Analysis/blob/main/Data_Extraction_and_Text_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Personal Information**

1. **Name** - Satyam Ojha
2. **Contact Detail**
  - **LinkedIN** - https://www.linkedin.com/in/ojha-satyam
  - **Email**    - ojhasatyam10@gmail.com
3. **Skills**  
  - PYTHON
  - SQL
  - Power BI / Tableau
  - Machine learning
  - Deep learning
  - Natural language processing (NLP)
  - Neural Network
4. **GitHub** - https://github.com/Satyam1018



# **Introduction**

This document explains the methodology for performing text analysis to derive insights such as sentiment scores, readability indices, and word complexity from a given text corpus. The analysis involves several steps:

- **Sentiment Analysi**s: Clean the text using stop words, create dictionaries of positive and negative words, and calculate scores like positive, negative, polarity, and subjectivity.

- **Readability Analysis**: Assess the readability of the text using the Gunning Fog Index.

Each section includes a detailed explanation and Python code snippets to demonstrate the implementation, providing a practical guide to text analysis for data science projects.

# **Explanation of Columns**

- **URL_ID**: Unique identifier for each URL.

- **URL**: The URL of the article or text.

- **POSITIVE SCORE**: Count of positive words in the text.

- **NEGATIVE SCORE**: Count of negative words in the text.

- **POLARITY SCORE**: (Positive Score - Negative Score) / (Positive Score + Negative Score + 0.000001)

- **SUBJECTIVITY SCORE**: (Positive Score + Negative Score) / (Total Number of Words + 0.000001)

- **AVG SENTENCE LENGTH**: Average number of words per sentence.

- **PERCENTAGE OF COMPLEX WORDS**: (Number of complex words / Total Number of Words) * 100

- **FOG INDEX**: 0.4 * (AVG SENTENCE LENGTH + PERCENTAGE OF COMPLEX WORDS)

- **AVG NUMBER OF WORDS PER SENTENCE**: Total Number of Words / Total Number of Sentences.

- **COMPLEX WORD COUNT**: Number of complex words (words with more than 2 syllables).

- **WORD COUNT**: Total number of words in the text.

- **SYLLABLE PER WORD**: Average number of syllables per word.

- **PERSONAL PRONOUNS**: Count of personal pronouns in the text (I, we, my, ours, us).

- **AVG WORD LENGTH**: Average length of words in the text.

# **GitHub Link**

https://github.com/Satyam1018/Data_Extraction-Text_Analysis

# **Problem Statement**

The goal of this project is to develop a methodology for text analysis to extract meaningful insights from a given corpus. Specifically, the project aims to:

**Determine the sentiment of the text, categorizing it as positive, negative, or neutral.**

**Assess the readability of the text to understand its complexity and ease of comprehension.**

**Analyze various linguistic features, including word count, sentence length, word complexity, syllable count, and the usage of personal pronouns.**

# ***Let's Begin !***

## **1. Import Libraries**

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


##**2. Dataset loading and Exploration**

In [4]:
df = pd.read_excel('/content/Input.xlsx')

In [5]:
df.head(10)

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...
5,blackassign0006,https://insights.blackcoffer.com/the-rise-of-t...
6,blackassign0007,https://insights.blackcoffer.com/rise-of-cyber...
7,blackassign0008,https://insights.blackcoffer.com/rise-of-inter...
8,blackassign0009,https://insights.blackcoffer.com/rise-of-cyber...
9,blackassign0010,https://insights.blackcoffer.com/rise-of-cyber...


In [6]:
df.tail()

Unnamed: 0,URL_ID,URL
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...
97,blackassign0098,https://insights.blackcoffer.com/contribution-...
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...
99,blackassign0100,https://insights.blackcoffer.com/how-will-covi...


In [7]:
df.shape

(100, 2)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URL_ID  100 non-null    object
 1   URL     100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


## **4. Data Extraction**

In [9]:
def extract_text_from_url(url):
    # Fetch the content from the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:

        # Parse the HTML content using BeautifulSoup
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract the main text content
        paragraphs = soup.find_all('p')
        text = ' '.join([para.get_text() for para in paragraphs])
        return text
    else:
        print(f"Error fetching {url}: Status code {response.status_code}")
        return None

# Apply the function to each URL in the DataFrame and store the result in a new column
df['Text'] = df['URL'].apply(extract_text_from_url)

# Display the DataFrame with the extracted text
print(df.head())

# Optionally, save the DataFrame to a new Excel file
df.to_excel('extracted_text.xlsx', index=True)


Error fetching https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/: Status code 404
Error fetching https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/: Status code 404
            URL_ID                                                URL  \
0  blackassign0001  https://insights.blackcoffer.com/rising-it-cit...   
1  blackassign0002  https://insights.blackcoffer.com/rising-it-cit...   
2  blackassign0003  https://insights.blackcoffer.com/internet-dema...   
3  blackassign0004  https://insights.blackcoffer.com/rise-of-cyber...   
4  blackassign0005  https://insights.blackcoffer.com/ott-platform-...   

                                                Text  
0  Efficient Supply Chain Assessment: Overcoming ...  
1  Efficient Supply Chain Assessment: Overcoming ...  
2  Efficient Supply Chain Assessment: Overcoming ...  
3  Efficient Supply Chain Assessment: Overcoming ...  
4  Efficient Supply Chain Assessment: Overc

In [10]:
df.head()

Unnamed: 0,URL_ID,URL,Text
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,Efficient Supply Chain Assessment: Overcoming ...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,Efficient Supply Chain Assessment: Overcoming ...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,Efficient Supply Chain Assessment: Overcoming ...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,Efficient Supply Chain Assessment: Overcoming ...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,Efficient Supply Chain Assessment: Overcoming ...


In [11]:
print(df.iloc[5]['Text'])

Efficient Supply Chain Assessment: Overcoming Technical Hurdles for Web Application Development Streamlined Integration: Interactive Brokers API with Python for Desktop Trading Application Efficient Data Integration and User-Friendly Interface Development: Navigating Challenges in Web Application Deployment Effective Management of Social Media Data Extraction: Strategies for Authentication, Security, and Reliability AI Bot Audio to audio Methodology for ETL Discovery Tool using LLMA, OpenAI, Langchain Methodology for database discovery tool using openai, LLMA, Langchain Chatbot using VoiceFlow Rising IT cities and its impact on the economy, environment, infrastructure, and city life by the year 2040. Rising IT Cities and Their Impact on the Economy, Environment, Infrastructure, and City Life in Future Internet Demand’s Evolution, Communication Impact, and 2035’s Alternative Pathways Rise of Cybercrime and its Effect in upcoming Future AI/ML and Predictive Modeling Solution for Contact 

- **Define Extraction Function:**
  - **Function: extract_text_from_url(url)**
  - **Action:** Fetches and parses HTML content from the URL, extracts text from `<p>` tags, and returns the combined text.

- **Apply Function:**
  - **Action:** Applied the function to each URL in the DataFrame's URL column.
  - **Result:** Stored the extracted text in a new column named Text.

- **Verify and Save:**
  - **Verify:** Displayed the first few rows of the DataFrame to ensure text extraction.
  - **Save:** Optionally saved the DataFrame to an Excel file (extracted_text.xlsx).

This process automates text extraction from URLs and stores it in a DataFrame for further analysis.


### **5. Loading Stop Words and Cleaning Text**

In [12]:
# Function to load stop words from multiple files
def load_stop_words(files, encoding='utf-8'):
    stop_words = set(stopwords.words('english'))
    for file in files:
        with open(file, 'r', encoding=encoding, errors='ignore') as f:
            stop_words.update([line.strip() for line in f if line.strip()])
    return stop_words

# Data Cleaning Steps including Lemmatization
def clean_text(text, stop_words):
    if text is None:
        return ''

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Convert text to lowercase
    text = text.lower()

    # Tokenize text into words
    words = word_tokenize(text)

    # Initialize the WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    # Remove stop words and lemmatize
    clean_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Join the cleaned words back into a single string
    clean_text = ' '.join(clean_words)
    return clean_text

# Load stop words from multiple files
stop_words_files = [
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_Auditor.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_Currencies.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_DatesandNumbers.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_Generic.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_GenericLong.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_Geographic.txt',
    '/content/drive/MyDrive/DS_Assignment_Data/StopWords_Names.txt'
]
stop_words = load_stop_words(stop_words_files)

# Apply the clean_text function to the 'Text' column
df['Cleaned_Text'] = df['Text'].apply(lambda x: clean_text(x, stop_words))

# Verify the first row to check if lemmatization and cleaning worked
print(df.iloc[0]['Text'])

# Verify the first row to check if lemmatization and cleaning worked
print(df.iloc[0]['Cleaned_Text'])

Efficient Supply Chain Assessment: Overcoming Technical Hurdles for Web Application Development Streamlined Integration: Interactive Brokers API with Python for Desktop Trading Application Efficient Data Integration and User-Friendly Interface Development: Navigating Challenges in Web Application Deployment Effective Management of Social Media Data Extraction: Strategies for Authentication, Security, and Reliability AI Bot Audio to audio Methodology for ETL Discovery Tool using LLMA, OpenAI, Langchain Methodology for database discovery tool using openai, LLMA, Langchain Chatbot using VoiceFlow Rising IT cities and its impact on the economy, environment, infrastructure, and city life by the year 2040. Rising IT Cities and Their Impact on the Economy, Environment, Infrastructure, and City Life in Future Internet Demand’s Evolution, Communication Impact, and 2035’s Alternative Pathways Rise of Cybercrime and its Effect in upcoming Future AI/ML and Predictive Modeling Solution for Contact 

**Explanation**

- **Load Stop Words:**

  - Created a function to load stop words from multiple text files.

  - Combined stop words into a set, handling different file encodings.

- **Data Cleaning Function:**

  - Developed a function to clean text by:

  - Removing URS.

  - Removing punctuation.

  - Converting text to lowercase.

  - Tokenizing text into words.

  - Convert into words base form.

  - Removing stop words.

  - Joining cleaned words back into a single string.

- **Apply Cleaning Function:**

  - Loaded stop words using the defined function.

  - Applied the cleaning function to the 'Text' column of the DataFrame.

  - Stored cleaned text in a new column 'Cleaned_Text'.


### **6. Define Sentiment Analysis Function**

Define a function **sentiment_analysis** to calculate positive, negative, polarity, and subjectivity scores.

In [13]:
def sentiment_analysis(text, positive_words, negative_words):
    words = word_tokenize(text)

    positive_score = sum(1 for word in words if word in positive_words)
    negative_score = sum(1 for word in words if word in negative_words)
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(words) + 0.000001)

    return positive_score, negative_score, polarity_score, subjectivity_score


### **7.Load Positive and Negative Word Lists**

Load the lists of **positive and negative words** from the provided files.

In [14]:
# Define encoding variable
encoding = 'utf-8'

# Load positive word list
with open('/content/drive/MyDrive/DS_Assignment_Data/positive-words.txt', 'r', encoding=encoding, errors='ignore') as file:
    positive_words = set(line.strip() for line in file if line.strip())

# Load negative word list
with open('/content/drive/MyDrive/DS_Assignment_Data/negative-words.txt', 'r', encoding=encoding, errors='ignore') as file:
    negative_words = set(line.strip() for line in file if line.strip())


### **8. Apply Sentiment Analysis**

In [15]:
df[['Positive_Score', 'Negative_Score', 'Polarity_Score', 'Subjectivity_Score']] = df['Cleaned_Text'].apply(
    lambda x: pd.Series(sentiment_analysis(x, positive_words, negative_words)))

### **9. Define Readability Analysis Function**

Define a function **readability_analysis** to calculate readability metrics.

In [16]:
def readability_analysis(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    num_sentences = len(sentences)
    num_words = len(words)
    avg_sentence_length = num_words / (num_sentences + 0.000001)  # Avoid division by zero

    complex_words = [word for word in words if len(re.findall(r'[aeiou]', word)) > 2]
    num_complex_words = len(complex_words)

    # Calculate percentage of complex words
    percentage_complex_words = (num_complex_words / (num_words + 0.000001)) * 100  # Avoid division by zero

    # Calculate fog index
    fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)

    # Calculate average word length
    avg_word_length = sum(len(word) for word in words) / (num_words + 0.000001)  # Avoid division by zero

    # Calculate syllables per word
    syllables_per_word = sum(len(re.findall(r'[aeiou]', word)) for word in words) / (num_words + 0.000001)  # Avoid division by zero

    # Calculate personal pronouns
    personal_pronouns = len(re.findall(r'\b(I|we|my|ours|us)\b', text.lower()))

    return avg_sentence_length, percentage_complex_words, fog_index, num_words / (num_sentences + 0.000001), num_complex_words, num_words, syllables_per_word, personal_pronouns, avg_word_length


### **10. Apply Sentiment and Readability Analysis**

Apply the readability analysis function to each row in the **Cleaned_Text** column and add the results as new columns in the DataFrame.

In [17]:
# Apply sentiment analysis to the dataframe
df[['POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE', 'SUBJECTIVITY SCORE']] = df['Cleaned_Text'].apply(
    lambda x: pd.Series(sentiment_analysis(x, positive_words, negative_words)))

# Apply readability analysis to the dataframe
df[['AVG SENTENCE LENGTH', 'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX', 'AVG NUMBER OF WORDS PER SENTENCE',
    'COMPLEX WORD COUNT', 'WORD COUNT', 'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH']] = df['Cleaned_Text'].apply(
    lambda x: pd.Series(readability_analysis(x)))


### **11. Adjust Column Names and Order**

In [18]:
# Rename columns to match the expected output format
df = df.rename(columns={
    'URL_ID': 'URL_ID',
    'URL': 'URL'
})

# Reorder columns to match the expected output format
df = df[['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE', 'SUBJECTIVITY SCORE',
         'AVG SENTENCE LENGTH', 'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX', 'AVG NUMBER OF WORDS PER SENTENCE',
         'COMPLEX WORD COUNT', 'WORD COUNT', 'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH']]


### **12. Save the Final DataFrame and View**

Save the DataFrame with all the **calculated metrics** to an Excel file.

In [19]:
# Save the final DataFrame with all calculated metrics
df.to_excel('cleaned_data_with_metrics.xlsx', index=False)

# Display the DataFrame to verify
df

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,14.0,2.0,0.750000,0.053333,299.999700,47.000000,138.799880,299.999700,141.0,300.0,2.616667,0.0,6.866667
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,63.0,32.0,0.326316,0.104510,908.999091,55.775577,385.909867,908.999091,507.0,909.0,2.754675,0.0,7.165016
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,48.0,27.0,0.280000,0.098945,757.999242,63.852243,328.740594,757.999242,484.0,758.0,3.072559,0.0,7.976253
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,45.0,82.0,-0.291339,0.169108,750.999249,60.452730,324.580791,750.999249,454.0,751.0,2.944075,0.0,7.768309
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,25.0,9.0,0.470588,0.067194,505.999494,54.150198,224.059877,505.999494,274.0,506.0,2.826087,0.0,7.363636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,35.0,65.0,-0.300000,0.139470,716.999283,50.906555,307.162335,716.999283,365.0,717.0,2.691771,0.0,7.138075
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,28.0,37.0,-0.138462,0.115248,563.999436,48.404255,244.961476,563.999436,273.0,564.0,2.597518,0.0,6.671986
97,blackassign0098,https://insights.blackcoffer.com/contribution-...,8.0,1.0,0.777778,0.039648,226.999773,54.185022,112.473918,226.999773,123.0,227.0,2.731278,0.0,7.127753
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...,20.0,4.0,0.666667,0.060302,397.999602,45.728643,177.491298,397.999602,182.0,398.0,2.512563,0.0,6.623116


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   URL_ID                            100 non-null    object 
 1   URL                               100 non-null    object 
 2   POSITIVE SCORE                    100 non-null    float64
 3   NEGATIVE SCORE                    100 non-null    float64
 4   POLARITY SCORE                    100 non-null    float64
 5   SUBJECTIVITY SCORE                100 non-null    float64
 6   AVG SENTENCE LENGTH               100 non-null    float64
 7   PERCENTAGE OF COMPLEX WORDS       100 non-null    float64
 8   FOG INDEX                         100 non-null    float64
 9   AVG NUMBER OF WORDS PER SENTENCE  100 non-null    float64
 10  COMPLEX WORD COUNT                100 non-null    float64
 11  WORD COUNT                        100 non-null    float64
 12  SYLLABLE 

# **Summary and Findings**

Our analysis provided a detailed examination of the **text content across multiple dimensions, highlighting variations in sentiment, readability, and complexity.** These insights are valuable for understanding the tone, accessibility, and potential impact of the content.

# **Conclusion**

- **Data Extraction**: We successfully fetched and parsed HTML content from given URLs, extracting relevant text data from `<p>` tags. This raw text formed the foundation for our subsequent analysis.
- **Data Cleaning and Preprocessing**: The text data was cleaned to remove extraneous elements, including HTML tags, special characters, and unnecessary whitespace. We ensured the text was in a format suitable for further analysis.
- **Sentiment Analysis**: Using predefined lists of positive and negative words, we calculated:
  - **Positive Score**: The number of positive words in the text.
  - **Negative Score**: The number of negative words in the text.
  - **Polarity Score**: A measure of the overall sentiment of the text, calculated as the normalized difference between positive and negative scores.
  - **Subjectivity Score**: The proportion of the text that is subjective, based on the presence of both positive and negative words.
- **Readability Analysis**: We computed various readability metrics to assess the complexity and readability of the text:
  - **Average Sentence Length**: The average number of words per sentence.
  - **Percentage of Complex Words**: The proportion of words with more than two syllables.
  - **Fog Index**: A readability test that estimates the years of formal education needed to understand the text.
  - **Complex Word Count**: The number of complex words in the text.
  - **Word Count**: The total number of words in the text.
  - **Syllables per Word**: The average number of syllables per word.
  - **Personal Pronouns**: The count of personal pronouns in the text.
  - **Average Word Length**: The average number of characters per word.
- **Results Compilation**: We compiled the calculated metrics into a comprehensive DataFrame, aligning with the expected output format. This included columns for URL_ID, URL, sentiment scores, readability scores, and other relevant metrics.
