<a href="https://colab.research.google.com/github/Sharugh/sharugh-ML-projects/blob/main/Sharugh_Text_Analysis_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The goal of this code is to perform text analysis on articles obtained from a set of given URLs. The process involves two main steps: data extraction and data analysis**

## **Download textblob**

In [None]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


## **Install Beautifulsoup4 , textblob , nltk**

In [None]:
pip install requests beautifulsoup4 textblob nltk



# **Data Extraction**

# We start by reading the input data from the 'Input.xlsx' file using the 'pandas' library. This file contains a list of URLs along with corresponding URL IDs.

Next, we use web scraping techniques with 'requests' and 'BeautifulSoup' to extract article titles and text from the HTML content of each URL. It's crucial to note that we focus on extracting only the article text, excluding headers, footers, and any irrelevant information. The extracted titles and text are then saved in separate text files, with filenames based on their respective URL IDs.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Read input file
input_data = pd.read_excel("/content/Input.xlsx")
for index, row in input_data.iterrows():
    url_id = row['URL_ID']
    url = row['URL']
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    article_title = soup.find('title').get_text()
    article_text = " ".join([p.get_text() for p in soup.find_all('p')])
    with open(f"{url_id}.txt", 'w', encoding='utf-8') as file:
        file.write(f"{article_title}\n\n{article_text}")

# **Download cumdict**

In [None]:
import nltk
nltk.download('cmudict')

[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


True

# **Data Analysis:**

### Moving on to the data analysis part, we employ the 'TextBlob' library for sentiment analysis and linguistic feature extraction. Additionally, we use the 'nltk' library for counting syllables.

### To optimize the code's performance, we introduce parallel processing using the 'concurrent.futures' module. This helps speed up both the reading of text files and the text analysis itself.

# **Output Data Structure:**

### The results of our text analysis are organized into a DataFrame using 'pandas' and saved in the 'Output Data Structure.xlsx' file. This structured output includes various variables such as positive and negative scores, polarity, subjectivity, and other linguistic features.

In [None]:
import concurrent.futures
from functools import partial
from textblob import TextBlob
from nltk.corpus import cmudict
import pandas as pd
def syllable_count(word):
    d = cmudict.dict()
    return max([len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]) if word.lower() in d else 0
def analyze_text(url_id, content):
    blob = TextBlob(content)
    words = blob.words
    syllables = sum(syllable_count(word) for word in words)
    positive_score = blob.sentiment.polarity
    negative_score = 1 - positive_score
    polarity_score = blob.sentiment.polarity
    subjectivity_score = blob.sentiment.subjectivity
    avg_sentence_length = len(blob.sentences)
    percentage_of_complex_words = len([word for word in words if syllable_count(word) > 2]) / len(words) * 100
    fog_index = 0.4 * (avg_sentence_length + percentage_of_complex_words)
    avg_number_of_words_per_sentence = len(words) / avg_sentence_length
    complex_word_count = len([word for word in words if syllable_count(word) > 2])
    word_count = len(words)
    syllable_per_word = syllables / word_count
    personal_pronouns = len([word for word in words if word.lower() in ['i', 'me', 'my', 'mine', 'we', 'us', 'our', 'ours']])
    avg_word_length = sum(len(word) for word in words) / word_count

    return [url_id, positive_score, negative_score, polarity_score, subjectivity_score,
            avg_sentence_length, percentage_of_complex_words, fog_index, avg_number_of_words_per_sentence,
            complex_word_count, word_count, syllable_per_word, personal_pronouns, avg_word_length]
def read_text_file(url_id):
    with open(f"{url_id}.txt", 'r', encoding='utf-8') as file:
        return url_id, file.read()
input_data = pd.read_excel("/content/Input.xlsx")
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(executor.map(partial(read_text_file), input_data['URL_ID']))
with concurrent.futures.ThreadPoolExecutor() as executor:
    output_data = list(executor.map(partial(analyze_text), [result[0] for result in results], [result[1] for result in results]))
columns = ['URL_ID', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE', 'SUBJECTIVITY SCORE',
           'AVG SENTENCE LENGTH', 'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX', 'AVG NUMBER OF WORDS PER SENTENCE',
           'COMPLEX WORD COUNT', 'WORD COUNT', 'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH']
output_df = pd.DataFrame(output_data, columns=columns)
output_df.to_excel("Output Data Structure.xlsx", index=False)

# **Conclusion:**
### In summary, this code efficiently extracts relevant information from a set of URLs, performs thorough text analysis, and organizes the results in a structured output format.

## **Additional Points:**
### We also implemented a syllable counting function using the CMU Pronouncing Dictionary to enhance our text analysis.

### The introduction of parallel processing significantly improves the overall speed of the code, making it more efficient, especially when dealing with a large number of URLs.