# Result using "Assignment.xlsx" file 

Result stored in "result_with_txt.csv" file.

Best Perplexity: 1313.8285671354606



## Here's what perplexity signifies:

It is the evaluation method particularly used in language models like Latent Dirichlet Allocation (LDA). This score measures how well the model predicts a sample of text.

1. Lower Perplexity: A lower perplexity score indicates that the model is better at predicting the sample text. In other words, the lower the perplexity, the better the model is at capturing the underlying structure of the text data.

2. Higher Perplexity: Conversely, a higher perplexity score indicates poorer performance of the model. It suggests that the model has more difficulty predicting the sample text and may not be capturing the underlying patterns effectively.

####  I used Perplexity instead of precision, recall, f1 score, because "perplexity" is the best way/method to evaluate the language based models 

# News Analysis Project

In [23]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy

# Load the dataset
df = pd.read_excel('/kaggle/input/dataset/Assignment.xlsx', header=None, names=['Description'])

# Function to clean text
nlp = spacy.load('en_core_web_sm')

import re
from bs4 import BeautifulSoup

def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Lemmatize using spaCy
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]
    
    # Join tokens back into a string
    cleaned_text = ' '.join(lemmatized_tokens)
    
    return cleaned_text

df['Cleaned_Description'] = df['Description'].apply(clean_text)


# Function to get sentiment
def get_sentiment(text):
    sid = SentimentIntensityAnalyzer()
    sentiment_scores = sid.polarity_scores(text)
    if sentiment_scores['compound'] >= 0.05:
        return 'Positive'
    elif sentiment_scores['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['Sentiment'] = df['Cleaned_Description'].apply(get_sentiment)

# Find topics
def find_topics(texts):
    vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.03, stop_words='english', max_features=150)
    tfidf = vectorizer.fit_transform(texts)
    
    # Hyperparameter tuning for LDA
    best_perplexity = float('inf')
    best_lda_model = None
    for n_topics in range(5, 50):  # Trying different numbers of topics
        lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
        lda_model.fit(tfidf)
        perplexity = lda_model.perplexity(tfidf)
        if perplexity < best_perplexity:
            best_perplexity = perplexity
            best_lda_model = lda_model
    
    print(f'Best Perplexity: {best_perplexity}')
    
    # Extracting topics from the best model
    feature_names = vectorizer.get_feature_names_out()
    topics = []
    for text in texts:
        doc = nlp(text)
        tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
        tfidf_vector = vectorizer.transform([' '.join(tokens)])
        topic_distribution = best_lda_model.transform(tfidf_vector)
        top_topic = topic_distribution.argmax(axis=1)[0]
        top_features_ind = best_lda_model.components_[top_topic].argsort()[:-10 - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        topics.append(', '.join(top_features))
    return topics

df['Topics'] = find_topics(df['Cleaned_Description'])

# Function to get aspects
def aspect_analysis(text):
    doc = nlp(text)
    aspects = set()  # Using a set to avoid duplicates
    # Additional entity types to consider
    additional_entity_types = ['PERSON', 'GPE', 'NORP', 'FAC', 'LOC', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'QUANTITY', 'ORDINAL', 'CARDINAL']
    for ent in doc.ents:
        if ent.label_ in ['ORG', 'PRODUCT', 'EVENT'] or ent.label_ in additional_entity_types:
            aspects.add((ent.text, ent.label_))
    return list(aspects)

df['Aspects'] = df['Description'].apply(aspect_analysis)

# Save the results to a new CSV file
result_df1 = df[['Description', 'Cleaned_Description', 'Sentiment', 'Topics', 'Aspects']]
result_df1.to_csv('result_with_excel.csv', index=False)


Best Perplexity: 1313.8285671354606


In [24]:
result_df1.head()

Unnamed: 0,Description,Cleaned_Description,Sentiment,Topics,Aspects
0,Article,article,Neutral,"account, nike, accord, cancer, strava, use, cl...",[]
1,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive,"drug, weight, like, people, approve, chief, ye...","[(U.S., GPE), (Ozempic, PERSON), (C.E.O.s, ORG..."
2,"Move over, Ozempic — there’s a new drug in tow...",ozempic s new drug town \n\n eli lillys zepb...,Negative,"drug, weight, like, people, approve, chief, ye...","[(the end of the year, DATE), (Zepbound, PERSO..."
3,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,sept reuters bristol myers squibb bmyn say...,Negative,"therapy, cancer, cart, treatment, drug, cell, ...","[(Pfizer, ORG), (10, CARDINAL), (Eliquis, PERS..."
4,Austin Wolcott was 18 years old and pretty sur...,austin wolcott year old pretty sure not surv...,Negative,"therapy, cancer, cart, treatment, drug, cell, ...","[(John McCain, PERSON), (the ‘90s, DATE), (Mic..."


# Result using cnn Articles 
### result stored in "result_with_sentiment_and_aspects.csv" file 

#### Best Perplexity score : 189.35439836654564 




In [9]:
# import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.decomposition import LatentDirichletAllocation
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# import spacy

# # Load the dataset
# df = pd.read_csv('/kaggle/input/cnn-articles-after-basic-cleaning/CNN_Articels_clean/CNN_Articels_clean.csv')

# # Function to clean text
# nlp = spacy.load('en_core_web_sm')

# def clean_text(text):
#     doc = nlp(text)
#     lemmatized_tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
#     cleaned_text = ' '.join(lemmatized_tokens)
#     return cleaned_text

# df['Cleaned_Description'] = df['Description'].apply(clean_text)

# # Function to get sentiment
# def get_sentiment(text):
#     sid = SentimentIntensityAnalyzer()
#     sentiment_scores = sid.polarity_scores(text)
#     if sentiment_scores['compound'] >= 0.05:
#         return 'Positive'
#     elif sentiment_scores['compound'] <= -0.05:
#         return 'Negative'
#     else:
#         return 'Neutral'

# df['Sentiment'] = df['Cleaned_Description'].apply(get_sentiment)

# # Find topics
# def find_topics(texts):
#     vectorizer = TfidfVectorizer(max_df=0.8, min_df=0.02, stop_words='english', max_features=3500)
#     tfidf = vectorizer.fit_transform(texts)
    
#     # Hyperparameter tuning for LDA
#     best_perplexity = float('inf')
#     best_lda_model = None
#     for n_topics in range(5, 51):  # Trying different numbers of topics
#         lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
#         lda_model.fit(tfidf)
#         perplexity = lda_model.perplexity(tfidf)
#         if perplexity < best_perplexity:
#             best_perplexity = perplexity
#             best_lda_model = lda_model
    
#     print(f'Best Perplexity: {best_perplexity}')
    
#     # Extracting topics from the best model
#     feature_names = vectorizer.get_feature_names_out()
#     topics = []
#     for text in texts:
#         doc = nlp(text)
#         tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
#         tfidf_vector = vectorizer.transform([' '.join(tokens)])
#         topic_distribution = best_lda_model.transform(tfidf_vector)
#         top_topic = topic_distribution.argmax(axis=1)[0]
#         top_features_ind = best_lda_model.components_[top_topic].argsort()[:-10 - 1:-1]
#         top_features = [feature_names[i] for i in top_features_ind]
#         topics.append(', '.join(top_features))
#     return topics

# df['Topics'] = find_topics(df['Cleaned_Description'])

# # Function to get aspects
# def aspect_analysis(text):
#     doc = nlp(text)
#     aspects = set()  # Using a set to avoid duplicates
#     # Additional entity types to consider
#     additional_entity_types = ['PERSON', 'GPE', 'NORP', 'FAC', 'LOC', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'QUANTITY', 'ORDINAL', 'CARDINAL']
#     for ent in doc.ents:
#         if ent.label_ in ['ORG', 'PRODUCT', 'EVENT'] or ent.label_ in additional_entity_types:
#             aspects.add((ent.text, ent.label_))
#     return list(aspects)

# df['Aspects'] = df['Description'].apply(aspect_analysis)

# # Save the results to a new CSV file
# result_df = df[['Description', 'Cleaned_Description', 'Sentiment', 'Topics', 'Aspects']]
# result_df.to_csv('result_with_sentiment_and_aspects.csv', index=False)


KeyboardInterrupt: 

In [None]:
# result_df.head(5)

Unnamed: 0,Description,Cleaned_Description,Sentiment,Topics,Aspects
0,Article,article,Neutral,"account, nike, accord, cancer, strava, use, cl...",[]
1,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive,"drug, weight, like, people, approve, chief, ye...","[(U.S., GPE), (Ozempic, PERSON), (C.E.O.s, ORG..."
2,"Move over, Ozempic — there’s a new drug in tow...",ozempic s new drug town \n\n eli lillys zepb...,Negative,"drug, weight, like, people, approve, chief, ye...","[(the end of the year, DATE), (Zepbound, PERSO..."
3,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,sept reuters bristol myers squibb bmyn say...,Negative,"therapy, cancer, cart, treatment, drug, cell, ...","[(Pfizer, ORG), (10, CARDINAL), (Eliquis, PERS..."
4,Austin Wolcott was 18 years old and pretty sur...,austin wolcott year old pretty sure not surv...,Negative,"therapy, cancer, cart, treatment, drug, cell, ...","[(John McCain, PERSON), (the ‘90s, DATE), (Mic..."


# Future Considerations

1. We can feed more data to the model. 

2. We can integration with Advanced NLP Models to get more better results .

3. Dynamic Topic Modeling: Develop dynamic topic modeling techniques to capture the evolving nature of topics in large text corpora over time, enabling more accurate and up-to-date topic analysis.

4. we can create interactive visualization tools to allow users to explore topics, sentiment, and other insights in text data interactively, facilitating deeper understanding and analysis.

5. Incorporate multimodal analysis techniques to analyze both textual and visual data together, enabling richer insights from diverse data sources such as images, videos, and text.

6. Implement real-time topic detection algorithms to identify emerging topics and trends as they happen, enabling timely decision-making and response in various applications such as social media monitoring and news analysis.

7. Cross-domain Analysis

