# Analyze Time Of India Articles

[Reference Lab](https://microsoftlearning.github.io/mslearn-ai-language/)

Following to be analyzed: 
* Get language
* Get sentiment
* Get key phrases
* Get entities
* Get linked entities



# Part-1: Setup

## 1.1 Imports

In [1]:
import keyring
import getpass
import os
import json
import re
import datetime
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient



## 1.2 Get Azure AI Language Service Endpoint and Key:

In [2]:
endpoint = getpass.getpass("Enter Azure AI Language Service endpoint URL: ").strip()

Enter Azure AI Language Service endpoint URL:  ········


In [3]:
api_key = getpass.getpass("Enter Azure AI Language Service API key: ").strip()

Enter Azure AI Language Service API key:  ········


In [4]:
# # Keyring : Optional Code block 
# SERVICE_NAME = "azure_ai_language"
# # Store password using keyring
# keyring.set_password(SERVICE_NAME, "endpoint", endpoint)
# keyring.set_password(SERVICE_NAME, "api_key", api_key)
# # Retrieve password
# api_key = keyring.get_password(SERVICE_NAME, "api_key")
# # Delete password
# keyring.delete_password(SERVICE_NAME, username)

## 1.3 Create client using endpoint and key:


In [5]:
credential = AzureKeyCredential(api_key)
text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=credential)

## 1.4 Set Path: 

In [6]:
# Read the Articles from folder 
articles_path = "Output/toi-articles/toi-articles-Most Commented/"
articles_path = "Output/toi-articles/toi-articles-Science/"

In [7]:
article_files = [f for f in os.listdir(articles_path) if f.endswith(".json")]

print("Found files:")
for f in article_files:
    print(os.path.join(articles_path, f))


Found files:
Output/toi-articles/toi-articles-Science/124065263-20250926-195737.json
Output/toi-articles/toi-articles-Science/124065406-20250926-195737.json
Output/toi-articles/toi-articles-Science/124069242-20250926-195737.json
Output/toi-articles/toi-articles-Science/124085777-20250926-195737.json
Output/toi-articles/toi-articles-Science/124087991-20250926-195737.json
Output/toi-articles/toi-articles-Science/124089977-20250926-195737.json
Output/toi-articles/toi-articles-Science/124090517-20250926-195737.json
Output/toi-articles/toi-articles-Science/124093219-20250926-195737.json
Output/toi-articles/toi-articles-Science/124099264-20250926-195737.json
Output/toi-articles/toi-articles-Science/124107834-20250926-195737.json
Output/toi-articles/toi-articles-Science/124108001-20250926-195737.json
Output/toi-articles/toi-articles-Science/124108906-20250926-195737.json
Output/toi-articles/toi-articles-Science/124112683-20250926-195737.json
Output/toi-articles/toi-articles-Science/124112893-

## 1.5 Data Pre-processing:

In [8]:
all_articles = []
for f in article_files:
    file_path = os.path.join(articles_path, f)
    article_str = open(file_path, encoding="utf-8").read()
    article_dict = json.loads(article_str)  # convert str to dict
    #print(type(article_dict), file_path)

    article_link = article_dict.get("article_link")
    
    article_headline = article_dict.get("article_details").get("headline")
    
    date_published = article_dict.get("article_details").get("datePublished")
    
    article_body = article_dict.get("article_details").get("articleBody")
    # Remove all non-ascii chars
    article_body = re.sub(r'[^\x00-\x7F]+', '', article_body)
    # Remove escaped double quotes from string \"
    article_body = article_body.replace(r'\"', '"')

    comments = article_dict.get("comments")

    
    all_articles.append(
        {
            "article_link"     : article_link,
            "article_headline" : article_headline,
            "article_body"     : article_body,
            "date_published"   : date_published,
            "comments"         : comments
        }
    )

    #print(article_body)
    #print("--"*30)

print(f"Number of Articles:{len(all_articles)}")

Number of Articles:20


# Part 2: Text Analytics: 

In [10]:
processed_articles = []

language_detection_char_limit = 100 # max = 5120
key_phrase_char_limit = 100
ner_char_limit = 50
entity_link_char_limit = 50

for article in all_articles:
     
    print("Processing:" + article.get('article_link'))

    # processed article dict
    processed_article = article.copy()

    # Error: InvalidDocument - document too large only 5120 text elements, see https://aka.ms/text-analytics-data-limits    
    # Language Detection
    detectedLanguage = text_analytics_client.detect_language(documents=[article.get("article_body")[:language_detection_char_limit]])[0]
    detected_lang_name = detectedLanguage.primary_language.name
    
    processed_article["detected_language"] = detected_lang_name
    
    # Sentiment Analysis for each comment
    # remove old list
    comments = article.get("comments")
    sentiment_comments = []
    comment_dict = {}
    for c in comments:        
        sentimentAnalysis = text_analytics_client.analyze_sentiment(documents=[ c ])[0]
        comment_sentiment = sentimentAnalysis.sentiment
        # print("Sentiment: {}".format(sentimentAnalysis.sentiment))
        comment_dict = {
            "comment"   : c,
            "sentiment" : comment_sentiment
        }
        sentiment_comments.append(comment_dict)
        
    processed_article["comments_sentiment"] = sentiment_comments

    # Key Phrases: only for 100 Chars
    key_phrases = text_analytics_client.extract_key_phrases(documents=[article.get("article_body")[:key_phrase_char_limit]])[0].key_phrases
    processed_article["key_phrases"] = key_phrases

    # Named Entiry Recognition:
    entities = text_analytics_client.recognize_entities(documents=[article.get("article_body")[:ner_char_limit]])[0].entities

    named_entities = []
    for entity in entities:
        temp_entity = {}
        temp_entity = {"entity_name": entity.text, "entity_type": entity.category}
        named_entities.append(temp_entity)

    processed_article["named_entiry_recognition"] = named_entities

    # Entity Linking 
    link_entities = text_analytics_client.recognize_linked_entities(documents=[article.get("article_body")[:entity_link_char_limit]])[0].entities
    link_entity_list = []

    for link_entity in link_entities:
        temp_link_entity = {}
        temp_link_entity = {"entity_name": link_entity.name, "entity_url": link_entity.url}
        link_entity_list.append(temp_link_entity)

    processed_article["linked_entities"] = link_entity_list

    # Finally append

    processed_articles.append(processed_article)



Processing:https://timesofindia.indiatimes.com/science/nasa-announces-all-american-class-of-10-astronauts-for-moon-and-mars-mission-women-outnumber-men-for-the-first-time/articleshow/124065263.cms
Processing:https://timesofindia.indiatimes.com/science/nasa-parker-solar-probe-sets-speed-record-at-687000-kilometers-per-hour-during-25th-flyby-new-insights/articleshow/124065406.cms
Processing:https://timesofindia.indiatimes.com/science/did-earths-oceans-originate-from-an-ancient-near-earth-asteroid-scientists-find-clues/articleshow/124069242.cms
Processing:https://timesofindia.indiatimes.com/science/asteroid-2024-yr4-nasa-proposes-blowing-up-hazardous-space-rock-to-protect-moonandsatellites/articleshow/124085777.cms
Processing:https://timesofindia.indiatimes.com/science/artemis-ii-2026-nasa-prepares-first-crewed-mission-to-circle-around-the-moon-in-50-years-scheduled-for-february/articleshow/124087991.cms
Processing:https://timesofindia.indiatimes.com/science/these-dogs-can-actually-think-

In [11]:
print(json.dumps(processed_articles[0]))

{"article_link": "https://timesofindia.indiatimes.com/science/nasa-announces-all-american-class-of-10-astronauts-for-moon-and-mars-mission-women-outnumber-men-for-the-first-time/articleshow/124065263.cms", "article_headline": "NASA announces all-American class of 10 astronauts for moon and Mars mission; women outnumber men for the first time", "article_body": "NASA on Monday unveiled its newest class of astronaut candidates at Johnson Space Center in Houston, selecting 10 individuals from a competitive pool of 8,000 applicants. This group, comprising six women and four men, marks the first time women outnumber men in a NASA astronaut class. Acting NASA Administrator Sean Duffy described them as Americas best and brightest, emphasizing the critical role they will play in the agencys ambitious plans, including returning humans to the moon and preparing for an unprecedented crewed mission to Mars. The selection reflects NASAs focus on diversity, expertise, and preparing the next generatio

In [12]:
# Make timestamp for filename
timestamp = datetime.datetime.now().strftime("%d%m%y-%H%M%S")

# Build folder and file path
output_folder = "Output/text-analytics"
os.makedirs(output_folder, exist_ok=True)

output_file = os.path.join(output_folder, f"{timestamp}.json")

# Write JSON to file
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(processed_articles, f, indent=4, ensure_ascii=False)

print(f"Processed articles written to file: {output_file}")


Processed articles written to file: Output/text-analytics\260925-200435.json
