# Telugu Token Frequency Analysis

This notebook reads a tokenized Telugu text file, filters out stopwords, and plots the top 100 most frequent tokens using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
import matplotlib.font_manager as fm
import os

## Load the Tokenized Telugu File

We load the tokens from the text file into a list.

In [None]:
# Path to your tokenized Telugu file
token_file = "tokenized_telugu.txt"

# Load tokens
with open(token_file, "r", encoding="utf-8") as f:
    tokens = [token for line in f for token in line.strip().split()]

## Filter Telugu Tokens

We keep only the tokens that contain Telugu characters.

In [None]:
# Filter Telugu tokens (Unicode range U+0C00 to U+0C7F)
def is_telugu(token):
    return any('\u0C00' <= ch <= '\u0C7F' for ch in token)

telugu_tokens = [token for token in tokens if is_telugu(token)]

## Remove Telugu Stopwords

We remove common Telugu stopwords so the analysis focuses on meaningful words.

In [None]:
# Telugu stop words (customize as needed)
telugu_stopwords = set([
    "ఇది", "అందరు", "మరియు", "తర్వాత", "కాని", "అయితే", "కూడా", "ఉంది", "ఉన్న", "చేసి",
    "వారు", "మీ", "నా", "తన", "తాము", "ఇంకా", "పైన", "కింద", "పాటు", "తో", "లో", "కి", "పై"
])

filtered_tokens = [token for token in telugu_tokens if token not in telugu_stopwords]

## Count Word Frequencies

We use `Counter` to get the top 100 most frequent words.

In [None]:
# Frequency distribution
freq_dist = Counter(filtered_tokens)
top_tokens = freq_dist.most_common(100)
tokens, counts = zip(*top_tokens)

## Load Telugu Font

We load the NotoSans Telugu font to ensure proper display in the graph.

In [None]:
font_path = r"C:\Users\Mahitha\OneDrive\Desktop\SVNIT\Semester - 5\Lab\NLP\NotoSansTelugu-VariableFont_wdth,wght.ttf"
telugu_font = None

if os.path.exists(font_path):
    telugu_font = fm.FontProperties(fname=font_path)
    print(f"✅ Telugu font loaded: {telugu_font.get_name()}")
else:
    print("⚠ Font not found, using default font.")

## Plot the Graph

We display the top 100 Telugu tokens in a horizontal bar chart.

In [None]:
# Plotting
plt.figure(figsize=(14, 12))  # Bigger figure for 100 labels
plt.barh(tokens[::-1], counts[::-1], color="teal")
plt.xlabel("Frequency", fontsize=16, fontproperties=telugu_font)
plt.title("Top 100 Telugu Tokens", fontsize=18, fontproperties=telugu_font)
plt.xticks(fontsize=14)

# Apply Telugu font explicitly to y-axis labels with bigger font
if telugu_font:
    plt.yticks(range(len(tokens[::-1])), tokens[::-1], fontsize=12, fontproperties=telugu_font)
else:
    plt.yticks(range(len(tokens[::-1])), tokens[::-1], fontsize=12)

plt.gca().invert_yaxis()  # Highest frequency at top
plt.tight_layout()
plt.show()
