# Exploratory Data Analysis (EDA) for Tamil and Malayalam Text Classification

This notebook provides an exploratory data analysis (EDA) for a dataset containing Tamil and Malayalam texts, classified into "Abusive" and "Non-Abusive" categories. The analysis includes text statistics, script detection, class distribution, and visualizations to better understand the dataset.

---

In [30]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Dataset Analysis Function

We will start by defining a function to analyze the dataset, which includes basic statistics, text length, word count, and language script distribution.

In [31]:
def analyze_dataset(data, set_name="Dataset"):
    """Analyze text dataset and print insights, including data quality checks"""
    print(f"\n{set_name} Analysis:")
    print("=" * 50)
    
    # Basic statistics
    print("\n1. Basic Statistics:")
    print(f"Total samples: {len(data)}")
    
    # Null values check
    print("\nNull Values Check:")
    print(data.isnull().sum())
    
    # Unique values check for each column
    print("\nUnique Values per Column:")
    for col in data.columns:
        unique_vals = data[col].nunique()
        print(f"{col}: {unique_vals} unique values")
    
    # Class distribution
    class_dist = data['Class'].value_counts()
    print("\nClass Distribution:")
    for cls, count in class_dist.items():
        percentage = (count/len(data))*100
        print(f"{cls}: {count} ({percentage:.2f}%)")
    
    # Text length statistics
    data['text_length'] = data['Text'].astype(str).apply(len)
    data['word_count'] = data['Text'].astype(str).apply(lambda x: len(x.split()))
    
    print("\n2. Text Length Statistics:")
    print("\nCharacter Count:")
    print(f"Mean: {data['text_length'].mean():.2f}")
    print(f"Median: {data['text_length'].median():.2f}")
    print(f"Min: {data['text_length'].min()}")
    print(f"Max: {data['text_length'].max()}")
    
    print("\nWord Count:")
    print(f"Mean: {data['word_count'].mean():.2f}")
    print(f"Median: {data['word_count'].median():.2f}")
    print(f"Min: {data['word_count'].min()}")
    print(f"Max: {data['word_count'].max()}")
    
    # Language script analysis
    def detect_script(text):
        tamil = len(re.findall(r'[\u0B80-\u0BFF]', str(text)))
        malayalam = len(re.findall(r'[\u0D00-\u0D7F]', str(text)))
        if tamil > malayalam:
            return 'Tamil'
        elif malayalam > tamil:
            return 'Malayalam'
        else:
            return 'Other/Mixed'
    
    data['script'] = data['Text'].apply(detect_script)
    
    print("\n3. Script Distribution:")
    script_dist = data['script'].value_counts()
    for script, count in script_dist.items():
        percentage = (count/len(data))*100
        print(f"{script}: {count} ({percentage:.2f}%)")
    
    # Cross tabulation of script and class
    print("\n4. Script vs Class Distribution:")
    cross_tab = pd.crosstab(data['script'], data['Class'])
    print(cross_tab)
    
    return data

## 2. Visualization Function

We will define a function that generates various visualizations to explore class distribution, text length, word count, and script distribution.

In [32]:
def plot_distributions(data, set_name="Dataset"):
    """Create visualizations for the dataset"""
    plt.figure(figsize=(15, 10))
    
    # 1. Class Distribution
    plt.subplot(2, 2, 1)
    sns.countplot(data=data, x='Class')
    plt.title(f'Class Distribution in {set_name}')
    plt.xticks(rotation=45)
    
    # 2. Text Length Distribution by Class
    plt.subplot(2, 2, 2)
    sns.boxplot(data=data, x='Class', y='text_length')
    plt.title('Text Length Distribution by Class')
    plt.xticks(rotation=45)
    
    # 3. Word Count Distribution by Class
    plt.subplot(2, 2, 3)
    sns.boxplot(data=data, x='Class', y='word_count')
    plt.title('Word Count Distribution by Class')
    plt.xticks(rotation=45)
    
    # 4. Script Distribution
    plt.subplot(2, 2, 4)
    sns.countplot(data=data, x='script', hue='Class')
    plt.title('Script Distribution by Class')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.savefig(f'{set_name.lower()}_analysis.png')
    plt.close()

## 3. Running the Analysis

Now, we will run the analysis on a sample dataset to demonstrate the process.

In [None]:
dev_ma = pd.read_csv('../data/dev_ma.csv')
dev_ta = pd.read_csv('../data/dev_ta.csv')
train_ma = pd.read_csv('../data/train_ma.csv')
train_ta = pd.read_csv('../data/train_ta.csv')

for data, set_name in [(dev_ma, "Dev Malayalam"), (dev_ta, "Dev Tamil"), 
                       (train_ma, "Train Malayalam"), (train_ta, "Train Tamil")]:
    
    analyzed_data = analyze_dataset(data, set_name)

    plot_distributions(analyzed_data, set_name)