# **DEVELOPING A PREDICTIVE MODEL FOR EARLY DETECTION OF MENTAL HEALTH DISORDERS** 

**DISCLAIMER: Upon use of this model  seek further consultation with a certified healthcare professional in your area**

---
**Authors:**

1. Elvis Wanjohi (Team Leader)

2. Jessica Gichimu

3. Jesse Ngugi

4. Stephen Gachingu

5. Latifa Riziki

---

## **1. Business Understanding**   

### 1.1 Business Overview
Given the fast-moving pace of economic and technological advancement in today’s world, most people, especially from the younger generation, tend to experience some form of mental health issues in their lifetime. There has been a significant increase in individuals experiencing suicidal ideation. While it may appear that such individuals do not explicitly communicate their distress, a closer examination of their online activity, such as social media posts, comments, and engagement patterns, often reveals underlying emotional states indicative of psychological distress. This could help researchers, students, and practitioners to develop early detection models for mental health support. The goal is to encourage data-driven approaches to mental health awareness, prevention, and support systems. Mental health awareness is primarily in the healthcare and psychology domains, focusing on the assessment, diagnosis, and treatment of mental health conditions. 

The target audience for this NLP model are health care professionals (such as therapists, psychologists, psychiatrists), and mental health organizations and clinics, where they can prioritize high-risk cases, or monitor trends in mental health conditions across populations. This model could be used to identify early symptoms of the mental health of individuals in our society. We were able to find a brief description of mental health in the Practical Natural Language Processing( A Comprehensive Guide to Building Real-World NLP Systems) book, which gave us the idea of tackling this project. The motivation for the project is try and improve the diagnosis  and treatment of mental health by identifying underlying conditions at an early stage.



---

### 1.2 Problem Statement

Mental health care professionals often rely on their experience, degrees and studies to make a diagnosis on the mental health of an individual. We are trying to build a tool that can coexist with their expertise in this field in order to accurately classify individuals with mental health conditions.Mental health care professionals can use statements made by patients  and use our model to analyse the words used and give a diagnosis on the mental health of the patient. This model is meant to work along with the knowledge that our capable health care professionals have, it should be used cautiously to ensure there is minimal or no misdiagnosis. This model will also be available to the public for anyone who would like a quick self assessment should they  protrary signs of a mental breakdown, but they should ensure to do further consultation with their mental health care provider.


---

### 1.3 Business Objective

#### 1.3.1 Main Objective 

The main objective is to build a model that can correctly classify mental health conditions based on given text or statements used by the public or individuals.


#### 1.3.2 Specific Objectives

The specific objectives of the project are:
 
1. Translate our text data to swahili to make the dataset localized.

2. Determine the most common mental health condition.

3. Preprocess the data through processes such as; Vectorization and tokenization, handling missing values, and creating new features such as characters, words and sentences.

4. Use wordcloud to visualize and identify the most commonly used words for a specific mental health condition.

5. Use text length to classify a mental health condition or show correlation  with a mental health condition.

6. 

7. Evaluate the model performance using Precision, Recall, F1score, Accuracy Score and Roc.

8. Compare different classification models to determine which performs best for this dataset.

9. Scrapping data from an online platform like twitter to show the efficiency of our model.

10. Create a translate feature where we can change a english text to swahili text to allow for intrepretability and diversity in our model.

#### 1.3.3 Research Questions

1. Can we translate our data?

2. Which is the most common health condition?

3. Which features influenced mental health condition?

4. Which words were specific to a certain class?

5. Which classifier model had the best Precision, Recall,F1 score, Accuracy score and ROC?

6. Which classification model performs best for this dataset?

7. Was our model efficient after making use 

---

### 1.5 Success Criteria  

The success of this project will be assessed in the following ways:

1. It should generate insights into how users feel about their products and services.

2. A machine learning model should be successfully developed that automatically determines the sentiment of a tweet based on words and tone used in the text.

 
---


## DATA PREPROCESSING


In [1]:
import  nltk
import re
import stopwordsiso as stopwords #provides stopwords for many languages, each identified by its ISO code.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('omw-1.4', quiet=True)

True

In [2]:
import pandas as pd
data = pd.read_csv('translated_dataset.csv')
data.head()

Unnamed: 0,text,status,text_sw
0,"""My mind is a never-ending cycle of worry, and...",anxiety,"""Akili yangu ni hali isiyobadilika ya wasiwasi..."
1,Despite the sun shining and birds singing outs...,bipolar,Licha ya jua kung'aa na ndege wakiimba nje ya ...
2,"I'm drowning in responsibilities, each one dem...",stress,"Mimi ninalemewa na madaraka, kila mmoja akitak..."
3,"""My emotions shift like the wind, leaving me u...",personality disorder,"""Hisia zangu hubadilika kama upepo, zikiniacha..."
4,"I'm trapped in a whirlwind of thoughts, unable...",anxiety,"Nimenaswa na mawazo mengi sana, nashindwa kuka..."


In [3]:
pd.set_option('display.max_colwidth',None)
df=data.drop('text', axis=1)
df.head()

Unnamed: 0,status,text_sw
0,anxiety,"""Akili yangu ni hali isiyobadilika ya wasiwasi, na hata kazi rahisi zaidi nahisi kuwa haiwezi kushindwa."" ""Ninakuwa na hofu na mashaka, na kila uamuzi unajisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea kulipuka."""
1,bipolar,"Licha ya jua kung'aa na ndege wakiimba nje ya dirisha langu, huzuni yangu inapungua sana, kana kwamba nimenaswa katika abiso isiyo na mwisho."
2,stress,"Mimi ninalemewa na madaraka, kila mmoja akitaka uangalifu wangu, hata hivyo nahofu kwamba hata nijaribu kadiri gani, huenda nisiweze kamwe kushinda kazi nyingi mno mbele yangu."
3,personality disorder,"""Hisia zangu hubadilika kama upepo, zikiniacha nikiwa na wasiwasi kuhusu mimi ni nani kwa kweli. Ninatamani kuwa imara, lakini ninahofia kupoteza uwezo wangu wa kinyonga wa kuchangamana na mazingira yoyote."""
4,anxiety,"Nimenaswa na mawazo mengi sana, nashindwa kukazia fikira kitu chochote huku akili yangu ikiona mandhari mbaya sana, ikiniacha nikiwa hoi na nikiwa nimelemewa na hofu."


In [4]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103488 entries, 0 to 103487
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   text     103481 non-null  object
 1   status   103488 non-null  object
 2   text_sw  103481 non-null  object
dtypes: object(3)
memory usage: 2.4+ MB


In [5]:
# checking for missing values
df.isna().sum()

status     0
text_sw    7
dtype: int64

In [6]:
# dropping rows with missing values
df.dropna(axis=0,inplace= True)

In [7]:
df.isna().sum()

status     0
text_sw    0
dtype: int64

In [8]:
df.duplicated().sum()

3148

There are 3,148 duplicates

In [9]:
#dropping duplicated rows
df = df.drop_duplicates(keep='first')

In [10]:
# lowercasing
df['text_sw']= df['text_sw'].str.lower().str.strip()
df.head()

Unnamed: 0,status,text_sw
0,anxiety,"""akili yangu ni hali isiyobadilika ya wasiwasi, na hata kazi rahisi zaidi nahisi kuwa haiwezi kushindwa."" ""ninakuwa na hofu na mashaka, na kila uamuzi unajisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea kulipuka."""
1,bipolar,"licha ya jua kung'aa na ndege wakiimba nje ya dirisha langu, huzuni yangu inapungua sana, kana kwamba nimenaswa katika abiso isiyo na mwisho."
2,stress,"mimi ninalemewa na madaraka, kila mmoja akitaka uangalifu wangu, hata hivyo nahofu kwamba hata nijaribu kadiri gani, huenda nisiweze kamwe kushinda kazi nyingi mno mbele yangu."
3,personality disorder,"""hisia zangu hubadilika kama upepo, zikiniacha nikiwa na wasiwasi kuhusu mimi ni nani kwa kweli. ninatamani kuwa imara, lakini ninahofia kupoteza uwezo wangu wa kinyonga wa kuchangamana na mazingira yoyote."""
4,anxiety,"nimenaswa na mawazo mengi sana, nashindwa kukazia fikira kitu chochote huku akili yangu ikiona mandhari mbaya sana, ikiniacha nikiwa hoi na nikiwa nimelemewa na hofu."


In [11]:
vitenzi = pd.read_csv('swahili_vitenzi_v1.csv')

In [12]:
vitenzi.shape

(2185, 146)

In [13]:
print(vitenzi.columns)

Index(['mzizi_wa_neno', 'nafsi_ya_kwanza_umoja_wakati_uliopita',
       'nafsi_ya_kwanza_umoja_wakati_uliopo',
       'nafsi_ya_kwanza_umoja_wakati_timilifu',
       'nafsi_ya_kwanza_umoja_wakati_ujao',
       'nafsi_ya_kwanza_wingi_wakati_uliopita',
       'nafsi_ya_kwanza_wingi_wakati_uliopo',
       'nafsi_ya_kwanza_umoja_wingi_timilifu',
       'nafsi_ya_kwanza_wingi_wakati_ujao',
       'nafsi_ya_pili_umoja_wakati_uliopita',
       ...
       'utaye', 'mliye.1', 'mnaye', 'mtaye', 'aliye.1', 'anaye', 'ataye',
       'waliye.1', 'wanaye', 'wataye'],
      dtype='object', length=146)


In [14]:
print(vitenzi.head())

  mzizi_wa_neno nafsi_ya_kwanza_umoja_wakati_uliopita  \
0          acha                              niliacha   
1        achama                            niliachama   
2        achana                            niliachana   
3     achanisha                         niliachanisha   
4         achia                             niliachia   

  nafsi_ya_kwanza_umoja_wakati_uliopo nafsi_ya_kwanza_umoja_wakati_timilifu  \
0                            ninaacha                              nimeacha   
1                          ninaachama                            nimeachama   
2                          ninaachana                            nimeachana   
3                       ninaachanisha                         nimeachanisha   
4                           ninaachia                             nimeachia   

  nafsi_ya_kwanza_umoja_wakati_ujao nafsi_ya_kwanza_wingi_wakati_uliopita  \
0                          nitaacha                              tuliacha   
1                        nit

# Convert dataset to long format

In [15]:
# Create lookup dictionary: every conjugated form → root
verb_dict = {}
for _, row in vitenzi.iterrows():
    root = row["mzizi_wa_neno"]
    for form in row.values:
        if isinstance(form, str):
            verb_dict[form.lower()] = root.lower()


In [16]:
def swahili_lemmatize_sentence(sentence, verb_dict):
    if not isinstance(sentence, str):
        return ""

    # Tokenize sentence
    words = re.findall(r"\w+|[^\w\s]", sentence.lower())

    lemmatized_words = []
    for word in words:
        lemma = verb_dict.get(word, word)  # replace if found
        lemmatized_words.append(lemma)

    return " ".join(lemmatized_words)


In [17]:
df['lemma'] = df['text_sw'].apply(lambda x: swahili_lemmatize_sentence(x, verb_dict))


In [18]:
df['lemma'].head()

0    " akili yangu ni hali isiyobadilika ya wasiwasi , na hata kazi rahisi zaidi nahisi wa haiwezi shindwa . " " ninakuwa na hofu na mashaka , na kila uamuzi jisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea lipuka . "
1                                                                                                   licha ya jua kung ' aa na ndege imba nje ya dirisha langu , huzuni yangu pungua sana , kana kwamba naswa katika abiso isiyo na mwisho .
2                                                          mimi ninalemewa na madaraka , kila mmoja akitaka uangalifu wangu , hata hivyo nahofu kwamba hata nijaribu kadiri gani , enda nisiweze kamwe shinda kazi nyingi mno mbele yangu .
3                                 " hisia zangu hubadilika kama upepo , zikiniacha wa na wasiwasi husu mimi ni nani kwa kweli . ninatamani wa imara , lakini ninahofia poteza uwezo wangu wa kinyonga wa changamana na mazingira yoyote . "
4                                                       

In [25]:
import re

# Common Swahili nouns (expandable list)
swahili_nouns = {
    "akili", "hali", "nguvu", "moyo", "wasiwasi", "hofu", "hisia", "mawazo",
    "mashaka", "kazi", "fikira", "mandhari", "dirisha", "machozi", "raha",
    "upendo", "hasira", "huruma", "mtu", "watu", "mwana", "mwili", "siku",
    "ndoto", "maisha", "teknolojia", "muda", "eneo", "ubongo", "ugonjwa",
    "ugumu", "hali", "nchi", "moyo", "ubaya", "ukweli", "uzoefu"
}

prefix_patterns = [
    r'^(?:ni|u|a|tu|m|wa|ki|vi|li|ta|zi|ku|ha|si|isi|mi|wi|na|ya|ka|hu)',
]

def strip_prefixes(word):
    """Remove multiple Swahili verb prefixes while avoiding over-stripping."""
    original = word
    for _ in range(3):  # try multiple passes
        for pat in prefix_patterns:
            new_word = re.sub(pat, '', word)
            if len(new_word) < len(word):
                word = new_word
    return word if word != "" else original

def clean_complex_forms(word):
    """Handle negations and suffix patterns."""
    patterns = [
        (r'^isiyo(\w+)', r'\1'),
        (r'^si(\w+)', r'\1'),
        (r'^ha(\w+)', r'\1'),
        (r'(\w+)wa$', r'\1'),
        (r'(\w+)ye$', r'\1'),
        (r'(\w+)ni$', r'\1'),
    ]
    for pat, rep in patterns:
        new_word = re.sub(pat, rep, word)
        if new_word != word:
            return new_word
    return word

def is_probably_noun(word):
    """Heuristic check to skip lemmatizing nouns."""
    if word in swahili_nouns:
        return True
    if len(word) <= 3:
        return True
    if word.endswith(("i", "a", "u", "e")) and not word.endswith(("ka", "la", "ta", "na")):
        # many nouns end in vowels, but not typical verb suffixes
        if not re.match(r'^(ni|ki|zi|wa|ta|na)', word):
            return True
    return False

def swahili_lemmatize_sentence(sentence, verb_dict):
    if not isinstance(sentence, str):
        return ""

    words = re.findall(r"\w+|[^\w\s]", sentence.lower())
    lemmatized_words = []

    for word in words:
        # If word looks like a noun, keep it
        if is_probably_noun(word):
            lemmatized_words.append(word)
            continue

        # Try direct dictionary lookup
        lemma = verb_dict.get(word, None)
        if lemma is None:
            cleaned = clean_complex_forms(word)
            stripped = strip_prefixes(cleaned)
            lemma = verb_dict.get(stripped, stripped)

        lemmatized_words.append(lemma)

    return " ".join(lemmatized_words)


In [26]:
df['lemma'] = df['text_sw'].apply(lambda x: swahili_lemmatize_sentence(x, verb_dict))


In [40]:
df['lemma'].head()

0    " akili yangu ni hali badilika ya wasiwasi , na ta kazi rahisi zaidi hisi kuwa haiwezi kushindwa . " " ninaku na hofu na mashaka , na la uamuzi unajisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea lipuka . "
1                                                                                                     licha ya jua ng ' aa na ndege imba nje ya dirisha langu , huzuni yangu inapungua sana , kana kwamba naswa tika biso yo na sho .
2                                                                             mimi leme na daraka , la mmoja ka uangalifu ngu , ta hivyo hofu kwamba ta jaribu kadiri gani , huenda weze kamwe kushinda kazi nyingi mno mbele yangu .
3                                                 " hisia zangu badilika kama pepo , acha wa na wasiwasi kuhusu mimi ni na kwa kweli . ma kuwa imara , lakini hofia kupoteza wezo ngu wa nyonga wa changamana na mazingira yoyote . "
4                                                                               

In [None]:
from collections import Counter
import re

def get_word_frequencies(text_series):
    all_words = []
    for text in text_series.dropna():
        words = re.findall(r"\b[a-zA-Z’']+\b", text.lower())
        all_words.extend(words)
    return Counter(all_words)


In [30]:
word_freqs = get_word_frequencies(df['text_sw'])
print(len(word_freqs), "unique words found.")


77802 unique words found.


In [31]:
def auto_detect_nouns(word_freqs, verb_dict, min_freq=2):
    noun_prefixes = ('m', 'wa', 'ki', 'vi', 'u', 'ma', 'mi')
    verb_prefixes = ('ni', 'na', 'ta', 'me', 'li', 'hu', 'ka', 'si', 'ku', 'ha')
    detected_nouns = set()

    for word, freq in word_freqs.items():
        if freq < min_freq:
            continue

        # Skip words known to be verbs
        if word in verb_dict:
            continue

        # Likely noun: starts with a noun prefix but not a verb one
        if word.startswith(noun_prefixes) and not word.startswith(verb_prefixes):
            detected_nouns.add(word)

        # Extra rule: words ending in vowels and long enough
        elif word.endswith(("a", "i", "o", "u", "e")) and len(word) > 4:
            detected_nouns.add(word)

    return detected_nouns


In [32]:
auto_nouns = auto_detect_nouns(word_freqs, verb_dict)
print("Auto-detected nouns:", list(auto_nouns)[:40])


Auto-detected nouns: ['wanasaikolojia', 'sitaenda', 'inaporudi', 'wangu', 'vikiniacha', 'whine', 'sitawahi', 'nighewe', 'kuyafanya', 'yapaswavyo', 'ninachojifunza', 'nilimfukuza', 'wanajihisi', 'kushonwa', 'wakaiga', 'amekasirika', 'hajazungumza', 'nisielewe', 'iliyovunjwa', 'bonigani', 'huzama', 'nitawakatisha', 'mipwito', 'nishikwe', 'yanatia', 'pengi', 'sigara', 'kujifahamu', 'yanayoingiliana', 'inakuumiza', 'mungu', 'nisaidieni', 'alivyoanza', 'hawatasikiliza', 'istima', 'unaovuja', 'miongozo', 'nipoa', 'hupigana', 'nijipe']


In [33]:
final_noun_list = swahili_nouns.union(auto_nouns)


In [34]:
final_noun_list = swahili_nouns.union(auto_nouns)


In [35]:
\
# Flatten all tense columns into one big set of verb forms
verb_forms = set(vitenzi.values.flatten())
verb_forms = {str(v).lower().strip() for v in verb_forms if isinstance(v, str)}

# Also include the roots (mzizi_wa_neno)
verb_roots = set(vitenzi['mzizi_wa_neno'].str.lower().tolist())
all_verbs = verb_forms.union(verb_roots)

print(f"✅ Loaded {len(all_verbs)} unique Swahili verb forms.")


✅ Loaded 303581 unique Swahili verb forms.


In [36]:
import re

def refined_auto_detect_nouns(word_freqs, all_verbs, min_freq=2):
    noun_prefixes = ('m', 'wa', 'ki', 'vi', 'u', 'ma', 'mi', 'chi', 'ny', 'ji')
    verb_prefixes = ('ku', 'ni', 'na', 'ta', 'me', 'li', 'hu', 'ka', 'si', 'ha')
    detected_nouns = set()

    for word, freq in word_freqs.items():
        if freq < min_freq or not word.isalpha():
            continue

        w = word.lower().strip()

        # Skip if the word or root is in verbs list
        if w in all_verbs:
            continue

        # Skip if it starts with a verb prefix and ends with 'a'
        if w.startswith(verb_prefixes) and w.endswith("a"):
            continue

        # Likely noun: starts with a noun prefix, not a verb one
        if w.startswith(noun_prefixes) and not w.startswith(verb_prefixes):
            detected_nouns.add(w)

        # Catch some simple nouns (e.g. "moyo", "akili", "shule")
        elif len(w) > 4 and not w.endswith("a") and w not in all_verbs:
            detected_nouns.add(w)

    return detected_nouns


In [37]:
refined_nouns = refined_auto_detect_nouns(word_freqs, all_verbs)
print(f"✅ Auto-detected {len(refined_nouns)} likely nouns")
print(list(refined_nouns)[:40])


✅ Auto-detected 12559 likely nouns
['prose', 'karne', 'umetatanika', 'plani', 'sinki', 'mapokezi', 'flesheni', 'wanasaikolojia', 'kitekinolojia', 'mwizi', 'hawajali', 'hakiwezi', 'ukajifunza', 'hanifanyi', 'waendelee', 'nyuma', 'inaporudi', 'ulichotaka', 'macy', 'wakapatwa', 'warembo', 'unaloona', 'mono', 'kinaendelea', 'wangu', 'wanapoingia', 'hawaoni', 'vikiniacha', 'whine', 'sitawahi', 'kimedumaa', 'unaofanya', 'itambidi', 'septle', 'ninavyowatumaini', 'nighewe', 'ifuatayo', 'hunidhibiti', 'waonekanapo', 'mwingilio']


In [38]:
final_noun_list = swahili_nouns.union(refined_nouns)


In [39]:
final_noun_list = {n for n in final_noun_list if not re.match(r"^ku[aeiou]", n)}


In [44]:
def rule_based_swahili_lemmatizer(word):
    # Dictionary of subject prefixes based on person and number
    subject_prefixes = {
        'ni': 'I', 'u': 'you (sg)', 'a': 'he/she',
        'tu': 'we', 'm': 'you (pl)', 'wa': 'they'
    }
    
    # Dictionary of tense markers
    tense_markers = {
        'na': 'present', 'li': 'past', 'ta': 'future', 'me': 'perfect'
    }
    
    # Very simple morphological parsing
    for sp_prefix, sp_name in subject_prefixes.items():
        if word.startswith(sp_prefix):
            # Remove the subject prefix
            stem = word[len(sp_prefix):]
            
            for t_marker, t_name in tense_markers.items():
                if stem.startswith(t_marker):
                    # Remove the tense marker
                    verb_root = stem[len(t_marker):]
                    return verb_root
    
    # Fallback: simple case for verbs without tense or for other POS
    if word.endswith('a'):
        return word
    
    return word

# Example usage with verb conjugations
print(f"Lemma for 'walitembea': {rule_based_swahili_lemmatizer('walitembea')}")
print(f"Lemma for 'unasoma': {rule_based_swahili_lemmatizer('unasoma')}")
print(f"Lemma for 'tutapika': {rule_based_swahili_lemmatizer('tutapika')}")
print(f"Lemma for 'nimefanya': {rule_based_swahili_lemmatizer('nimefanya')}")
print(f"Lemma for 'mwanafunzi': {rule_based_swahili_lemmatizer('mwanafunzi')}")



Lemma for 'walitembea': tembea
Lemma for 'unasoma': soma
Lemma for 'tutapika': pika
Lemma for 'nimefanya': fanya
Lemma for 'mwanafunzi': mwanafunzi


In [45]:
df['lemma_1']=df['text_sw'].apply(rule_based_swahili_lemmatizer)

In [46]:
df['lemma_1'].head()

0    "akili yangu ni hali isiyobadilika ya wasiwasi, na hata kazi rahisi zaidi nahisi kuwa haiwezi kushindwa." "ninakuwa na hofu na mashaka, na kila uamuzi unajisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea kulipuka."
1                                                                                              licha ya jua kung'aa na ndege wakiimba nje ya dirisha langu, huzuni yangu inapungua sana, kana kwamba nimenaswa katika abiso isiyo na mwisho.
2                                                           mimi ninalemewa na madaraka, kila mmoja akitaka uangalifu wangu, hata hivyo nahofu kwamba hata nijaribu kadiri gani, huenda nisiweze kamwe kushinda kazi nyingi mno mbele yangu.
3                            "hisia zangu hubadilika kama upepo, zikiniacha nikiwa na wasiwasi kuhusu mimi ni nani kwa kweli. ninatamani kuwa imara, lakini ninahofia kupoteza uwezo wangu wa kinyonga wa kuchangamana na mazingira yoyote."
4                                                   

In [43]:
from transformers import pipeline

# Load the Swahili POS tagger pipeline
tagger = pipeline("token-classification", model="masakhane/swahili-pos-tagger-afroxlmr", aggregation_strategy="simple")

text = "Mwanafunzi anasoma kitabu."
result = tagger(text)

print(result)

# Expected output will show the word, its tag, and confidence score
# Example (simplified):
# [
#     {'word': 'Mwanafunzi', 'entity_group': 'N', ...},
#     {'word': 'anasoma', 'entity_group': 'V', ...},
#     {'word': 'kitabu', 'entity_group': 'N', ...}
# ]


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 10.5M/2.24G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [47]:
from simplemma import lemmatize

# Lemmatize a conjugated verb
word = "walitembea"
lemma = lemmatize(word, lang='sw')
print(f"The lemma for '{word}' is: {lemma}")

# Lemmatize a plural noun
word = "vitabu"
lemma = lemmatize(word, lang='sw')
print(f"The lemma for '{word}' is: {lemma}")

# Lemmatize a word that is already in its base form
word = "shule"
lemma = lemmatize(word, lang='sw')
print(f"The lemma for '{word}' is: {lemma}")


The lemma for 'walitembea' is: tembea
The lemma for 'vitabu' is: kitabu
The lemma for 'shule' is: shule


In [None]:
from simplemma import lemmatize
def swahili_lemmatize(word):
   return lemmatize(word, lang='sw')

In [49]:
df['lemma_2'] = df['text_sw'].apply(swahili_lemmatize)

In [50]:
df['lemma_2'].head()

0    "akili yangu ni hali isiyobadilika ya wasiwasi, na hata kazi rahisi zaidi nahisi kuwa haiwezi kushindwa." "ninakuwa na hofu na mashaka, na kila uamuzi unajisikia kama uwanja wa mabomu ya kutegwa chini ya ardhi unaongojea kulipuka."
1                                                                                              licha ya jua kung'aa na ndege wakiimba nje ya dirisha langu, huzuni yangu inapungua sana, kana kwamba nimenaswa katika abiso isiyo na mwisho.
2                                                           mimi ninalemewa na madaraka, kila mmoja akitaka uangalifu wangu, hata hivyo nahofu kwamba hata nijaribu kadiri gani, huenda nisiweze kamwe kushinda kazi nyingi mno mbele yangu.
3                            "hisia zangu hubadilika kama upepo, zikiniacha nikiwa na wasiwasi kuhusu mimi ni nani kwa kweli. ninatamani kuwa imara, lakini ninahofia kupoteza uwezo wangu wa kinyonga wa kuchangamana na mazingira yoyote."
4                                                   

In [None]:
from transformers import pipeline

# Load the Masakhane model for Swahili
# This model is a multi-task model that includes lemmatization
model_name = "masakhane/swahili-pos-tagger-afroxlmr"
lemmatizer = pipeline("token-classification", model=model_name, aggregation_strategy="simple")

text = "Wanafunzi walitembea shuleni."
result = lemmatizer(text)

# The output will include both the word and its predicted lemma (depending on the pipeline's configuration)
# Further processing may be needed to extract only the lemma.
for token in result:
    print(f"Word: {token['word']}, POS: {token['entity_group']}")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  10%|9         | 220M/2.24G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  64%|######3   | 1.43G/2.24G [00:00<?, ?B/s]