The project is a swahili sentiment analyzer. <br> The model will be using Spark NLP, a natural language processing library built on top of Apache spark and its ML library. <br> The library has been trained using natural language datasets, especially on the swahili language. Using the XLM-RoBERTa model, a multilingual language model, the purpose of the project is to evaluate the sentiments of swahili text and news headlines. The data was fetched from [Zenodo](https://zenodo.org/record/3553423).

1. Installing pyspark and spark-nlp

In [2]:
! pip install -q pyspark==3.3.2 spark-nlp==4.3.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.7/471.7 KB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


2. Importing libraries

In [6]:
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import nltk
nltk.download('punkt')

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType
import pandas as pd
import re
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


3. Starting a spark NLP session

In [4]:
spark = sparknlp.start()

4. The NLP Model

In [41]:
#Document 
document_assembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")
#tokenize
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")
#Removing stopwords
stopwords_cleaner = StopWordsCleaner.pretrained("stopwords_sw", "sw") \
        .setInputCols(["normalized"]) \
        .setOutputCol("cleanTokens")\
        .setCaseSensitive(False)

embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base_finetuned_swahili", "sw")\
    .setInputCols(["document", "cleanTokens"])\
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
#Sentiment classifier
sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_xlm_roberta_sentiment", "sw") \
  .setInputCols(["sentence_embeddings"]) \
  .setOutputCol("class_")

stopwords_sw download started this may take some time.
Approximate size to download 1.5 KB
[OK!]
xlm_roberta_base_finetuned_swahili download started this may take some time.
Approximate size to download 994.1 MB
[OK!]
classifierdl_xlm_roberta_sentiment download started this may take some time.
Approximate size to download 21.9 MB
[OK!]


5. Spark sentiment analysis pipeline

In [17]:
sw_pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer, 
        normalizer, 
        stopwords_cleaner, 
        embeddings, 
        embeddingsSentence, 
        sentimentClassifier
        ])

6. Importing swahili stopwords (Common Swahili Stop-words.csv). This csv file will be packaged together with the notebook.

In [7]:
from google.colab import files
swastopwords=files.upload()

Saving Common Swahili Stop-words.csv to Common Swahili Stop-words.csv


In [8]:
st=pd.read_csv('Common Swahili Stop-words.csv')
st

Unnamed: 0,StopWords
0,na
1,lakini
2,ingawa
3,ingawaje
4,kwa
...,...
250,nini
251,hasa
252,huu
253,zako


7. Cleaning unwanted characters

In [9]:
def preprocess_texts(text):
    text = text.strip().lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    words = word_tokenize(text)
    words = [w for w in words if not w in st]
    return ' '.join(words)

8. Importing the data (swatxt .csv).This csv file will be packaged together with the notebook.

In [10]:
from google.colab import files
swatxt=files.upload()

Saving swatxt .csv to swatxt .csv


9. Inserting the imported file into a dataframe

In [11]:
dt = pd.read_csv('swatxt .csv')
swatxtdf= dt['text']
swatxtdf

0        taarifa hiyo ilisema kuwa ongezeko la joto la ...
1        aidha ilisema kuwa mwelekeo wa kupungua kwa jo...
2        mwelekeo wa mvua wa septemba hadi desemba ishi...
3        ilifafanua kuwa msimu wa vuli UNK maeneo ambay...
4        katika maeneo hayo mvua zinatarajiwa kunyesha ...
                               ...                        
42063    sioni kama bwana mengi amefanya ujasiri wowote...
42064    jambo la busara ilikuwa kuchukua ushahidi alio...
42065    kama rais UNK ushahidi huo ndipo UNK na kuelez...
42066    profesa lipumba alidai bwana mengi anacheza ka...
42067    kwa kutaja majina hayo anataka kuufanya umma w...
Name: text, Length: 42068, dtype: object

10. Cleaning the data above

In [21]:
cleanedswatxtdf=swatxtdf.apply(preprocess_texts)

11. Changing the dataframe to a list

In [22]:
swatxtlist=cleanedswatxtdf.values.tolist()
swatxtlist

['taarifa hiyo ilisema kuwa ongezeko la joto la maji juu ya wastani katikati ya bahari ya unk inaashiria kuwepo kwa mvua za el nino unk hadi mwishoni mwa april ishirini moja sifuri imeelezwa kuwa ongezeko la joto magharibi mwa bahari ya hindi linatarajiwa kuhamia katikati ya bahari hiyo hali ambayo itasababisha pepo kutoka kaskazini mashariki kuvuma kuelekea bahari ya hindi',
 'aidha ilisema kuwa mwelekeo wa kupungua kwa joto kusini mashariki mwa bahari ya atlantic unk kusababisha pepo kutoka magharibi kuvuma kuelekea magharibi mwa tanzania katika maeneo ya ziwa victoria',
 'mwelekeo wa mvua wa septemba hadi desemba ishirini sifuri tisa unatarajiwa kuwa katika namna tofauti ambapo baadhi ya maeneo yanaweza kunufaika huku mengine unk',
 'ilifafanua kuwa msimu wa vuli unk maeneo ambayo hupata mvua mara mbili ambayo ni kaskazini mwa nchi ikiwa ni nyanda za juu kaskazini mashariki kanda ya ziwa victoria na pwani ya kaskazini',
 'katika maeneo hayo mvua zinatarajiwa kunyesha wiki ya pili na

12. Sentiment analysis of the headlines imported at 10

In [37]:
analyzedswadf = spark.createDataFrame(swatxtlist,StringType()).toDF("text")
result = sw_pipeline.fit(analyzedswadf).transform(analyzedswadf)

13. Display the sentiment analysis result

In [38]:
result.select(F.explode(F.arrays_zip(result.document.result, 
                                     result.class_.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("text"),
              F.expr("cols['1']").alias('sentiment')).show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|text                                                                                                                                                                                                                                                                                                                                                                                  |sentiment|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [39]:
light_pipeline = LightPipeline(sw_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

14. Testing the model

In [42]:
swahilistatement = input("Enter text: ")
result1 = light_pipeline.annotate(swahilistatement)
print("The sentiment is:")
print(result1["class_"])

Enter text: Nafurahia kuwaone siku leo hii
The sentiment is:
['Positive']


References

---
<br>

Swahili NLP Model[John Snow Labs](https://nlp.johnsnowlabs.com/2021/12/29/classifierdl_xlm_roberta_sentiment_sw.html)