# Nigerian English classification model

## Table of contents:

1. [Importing libraries](#Libraries)
2. [Loading dataset](#Data)
3. [Merging Dataset](#merge)
4. [Preprocess dataset](#Preprocess)
5. [Model Building](#Modelling)

<a name="Libraries"></a>
## 1. Importing libraries

In [72]:
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import BertTokenizer, BertModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

<a name="Data"></a>
## 2. Loading dataset

#### imputing the Nigerian dataset manually from the Nigerian dictionary

In [57]:
# Nigerian dataset
nig_data = [
    ("I’m about leaving", "Nig_enlish"),
    ("We love ourselves abroad", "Nig_enlish"),
    ("it is an abuse", "Nig_enlish"),
    ("that boy abused him well- well", "Nig_enlish"),
    ("State government has many people chopping money. Actually!", "Nig_enlish"),
    ("Add more!", "Nig_enlish"),
    ("not African time, please!", "Nig_enlish"),
    ("Some time after", "Nig_enlish"),
    ("We will go to the market", "Nig_enlish"),
    ("I have no money again", "Nig_enlish"),
    ("These agric beans do not taste well or He bought an agric plough", "Nig_enlish"),
    ("The alhajis have bought up all the petrol", "Nig_enlish"),
    ("I believe all what you say", "Nig_enlish"),
    ("they are among", "Nig_enlish"),
    ("he gave me some amount", "Nig_enlish"),
    ("Kwaran, Bayelsan", "Nig_enlish"),
    ("they are sewing and co. for the wedding", "Nig_enlish"),
    ("he answers Obi", "Nig_enlish"),
    ("the children were playing in the garden, anyhow", "Nig_enlish"),
    ("he is an applicant", "Nig_enlish"),
    ("art by Laranto Arts,Jos", "Nig_enlish"),
    ("art-works", "Nig_enlish"),
    ("The designers…spared no details with aso-oke", "Nig_enlish"),
    ("Husband and wife were wearing aso oke", "Nig_enlish"),
    ("they give us too many assignments", "Nig_enlish"),
    ("people were not enjoying, at all", "Nig_enlish"),
    ("Is this man here?", "Nig_enlish"),
    ("now he is elected Governor, automatically he will chop money", "Nig_enlish"),
    ("my papa is in the village", "Nig_enlish"),
    ("In America they don't back their babies as we do here", "Nig_enlish"),
    ("I just backed him and pretended not to see him", "Nig_enlish"),
    ("He was standing at my back", "Nig_enlish"),
    ("a bag of money", "Nig_enlish"),
    ("Adamu has bagged the governorship", "Nig_enlish"),
    ("He has bagged an M.A", "Nig_enlish"),
    ("I will balance you fifty Naira", "Nig_enlish"),
    ("See her just balancing there feeling cool about herself", "Nig_enlish"),
    ("I have no time to bandy with them", "Nig_enlish"),
    ("She is barbing him", "Nig_enlish"),
    ("Viable small businesses … are mostly in such trades as hair dressing and barbing salons, tailoring shops as well as “pure water” packaging businesses.", "Nig_enlish"),
    ("Have you taken your bath?", "Nig_enlish"),
    ("Do you want to bath?", "Nig_enlish"),
    ("this moto is for you", "Nig_enlish"),
    ("it is with him", "Nig_enlish"),
    ("If you do that again I am going to beat you", "Nig_enlish"),
    ("rain will beat you", "Nig_enlish"),
    ("I will put bedsheet for you", "Nig_enlish"),
    ("I get Beetle", "Nig_enlish"),
    ("Do not do this thing-o, I beg!", "Nig_enlish"),
    ("We enjoy bitterleaf soup", "Nig_enlish"),
    ("Blindman can often be seen begging outside the mosque", "Nig_enlish"),
    ("That black boy came to visit you", "Nig_enlish"),
    ("That boy is just bluffing", "Nig_enlish"),
    ("How is Bomboy?", "Nig_enlish"),
    ("Borrow me this money", "Nig_enlish"),
    ("But it’s the bottom men that make the leader", "Nig_enlish"),
    ("I am going to branch at his house", "Nig_enlish"),
    ("give me ten bread!", "Nig_enlish"),
    ("NEPA has brought light again", "Nig_enlish"),
    ("Bring money!", "Nig_enlish"),
    ("Broken English", "Nig_enlish"),
    ("take this bush track and you will burst out onto the Onitsha road", "Nig_enlish"),
    ("In the rickety wooden huts of Tselekwu, kerosine bush lamps fight a losing battle to dispel some of the pervasive darkness.", "Nig_enlish"),
    ("bushmeat soup", "Nig_enlish"),
    ("buttered bread", "Nig_enlish"),
    ("He is butchering the fish", "Nig_enlish"),
    ("this cash-madam is too tough", "Nig_enlish"),
    ("sickness has caught me", "Nig_enlish"),
    ("he prefers chaiking to reading", "Nig_enlish"),
    ("bring money, or I will charge you to court", "Nig_enlish"),
    ("we will go check him", "Nig_enlish"),
    ("these people only give chicken-change", "Nig_enlish"),
    ("the thorn has chooked me", "Nig_enlish"),
    ("can you chop rice?", "Nig_enlish"),
    ("He has chopped too much money", "Nig_enlish"),
    ("let us go for chop", "Nig_enlish"),
    ("I chop-congo today", "Nig_enlish"),
    ("since we had civilisation", "Nig_enlish"),
    ("He is my class-mate", "Nig_enlish"),
    ("Clean off that writing!", "Nig_enlish"),
    ("Clear the car!", "Nig_enlish"),
    ("Park well!", "Nig_enlish"),
    ("climb down from that machine", "Nig_enlish"),
    ("they have closed", "Nig_enlish"),
    ("co-wives always quarrel", "Nig_enlish"),
    ("Now we have reached the coal-tar again", "Nig_enlish"),
    ("Collect your loads", "Nig_enlish"),
    ("Our colonial masters taught us this", "Nig_enlish"),
    ("He is showing colonial mentality", "Nig_enlish"),
    ("I will come down here", "Nig_enlish"),
    ("he is a company man", "Nig_enlish"),
    ("the community forwarded their complains to higher authorities", "Nig_enlish"),
    ("your money is not complete", "Nig_enlish"),
    ("the market burned for complete ten days", "Nig_enlish"),
    ("He has conceived her", "Nig_enlish"),
    ("the moto is condemned", "Nig_enlish"),
    ("the place is not conducive", "Nig_enlish"),
    ("he is just a confusionist", "Nig_enlish"),
    ("she is cooking the water", "Nig_enlish"),
    ("That song is copyright.", "Nig_enlish"),
    ("they are not correct", "Nig_enlish")
]



In [58]:
nig_data

[('I’m about leaving', 'Nig_enlish'),
 ('We love ourselves abroad', 'Nig_enlish'),
 ('it is an abuse', 'Nig_enlish'),
 ('that boy abused him well- well', 'Nig_enlish'),
 ('State government has many people chopping money. Actually!', 'Nig_enlish'),
 ('Add more!', 'Nig_enlish'),
 ('not African time, please!', 'Nig_enlish'),
 ('Some time after', 'Nig_enlish'),
 ('We will go to the market', 'Nig_enlish'),
 ('I have no money again', 'Nig_enlish'),
 ('These agric beans do not taste well or He bought an agric plough',
  'Nig_enlish'),
 ('The alhajis have bought up all the petrol', 'Nig_enlish'),
 ('I believe all what you say', 'Nig_enlish'),
 ('they are among', 'Nig_enlish'),
 ('he gave me some amount', 'Nig_enlish'),
 ('Kwaran, Bayelsan', 'Nig_enlish'),
 ('they are sewing and co. for the wedding', 'Nig_enlish'),
 ('he answers Obi', 'Nig_enlish'),
 ('the children were playing in the garden, anyhow', 'Nig_enlish'),
 ('he is an applicant', 'Nig_enlish'),
 ('art by Laranto Arts,Jos', 'Nig_enlish

In [59]:
# Converting the list of tuples into a DataFrame
columns = ['Text', 'label']
nig_data = pd.DataFrame(nig_data, columns=columns)

In [60]:
nig_data

Unnamed: 0,Text,label
0,I’m about leaving,Nig_enlish
1,We love ourselves abroad,Nig_enlish
2,it is an abuse,Nig_enlish
3,that boy abused him well- well,Nig_enlish
4,State government has many people chopping mone...,Nig_enlish
...,...,...
96,the place is not conducive,Nig_enlish
97,he is just a confusionist,Nig_enlish
98,she is cooking the water,Nig_enlish
99,That song is copyright.,Nig_enlish


#### importing the English dataset

In [44]:
eng_data = pd.read_csv(r'C:\Users\ADMIN\Documents\BERT\brown.csv')
column_en = eng_data['tokenized_text'][:101]

In [43]:
# Removing unnecessary symbols and keeping only text
def remove_symbols(text):
    new_business = ''
    for i in (text):
        if (i == np.nan):
            new_business += ' '
        if (i.isalpha() or i == ' '):
            new_business += i
    return(new_business)

In [None]:
column_en = column_en.apply(remove_symbols)

In [46]:
column_en

0      Furthermore  as an encouragement to revisionis...
1      The Unitarian clergy were an exclusive club of...
2      Ezra Stiles Gannett  an honorable representati...
3      Even so  Gannett judiciously argued  the Assoc...
4      We today are not entitled to excoriate honest ...
                             ...                        
96     It is hardly possible to emphasize this too much 
97     Most people do not realize that the congregati...
98     The idea that it is a feature of all religions...
99     The Jewish synagogue affords a parallel to the...
100    Their characteristic experience is that of the...
Name: tokenized_text, Length: 101, dtype: object

<a name="merge"></a>
## 3. Merging Dataset

In [61]:
df_1 = nig_data.copy()
df_1["label"] = 1

df_2 = pd.DataFrame([column_en]).transpose()
df_2["label"] = 0
df_2.rename(columns = {'tokenized_text':'Text'}, inplace = True)

df = pd.concat([df_1, df_2])
print(df)

                                                  Text  label
0                                    I’m about leaving      1
1                             We love ourselves abroad      1
2                                       it is an abuse      1
3                       that boy abused him well- well      1
4    State government has many people chopping mone...      1
..                                                 ...    ...
96   It is hardly possible to emphasize this too much       0
97   Most people do not realize that the congregati...      0
98   The idea that it is a feature of all religions...      0
99   The Jewish synagogue affords a parallel to the...      0
100  Their characteristic experience is that of the...      0

[202 rows x 2 columns]


<a name="Preprocess"></a>
## 4. Preprocessing Dataset

In [64]:
# Download NLTK stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Preprocessing function
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

df['Preprocessed Text'] = df['Text'].apply(preprocess_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [65]:
df.head()

Unnamed: 0,Text,label,Preprocessed Text
0,I’m about leaving,1,leav
1,We love ourselves abroad,1,love abroad
2,it is an abuse,1,abus
3,that boy abused him well- well,1,boy abus well
4,State government has many people chopping mone...,1,state govern mani peopl chop money actual


In [66]:
# Split dataset into features and labels
#X = [message[0] for message in messages]
#y = [message[1] for message in messages]

X=df.drop('label',axis=1)
y=df['label']

In [70]:
vectorizer = CountVectorizer(max_features=300, min_df=1, max_df=0.8)
X = vectorizer.fit_transform(df['Text']).toarray()

In [73]:
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [75]:
from sklearn.model_selection import train_test_split
p = np.random.permutation(len(X))
X_train, X_test, y_train, y_test = train_test_split(X[p], df['label'].iloc[p], test_size=0.2, random_state=0)

In [81]:
X_train

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.23859303, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

<a name="Modelling"></a>
## 7.  Model Building

### KNN Classifier

In [76]:
# Train KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)
knn_predictions = knn_classifier.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_predictions)

print("KNN Accuracy:", knn_accuracy)
print("KNN Classification Report:\n", classification_report(y_test, knn_predictions))


KNN Accuracy: 0.5609756097560976
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        18
           1       0.56      1.00      0.72        23

    accuracy                           0.56        41
   macro avg       0.28      0.50      0.36        41
weighted avg       0.31      0.56      0.40        41



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Naives Bayes Classifier

In [77]:
# Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
nb_predictions = nb_classifier.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_predictions)

print("Naive Bayes Accuracy:", nb_accuracy)
print("Naive Bayes Classification Report:\n", classification_report(y_test, nb_predictions))

Naive Bayes Accuracy: 0.7073170731707317
Naive Bayes Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.94      0.74        18
           1       0.92      0.52      0.67        23

    accuracy                           0.71        41
   macro avg       0.77      0.73      0.70        41
weighted avg       0.78      0.71      0.70        41

