### Sihle_Riti_Classification_Hack

## 1. Introduction

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

1. Introduction
2. Import libraries and load data
3. Data pre-processing
4. Exploratory Data Analysis

In [58]:
# Standard
import pandas as pd
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Natural language Processing
import nltk
import string
import re
from sklearn.utils import resample
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Models
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, StackingClassifier

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

# Performance
from sklearn.metrics import f1_score
from sklearn import metrics

# Exploratory Data Analysis
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS

## 2: Import libraries and load data

In [59]:
train = pd.read_csv('train_set.csv')
test = pd.read_csv('test_set.csv')
sample = pd.read_csv('sample_submission.csv')

In [60]:
train.head(11)

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...
5,nso,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...
6,tsn,kgetse nngwe le nngwe e e sa faposiwang mo tsh...
7,ven,mbadelo dze dza laelwa dzi do kwama mahatulele...
8,nso,maloko a dikhuduthamaga a ikarabela mongwe le ...
9,tsn,fa le dirisiwa lebone le tshwanetse go bontsha...


In [61]:
test.head(11)

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.
5,6,"Ke feela dilense tše hlakilego, tša pono e tee..."
6,7,<fn>(762010101403 AM) 1495 Final Gems Birthing...
7,8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bon...
8,9,u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezi...
9,10,"So, on occasion, are statistics misused."


In [62]:
sample.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl


In [63]:
def clean_text(text):

    # change all words into lower case
    text = text.lower()

    # replace all url-links with url-web
    url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
    web = 'url-web'
    text = re.sub(url, web, text)
    # removing all punctuation and digits
    text = re.sub(r'[-]',' ', text)
    text = re.sub(r'[_]', ' ',  text)
    text = re.sub(r'[^\w\s]','', text)
    text = re.sub('[0-9]+', '',  text) 
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~âã¢¬¦¢’‘‚…]', ' ',  text)
    text = re.sub("â|ã", " ",  text)  # removes strange character 
    text = re.sub("\\s+", " ",  text)  # fills white spaces
    text =  text.lstrip()  # removes whitespaces before string
    text =  text.rstrip()  # removes whitespaces after string 
    

    text = re.sub("â|ã", " ", text)  # removes strange character    
    text = re.sub("\\s+", " ", text)  # fills white spaces
    text = text.lstrip()  # removes whitespaces before string
    text = text.rstrip()  # removes whitespaces after string 
    return text

In [64]:
#Apply the clean function to our train and test data
train['clean_text']=train['text'].apply(clean_text)
test['clean_text']=test['text'].apply(clean_text)

In [65]:
train.head(11)

Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...
5,nso,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...
6,tsn,kgetse nngwe le nngwe e e sa faposiwang mo tsh...,kgetse nngwe le nngwe e e sa faposiwang mo tsh...
7,ven,mbadelo dze dza laelwa dzi do kwama mahatulele...,mbadelo dze dza laelwa dzi do kwama mahatulele...
8,nso,maloko a dikhuduthamaga a ikarabela mongwe le ...,maloko a dikhuduthamaga a ikarabela mongwe le ...
9,tsn,fa le dirisiwa lebone le tshwanetse go bontsha...,fa le dirisiwa lebone le tshwanetse go bontsha...


In [66]:
test.head(11)

Unnamed: 0,index,text,clean_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...",mmasepala fa maemo a a kgethegileng a letlelel...
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu
3,4,Kube inja nelikati betingevakala kutsi titsini...,kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta
5,6,"Ke feela dilense tše hlakilego, tša pono e tee...",ke feela dilense tše hlakilego tša pono e tee ...
6,7,<fn>(762010101403 AM) 1495 Final Gems Birthing...,fn am final gems birthing options zulutxtfn
7,8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bon...,ntjhafatso ya konteraka ya mosebetsi etsa bonn...
8,9,u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezi...,u gems uhlinzeka ngezinzuzo zemithi yezifo ezi...
9,10,"So, on occasion, are statistics misused.",so on occasion are statistics misused


In [67]:
train.lang_id.value_counts()

ssw    3000
tsn    3000
afr    3000
zul    3000
xho    3000
nso    3000
ven    3000
sot    3000
tso    3000
nbl    3000
eng    3000
Name: lang_id, dtype: int64

In [68]:
X = train['text']
y = train['lang_id'] 


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

In [69]:
tfidf = TfidfVectorizer(ngram_range=(3,6),analyzer='char') 

In [70]:
lsvc = LinearSVC(C=100, class_weight='balanced',random_state=42)
clf_lsvc = Pipeline([('tfidf', tfidf), ('clf', lsvc)])
clf_lsvc.fit(X_train, y_train)
y_pred_lsvc = clf_lsvc.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred_lsvc, y_test))
print('f1_score %s' % metrics.f1_score(y_test,y_pred_lsvc,average='weighted'))
print(classification_report(y_test, y_pred_lsvc))



accuracy 0.9993939393939394
f1_score 0.9993939934069507
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       300
         eng       1.00      1.00      1.00       272
         nbl       1.00      1.00      1.00       312
         nso       1.00      1.00      1.00       277
         sot       1.00      1.00      1.00       299
         ssw       1.00      1.00      1.00       320
         tsn       1.00      1.00      1.00       295
         tso       1.00      1.00      1.00       299
         ven       1.00      1.00      1.00       306
         xho       1.00      1.00      1.00       308
         zul       1.00      1.00      1.00       312

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [73]:
y_test_pred_lsvc= clf_lsvc.predict(test['text'])

prediction_lsvc = pd.DataFrame({'index':test['index'],
                          'lang_id':y_test_pred_lsvc})

prediction_lsvc.to_csv('classification_lsvc3.csv',index=False)
y_test_pred_lsvc

array(['tsn', 'nbl', 'ven', ..., 'sot', 'sot', 'nbl'], dtype=object)

In [74]:
nb = MultinomialNB()
clf_nb= Pipeline([('tfidf', tfidf), ('clf', nb)])
clf_nb.fit(X_train, y_train)
y_pred_nb = clf_nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred_nb, y_test))
print('f1_score %s' % metrics.f1_score(y_test,y_pred_nb,average='weighted'))
print(classification_report(y_test, y_pred_nb))

accuracy 1.0
f1_score 1.0
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       300
         eng       1.00      1.00      1.00       272
         nbl       1.00      1.00      1.00       312
         nso       1.00      1.00      1.00       277
         sot       1.00      1.00      1.00       299
         ssw       1.00      1.00      1.00       320
         tsn       1.00      1.00      1.00       295
         tso       1.00      1.00      1.00       299
         ven       1.00      1.00      1.00       306
         xho       1.00      1.00      1.00       308
         zul       1.00      1.00      1.00       312

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [78]:
y_test_pred_NB= clf_nb.predict(test['text'])

NB = pd.DataFrame({'index':test['index'],
                          'lang_id':y_test_pred_NB})

NB.to_csv('NB.csv',index=False)
y_test_pred_NB

array(['tsn', 'nbl', 'ven', ..., 'sot', 'sot', 'nbl'], dtype='<U3')