# EXPLORE Data Science Academy Classification Hackathon

## 1. Introduction

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society. With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.
In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in.


## 1.1 Problem Statement

The task is to develop a machine learning model to identify language within a text, among the South Africa's 11 official languages

## 2. Import Python Libraries


Let's import everything we need to begin. This will include techniques for text feature extraction and ways to divide our data. The models we want to train will be included in the subsequent sections.

In [90]:
# Loading Data
import pandas as pd
import numpy as np
import nltk
import string
import re
import time

# Explore Data Analysis
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from matplotlib.pyplot import rcParams
from matplotlib.colors import ListedColormap
from sklearn.feature_extraction.text import CountVectorizer


# Data Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.utils import resample
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
ps = PorterStemmer()
wn = nltk.WordNetLemmatizer()


#NLTK (Natural Language Tool Kit) 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
import nltk



## 2.1 Load Datasets

In [91]:
# Load the train & test data sets
train_df = pd.read_csv('train_set.csv')
test_df = pd.read_csv('test_set.csv')

## 3. Exploratory Data Analysis

This section aims to conduct initial investigation on the dataset, to identify patterns, anomalies, and suggestive hypotheses of the dataset. Among other features explored are infomation and shape on both train and test sets

In [92]:
# Viewing the first 5 rows of train_df
train_df.head()


Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [93]:
#viewing the first 5 rows of test_df
test_df.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [94]:
print(train_df.info()) #checking the data type of each column in the train data
print('\n')
print(test_df.info()) #checking the data type of each column in the test data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB
None


In [95]:
# Display of the data statistics using the transpose method
train_df.describe().T


Unnamed: 0,count,unique,top,freq
lang_id,33000,11,xho,3000
text,33000,29948,ngokwesekhtjheni yomthetho ophathelene nalokhu...,17


In [96]:
test_df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
index,5682.0,2841.5,1640.396446,1.0,1421.25,2841.5,4261.75,5682.0


In [97]:
# Check for test data missing values 
test_df.isnull().sum()

index    0
text     0
dtype: int64

In [98]:
# Check for train data missing values 
train_df.isnull().sum()

lang_id    0
text       0
dtype: int64

In [99]:
lang = train_df['text']

# Creating a dataframe from the text column
lang_df = pd.DataFrame(lang)

# Add sentiment column to the language dataframe
lang_df['lang_id'] = train_df['lang_id']

# View the top 3 rows of languages
lang_df.head(3)

Unnamed: 0,text,lang_id
0,umgaqo-siseko wenza amalungiselelo kumaziko ax...,xho
1,i-dha iya kuba nobulumko bokubeka umsebenzi na...,xho
2,the province of kwazulu-natal department of tr...,eng


## 4. Data Processing

To process the data, a function cleaner is initialized, whose main purpose is;
        1. Convet the text into lower case
        2. Remove URLS,Hashtags(if any),Numeric Values, and Character Notes 
        3. Strip pucntuations and special characters
        4. Remove white spaces and RTs (if any)
        

In [100]:
def cleaner(texter):
    """
    this function takes in a dataframe and perform the following:
    -Convert letters to lowercases
    -remove URL links
    -remove # from hashtags
    -remove numbers
    -remove punctuation
    from the text field then return a clean dataframe 
    """
    texter = texter.lower() #convert to text to lowercase
    to_remove = [
        r"@[\w]*",  # strip account mentions
        r"http(s?):\/\/.*\/\w*",  # strip URLs
        r"#\w*",  # strip hashtags
        r"\d+",  # delete numeric values
        r"U+FFFD",  # remove the "character note present" diamond
    ]
    for key in to_remove:
        texter = re.sub(key, "", texter)
    
    # strip punctuation and special characters
    texter = re.sub(r"[,.;':@#?!\&/$]+\ *", " ", texter)
    
    # strip excess white-space
    texter = re.sub(r"\s\s+", " ", texter)
    texter = re.sub(r'rt[\s]+', '', texter) #Remove RT
    
    return texter.lstrip(" ")

Now let us apply the cleaner fuction to both train and test datasets

In [101]:
#apply function to remove noise in the data
train_df['text'] = train_df['text'].apply(cleaner)  
test_df['text'] = test_df['text'].apply(cleaner)  


In [102]:
train_df.tail()

Unnamed: 0,lang_id,text
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999,sot,mafapha a mang le ona a lokela ho etsa ditlale...


Assigning X and y variables

In [103]:
y=train_df['lang_id']
X=train_df['text']

## 4. Model Building

We will specify the model names and invoke the classes that implement the model. Keep in mind that some of the classifiers require input variables. These are some instances of hyperparameters.

In [104]:
#Feature Engineering and Model Building
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from textblob import TextBlob
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

In [65]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33, random_state = 42)

In [67]:
pipeline1 = Pipeline([
    ('bow',CountVectorizer(stop_words='english', 
                             min_df=2, 
                             max_df=0.5, 
                             ngram_range=(1, 1))),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [68]:
pipeline1.fit(X_train,y_train)
predictions = pipeline1.predict(X_test)
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))
print(accuracy_score(predictions,y_test))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       986
         eng       1.00      1.00      1.00       994
         nbl       1.00      0.99      1.00       955
         nso       1.00      1.00      1.00      1025
         sot       1.00      1.00      1.00      1023
         ssw       1.00      1.00      1.00       998
         tsn       1.00      1.00      1.00       983
         tso       1.00      1.00      1.00       952
         ven       1.00      1.00      1.00      1034
         xho       1.00      1.00      1.00      1004
         zul       0.99      1.00      1.00       936

    accuracy                           1.00     10890
   macro avg       1.00      1.00      1.00     10890
weighted avg       1.00      1.00      1.00     10890

[[ 984    0    1    0    0    0    1    0    0    0    0]
 [   0  991    1    0    0    0    0    0    0    0    2]
 [   0    0  950    0    0    0    0    0    0    2    3]
 [   0    0  

In [85]:
from sklearn. model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate


pipe_nb = make_pipeline(
    CountVectorizer(),
    MultinomialNB(alpha=0.006)
)
scores = cross_validate(pipe_nb, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,1.591866,0.471963,0.998869,1.0
1,1.464393,0.271945,0.998869,1.0
2,1.519857,0.34398,0.999548,1.0
3,2.015833,0.199982,0.999095,1.0
4,1.615862,0.199962,0.998643,1.0


In [70]:
pipe_nb2 = make_pipeline(
    CountVectorizer(),
    LogisticRegression()
)
scores = cross_validate(pipe_nb2, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,fit_time,score_time,test_score,train_score
0,34.240282,0.279974,0.99299,1.0
1,37.071507,0.27197,0.996834,1.0
2,34.201093,0.288005,0.995025,1.0
3,34.295636,0.311972,0.993894,1.0
4,34.527701,0.279975,0.993894,1.0


In [71]:
pipe_nb3 = make_pipeline(
    CountVectorizer(),
    SVC(C = 50, degree = 1, gamma = "auto", kernel = "rbf", probability = True)
)
scores = cross_validate(pipe_nb2, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,fit_time,score_time,test_score,train_score
0,35.824829,0.287993,0.99299,1.0
1,32.456189,0.34399,0.996834,1.0
2,32.921337,0.255979,0.995025,1.0
3,34.856634,0.287973,0.993894,1.0
4,34.767986,0.279969,0.993894,1.0


In [72]:
pipe_nb4 = make_pipeline(
    CountVectorizer(),
    RandomForestClassifier(n_estimators = 500, criterion = "gini", max_depth = 10,
                                     max_features = "auto", min_samples_leaf = 0.005,
                                     min_samples_split = 0.005, n_jobs = -1, random_state = 1000)
)
scores = cross_validate(pipe_nb2, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,fit_time,score_time,test_score,train_score
0,37.489657,0.263992,0.99299,1.0
1,35.800047,0.38398,0.996834,1.0
2,36.926783,0.264207,0.995025,1.0
3,37.135484,0.391966,0.993894,1.0
4,35.680167,0.34397,0.993894,1.0


In [73]:
classifiers = [
    LogisticRegression, 
    KNeighborsClassifier,
    SVC,
    DecisionTreeClassifier,
    RandomForestClassifier,    
]
predictions_list = []
models_list =[]
for classifier in classifiers:
    pipeline2 = Pipeline([
        ('bow',CountVectorizer(stop_words='english', 
                                 min_df=2, 
                                 max_df=0.5, 
                                 ngram_range=(1, 1))),  # strings to token integer counts
        ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
        ('classifier', classifier()),  # train on TF-IDF vectors 
    ])
 
    pipeline2.fit(X_train,y_train)
    predictions = pipeline2.predict(X_test)
    models_list.append(pipeline2)
    predictions_list.append(predictions)
    print('...............................')
    print(classifier)
    print(classification_report(predictions,y_test)) 
    print(confusion_matrix(predictions,y_test))
    print(accuracy_score(predictions,y_test))
    print(" ")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


...............................
<class 'sklearn.linear_model._logistic.LogisticRegression'>
              precision    recall  f1-score   support

         afr       0.99      1.00      1.00       978
         eng       1.00      0.99      0.99      1002
         nbl       0.99      0.99      0.99       952
         nso       1.00      1.00      1.00      1023
         sot       1.00      1.00      1.00      1019
         ssw       0.99      1.00      1.00       994
         tsn       1.00      1.00      1.00       986
         tso       1.00      1.00      1.00       952
         ven       1.00      1.00      1.00      1034
         xho       1.00      0.99      0.99      1013
         zul       0.98      0.99      0.98       937

    accuracy                           0.99     10890
   macro avg       0.99      0.99      0.99     10890
weighted avg       0.99      0.99      0.99     10890

[[ 978    0    0    0    0    0    0    0    0    0    0]
 [   6  991    1    2    0    0    1 

In [74]:
# Classifiers
from sklearn.svm import NuSVC, SVC

from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier # <- Here is our boy

# Used to ignore warnings generated from StackingCVClassifier
import warnings
warnings.simplefilter('ignore')

In [None]:
###############################################################################
#                              5. Good Ol' Classifiers                        #
###############################################################################
# Initializing Support Vector classifier
classifier1 = SVC(C = 50, degree = 1, gamma = "auto", kernel = "rbf", probability = True)



# Initialing Nu Support Vector classifier
classifier3 = NuSVC(degree = 1, kernel = "rbf", nu = 0.25, probability = True)

# Initializing Random Forest classifier
classifier4 = RandomForestClassifier(n_estimators = 500, criterion = "gini", max_depth = 10,
                                     max_features = "auto", min_samples_leaf = 0.005,
                                     min_samples_split = 0.005, n_jobs = -1, random_state = 1000)

In [None]:
###############################################################################
#                             6. Stacking Classifier                          #
###############################################################################
# Initializing the StackingCV classifier
sclf = StackingCVClassifier(classifiers = [classifier1, classifier3, classifier4],
                            shuffle = False,
                            use_probas = True,
                            cv = 5,
                            meta_classifier = SVC(probability = True))

In [None]:
# pipeline4 = Pipeline([
#     ('bow',CountVectorizer(stop_words='english', 
#                              min_df=2, 
#                              max_df=0.5, 
#                              ngram_range=(1, 1))),  # strings to token integer counts
#     ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
#     ('classifier', sclf())

In [None]:
# ###############################################################################
# #                       7. Putting classifiers in a dictionary                #
# ###############################################################################
# # Create list to store classifiers
# classified = {"SVC": classifier1,
#                "NuSVC": classifier3,
#                "RF": classifier4,
#                "Stack": sclf}

In [None]:
# sclf.fit(X_train, y_train)
# y_pred = sclf.predict(X_test).astype(str)


## Exporting to Test Data

In [25]:


#Make Kaggle Submission
test = test_df['text']
pipeline1.fit(X,y)
y_pred = pipeline2.predict(test)

submission = pd.DataFrame(y_pred, columns = ['lang_id'])
submission['index'] = test_df ['index']
submission = submission[['index','lang_id']]
submission.to_csv('submission_pipeline4.csv', index=False)

In [84]:
test = test_df['text']
pipe_nb.fit(X,y)
y_pred = pipe_nb.predict(test)

submission = pd.DataFrame(y_pred, columns = ['lang_id'])
submission['index'] = test_df ['index']
submission = submission[['index','lang_id']]
submission.to_csv('submission_pipeline9.csv', index=False)

In [45]:
test = test_df['text']
pipe_nb2.fit(X,y)
y_pred = pipe_nb2.predict(test)

submission = pd.DataFrame(y_pred, columns = ['lang_id'])
submission['index'] = test_df ['index']
submission = submission[['index','lang_id']]
submission.to_csv('submission_pipeline6.csv', index=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
