<a id="top"></a>
#  **Classification Hackathon** 
## Xolisile Sibiya <sup> </sup>

<a id="intro"></a>
## **Introduction**

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.


### Problem Statement

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

### Objectives
The key objective is to build a Machine Learning model that will take text which is in any of South Africa's 11 Official languages and identify which language the text is in.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
import nltk
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('punkt')
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import word_tokenize, pos_tag, pos_tag_sents
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from collections import Counter
# For searching patterns on the tweets (regex)
import re
# datetime
import datetime

# Libraries for data preparation and model building
from sklearn.pipeline import Pipeline
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest, chi2
from scipy.stats import boxcox, zscore
from sklearn.metrics import mean_squared_error
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, PolynomialFeatures

# visualizations
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
# saving my model
import pickle

#ignoring warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Read data
df_test = pd.read_csv('test_set.csv')
df_train = pd.read_csv('train_set.csv')
df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [3]:
df_test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


### Data Overview

**Train dataset**

In [4]:
# Checking how our training dataset looks like
print("Rows    : ", df_train.shape[0])

print("Columns : ", df_train.shape[1])

print("\nMissing values: ", df_train.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", df_train.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in df_train.columns:
    unique_out = len(df_train[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  33000
Columns :  2

Missing values:  0

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
  
 None

About the data: 

Feature 'lang_id' has 11 unique categories
Feature 'text' has 29948 unique categories


**Test dataset**

In [None]:
# Checking how our data looks like
print("Rows    : ", df_test.shape[0])

print("Columns : ", df_test.shape[1])

print("\nMissing values: ", df_test.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", df_test.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in df_test.columns:
    unique_out = len(df_test[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

<a id="cleaning"></a>
## **Data Preprocessing**

Data preprocessing is a technique that involves taking in raw data and transforming it into a understable format. The technique includes data cleaning, intergration, transformation, reduction and discretization. The data preprocessing plan will include the following processes:

- **Data cleaning**



### Data Cleaning 

Data cleaning is a process of improving the quality of the data by identifying corrupt or erroneous records from a data set and rectifying them.

The data cleaning process will include the following:
- Expanding contractions where necessary
- Removal of the noise:
    - punctuations
    - numbers

#### Convert text to lowercase

In [5]:
def lowercase(text):
    text = text.lower() # making text to be lowercase
    return text

df_train['text'] = df_train['text'].apply(lowercase)

df_test['text'] = df_test['text'].apply(lowercase)

#### Remove Noise from Text

Data that can not be processed/interpreted by a machine is classified as noisy data. Text data contains a lot of noise, this comes in a  form of special characters punctuation and numbers. During this process the data will be changed from accent letters to normal letters and the noise will be removed.

In [6]:
 
def clean_text(text):
    
    """takes string of text and return a string of text with no punctuations,
    white spaces """
    
    text = re.sub("\\s+", " ", text)  # Remove extra whitespace
                  
    text = re.sub("(\#)|(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\-)", " ", text) # replace punctuation with space
    
    text = re.sub("(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})"," ",text)# replace punctuation with space
    
    text = text.lstrip()  # removes whitespaces before string
    
    text = text.rstrip()  # removes whitespaces after string
    
    return text

df_train['clean_text'] = df_train['text'].apply(clean_text)

df_test['clean_text'] = df_test['text'].apply(clean_text)

In [7]:
# remove numbers

def remove_numbers(text):
    
    """takes a string of text and remove numbers"""
    
    number_pattern = r'\d+'
    
    without_number = re.sub(pattern = number_pattern, repl = " ", string = text)
    
    return without_number

df_train['clean_text'] = df_train['clean_text'].apply(lambda x: remove_numbers(x))

df_test['clean_text'] = df_test['clean_text'].apply(lambda x: remove_numbers(x))

In [8]:
df_train.head()

Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...


<a id="features"></a>
## **Modelling**

The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process.

The training data must contain the correct answer, which is known as a target or target attribute. The learning algorithm finds patterns in the training data that map the input data attributes to the target (the answer that you want to predict), and it outputs an ML model that captures these patterns.
We train different models on the training data.

- **The training** set is a subset of the dataset to build predictive models.

- **Test** set or unseen examples is a subset of the dataset to assess the likely future performance of a model. If a model fit to the training set much better than it fits the test set, overfitting is probably the cause.

Data is randomly split into training and validation data sets. 80% is for training the model and 20% is for validation. 

In [9]:
y = df_train['lang_id']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df_train['clean_text'],
                                                    y,
                                                    test_size = 0.20,
                                                    random_state = 42)

Models are put in a pipeline with a vectorizer. Machine learning algorithms operate on a numeric feature space, expecting input as a two-dimensional array where rows are instances and columns are features. In order to perform machine learning on text, we need to transform our documents into vector representations such that we can apply numeric machine learning. This process is called feature extraction or more simply, vectorization, and is an essential first step toward language-aware analysis.

- CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.

- Frequency–Inverse Document Frequency Vectorizer Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index. 

**Linear Support Vector Classification**

A support vector machine takes these data points and outputs the hyperplane (which in two dimensions it's simply a line) that best separates the tags. This line is the decision boundary.
Linear SVC (Support Vector Classifier) fits to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data.

Similar to SVC with parameter kernel = ’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

In [None]:
# Linear Support Vector Classifier with TfidfVectorizer
from sklearn import preprocessing
scaler = preprocessing.MaxAbsScaler()

lsvc = Pipeline([('tfidf', TfidfVectorizer(ngram_range = (1,3))), 
                 
                 ('scaler', preprocessing.MaxAbsScaler()),
                 
                 ('lsvc', LinearSVC(C = 5, class_weight ='balanced',
                                  max_iter = 8000))])


In [None]:
lsvc.fit(df_train['clean_text'], df_train['lang_id'])

**Naive Bayes**

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn import preprocessing


#pred2 = snb_classifier.predict(X_test_scaled)
nb = Pipeline([('tfidf', TfidfVectorizer(stop_words = 'english',
                                         ngram_range = (1,2))), 
                 
                 ('scaler', preprocessing.MaxAbsScaler()),
                 
                 ('nb',MultinomialNB())])


In [13]:
nb.fit(X_train, y_train)

<a id="evaluation"></a>
## **Model Evaluation**

Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data. It also focuses on how well the chosen model will work in the future.

We use the f-score to evaluate model performance. The F1-score or F1-measure is a measure of a model's accuracy in test data set. It is calculated from the precision and recall of the test data, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive.

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.

**Linear Support Vector Classification**

In [None]:
print('f1- score:')

print(metrics.f1_score(y_test, lsvc.predict(X_test), average = 'macro'))

print('\nConfusion Matrix\n')

print(confusion_matrix(y_test, lsvc.predict(X_test)))

**Naive Bayes**

In [14]:
print('f1- score: '+ str(metrics.f1_score(y_test, nb.predict(X_test), average = 'macro')))

print('\nConfusion Matrix\n')

print(confusion_matrix(y_test, nb.predict(X_test)))

f1- score: 0.9986255849206738

Confusion Matrix

[[583   0   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  0   0 583   0   0   0   0   0   0   0   0]
 [  0   0   0 623   2   0   0   0   0   0   0]
 [  0   0   0   0 618   0   0   0   0   0   0]
 [  0   0   0   0   0 584   0   0   0   0   0]
 [  1   0   0   0   0   0 597   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   0   0   0   0   0   0   0 609   0]
 [  0   0   3   0   0   2   0   0   0   1 584]]


In [15]:
results = pd.DataFrame( data = {'index': df_test['index'],
                             'lang_id': nb.predict(df_test['clean_text']) })
results.to_csv('submission.csv', index = False)