# EDSA 2022 Classification Hackathon - South African Language Identification

**Overview**

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
From South African Government

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

# Honour Code

I **Christian Divinefavour**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code.

Non-compliance with the honour code constitutes a material breach of contract.


# Problem Statement

With the divers official languages in South Africa, 11, precisely, a system is needed to effectively taken in texts in any of this languages and identify accurateky which language it's in; this is to aid general communal interaction in the country.


<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Preprocessing</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

<a id="one"></a>
## 1. Importing Packages 

In [None]:

#import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import re
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn.utils import resample
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


<a id = "two"></a>
## 2. Loading Data

**Load** **original** **data**

In [None]:
df_train = pd.read_csv('train_set.csv')
df_test = pd.read_csv('test_set.csv')
df_train.head()

In [None]:
df_test.head()

**Create a copy of data for further analysis and grouping by distinct language**

In [None]:
train_copy = df_train.copy()
test_copy = df_test.copy()

<a id = "three"></a>
## 3. Exploratory Data Analysis

**Now that we've successfully loaded our data, let's see what we're working with; let's check for data types on train data**

In [None]:
df_train.info()

**Next, let's check for null values**

In [None]:
df_train.isnull().sum()

**There are no null values in the data frame. Let's check for unique values of lang_id**

In [None]:
unique_vals = df_train['lang_id'].unique()
count_of_unique_vals = df_train['lang_id'].nunique()

print(unique_vals, "\nThere are ", count_of_unique_vals, "unique values in total")

**Let's visualize the how these languages occur in the data frame by plot the values**

In [None]:
fig = plt.figure()
ax = fig.add_axes([1,1,1,1])
df_train['lang_id'].value_counts().plot(kind = 'barh')
ax.set_xlabel('Languages')
ax.set_ylabel('Count of Languages')
plt.show()


**It seems each language has 3000 entries. Let's confirming that by creating a dictionary to show each language and the number of occurences**

In [None]:
unique_plot = {}
for i in unique_vals:
    unique_plot[i] = df_train[df_train['lang_id'] == i]['lang_id'].count()
unique_plot

**Clearly they have 3000 entries, each**

**Let us now visualize what the words look like, using word clouds.**

1. Group the data into a list of 11 dataframes of the unique languages
2. Visualize, using word cloud

In [None]:
column_list = [train_copy[train_copy['lang_id'] == j] for j in unique_vals]
column_list

In [None]:
#xho' 'eng' 'nso' 'ven' 'tsn' 'nbl' 'zul' 'ssw' 'tso' 'sot' 'afr'
#--VISUALIZATION
xho = train_copy[(train_copy['lang_id'] == 'xho')]
eng = train_copy[(train_copy['lang_id'] == 'eng')]
nso = train_copy[(train_copy['lang_id'] == 'nso')]
ven = train_copy[(train_copy['lang_id'] == 'ven')]
tsn = train_copy[(train_copy['lang_id'] == 'tsn')]
nbl = train_copy[(train_copy['lang_id'] == 'nbl')]
zul = train_copy[(train_copy['lang_id'] == 'zul')]
ssw = train_copy[(train_copy['lang_id'] == 'ssw')]
tso = train_copy[(train_copy['lang_id'] == 'tso')]
sot = train_copy[(train_copy['lang_id'] == 'sot')]
afr = train_copy[(train_copy['lang_id'] == 'afr')]

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(xho.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('XHO',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(eng.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('ENG',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(nso.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('NSO',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(ven.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('VEN',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(tsn.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('TSN',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(nbl.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('NBL',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(zul.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('ZUL',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(ssw.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('SSW',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(tso.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('TSO',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(sot.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('SOT',fontsize=20)

plt.figure()
wc = WordCloud(max_words = 200).generate(" ".join(afr.text))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('AFR',fontsize=20)

# 4. Data Processing

Let's get to clean the data.

In [None]:
nltk.download('stopwords')
stop = stopwords.words('english')
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
nltk.download('wordnet')
from bs4 import BeautifulSoup
def review_to_words(raw_message):
    # 1. Delete HTML 
    message_text = BeautifulSoup(raw_message, 'html.parser').get_text()
    #letters2 =raw_message.replace('http\S+|www.\S+', '', case=False)
    # 2. Make a space
    letters3 = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', message_text)
    letters_only = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))', '', letters3, flags=re.MULTILINE)
    letters = re.sub('[^a-zA-Z]', ' ',  letters_only)
    letters1 = re.sub(r'http', ' ', letters)
    
    letters2 = re.sub("\n", "", letters1)
     
    # 3. lower letters
    words = letters2.lower().split()
    # 5. Stopwords 
    meaningful_words = [w for w in words if not w in stop]
    # 6. lemmitization
    lemmitize_words = [lemmatizer.lemmatize(w) for w in meaningful_words]
    # 7. space join words
    return( ' '.join(lemmitize_words))

In [None]:
df_train['cleaned_text'] = df_train['text'].apply(review_to_words)
df_test['cleaned_text'] = df_train['text'].apply(review_to_words)

In [None]:
df_test.head()

In [None]:
use_train = df_train[['lang_id', 'cleaned_text']]
use_test = df_test[['cleaned_text']]

use_train.head()

In [None]:
count_vector = CountVectorizer(max_features=20000,analyzer='word', ngram_range=(2, 2))
tfidf1 = TfidfVectorizer()
vect1 = [count_vector , tfidf1]

X = count_vector.fit_transform(use_train['cleaned_text'].values.astype(str))
X.shape

In [None]:
y = use_train['lang_id']

In [None]:
train_model = pd.DataFrame(data=X.toarray(),columns = count_vector.get_feature_names())
train_model.head()

## 5. Modeling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_model, y, test_size=0.2, random_state=42)

In [None]:
rF_model = RandomForestClassifier(n_estimators=2, random_state=0)
rF_model.fit(train_model, y)
y_pred = rF_model.predict(X_test)

In [None]:
classification_report(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
svc =  LinearSVC(C= 1)
p = [0.001, 0.01, 0.1, 1, 10]

svc.fit(train_model, y)
y_pred1 = rF_model.predict(X_test)
print(classification_report(y_test, y_pred1))


In [None]:
from sklearn.naive_bayes import MultinomialNB
logreg  = MultinomialNB()
logreg.fit(train_model, y)
y_pred2 = rF_model.predict(X_test)
print(classification_report(y_test, y_pred2))


Let's repeat all we've done on train_df to test data

In [None]:
use_test.head()

In [None]:
count_test = count_vector.transform(use_test['cleaned_text'].values.astype(str))
count_test.shape

In [None]:
test_model = pd.DataFrame(data = count_test.toarray(),columns = count_vector.get_feature_names())
test_model.head()

In [None]:
rF_pred = rF_model.predict(test_model)

In [None]:
n =  test_model.index.tolist()
index = [x+1 for x in n]

submission = pd.DataFrame({'index': index, 'lang_id': rF_pred})
submission.head(15)

In [None]:
submission.to_csv('submission.csv',index=False)

In [None]:
svc_pred = svc.predict(test_model)
submission1 = pd.DataFrame({'index': index, 'lang_id': svc_pred})
submission1.head(15)

In [None]:
log_pred = logreg.predict(test_model)
submission3 = pd.DataFrame({'index': index, 'lang_id': log_pred})
submission3.head()