# Language Identification Hackathon

©  Explore Data Science Academy

---

### Honour Code

I **MARVIC** **COCOUVI**, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<img src="climate_change.jpg" width="800px">
    <figcaption><p text_align = "center">
    
    
## **Language identification**

Overview
South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
From South African Government

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Modeling</a>

<a href=#five>5. Submission</a>



<a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section the required packages are imported, and briefly discuss, the libraries that will be used throughout the analysis and modelling. |

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np # for linear algebra
import pandas as pd # for importing, creating and manipulating dataframes

#Visualization Packages
import matplotlib.pyplot as plt
import seaborn as sns

# Packages for text manipulation and Natural language processing
import re
from string import punctuation
import nltk
nltk.download(['stopwords','punkt'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS

# Train-test split package
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.utils import resample

from imblearn.pipeline import Pipeline

# Libraries for data preparation and model building
##Accuracy packages
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier 

from sklearn.ensemble import StackingClassifier

from sklearn.metrics import classification_report,confusion_matrix

from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from sklearn import metrics
import warnings
warnings.simplefilter('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cocou\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cocou\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [3]:
test_df = pd.read_csv('C:/Users/cocou/OneDrive/Desktop/Books/test_set.csv')
train_df = pd.read_csv('C:/Users/cocou/OneDrive/Desktop/Books/train_set.csv')
sample_submission_df = pd.read_csv('C:/Users/cocou/OneDrive/Desktop/Books/sample_submission.csv')

In [4]:
train_df.head(10)

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...
5,nso,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...
6,tsn,kgetse nngwe le nngwe e e sa faposiwang mo tsh...
7,ven,mbadelo dze dza laelwa dzi do kwama mahatulele...
8,nso,maloko a dikhuduthamaga a ikarabela mongwe le ...
9,tsn,fa le dirisiwa lebone le tshwanetse go bontsha...


In [36]:
test_df.head()

Unnamed: 0,index,text,processed_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...","mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...,kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta.


In [None]:
train_df.shape

In [None]:
test_df.shape

## 3. ALL EDA / DATA PRE-PROCESSING HERE

In [None]:
train_df['lang_id'].value_counts()

In [8]:
def text_preprocessing(text):
    
    '''
    This functions cleans tweets from line breaks, URLs, numbers, etc.
    '''
    
    text = text.lower() #to lower case
    text = text.replace('\n', ' ') # remove line breaks
    text = text.replace('\@(\w*)', '') # remove mentions
    text = re.sub(r"\bhttps://t.co/\w+", '', text) # remove URLs
    text = re.sub('\w*\d\w*', '', text) # remove numbers
    text = re.sub(r'\#', '', text) # remove hashtags. To remove full hashtag: '\#(\w*)'
    text = re.sub('\w*\d\w*', '', text) # removes numbers?
    text = re.sub(' +', ' ', text) # remove 1+ spaces

    return text

# 4. Modeling

In [10]:
# Splitting the labels and features
train_df['processed_text'] = train_df['text'].apply(text_preprocessing)
X = train_df['text'].values
y = train_df['lang_id'].values

In [11]:
test_df['processed_text'] = test_df['text'].apply(text_preprocessing)

In [12]:
# Splitting the labels and fetures into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,random_state=42,stratify=y)

In [13]:
mnb = Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])
#fitting the model
mnb.fit(X_train, y_train)

#apply model on test data
y_pred_mnb = mnb.predict(X_test)

In [14]:
# Classification report
print(classification_report(y_test, y_pred_mnb))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       300
         eng       0.99      1.00      1.00       300
         nbl       1.00      1.00      1.00       300
         nso       1.00      1.00      1.00       300
         sot       1.00      1.00      1.00       300
         ssw       1.00      1.00      1.00       300
         tsn       1.00      1.00      1.00       300
         tso       1.00      1.00      1.00       300
         ven       1.00      1.00      1.00       300
         xho       1.00      0.99      1.00       300
         zul       1.00      1.00      1.00       300

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [15]:
#SVC
svc = Pipeline([('Count',CountVectorizer()),('classify',SVC(max_iter=300,C=1))])

In [16]:
#linearSVC
linsvc = Pipeline([('Count',CountVectorizer()),('classify',LinearSVC(max_iter=300,C=1))])

In [17]:
#Logistic Regression
lr = Pipeline([('Count',CountVectorizer()),('classify',LogisticRegression(max_iter=300))])

In [18]:
# Invoke the KNN classifier
knn = Pipeline([('Count',CountVectorizer()),('classify',KNeighborsClassifier(n_neighbors=3))])

In [19]:
# Call up the Random Forest Sampler
rf = Pipeline([('Count',CountVectorizer()),('classify',RandomForestClassifier())])

In [20]:
num=3
# SVC
scores = cross_val_score(
        svc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' SVC models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 SVC models is 0.9845704691936438


In [21]:
#linearSVC
scores = cross_val_score(
        linsvc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+ ' LinearSVC models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 LinearSVC models is 0.9965452288607383


In [22]:
#Logistic Regression
scores = cross_val_score(
        lr, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' Logistic Regression models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 Logistic Regression models is 0.9951497185965673


In [24]:
#KNN
scores = cross_val_score(
        knn, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 KNN models is 0.9012927517522525


In [25]:
#Random Forest
scores = cross_val_score(
        rf, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 KNN models is 0.9868922435102471


In [None]:
#Decision Tree
dt = Pipeline([('Count',CountVectorizer()),('classify',DecisionTreeClassifier())])

**Hyperparameter Tuning** 



In [26]:
from sklearn.model_selection import GridSearchCV
Cs = [0.001, 0.01, 0.1, 1, 10]
param_grid = {
    'C'     : Cs
    }
grid_SVM = GridSearchCV(LogisticRegression(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_

{'C': 10}

In [27]:
param_grid = {'C'     : Cs }
grid_SVM = GridSearchCV(LinearSVC(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_

{'C': 10}

In [40]:
estimators = [
        ('rf', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',RandomForestClassifier())])),
         
        ('lnsvc', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LinearSVC(C=0.1))])),
         
        ('MNB',Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])),
    
        ('lr', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LogisticRegression(C=1))]))]

In [41]:
clf = StackingClassifier(
        estimators=estimators, final_estimator=LogisticRegression()
    )

#fitting the model
clf.fit(X, y)

<a id="save"></a>
## 5. SUBMISSION

Let's create the csv file that will help us to make submission on Kaggle

In [42]:
x_unseen = test_df['processed_text']

submission = pd.DataFrame(
    {'index': test_df['index'],
     'lang_id': clf.predict(x_unseen)
    })

# save DataFrame to csv file for submission
submission.to_csv("Submission_final.csv", index=False)