## Classification Hackathon 2021 - Language Classifier
Henri Edwards - Explore Data Science - Class of July 2021

### 1. Introduction

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

### 2. Challenge

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

### 3. Importing Packages

In [1]:
# Install Prerequisites
import sys
import nltk

# Exploratory Data Analysis
import re
import ast
import time
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data Preprocessing
import string
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
stemmer = PorterStemmer()

# Classification Model
from sklearn.naive_bayes import MultinomialNB

# Performance Evaluation
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, classification_report, confusion_matrix

In [2]:
# Ignore spammy warnings
import warnings
warnings.filterwarnings('ignore')

### 4. Loading the Data

In [3]:
train_df = pd.read_csv('train_set.csv')
test_df = pd.read_csv('test_set.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'train_set.csv'

In [None]:
# View train_df
train_df.head(3)

In [None]:
# View test_df
test_df.head(3)

### 5. Data Cleaning and Preprocessing

In [None]:
def cleaning(text):    
    
    """Function that takes in input text, removes stop words, transforms text to lowercase, removes punctuation and hyperlinks"""    

    stopwords_list = stopwords.words('english')
    
    text = text.lower() # Changes input text to lowercase for better cleaning
    text =  ' '.join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split()) # Remove punctuation
    text = re.sub("https?:\/\/\S+", "", text) # Remove hyper links
    text = ' '.join(word for word in text.split() if word not in stopwords_list) # Remove StopWords
    return text    
    
train_df['text'] = train_df['text'].apply(cleaning) # Apply function to train_df
test_df['text'] = test_df['text'].apply(cleaning) # Apply function to test_df

**Tokenisation**

In [None]:
def cleaning(text):
    
    """Function tokenizes input string, and output tokenized text"""
    
    text = word_tokenize(text)
    return text

# Applies function and creates a new feature with function output
train_df['tokens'] = train_df['text'].apply(cleaning)
test_df['tokens'] = test_df['text'].apply(cleaning)

In [None]:
# view tokenized feature
train_df.head(3)

**Stemming**

Stemming outperformed lemmitization.

In [None]:
def stemming(text, stemmer):
    
    """Function performs stemming on a tokenized feature, and outputs a stemmed feature"""
    
    return [stemmer.stem(word) for word in text]

train_df['stem'] = train_df['tokens'].apply(stemming, args=(stemmer, )) # Apply function to train_df
test_df['stem'] = test_df['tokens'].apply(stemming, args=(stemmer, )) # Apply function to test_df

In [None]:
def ListToSentence(text):
    
    """Function converts lists to strings"""
    
    return ' '.join(word for word in text)

train_df['tokens'] = train_df['tokens'].apply(ListToSentence) # Apply function to train_df
train_df['stem'] = train_df['stem'].apply(ListToSentence) # Apply function to train_df

test_df['tokens'] = test_df['tokens'].apply(ListToSentence) # Apply function to test_df
test_df['stem'] = test_df['stem'].apply(ListToSentence) # Apply function to test_df

In [None]:
train_df.head(3)

### 6. Exploratory Data Analysis

In [None]:
# View all objects
train_df.info()

In [None]:
# Return total values to predict
print('Total languages to predict: '+ str(train_df['lang_id'].nunique()))

In [None]:
# Return target values to predict
train_df['lang_id'].unique()

### 7. Modeling

In [None]:
# Assign independent variable to X and dependent variable y
X = train_df['stem']
y = train_df['lang_id']  

In [None]:
# Vectorize X
vect = CountVectorizer()
X = vect.fit_transform(X)

# Splitting train_df into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.01, random_state=42) #using most of the data for training

#### Training Model

In [None]:
# Assign algorithm to clf
clf = MultinomialNB(alpha=1)

# Fit data to MNB model
clf.fit(X_train, y_train)

In [None]:
# Make prediction on training data
y_pred = clf.predict(X_val)

In [None]:
# Print classification report
print(metrics.classification_report(y_val, y_pred))

# Print f1 score
print('F1_score: ',round(metrics.f1_score(y_val, y_pred, average = 'macro'),8))

### 8. Model Performance
Performing hyperparameter tuning using the function GridSearchCV to increase prediction accuracy.

In [None]:
# View available hyperparameters for MNB algorithm
MultinomialNB().get_params()

In [None]:
# Hyperparameter selection for gridsearch
alphs = [0.0005, 0.0025, 0.005]

In [None]:
# Reference the hyperparameter selection
param_grid = {'alpha': alphs}

# Assign algorithm to MNB
MNB = MultinomialNB()

# Assign gridsearch to grid_MNB
grid_MNB = GridSearchCV(MNB, param_grid, scoring='f1')

# Fit model to training data using Gridsearch
grid_MNB.fit(X_train, y_train)

# Get best performing hyperparameters
grid_MNB.best_params_

In [None]:
# Make prediction on train data using best performing hyperparameters
y_pred = grid_MNB.predict(X_val)

In [None]:
# Print classification metrix
print(metrics.classification_report(y_val, y_pred))

# Print F1 score
print('F1_score: ',round(metrics.f1_score(y_val, y_pred, average = 'macro'),8)) 

### 9. Submission
For kaggle submission only.

In [None]:
# Make predictions on test data
pred_test_data = grid_MNB.predict(vect.transform(test_df['stem']))
pred_df = pd.DataFrame(data=test_df['index'], columns=['index'])
pred_df.insert(1, 'lang_id', pred_test_data, allow_duplicates=False)
pred_df.to_csv(path_or_buf='Submission.csv', index=False) 