# South African Language Identification Hack 2022

© Explore Data Science Academy

---
### Honour Code

I {**Joshua, Olalemi**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Hackathon Overview: Spain Electricity Shortfall Challenge

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
From South African Government
<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2205222%2F7f34544c1b1f61d1a5949bddacfd84a9%2FSouth_Africa_languages_2011.jpg?generation=1604393669339034&alt=media"
     alt="South Africa's Languages"
     style="float: center; padding-bottom=0.5em"
     width=800px/>


With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section, I imported and briefly discussed the libraries that I used throughout the analysis and modelling. |

---

In [4]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd #This will be used for data loading and manipulation
import numpy as np # This will be used for linear algebra on data
#import seaborn as sns #
#import matplotlib.pyplot as plt #This will be used for data visualization
#from matplotlib import rc
#%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
# Libraries for data preparation and model building
#import *

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section, I loaded the data "train" and "test" datasets into different DataFrames. |

---

In [5]:
df = pd.read_csv("train_set.csv") # load the data
df.head() #This is just to be sure the dataset is loaded and to have a glance at it

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, I performed an in-depth analysis of all the variables in the DataFrame.|

---


In [6]:
# look at data statistics
df.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


Looking at the top five rows of our data we can see some of our features as well as the types of data we are working with.

Observe that it was trunctated after Bilbao_wind_speed, then it resumed again at Madrid_temp_max.

To get all the features of our train dataset, we will apply the `.columns` method on it.

In [7]:
df.columns #This will return all the column names in the dataset

Index(['lang_id', 'text'], dtype='object')

In [8]:
df.shape #This will show us the 'shape' of our dataset.

(33000, 2)

As displayed, our data has **8763** rows and **49** columns.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


The info command displays the information of the dataset; viz: the index number, column name, non-null value and the data type contained in the columns.

A model can only read numeric data types, such as the 'float64' or 'int64'. Categorical features ('object') will break the model.

So, observe the datatypes of column 'time', 'Valencia_wind_deg' and 'Seville_pressure'

Again, null values are not allowed in our data.

Remember that from the `df.shape` command we ran earlier, our dataset has `8763` rows. So, from our table info cell above, any column name under the **Non-Null Count** that is less than *8763* means it actually contains nulls.

Now, observe the **Non-Null Count** of 'Valencia_pressure'. It is clear that the 'Valencia_pressure' feature has some nulls in it. This can be confirmed below:

In [10]:
df.isnull().sum() #This returns the count of null values in the features.

lang_id    0
text       0
dtype: int64

This becomes clearer that 'Valencia_pressure' has `2068` null values in it.

In [11]:
df.mean()

Series([], dtype: float64)

In [12]:
# plot relevant feature interactions

In [13]:
# evaluate correlation

In [14]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section, I engineered the dataset by cleaning it and also adding new features - as identified in the EDA phase. |

---

In [15]:
# remove missing values/ features

In [16]:
# create new features

In [17]:
# engineer existing features

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, I created several Machine Learning (ML) models to classify the languages accordingly. |

---

In [18]:
# split data

In [19]:
# create targets and features dataset

In [20]:
# create one or more ML models

In [21]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section, I compared the relative performance of the various trained ML models on a holdout dataset and gave a brief comment on what model is the best and why. |

---

In [22]:
# Compare model performance

In [23]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| This is where I explained the chosen model and briefly discussed how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [24]:
# discuss chosen methods logic