# Language Identification Hackathon

© Explore Data Science Academy

---
### Honour Code

I {**Buhle, Nonjojo**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

---

### Language Identification Challenge 
---
#### Introduction

South Africa is a nation brimming with cultural richness, woven together by the threads of its diverse languages. Language transcends mere communication; it is the lifeblood of this democracy, enriching social interactions, fostering cultural expression, and empowering intellectual, economic, and political spheres.

Eleven official languages hold equal standing, reflecting the vibrant tapestry of South African society. Notably, many South Africans are multilingual, effortlessly navigating two or more of these languages, a testament to the nation's remarkable cultural exchange and understanding.

#### The Challenge

Given South Africa's rich linguistic tapestry, where 11 official languages flourish, it's natural for our technology to reflect this diversity. In this challenge, I will embark on a Natural Language Processing (NLP) adventure – identifying the language of a given text, regardless of whether it's in isiZulu, Afrikaans, or any other of the nation's official languages.

<a id="cont"></a>

## Table of Contents

><a href=#one>1. Importing Packages</a>

><a href=#two>2. Loading Data</a>

><a href=#three>3. Exploratory Data Analysis (EDA)</a>

><a href=#four>4. Data Engineering and NLP Processing</a>

><a href=#four>5. Modelling</a>

><a href=#four>6. Model performance</a>

<a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section I will import, and briefly discuss, the libraries that will be used throughout our analysis and modelling. |

---

First I will import all necessary modules:

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

  from pandas.core import (


In [3]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Define stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section I will load the data from the files into a DataFrame. |

---

In [5]:
# Loading the datasets
df_train = pd.read_csv('train_set.csv')
df_evaluate = pd.read_csv('test_set.csv')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, I will perform an in-depth analysis of all the variables in the DataFrame. |

---


Let's start with investigating the dataset by viewing how the dataframe is currently looking:

In [31]:
# A look into the data:
print(df_train.head())

  lang_id                                               text
0     xho  umgaqo-siseko wenza amalungiselelo kumaziko ax...
1     xho  i-dha iya kuba nobulumko bokubeka umsebenzi na...
2     eng  the province of kwazulu-natal department of tr...
3     nso  o netefatša gore o ba file dilo ka moka tše le...
4     ven  khomishini ya ndinganyiso ya mbeu yo ewa maana...



Let's now look at the datatypes contained within our dataset:

In [32]:
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
None


<a id="four"></a>
## 4. Data Engineering and NLP Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering and NLP Processing ⚡ |
| :--------------------------- |
| In this section I will: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [33]:
# Combine the training and testing data

df_combined = pd.concat((df_train, df_evaluate))

In [34]:
# Encode labels

encoder = LabelEncoder()
df_combined['lang_id'] = encoder.fit_transform(df_combined['lang_id'].astype(str))

In [35]:
# Preprocess the text data

def preprocess_text(text):
    tokenized_text = word_tokenize(text.lower())
    lemmatized_text = [lemmatizer.lemmatize(word) for word in tokenized_text if word not in stop_words]
    return ' '.join(lemmatized_text)

df_combined['text'] = df_combined['text'].apply(preprocess_text)

In [36]:
# Vectorize the preprocessed text

vectorizer = TfidfVectorizer(sublinear_tf=True)
features = vectorizer.fit_transform(df_combined['text'])

In [37]:
# Split the data back into the respective training and evaluation datasets

x_train = features[:df_train.shape[0]]
x_eval = features[df_train.shape[0]:]
y_train = df_combined['lang_id'][:df_train.shape[0]]

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, I will create one model that is able to accurately predict whether or not a person believes in climate change based on their novel tweet data. |

---

In [38]:
# Split the training data into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

In [39]:
# Train a Multinomial Naive Bayes model

nb = MultinomialNB()
nb.fit(X_train, y_train)

In [40]:
# Predict on validation set

y_pred_val = nb.predict(X_val)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section, I will create one model that is able to accurately predict whether or not a person believes in climate change based on their novel tweet data. |

---

In [41]:
# Check the model performance

print('F1 score on validation set:', f1_score(y_val, y_pred_val, average='macro'))

F1 score on validation set: 0.9987793870574063


In [42]:
# Predict on evaluation set

y_pred_eval = nb.predict(x_eval)


In [29]:
# Create a submission dataframe

submission = pd.DataFrame({
    'index': df_evaluate['index'],
    'lang_id': encoder.inverse_transform(y_pred_eval)
})

# Save the submission dataframe to a csv file

submission.to_csv('submission.csv', index=False)