# Language Identification Hackathon

© Explore Data Science Academy

---
### Honour Code

I {**Buhle, Nonjojo**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

---

### Language Identification Challenge 
---
#### Introduction

South Africa is a nation brimming with cultural richness, woven together by the threads of its diverse languages. Language transcends mere communication; it is the lifeblood of this democracy, enriching social interactions, fostering cultural expression, and empowering intellectual, economic, and political spheres.

Eleven official languages hold equal standing, reflecting the vibrant tapestry of South African society. Notably, many South Africans are multilingual, effortlessly navigating two or more of these languages, a testament to the nation's remarkable cultural exchange and understanding.

#### The Challenge

Given South Africa's rich linguistic tapestry, where 11 official languages flourish, it's natural for our technology to reflect this diversity. In this challenge, I will embark on a Natural Language Processing (NLP) adventure – identifying the language of a given text, regardless of whether it's in isiZulu, Afrikaans, or any other of the nation's official languages.

<a id="cont"></a>

## Table of Contents

><a href=#one>1. Importing Packages</a>

><a href=#two>2. Loading Data</a>

><a href=#three>3. Exploratory Data Analysis (EDA)</a>

><a href=#four>4. Data Engineering and NLP Processing</a>

><a href=#four>5. Modelling</a>

><a href=#four>6. Model performance</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section I will import, and briefly discuss, the libraries that will be used throughout our analysis and modelling. |

---

First, I include a list of packages that may needed to be installed on your system:

In [19]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [20]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [21]:
pip install nltk




Then import all necessary modules:

In [22]:

import nltk
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize


In [23]:
# Download necessary NLTK data

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nonjo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# Define stop words and lemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section I will load the data from the files into a DataFrame. |

---

In [25]:
# Loading the datasets
df_train = pd.read_csv('train_set.csv')
df_test = pd.read_csv('test_set.csv')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, I will perform an in-depth analysis of all the variables in the DataFrame. |

---


Let's start with investigating the dataset by viewing how the dataframe is currently looking:

In [26]:
# A look into the data:

print(df_train.head())

  lang_id                                               text
0     xho  umgaqo-siseko wenza amalungiselelo kumaziko ax...
1     xho  i-dha iya kuba nobulumko bokubeka umsebenzi na...
2     eng  the province of kwazulu-natal department of tr...
3     nso  o netefatša gore o ba file dilo ka moka tše le...
4     ven  khomishini ya ndinganyiso ya mbeu yo ewa maana...


Let's now look at the datatypes contained within our dataset:

In [27]:
print (df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
None


Let us consider if our dataset contains missing values:

In [28]:
df_train.isnull().sum()

lang_id    0
text       0
dtype: int64

<a id="four"></a>
## 4. Data Engineering and NLP Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering and NLP Processing ⚡ |
| :--------------------------- |
| In this section I will: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [31]:
# Combine the training and testing data

df_combined = pd.concat((df_train, df_evaluate))

FileNotFoundError: [Errno 2] No such file or directory: 'evaluation_data.csv'

In [30]:
# Encode labels
encoder = LabelEncoder()
df_combined['lang_id'] = encoder.fit_transform(df_combined['lang_id'].astype(str))

NameError: name 'df_combined' is not defined

In [18]:
# Preprocess the text data
def preprocess_text(text):
    tokenized_text = word_tokenize(text.lower())
    lemmatized_text = [lemmatizer.lemmatize(word) for word in tokenized_text if word not in stop_words]
    return ' '.join(lemmatized_text)

df_combined['text'] = df_combined['text'].apply(preprocess_text)

NameError: name 'df_combined' is not defined

In [None]:
# Vectorize the preprocessed text
vectorizer = TfidfVectorizer(sublinear_tf=True)
features = vectorizer.fit_transform(df_combined['text']

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, I will create one model that is able to accurately predict whether or not a person believes in climate change based on their novel tweet data. |

---

In [None]:
# Split the data back into the respective training and evaluation datasets
x_train = features[:df_train.shape[0]]
x_eval = features[df_train.shape[0]:]
y_train = df_combined['lang_id'][:df_train.shape[0]]

In [None]:
# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

In [None]:
# Train a Multinomial Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

In [None]:
# Predict on validation set
y_pred_val = nb.predict(X_val)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section, I will create one model that is able to accurately predict whether or not a person believes in climate change based on their novel tweet data. |

---

In [None]:
# Check the model performance
print('F1 score on validation set:', f1_score(y_val, y_pred_val, average='macro'))

In [None]:
# Predict on evaluation set
y_pred_eval = nb.predict(x_eval)


In [None]:
# Create a submission dataframe
submission = pd.DataFrame({
    'index': df_evaluate['index'],
    'lang_id': encoder.inverse_transform(y_pred_eval)
})

# Save the submission dataframe to a csv file
submission.to_csv('submission.csv', index=False)

In [43]:
#Given the task at hand, the feature 'text' needs to be vectorized. A TF-IDF Vectorizer is used for this.
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8', decode_error='ignore')
vectorizer.fit(df_train['text'])

In [44]:
# Next, the 'lang_id' is encoded using LabelEncoder.
encoder = LabelEncoder()
df_train['lang_id'] = encoder.fit_transform(df_train['lang_id'].astype(str))

In [45]:
# The training dataset is then split into training and validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(df_train['text'], df_train['lang_id'], 
                                                      test_size=0.2, random_state=42)

In [46]:
# Next, the text data in the train and validation sets are transformed into TF-IDF vectors.
X_train = vectorizer.transform(X_train)
X_valid = vectorizer.transform(X_valid)

In [47]:
# Now, a Logistic Regression model is trained.
model = LogisticRegression(random_state=0, C=2.5, max_iter=1000, n_jobs=-1)
model.fit(X_train, y_train)

In [48]:
# The performance of the model can be evaluated using the validation set.
y_pred = model.predict(X_valid)
print('F1 Score:', f1_score(y_valid, y_pred, average='macro'))

F1 Score: 0.995552204421297


In [49]:
# Next, the test data is transformed into a TF-IDF vector and the trained model is used to predict the language of the texts in the test data.
X_test = vectorizer.transform(df_test['text'])
predictions = model.predict(X_test)

In [50]:
# The predicted language IDs are then decoded back into the original form.
predictions = encoder.inverse_transform(predictions)

In [51]:
# Create a submission dataframe with the index column from the test data
submission = pd.DataFrame({
    'index': df_test['index'],
    'lang_id': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

First, to avoid duplicating the operations performed on both the testing and training dataset, we join them together:

In [52]:
# Discard all changes and reload the data
df_train = pd.read_csv('train_set.csv')
df_evaluate = pd.read_csv('test_set.csv')

# Combine the training and evaluation data
df_combined = pd.concat([df_train, df_evaluate], ignore_index=True)

In [53]:
# Data preprocessing: lowercasing text
df_combined['text'] = df_combined['text'].apply(lambda x: x.lower())

In [54]:
# Feature extraction
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8', decode_error='ignore', stop_words='english')
X = vectorizer.fit_transform(df_combined['text'])

In [55]:
# Move target variable to the rightmost side and encode labels
encoder = LabelEncoder()
df_combined['lang_id'] = encoder.fit_transform(df_combined['lang_id'].astype(str))

In [56]:
# Split the combined data back into training and evaluation data
df_train = df_combined.iloc[:len(df_train)]
df_evaluate = df_combined.iloc[len(df_train):]

y = df_train['lang_id']

In [57]:
# Split the training data into training and testing subsets
X_train, X_valid, y_train, y_valid = train_test_split(X[:len(df_train)], y, test_size=0.2, random_state=42)

In [58]:
# Define and train the model

model = LinearSVC()
model.fit(X_train, y_train)



In [59]:
# Evaluate the model

y_pred = model.predict(X_valid)

# Print classification report
print('classification_report:\n', classification_report(y_valid, y_pred))

# Print confusion matrix
print('Confusion Matrix:\n', confusion_matrix(y_valid, y_pred))

# Print F1 score
print('F1 Score:', f1_score(y_valid, y_pred, average='weighted'))

classification_report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       583
           1       1.00      1.00      1.00       615
           3       0.99      0.99      0.99       583
           4       1.00      1.00      1.00       625
           5       1.00      1.00      1.00       618
           6       1.00      1.00      1.00       584
           7       1.00      1.00      1.00       598
           8       1.00      1.00      1.00       561
           9       1.00      1.00      1.00       634
          10       0.99      1.00      1.00       609
          11       0.99      0.99      0.99       590

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600

Confusion Matrix:
 [[581   2   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  0   1 577   0   0   0   0   0   0   2   3]
 [ 

In [60]:
# Predict on evaluation data
# Predictions = model.predict(X[len(df_train):])

In [61]:
# Decode the predicted labels
# Predictions = encoder.inverse_transform(predictions)

In [62]:
# Prepare a submission file
submission = pd.DataFrame({'index': df_evaluate.index, 'lang_id': predictions})
submission.to_csv('submission.csv', index=False)