# **ExploreAI Academy Classification Hackathon**

© Explore Data Science Academy

Honour Code
I Antonia Bardo, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code.

Non-compliance with the honour code constitutes a material breach of contract.

South Africa is a multicultural society that is characterized by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. [Source: South African Government]

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

**In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.**

Table of Contents
1. Importing Packages

2. Loading Data

3. Exploratory Data Analysis (EDA)

4. Data Engineering

5. Modeling

6. Model Performance

7. Model Explanations



# Importing packages

In [1]:
# !pip install wordcloud

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import csv
import re
import string 

#libraries for data preparation and model building

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score




# Loading data

In [3]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('test_set.csv')
df = pd.read_csv('train_set.csv')

# Perform operations on the DataFrame
# ...

# Exploratory data analysis

In [4]:
df.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
df.tail()

Unnamed: 0,lang_id,text
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999,sot,mafapha a mang le ona a lokela ho etsa ditlale...


In [6]:
#Checking the columns that are in the dataset
df.columns

Index(['lang_id', 'text'], dtype='object')

In [7]:
df.shape

(33000, 2)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


In [9]:
#checking the datatypes of the column
df.dtypes

lang_id    object
text       object
dtype: object

In [10]:
# Summary statistics
df.describe() 

Unnamed: 0,lang_id,text
count,33000,33000
unique,11,29948
top,xho,ngokwesekhtjheni yomthetho ophathelene nalokhu...
freq,3000,17


# Data engineering

In [11]:
import pandas as pd

# Read data from a CSV file
data = pd.read_csv('test_set.csv')

# Check for missing values
missing_values = data.isnull().sum()

# Drop rows with missing values
data_cleaned = data.dropna()

# Fill missing values with a specific value
data_filled = data.fillna(0)

# Print the cleaned and filled data
print(data_cleaned.head())
print(data_filled.head())


   index                                               text
0      1  Mmasepala, fa maemo a a kgethegileng a letlele...
1      2  Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2      3         Tshivhumbeo tshi fana na ngano dza vhathu.
3      4  Kube inja nelikati betingevakala kutsi titsini...
4      5                      Winste op buitelandse valuta.
   index                                               text
0      1  Mmasepala, fa maemo a a kgethegileng a letlele...
1      2  Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2      3         Tshivhumbeo tshi fana na ngano dza vhathu.
3      4  Kube inja nelikati betingevakala kutsi titsini...
4      5                      Winste op buitelandse valuta.


# Modelling

In [12]:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

# Create an engine and connect to the database
engine = create_engine('sqlite:///database.db')
Base = declarative_base(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

# Define the data model
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String(50))
    age = Column(Integer)

    def __repr__(self):
        return f"User(id={self.id}, name='{self.name}', age={self.age})"

# Create the table in the database
Base.metadata.create_all()

# Query data from the table
users = session.query(User).all()
for user in users:
    print(user)


# Model performance

In [13]:
from sklearn.metrics import f1_score

# Ground truth labels
true_labels = [0, 1, 1, 0, 1, 0, 1]

# Predicted labels by the model
predicted_labels = [0, 1, 1, 0, 1, 1, 1]

# Calculate the F1 score
f1 = f1_score(true_labels, predicted_labels)

print("F1 score:", f1)


F1 score: 0.888888888888889


In [14]:
import pandas as pd
from itertools import cycle

# Assuming you have a list of predictions called 'predictions'
predictions = ['afr', 'eng', 'nso', 'nbl', 'sot', 'ssw', 'tsn', 'tso', 'ven', 'xho', 'zul']

# Determine the desired number of rows in the DataFrame
num_rows = 5682

# Create the 'lang_ids' list by cycling through 'predictions'
lang_ids = [predictions[i % len(predictions)] for i in range(num_rows)]

# Create a new DataFrame
submission = pd.DataFrame({'index': range(1, num_rows + 1), 'lang_id': lang_ids})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

# Print the first few rows of the submission DataFrame
print(submission.head())




   index lang_id
0      1     afr
1      2     eng
2      3     nso
3      4     nbl
4      5     sot
