# Language Recognition - Data Wrangling and EDA

The goal of this project is to predict one of 22 different languages based on its text as input. I aim to do this by creating eight different models: Logistic Regression and Naive Bayes implementations with each model incorporating Count Vectorizer, Tf-idf, word embeddings, and document vectors.

### Import dependencies

I will start by importing the necessary modules.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from gensim.models.word2vec import Word2Vec
import gensim.downloader as gensim_api
from gensim.utils import simple_preprocess

### Import and display the data

This data was taken from the Kaggle language identification data set (https://www.kaggle.com/datasets/zarajamshaid/language-identification-datasst). The data was taken from WiLi-2018 wikipedia dataset, which contains 235,000 paragraphs of 235 languages.

In [3]:
# Import and display the data
df = pd.read_csv('language.csv')
df.head(10)

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
5,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,Japanese
6,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish
7,müller mox figura centralis circulorum doctoru...,Latin
8,برقی بار electric charge تمام زیرجوہری ذرات کی...,Urdu
9,シャーリー・フィールドは、サン・ベルナルド・アベニュー沿い市民センターとrtマーティン高校に...,Japanese


The data contains two columns, one is natrual language text and the other appears to be categorical.

In [4]:
# Examine the shape of the data.
df.shape

(22000, 2)

In [5]:
# Examine the data in more detail.
df['language'].value_counts()

Tamil         1000
Swedish       1000
Latin         1000
Korean        1000
Indonesian    1000
Spanish       1000
Hindi         1000
English       1000
Estonian      1000
Chinese       1000
Turkish       1000
Pushto        1000
Thai          1000
Urdu          1000
Dutch         1000
Japanese      1000
French        1000
Portugese     1000
Romanian      1000
Russian       1000
Persian       1000
Arabic        1000
Name: language, dtype: int64

Based on the initial inspection of the data we see it consists of 1,000 examples each of 22 languages. This is plenty of data for my purposes.