# Tweets Language Classification

## Author: Luis Eduardo Ferro Diez <a href="mailto:luis.ferro1@correo.icesi.edu.co">luis.ferro1@correo.icesi.edu.co</a>

This notebook contains the model training preparation for classifying the tweets language.

## Dataset
* Tweets dataset in parquet format (After executing the first transformation spark pipeline)

## Resources
* https://machinelearningmastery.com/best-practices-document-classification-deep-learning/

### Prepare the data

In [1]:
tweets_path = "../../datasets/tweets_parquet"

In [3]:
import pandas as pd

tweets = pd.read_parquet(tweets_path, engine="pyarrow")

In [4]:
tweets.head()

Unnamed: 0,id,tweet,lang,favorite_count,retweet_count,is_retweet,user_id,user_name,user_followers_count,user_following_count,...,place_full_name,country,country_code,place_type,place_url,is_spam,year,month,day,hour
0,374048987034419200,@adambeyer234: I already miss Jace like hell,en,0.0,0.0,0.0,363516745,rilez_sharp,582.0,666.0,...,"New York, US",United States,US,admin,https://api.twitter.com/1.1/geo/id/94965b2c453...,0.0,2013,9,1,1
1,374048991224160256,I really really hate texting unless we're talk...,en,0.0,0.0,0.0,542867684,ivonne_xoxo,367.0,310.0,...,"Chicago, IL",United States,US,city,https://api.twitter.com/1.1/geo/id/1d9a5370a35...,0.0,2013,9,1,1
2,374048995414659072,"Wind 4.0 mph SSE. Barometer 1040.0 mb, Falling...",en,0.0,0.0,0.0,1035302827,MossleyWX,38.0,60.0,...,"Craven, North Yorkshire",United Kingdom,GB,city,https://api.twitter.com/1.1/geo/id/4e008be7a8d...,0.0,2013,9,1,1
3,374048999642120192,"""@3gerardpique: Congratulations to Bayern Münc...",en,0.0,0.0,0.0,1716178933,456ronnys,6.0,65.0,...,"Cakung, Jakarta Timur",Indonesia,ID,city,https://api.twitter.com/1.1/geo/id/ac9f3b0d4a9...,0.0,2013,9,1,1
4,374048999625723904,@612wildabeast ima do that brah lol,en,0.0,0.0,0.0,373470202,BrandonWarren40,75.0,36.0,...,"Tampa, FL",United States,US,city,https://api.twitter.com/1.1/geo/id/dc62519fda1...,0.0,2013,9,1,1


We consider the most relevant parts to predict the tweet language to be:
* The tweet text
* The country code

The tweet text might contain user mentions, for simplicity, we will first transform the text, replacing the user mention with the text "@usermention".

In [18]:
import re

tweets.tweet = tweets.tweet.apply(lambda x: re.sub(r"@[\w\d]+", "@usermention", x))
tweets = tweets[["tweet", "country_code", "lang"]]
tweets.head()

Unnamed: 0,tweet,country_code,lang
0,@usermention: I already miss Jace like hell,US,en
1,I really really hate texting unless we're talk...,US,en
2,"Wind 4.0 mph SSE. Barometer 1040.0 mb, Falling...",GB,en
3,"""@usermention: Congratulations to Bayern Münch...",ID,en
4,@usermention ima do that brah lol,US,en


In [28]:
tweets.lang.unique()

array(['en'], dtype=object)

With this information we'll create embeddings to train a CNN + Fully Connected ANN to predict the language.

Also for testing, we are going to use a LTSM + Fully Connected ANN and compare the results.

First, let's split the dataset into train and test subsets.

In [13]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(tweets, test_size=0.2, random_state=1234)

Now, let's create the embeddings.

In [23]:
from keras.preprocessing.text import Tokenizer

words = 1000
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(tweets.tweet)

In [24]:
print(f"Unique tokens: {len(tokenizer.word_index)}")
tokenizer.word_index

Unique tokens: 504


{'usermention': 1,
 'i': 2,
 'to': 3,
 'the': 4,
 't': 5,
 'co': 6,
 'me': 7,
 'you': 8,
 'and': 9,
 'a': 10,
 'of': 11,
 'my': 12,
 'for': 13,
 'in': 14,
 'with': 15,
 'http': 16,
 'be': 17,
 'it': 18,
 '0': 19,
 '—': 20,
 'https': 21,
 'that': 22,
 'is': 23,
 'but': 24,
 'at': 25,
 'lol': 26,
 'by': 27,
 'so': 28,
 'she': 29,
 'all': 30,
 'im': 31,
 'up': 32,
 'go': 33,
 'should': 34,
 'when': 35,
 'this': 36,
 'on': 37,
 "don't": 38,
 'today': 39,
 'love': 40,
 'u': 41,
 'do': 42,
 'now': 43,
 'text': 44,
 'your': 45,
 'come': 46,
 'off': 47,
 'work': 48,
 'hard': 49,
 'home': 50,
 'out': 51,
 'been': 52,
 'hoes': 53,
 'have': 54,
 'was': 55,
 'pic': 56,
 'like': 57,
 'really': 58,
 'hate': 59,
 'wind': 60,
 'rain': 61,
 'just': 62,
 'give': 63,
 'good': 64,
 'music': 65,
 'famous': 66,
 'dessert': 67,
 'high': 68,
 'day': 69,
 'lmao': 70,
 'birthday': 71,
 'would': 72,
 'her': 73,
 'plz': 74,
 '😂': 75,
 'worth': 76,
 'k': 77,
 'only': 78,
 'take': 79,
 'lazy': 80,
 'world': 81,
 '3

In [25]:
sequences = tokenizer.texts_to_sequences(tweets.tweet)

In [33]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
model.add(Embedding(1000, 64, input_length=20))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 20, 64)            64000     
Total params: 64,000
Trainable params: 64,000
Non-trainable params: 0
_________________________________________________________________


In [34]:
%%time


ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 87 arrays: [array([[  1],
       [  2],
       [106],
       [107],
       [108],
       [ 57],
       [109]]), array([[  2],
       [ 58],
       [ 58],
       [ 59],
       [110],
       [111],
       [112],
 ...