# Practical Machine Learning
# Toader Liviu Eduard - Group 407

In this file I tried out multiple model types (linear regression, ridge, lasso etc.) except neural network

Because the scores of these models weren't good enough, I switched to a neural network (which can be found in the other ipynb file)

## 1. Imports

In [1]:
import pandas as pd
import numpy as np

from spacy.tokenizer import Tokenizer
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
from spacy.pipeline import Tagger
import de_core_news_sm

from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection
from sklearn.linear_model import LinearRegression, Ridge, MultiTaskElasticNet, MultiTaskLasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

## 2. Reading the data

Read the training and validation files

The Twitter messages are the features <b>x</b> 

The latitude and longitude coordinates are the labels <b>y</b>

In [2]:
def read(file):
    df = pd.read_csv(file, names=['id', 'lat', 'long', 'message'], delimiter=',')
    x = df[['message']]
    y = df[['lat', 'long']]
    return x, y

train_x, train_y = read('training.txt')
validation_x, validation_y = read('validation.txt')

Print the messages from the training set

In [3]:
train_x

Unnamed: 0,message
0,"Seit d Vase: ""Wenn ich kaputt gang, bringt das..."
1,Haha bin au w isch der amig au so richtig lang...
2,isch d hiltl dachterrasse amne samstig viel bs...
3,Ich fÃ¼hle mich wie die Weimarer Republik... .....
4,Eui liebschte Lunchidee zum Mitneh? ðŸ˜¬ En Grill...
...,...
22578,"Bin grad in Bus igstige, da seit de Buschauffe..."
22579,Rien ne surpassera Dragostea Din Tei de O-zone...
22580,het Ã¶pert au kei bock meh zum schaffa und lust...
22581,Oh wenn wedermol en jodel -5 het wos ned verdi...


Print the coordinates from the training set

In [4]:
train_y

Unnamed: 0,lat,long
0,51.810067,10.191331
1,51.918188,10.599245
2,52.711074,9.987374
3,52.386711,11.700612
4,52.314631,9.701835
...,...,...
22578,51.884863,10.487841
22579,49.935479,7.051477
22580,50.597534,12.055682
22581,51.848082,8.554886


## 3. Preprocessing the data

Scale the coordinates

In [5]:
scaler = StandardScaler()
scaler.fit(train_y)

train_y = scaler.transform(train_y)
validation_y = scaler.transform(validation_y)

Print the scaled coordinates from the training set

In [6]:
train_y

array([[ 0.09224194,  0.60083834],
       [ 0.2128527 ,  0.90798381],
       [ 1.09733166,  0.44726561],
       ...,
       [-1.260361  ,  2.00463273],
       [ 0.13464846, -0.63135015],
       [ 1.36553917, -0.06534156]])

Use spaCy's de_core_news_sm for German natural language processing

Keep only some parts of speech because they should contain the relevant information (nouns, adjectives, verbs, adverbs)

Lemmatize the words

In [7]:
german = de_core_news_sm.load()
allowed_pos = ['NOUN', 'PROPN', 'ADJ', 'VERB', 'ADV']

def preprocess(df):
    n = 1
    
    df['preprocessed'] = np.nan
    for index, message in enumerate(df['message']):
        text = german(message)
        lemmas = []
        for token in text:
            if not token.is_stop and token.pos_ in allowed_pos:
                lemmas.append(token.lemma_)
        df.loc[index, 'preprocessed'] = str(lemmas)
        
        if n % 2500 == 0:
            print('Preprocessed', n, 'messages...')
        n += 1
    
    return df

In [8]:
train_x = preprocess(train_x)

Preprocessed 2500 messages...
Preprocessed 5000 messages...
Preprocessed 7500 messages...
Preprocessed 10000 messages...
Preprocessed 12500 messages...
Preprocessed 15000 messages...
Preprocessed 17500 messages...
Preprocessed 20000 messages...
Preprocessed 22500 messages...


Print the preprocessed training set

In [9]:
train_x

Unnamed: 0,message,preprocessed
0,"Seit d Vase: ""Wenn ich kaputt gang, bringt das...","['d', 'Vase', 'kaputt', 'gang', 'bringen', 'Un..."
1,Haha bin au w isch der amig au so richtig lang...,"['Haha', 'au', 'isch', 'amig', 'au', 'langwili..."
2,isch d hiltl dachterrasse amne samstig viel bs...,"['isch', 'hiltl', 'dachterrasse', 'amne', 'sam..."
3,Ich fÃ¼hle mich wie die Weimarer Republik... .....,"['fÃ¼hlen', 'Weimarer', 'Republik', 'Verfassung..."
4,Eui liebschte Lunchidee zum Mitneh? ðŸ˜¬ En Grill...,"['Eui', 'liebschte', 'Lunchidee', 'Mitneh', 'ðŸ˜¬..."
...,...,...
22578,"Bin grad in Bus igstige, da seit de Buschauffe...","['grad', 'Bus', 'de', 'Buschauffeur', 'set', '..."
22579,Rien ne surpassera Dragostea Din Tei de O-zone...,"['Rien', 'surpassera', 'Dragostea', 'Din', 'Te..."
22580,het Ã¶pert au kei bock meh zum schaffa und lust...,"['het', 'Ã¶pert', 'kei', 'bock', 'meh', 'schaff..."
22581,Oh wenn wedermol en jodel -5 het wos ned verdi...,"['wedermol', 'jodel', '-5', 'het', 'wos', 'ned..."


We can observe that natural language processing may not be very efficient for this type of dataset

There are tweets that may contain spelling mistakes or non-German words, for example tweet 22579 printed above

Convert the words into numbers by calculating the <b>term frequency - inverse document frequency</b> (TF-IDF) of the words

In [10]:
vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(train_x['preprocessed'])

train_x = vectorizer.transform(train_x['message'])
validation_x = vectorizer.transform(validation_x['message'])

## 4. Creating the models

Try out multiple model types (linear regression, ridge, lasso, etc.)

Print out their mean absolute and squared errors

In [11]:
models = [
    LinearRegression(),
    Ridge(),
    MultiTaskElasticNet(),
    MultiTaskLasso(),
    DecisionTreeRegressor(),
    KNeighborsRegressor(),
    RadiusNeighborsRegressor(radius=2)
]

for index, model in enumerate(models):
    if index in [2, 3]:
        model.fit(train_x.toarray(), train_y)
    else:
        model.fit(train_x, train_y)
    validation_y_predicted = model.predict(validation_x)
    
    print('Model', index, 
          'MAE', mean_absolute_error(validation_y_predicted, validation_y), 
          'MSE', mean_squared_error(validation_y_predicted, validation_y))

Model 0 MAE 0.596357884085078 MSE 0.6061944035894757
Model 1 MAE 0.5511657177935221 MSE 0.518076656795322
Model 2 MAE 0.8149175336652377 MSE 0.9477759252791536
Model 3 MAE 0.8149175336652377 MSE 0.9477759252791536
Model 4 MAE 0.7164650209519143 MSE 1.0486597740225576
Model 5 MAE 0.808148797257151 MSE 0.9907357608073597
Model 6 MAE 0.8149175336652379 MSE 0.9477759252791534


Do a grid search for finding hyperparameters on ridge

In [12]:
grid = GridSearchCV(
    Ridge(),
    param_grid = {
        'alpha': [0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8],
        'fit_intercept': [True, False],
    },
    scoring = 'neg_mean_absolute_error'
)

grid.fit(train_x, train_y)

GridSearchCV(estimator=Ridge(),
             param_grid={'alpha': [0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8],
                         'fit_intercept': [True, False]},
             scoring='neg_mean_absolute_error')

Print the grid search result

In [13]:
grid.best_score_

-0.5612044511105517

## 5. Conclusion

Usually ridge was the best performing model on my runs with around 0.55 mean absolute error, followed by linear regression with around 0.60 MAE, and the others around 0.80 MAE

I didn't do the final submission with models from this file, instead I created a neural network (which can be found in the other ipynb file) that has lower mean errors.