# Homework 5 – Analysis of distributed data sources

In the dataset we already used in the exercise, there is another target attribute: `age`. Apply your learnings on this new target and assess the performance of each classifier.

1. Build and test a text classifier targeting the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

2. Build a ML name classifier that classifies the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

3. Build a meta classifier that combines the previously built classifiers based on their age classes (0-10, 11-20, 21-30, 31+).

As introduced in a previous homework, please save each classifier with the following command: `dump(tree_clf, 'clf1.joblib')`.<br/> Name them `clf1.joblib`, `clf2.joblib` and `clf3.joblib`. 


**Please make sure:**

- Each cell (essential step) is commented on with a short sentence
- New variables / fields are output in sufficient length (e.g., `df.head(10)`)
- Each of the tasks is answered with a short written statement
- Tidy up your code

There are no defined functions, we expect you to structure your code on your own (functions are not mandatory). Don't forget to upload the joblibs next to your notebook! 

<hr/>

## Coding Area

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB as bayes
from sklearn.feature_extraction.text import CountVectorizer as countvec
from sklearn.ensemble import RandomForestClassifier 
from joblib import dump

In [None]:
data = pd.read_pickle('data/twitterData.pkl')
data.shape

In [None]:
# get age class by floor div e. g. 31 // 10 = 3
data['age_class'] = data['age'].floordiv(10)
data.head(10)

In [None]:

data['age_class'].hist()

In [None]:
train_sub, temp = train_test_split(data, test_size=0.4, random_state=42)
train_meta, test = train_test_split(temp, test_size=0.4, random_state=42)
print(train_sub.shape, train_meta.shape, test.shape)

In [None]:
train_sub_tweets = train_sub['tweets_concatenated']
train_meta_tweets = train_meta['tweets_concatenated']
test_tweets = test['tweets_concatenated']

train_sub_names = train_sub['name']
train_meta_names = train_meta['name']
test_names = test['name']

In [None]:
y_train_sub = train_sub['age_class']
y_train_meta = train_meta['age_class']
y_test = test['age_class']

In [None]:
train_sub_tweets.head(10)

In [None]:
y_train_sub.head(4)

In [None]:
countvectorizer_tweets = countvec()
x_train_sub_tweets = countvectorizer_tweets.fit_transform(train_sub_tweets)
x_train_meta_tweets = countvectorizer_tweets.transform(train_meta_tweets)
x_test_tweets = countvectorizer_tweets.transform(test_tweets)

In [None]:
pd.DataFrame(x_train_sub_tweets.todense(), columns=countvectorizer_tweets.get_feature_names())

In [None]:
# train bayes clf on tweets

bayes_tweets = bayes()
bayes_tweets.fit(x_train_sub_tweets, y_train_sub)
tweet_score = bayes_tweets.score(x_test_tweets, y_test)

In [None]:
# dump clf as required
dump(bayes_tweets, 'clf1.joblib') 

In [None]:
tweet_score_text = f"Tweet Score is {tweet_score:0.2%}"
print(tweet_score_text)

In [None]:
# stack input for later use in meta clf
stacked_input_1 = pd.Series(bayes_tweets.predict(x_train_meta_tweets))
stacked_input_1_test = pd.Series(bayes_tweets.predict(x_test_tweets))

## Name classifier

In [None]:
# apply count vectorizer to names 

cvectorizer_names = countvec()
x_train_sub_names = cvectorizer_names.fit_transform(train_sub_names)

x_train_meta_names = cvectorizer_names.transform(train_meta_names)
x_test_names = cvectorizer_names.transform(test_names)

In [None]:
# train bayes clf
bayes_names = bayes()
bayes_names.fit(x_train_sub_names, y_train_sub)

In [None]:
name_score = bayes_names.score(x_test_names, y_test)
name_score_text = f"Name Score is {name_score:0.2%}"
print(name_score_text)

In [None]:
# dump bayes clf as required
dump(bayes_names, 'clf2.joblib') 

In [None]:
# stack input for later use in meta clf
stacked_input_2 = pd.Series(bayes_names.predict(x_train_meta_names))
stacked_input_2_test = pd.Series(bayes_names.predict(x_test_names))

## Meta Classifier

In [None]:
# initialize RF classifier
forest = RandomForestClassifier()

In [None]:
# compose meta results for training
meta_data_train = {'input_1': stacked_input_1, 'input_2': stacked_input_2}
meta_data_train = pd.DataFrame(meta_data_train)

meta_data_train.head(10)

In [None]:
# compose meta results for testing
meta_data_test = {'input_1': stacked_input_1_test, 'input_2': stacked_input_2_test}
meta_data_test = pd.DataFrame(meta_data_test)
forest.fit(meta_data_train, y_train_meta)

In [None]:
meta_score = forest.score(meta_data_test, y_test)
meta_score_text = f"Meta Score is {meta_score:0.2%}"

In [None]:
# dump clf as required
dump(forest, 'clf3.joblib') 

In [None]:
# final comparsion
print(tweet_score_text)
print(name_score_text)
print(meta_score_text)