# Homework 5 – Analysis of distributed data sources

In the data set we already used in the exercise, there is another target attribute: age.
Complete the task group A **or** B. Always assess the performance of the corresponding classifier/regressor.

**Task group A**
1. Build and test a text classifier based on the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

2. Build a ML name classifier that classifies the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

3. Build a meta classifier that combines the previously built classifiers based on their age classes (0-10, 11-20, 21-30, 31+).

**Task group B**
1. Build and test a text regressor based on the age of a user according their specific age (regression).

2. Build a ML name classifier that classifies the age of a user according their specific age (regression).

3. Build a meta classifier that combines the previously built classifiers based on their specific age (regression).

<i>**Please make sure:**

- each cell (essential step) is commented on with a short sentence
- new variables / fields are output in sufficient length (e.g., df.head (10))
- each of the tasks is answered with a short written statement

This makes the evaluation much easier and, thus, would help us a lot.</i>


## Task group A

### Preparation

Import the necessary functions

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import accuracy_score as accuracy
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC

Import the data set

In [2]:
data = pd.read_pickle('data/twitterData.pkl')
data.shape

(2916, 6)

Get familiar with the structure of the data.

In [3]:
data.sample(10)

Unnamed: 0,screen_name,name,tweets_concatenated,avatar_url,gender,age
0,Celli_Dragonfly,Celine Schwaiger,"@otasatic Ist nicht schlimm, ich kenne das zu ...",http://pbs.twimg.com/profile_images/4912167509...,F,18.0
0,530bw,T. Jennings,Katastrophe für Blogs dank EuGH-Urteil: Urhebe...,http://pbs.twimg.com/profile_images/2448380322...,F,
0,LochisTabea,⚜Tabea⚜,@Juleelchs @_DeenisTJ_ auf jeden 😏😂||Wieso ist...,http://pbs.twimg.com/profile_images/7696618319...,F,
0,die_nuff,Räubertochter,@1337gut heute funktioniert wieder alles o.O i...,http://pbs.twimg.com/profile_images/7604922231...,F,
0,hanshans1963,Hanshans Fruehsammer,Die Attraction Marketing Zeitung von #Hanshans...,http://pbs.twimg.com/profile_images/2796422108...,M,49.0
0,Anton2365_2,Anton Islitzer,Ich habe ein @YouTube-Video positiv bewertet: ...,http://pbs.twimg.com/profile_images/7123707339...,M,
0,the_witchchild,Nessa,Ich gähne. Und gähne. Und gähne. 😴||Wenn einem...,http://pbs.twimg.com/profile_images/3788000008...,F,
0,Sarah_x393,Sarah Mothes,,http://pbs.twimg.com/profile_images/7568505502...,F,22.0
0,xXChrisi95Xx,Christina Scheffler,Oh man mein display zerbröselt armes iphon||Oh...,http://pbs.twimg.com/profile_images/1909240700...,F,16.0
0,JBurstner,Querfeldeindenkerin,,http://pbs.twimg.com/profile_images/6917191334...,F,


Filter out all data instances with empty tweets.

In [4]:
data = data[data.tweets_concatenated != '']
data.shape

(2486, 6)

Filter out all data instances with missing age.

In [5]:
data = data[data.age.notnull()]
data.shape


(1137, 6)

Filter out all data instances with missing name.

In [6]:
data = data[data.name !='']
data.shape

(1137, 6)

Check wether filtering is correct:

In [7]:
data.sample(15)

Unnamed: 0,screen_name,name,tweets_concatenated,avatar_url,gender,age
0,zeyyil1999,Zeynep :D :**,@fabrice10701 danke fürs folgen :))||Darf morg...,http://pbs.twimg.com/profile_images/3433295581...,F,16.0
0,BlackPlayerX,Maik,Spielt #UNCHARTED2 (Mehrspieler): http://www.n...,http://pbs.twimg.com/profile_images/1624391996...,M,30.0
0,Tobi_2022,Tobi Slr,RT @bushido: Unfassbar!!! ROLI auf 3 in den Si...,http://pbs.twimg.com/profile_images/7297590509...,M,15.0
0,nina_tbp,Nina Teller,whitney du warst die beste||ich höre gerade tb...,http://pbs.twimg.com/profile_images/1825389989...,F,15.0
0,EhringStef,Stefan Ehring,"@Regio_NRW Moin, im #RE11 Abf. 7:46 von Essen ...",http://pbs.twimg.com/profile_images/3788000000...,M,23.0
0,KingGor95,Gor Karapetyan,RT @SPORT1: .@HenrikhMkh lässt auf der Asienre...,http://pbs.twimg.com/profile_images/5797777481...,M,20.0
0,smiilecat,Jenny,"Kinder die Sonne scheint , ab raus gehen und S...",http://pbs.twimg.com/profile_images/3445132615...,F,17.0
0,mximilin,maximilian,Ich Twitter. :) Bin jetzt #essen. Die ersten B...,http://pbs.twimg.com/profile_images/678166364/...,M,17.0
0,MusicFreakFever,Nadine (:,Heute nur DREI (!) Schulstunden !\n Ich LIIIIE...,http://pbs.twimg.com/profile_images/3788000003...,M,15.0
0,Badmulder,Alex Mulder,Guten Morgen. Die Sonne scheint - viel Spaß eu...,http://pbs.twimg.com/profile_images/3399873707...,M,38.0


In [8]:
data['age'].describe()

count    1137.000000
mean       22.794195
std        12.000068
min         3.000000
25%        16.000000
50%        19.000000
75%        24.000000
max        82.000000
Name: age, dtype: float64

Oldest user in the dataset is 82 -> 100 as upper bound for age is valid.

In [9]:
bins = pd.IntervalIndex.from_tuples([(0,10),(10,20),(20,30),(30,100)])
data['age_class'] = pd.cut(data['age'],bins = bins)
data.head(10)

Unnamed: 0,screen_name,name,tweets_concatenated,avatar_url,gender,age,age_class
0,DatZerooo,David,Warum riecht mein Bruder nach Pizza wenn er ei...,http://pbs.twimg.com/profile_images/7569661512...,M,16.0,"(10, 20]"
0,reap705,Oliver Gast,[CSS] Ein Off-canvas-Menü mit Dropdown-Navigat...,http://pbs.twimg.com/profile_images/1366984169...,M,15.0,"(10, 20]"
0,eduUu06,eduUu,heut abend kogge und morgen endlich haare ab :...,http://pbs.twimg.com/profile_images/896480580/...,M,46.0,"(30, 100]"
0,Narutofreak935,Avengar,@GrandlineTV gib nicht auf und mach dein Ding ...,http://pbs.twimg.com/profile_images/7317971734...,F,19.0,"(10, 20]"
0,miley_sarah,Sarah,"RT @bomelino: Das ""Backe, backe Kuchen""-Lied i...",http://pbs.twimg.com/profile_images/7584397626...,M,18.0,"(10, 20]"
0,DerIncubus,Der Incubus,@NicoleAllm Na ... gut ins neue Jahr gestartet...,http://pbs.twimg.com/profile_images/5808649200...,F,22.0,"(20, 30]"
0,Petouser,ペトユサ (Petoyusa),Verschwörungstheorie: Pokemon Go wird von der ...,http://pbs.twimg.com/profile_images/6626943925...,M,26.0,"(20, 30]"
0,ChrisWhite126,Chris White,Ach du scheiße ist das warm. :( Hab locker 5kg...,http://pbs.twimg.com/profile_images/6463341220...,M,37.0,"(30, 100]"
0,MusicFreakFever,Nadine (:,Heute nur DREI (!) Schulstunden !\n Ich LIIIIE...,http://pbs.twimg.com/profile_images/3788000003...,M,15.0,"(10, 20]"
0,LukasAlthoff,Luk Alt,RT @cem_oezdemir: #Pazar will mir eine Ehrenbü...,http://pbs.twimg.com/profile_images/5035102215...,F,30.0,"(20, 30]"


Coddify age intervalls.

In [10]:
data['age_class'] = data['age_class'].astype('category')
data['age_class'] = data['age_class'].cat.codes

data.head(10)

Unnamed: 0,screen_name,name,tweets_concatenated,avatar_url,gender,age,age_class
0,DatZerooo,David,Warum riecht mein Bruder nach Pizza wenn er ei...,http://pbs.twimg.com/profile_images/7569661512...,M,16.0,1
0,reap705,Oliver Gast,[CSS] Ein Off-canvas-Menü mit Dropdown-Navigat...,http://pbs.twimg.com/profile_images/1366984169...,M,15.0,1
0,eduUu06,eduUu,heut abend kogge und morgen endlich haare ab :...,http://pbs.twimg.com/profile_images/896480580/...,M,46.0,3
0,Narutofreak935,Avengar,@GrandlineTV gib nicht auf und mach dein Ding ...,http://pbs.twimg.com/profile_images/7317971734...,F,19.0,1
0,miley_sarah,Sarah,"RT @bomelino: Das ""Backe, backe Kuchen""-Lied i...",http://pbs.twimg.com/profile_images/7584397626...,M,18.0,1
0,DerIncubus,Der Incubus,@NicoleAllm Na ... gut ins neue Jahr gestartet...,http://pbs.twimg.com/profile_images/5808649200...,F,22.0,2
0,Petouser,ペトユサ (Petoyusa),Verschwörungstheorie: Pokemon Go wird von der ...,http://pbs.twimg.com/profile_images/6626943925...,M,26.0,2
0,ChrisWhite126,Chris White,Ach du scheiße ist das warm. :( Hab locker 5kg...,http://pbs.twimg.com/profile_images/6463341220...,M,37.0,3
0,MusicFreakFever,Nadine (:,Heute nur DREI (!) Schulstunden !\n Ich LIIIIE...,http://pbs.twimg.com/profile_images/3788000003...,M,15.0,1
0,LukasAlthoff,Luk Alt,RT @cem_oezdemir: #Pazar will mir eine Ehrenbü...,http://pbs.twimg.com/profile_images/5035102215...,F,30.0,2


Split data set into two train data sets and one test data set.

In [11]:
trainSub, tempData = train_test_split(data, test_size = 0.3, random_state = 0)
trainMeta, test = train_test_split(tempData,test_size = 0.3, random_state = 0)

Get tweet data.

In [12]:
trainSub_tweets = trainSub['tweets_concatenated']
trainMeta_tweets = trainMeta['tweets_concatenated']
test_tweets = test['tweets_concatenated']

Get name data.

In [13]:
trainSub_name = trainSub['name']
trainMeta_name = trainMeta['name']
test_name = test['name']

Get target data.

In [14]:
y_trainSub = trainSub['age_class']
y_trainMeta = trainMeta['age_class']
y_test = test['age_class']

### Task 1 - Classifier based on tweets

Vectorize tweet data.

In [15]:
vect = CountVectorizer()
vect_sub = vect.fit_transform(trainSub_tweets)
vect_meta = vect.transform(trainMeta_tweets)
vect_test = vect.transform(test_tweets)

Train classifier on the tweet data and predict users' age.

In [16]:
svm_tweets = SVC()
svm_tweets.fit(vect_sub,y_trainSub)
tweet_score = svm_tweets.score(vect_test,y_test)
tweet_score_text = 'Tweet Score is: {:0.2%}'.format(tweet_score)
print(tweet_score_text)



Tweet Score is: 62.14%


Store output needed for meta classifier.

In [18]:
stacked_input1 = pd.Series(svm_tweets.predict(vect_meta))
stacked_input1_test = pd.Series(svm_tweets.predict(vect_test))

### Task 2 - Classifier based on the names

Vectorize name data.

In [19]:
vect2 = CountVectorizer()
vect_sub_name = vect.fit_transform(trainSub_name)
vect_meta_name = vect.transform(trainMeta_name)
vect_test_name = vect.transform(test_name)

Train Classifier on the name data and predict users' age.

In [20]:
bayes_names = MultinomialNB()
bayes_names.fit(vect_sub_name,y_trainSub)
name_score = bayes_names.score(vect_test_name,y_test)
name_score_text = 'Name Score is: {:0.2%}'.format(name_score)
print(name_score_text)

Name Score is: 65.05%


Store output needed for meta classifier.

In [21]:
stacked_input2 = pd.Series(bayes_names.predict(vect_meta_name))
stacked_input2_test = pd.Series(bayes_names.predict(vect_test_name))

### Task 3 - Meta Classifier

Ininitialize a RandomForestClassifier as meta classifier.

In [22]:
rfc = RandomForestClassifier(n_estimators = 100, random_state = 0)

Combine outpouts from task 1 and 2 to DataFrames.

In [23]:
meta_data_train = {'input1': stacked_input1,
                   'input2': stacked_input2}
meta_data_train = pd.DataFrame(meta_data_train)

meta_data_test = {'input1': stacked_input1_test,
                  'input2': stacked_input2_test}
meta_data_test = pd.DataFrame(meta_data_test)

Train Classifier on the meta data and make predictions.

In [24]:
rfc.fit(meta_data_train,y_trainMeta)
meta_score = rfc.score(meta_data_test,y_test)
name_score_text2 = 'Meta Score is: {:0.2%}'.format(meta_score)

Performance of all Classifiers:

In [25]:
print(tweet_score_text)
print(name_score_text)
print(name_score_text2)

Tweet Score is: 62.14%
Name Score is: 65.05%
Meta Score is: 64.08%
