## Feature Selection 

**Om het einde van dit notebook te kunnen runnen moet eerst het notebook cleanup.ipynb gerund worden om de csv bestanden te verkrijgen!**
In het notebook 'cleaned_dataset_analysis' zijn van veel variabelen plots gemaakt om te bestuderen welke variabelen mogelijk veel invloed hebben op de uitkomsten van het model en welke variabelen geen waarde hebben. In dit notebook wordt daar dieper op ingegaan door middel van t-toetsen op de binaire variabelen. 

In [1]:
import pandas as pd
import numpy as np
from functions.csv_tools import to_csv
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('./cleaned_data/all_columns.csv')
data.columns

Index(['Unnamed: 0', 'full_text', 'retweet_count', 'favorite_count',
       'user_description', 'user_followers_count',
       'user_normal_followers_count', 'user_friends_count',
       'user_listed_count', 'user_favourites_count', 'user_statuses_count',
       'user_media_count', 'user_translator_type', 'hashtags_count',
       'username', 'user_profile_location', 'has_user_url', 'text_length',
       'sent_via_twitter', 'twitter_android_user', 'twitter_apple_user',
       'tweeted_in_weekend', 'user_created_in_weekend',
       'possibly_sensitive_media', 'user_is_verified',
       'user_has_translation_enabled', 'user_has_default_profile',
       'user_has_default_profile_image', 'has_pinned_tweet',
       'user_has_custom_timeline', 'user_is_advertiser',
       'user_service_level_analytics', 'user_service_level_dso',
       'user_service_level_media_studio', 'user_service_level_mms',
       'user_service_level_reseller', 'user_service_level_smb',
       'user_service_level_subscri

In [3]:
# verwijder alle numerieke kolommen
binary_data = data.drop(['Unnamed: 0', 'full_text', 'retweet_count', 'favorite_count', 'user_description', 'user_followers_count', 'user_normal_followers_count', 
                        'user_friends_count', 'user_listed_count', 'user_favourites_count', 'user_statuses_count', 'user_media_count', 'hashtags_count', 'username', 'text_length'], axis = 1)

In [4]:
binary_data.head()

Unnamed: 0,user_translator_type,user_profile_location,has_user_url,sent_via_twitter,twitter_android_user,twitter_apple_user,tweeted_in_weekend,user_created_in_weekend,possibly_sensitive_media,user_is_verified,...,user_creation_tweet_diff,tweeted_in_daypart_day,tweeted_in_daypart_evening,tweeted_in_daypart_morning,tweeted_in_daypart_night,user_created_in_daypart_day,user_created_in_daypart_evening,user_created_in_daypart_morning,user_created_in_daypart_night,real_fake_grade
0,0,0,1,1,0,0,1,1,0,1,...,16384932,0,1,0,0,0,0,1,0,1.0
1,0,0,1,0,0,0,0,0,0,0,...,293776787,1,0,0,0,0,0,1,0,-1.0
2,0,0,1,1,0,0,1,1,0,1,...,9039963,0,1,0,0,0,0,1,0,1.0
3,0,1,1,0,0,0,1,0,0,1,...,375950159,0,0,0,1,0,1,0,0,1.0
4,0,0,1,1,0,0,1,1,1,1,...,2470004,0,1,0,0,0,0,1,0,1.0


In [5]:
binary_data.columns

Index(['user_translator_type', 'user_profile_location', 'has_user_url',
       'sent_via_twitter', 'twitter_android_user', 'twitter_apple_user',
       'tweeted_in_weekend', 'user_created_in_weekend',
       'possibly_sensitive_media', 'user_is_verified',
       'user_has_translation_enabled', 'user_has_default_profile',
       'user_has_default_profile_image', 'has_pinned_tweet',
       'user_has_custom_timeline', 'user_is_advertiser',
       'user_service_level_analytics', 'user_service_level_dso',
       'user_service_level_media_studio', 'user_service_level_mms',
       'user_service_level_reseller', 'user_service_level_smb',
       'user_service_level_subscription', 'tweet_contains_url',
       'tweet_contains_media', 'user_decscription_has_urls', 'is_quoted_tweet',
       'tweet_is_reply', 'part_of_thread', 'user_description_sentiment',
       'tweet_sentiment', 'user_creation_tweet_diff', 'tweeted_in_daypart_day',
       'tweeted_in_daypart_evening', 'tweeted_in_daypart_morning'

### T-toets

H0: het gemiddelde van de real fake grade is voor beide groepen (1 en 0) van de verklarende variabele gelijk
H1: het gemidelde van de real fake grade is voor beide groepen ongelijk 

significantie niveau = 0.05

Omdat de steekproef omvang groter is dan 30 mogen we uitgaan van een normale verdeling van de variabelen. **Aanname dat variabelen onafhankelijk zijn klopt niet, bijvoorbeeld bij de variabelen met ochtend, middag, avond en nacht, die zijn namelijk sterk afhankelijk van elkaar. *Hoe gaan we daar mee om?***

Hieronder wordt gebruik gemaakt van Welch's t-toetsen, deze corrigeert voor ongelijke varianties. 

##### Voorbeeld voor verified users

In [6]:
group1 = binary_data[data['user_is_verified'] == 0]['real_fake_grade']
group2 = binary_data[data['user_is_verified'] == 1]['real_fake_grade']


In [7]:
import researchpy as rp

In [8]:
des, res = rp.ttest(group1 = group1, group2=group2, group1_name= 'non_verified', group2_name='verified') # returns two dataframes

In [9]:
res['results'][2]

-31.3507

In [10]:
results = {}
results['t'] = res.iloc[2,:]
results['p'] = res.iloc[3,:]
pd.DataFrame.from_dict(results)

Unnamed: 0,t,p
Independent t-test,t =,Two side test p value =
results,-31.3507,0.0


In [11]:
def t_toets(col):
    
    # define groups
    group1 = binary_data[data[col] == 0]['real_fake_grade']
    group2 = binary_data[data[col] == 1]['real_fake_grade']

    print(col)
    # researchpy t test (compensates for unequal variances)
    des, res = rp.ttest(group1 = group1, group2=group2) # returns two dataframes

    t_waarde, p_waarde = res['results'][2], res['results'][3]
    
    return t_waarde, p_waarde

results = {}
results['feature'] = []
results['t_waarde'] = []
results['p_waarde'] = []

for i in range(0, len(binary_data.columns)-1): # loopen over alle columns behalve real_fake_grade

    col = binary_data.columns[i]

    t_waarde, p_waarde = t_toets(col)
    results['feature'].append(col)
    results['t_waarde'].append(t_waarde)
    results['p_waarde'].append(p_waarde)

results_df = pd.DataFrame(results)

user_translator_type
user_profile_location
has_user_url
sent_via_twitter
twitter_android_user
twitter_apple_user
tweeted_in_weekend
user_created_in_weekend
possibly_sensitive_media
user_is_verified
user_has_translation_enabled
user_has_default_profile
user_has_default_profile_image
has_pinned_tweet
user_has_custom_timeline
user_is_advertiser
user_service_level_analytics
user_service_level_dso
user_service_level_media_studio
user_service_level_mms
user_service_level_reseller
user_service_level_smb
user_service_level_subscription
tweet_contains_url
tweet_contains_media
user_decscription_has_urls
is_quoted_tweet
tweet_is_reply
part_of_thread
user_description_sentiment
tweet_sentiment
user_creation_tweet_diff
tweeted_in_daypart_day
tweeted_in_daypart_evening
tweeted_in_daypart_morning
tweeted_in_daypart_night
user_created_in_daypart_day
user_created_in_daypart_evening
user_created_in_daypart_morning
user_created_in_daypart_night


In [12]:
results_df = results_df.sort_values(by = 'p_waarde', ascending=False)
results_df = results_df.reset_index(drop=True)

In [13]:
results_df

Unnamed: 0,feature,t_waarde,p_waarde
0,user_description_sentiment,0.0973,0.9225
1,user_created_in_daypart_night,0.2596,0.7952
2,user_profile_location,0.3334,0.7388
3,twitter_apple_user,0.7089,0.4784
4,user_created_in_daypart_morning,0.8067,0.4199
5,user_translator_type,-0.8285,0.4074
6,is_quoted_tweet,-1.1158,0.2645
7,user_service_level_subscription,-1.208,0.2271
8,tweeted_in_daypart_night,-2.1894,0.0286
9,tweeted_in_weekend,2.9819,0.0029


Nu zegt de uitkomst nog niet zo veel denk ik omdat er sterk afhankelijke variabelen in de dataset zitten zoals ochtend en avond etc. Daar nog even wat mee knoeien en dan dit opnieuw runnen. 
Hopsakee :)

### Numerieke Variabelen 

- Histogrammen, is het normaal verdeeld? 

In [14]:
from scipy import stats
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split

In [15]:
data_chi2 = data[[  
                 'real_fake_grade',           
                 'tweet_contains_url',
                 'user_service_level_analytics',
                 'user_service_level_dso',
                 'user_service_level_mms',
                 'user_service_level_reseller',
                 'user_service_level_smb',
                 'user_service_level_subscription',                   
                 'tweeted_in_weekend', 
                 'user_created_in_weekend',  
                 'sent_via_twitter',           
                 'twitter_android_user',              
                 'twitter_apple_user',            
                 'possibly_sensitive_media',       
                 'user_is_verified',
                 'user_has_translation_enabled',          
                 'user_has_default_profile',
                 'user_has_default_profile_image',
                 'user_has_custom_timeline',
                 'is_quoted_tweet',
                 'has_pinned_tweet',
                 'user_profile_location',
                 'has_user_url',
                 'tweet_contains_media',
                 'tweet_is_reply',   
                 'part_of_thread',
                 'user_is_advertiser',
                 'tweeted_in_daypart_day',
                 'tweeted_in_daypart_evening',
                 'tweeted_in_daypart_morning',             
                 'tweeted_in_daypart_night',   
                 'user_created_in_daypart_day',      
                 'user_created_in_daypart_evening',
                 'user_created_in_daypart_morning',           
                 'user_created_in_daypart_night' 
                  ]] 

In [16]:
data_chi2['real_fake_grade'] = data_chi2['real_fake_grade'].replace([1], 2)
data_chi2['real_fake_grade'] = data_chi2['real_fake_grade'].replace([0], 1)
data_chi2['real_fake_grade'] = data_chi2['real_fake_grade'].replace([-1], 0)

In [17]:
data_chi2['real_fake_grade'].value_counts()

2.0    4643
0.0    2354
1.0     908
Name: real_fake_grade, dtype: int64

In [18]:
X = data_chi2.drop(labels=['real_fake_grade'], axis=1)
y = data_chi2['real_fake_grade']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=0)

In [20]:
chi2_selector = SelectKBest(score_func=chi2, k='all')
chi2_selector.fit_transform(X, y)

array([[0, 0, 0, ..., 0, 1, 0],
       [1, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 1, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 1, 0]], dtype=int64)

In [21]:
features_score = pd.DataFrame(chi2_selector.scores_)
features_pvalue = pd.DataFrame(np.round(chi2_selector.pvalues_,4))
features = pd.DataFrame(X.columns)
feature_score = pd.concat([features,features_score,features_pvalue],axis=1)
# Assign the column name
feature_score.columns = ["Input_Features","Score","P_Value"]
print(feature_score.nlargest(46,columns="Score"))

                     Input_Features       Score  P_Value
13                 user_is_verified  679.309929   0.0000
24                   part_of_thread  500.657815   0.0000
2            user_service_level_dso  218.538391   0.0000
22             tweet_contains_media  188.910815   0.0000
3            user_service_level_mms  165.310468   0.0000
23                   tweet_is_reply  160.327743   0.0000
21                     has_user_url  155.745652   0.0000
14     user_has_translation_enabled  127.609097   0.0000
19                 has_pinned_tweet  114.072723   0.0000
11               twitter_apple_user   92.730322   0.0000
26           tweeted_in_daypart_day   81.877156   0.0000
32  user_created_in_daypart_morning   73.805841   0.0000
5            user_service_level_smb   66.014012   0.0000
16   user_has_default_profile_image   62.772541   0.0000
1      user_service_level_analytics   58.368827   0.0000
4       user_service_level_reseller   55.029115   0.0000
17         user_has_custom_time

van de chi2 krijgen we 2 waarde terug voor elke collom de P_Value en de chi2 score de P_Value krijgen we all van de T_Test maar de chi2 score kijkt hoe indepent de collomen zijn vergeleken met de real_fake_grade.  
Hoe hoger de score hoe meer depent de collomen zijn.  
Wat dit betekent is dat de collomen met laage indepense waarschijnlijk niet bijdragen aan het model

### Nieuwe csv bestanden

Hieronder worden de subsets ingeladen vanuit de map cleaned_data. De csv's zijn gemaakt in het bestand cleanup.ipynb. 

In [22]:

all_data = pd.read_csv('../data/cleaned_data/all_columns.csv')
tweet_data = pd.read_csv('../data/cleaned_data/tweet_data_columns.csv')
user_data = pd.read_csv('../data/cleaned_data/user_data_columns.csv')

De kolommen die in de t-test een p-waarde hoger scoren dan 0.05 worden uit de dataset verwijderd omdat er niet genoeg aanleiding is om de nulhypothese te verwerpen. Dit betekent dat de waarschijnlijkheid groot is dat deze kolommen geen invloed uitoefenen op de resultaten van het model. Door ze te verwijderen hopen we ruis weg te nemen en de prestaties te verbeteren.  

In [23]:
t_test_drop = list(results_df[0:7]['feature'])

In [24]:
list(results_df[0:7]['feature'])

['user_description_sentiment',
 'user_created_in_daypart_night',
 'user_profile_location',
 'twitter_apple_user',
 'user_created_in_daypart_morning',
 'user_translator_type',
 'is_quoted_tweet']

Uit de chi^2 toets blijken twee variabelen een lage indepence te hebben gekoppeld aan een hoge p-waarde, *user_created_in_daypart_night* en *user_profile_location*. Deze zitten ook al de in variabelen die uit te t test een hoge p-waarde hebben dus die hoeven niet nog extra verwijderd te worden. 

UIt de correlatiematrix blijkt dat sommige variabelen onderling erg sterk correleren, wat ook ruis zou kunnen opleveren. De correlatiematrix is te vinden in cleanup.ipynb. Van alle paren variabelen met een onderlinge correlatie coefficient groter dan 0.8 wordt één variabele verwijderd. 

In [25]:
corr_drop = ['user_normal_followers_count', 'user_listed_count', 'favorite_count', 'user_service_level_mms']

In [26]:
all_data_selected = all_data.drop(t_test_drop,  axis=1)

In [27]:
all_data_selected = all_data_selected.drop(corr_drop,  axis=1)
all_data_selected = all_data_selected.drop('Unnamed: 0',  axis=1)

In [28]:
print('shape voor feature selection:', all_data.shape)
print('shape na feature selection:', all_data_selected.shape)

# dus 12 kolommen verwijderd, waarvan 11 metadata en 1 unnamed:0

shape voor feature selection: (7905, 56)
shape na feature selection: (7905, 44)


In totaal zijn er 11 metadata kolommen uit all_data verwijderd

In [29]:
tweet_drop = ['favorite_count', 'twitter_apple_user', 'is_quoted_tweet']
user_drop = ['user_created_in_daypart_night',
                'user_profile_location',
                'user_created_in_daypart_morning',
                'user_translator_type',
                'user_service_level_subscription',
                'user_listed_count', 
                'user_service_level_mms', 
                'user_normal_followers_count']

In [30]:
tweet_data_selected = tweet_data.drop(tweet_drop, axis = 1)
user_data_selected = user_data.drop(user_drop, axis = 1)

In [31]:
print('tweet metadata shape voor feature selection:', tweet_data.shape)
print('tweet metadata shape na feature selection:', tweet_data_selected.shape)

tweet metadata shape voor feature selection: (7905, 22)
tweet metadata shape na feature selection: (7905, 19)


In [32]:
print('user metadata shape voor feature selection:', user_data.shape)
print('user metadata shape na feature selection:', user_data_selected.shape)

user metadata shape voor feature selection: (7905, 33)
user metadata shape na feature selection: (7905, 25)


Van de 11 verwijderde kolommen die uit de totale dataset verwijderd zijn zaten er 3 in de tweet data en 8 in user data. 

### Save to csv files

In [33]:
directory = './selected_data'

In [34]:
# all data selected
to_csv(all_data_selected, directory, 'all_data_selected.csv')

Succesfully saved data to ./selected_data/all_data_selected.csv


In [35]:
# tweet data selected
to_csv(tweet_data_selected, directory, 'tweet_data_selected.csv')

Succesfully saved data to ./selected_data/tweet_data_selected.csv


In [36]:
# user data selected
to_csv(user_data_selected, directory, 'user_data_selected.csv')

Succesfully saved data to ./selected_data/user_data_selected.csv
