# Twitter users gender classification

Schloesing Benjamin, Yao Yuan, Ramet Gaétan

## Introduction

The objective of this project is to find features which can help to determine a Twitter user's gender using machine learning.

## Step 1 : Import data

The dataset we will use is the [Twitter User Gender Classification](https://www.kaggle.com/crowdflower/twitter-user-gender-classification) dataset made available by [Crowdflower](https://www.crowdflower.com/). This datasets contains 20000 entries, each of them being a tweet from different users, with many other associated features which are listed here:

* **_unit_id** : a unique id for each user
* **_golden** : a boolean which states whether the user is included in the golden standard for the model
* **_unit_state** : the state of the obervation, eiter *golden* for gold standards or *finalized* for contributor-judged
* **_trusted_judgments** : the number of judgment on a user's gender. 3 for non-golden, or a unique id for golden
* **_last_judgment_at** : date and time of the last judgment, blank for golden observations
* **gender** : either *male*, *female* or *brand* for non-human profiles
* **gender:confidence** : a float representing the confidence of the gender judgment
* **profile_yn** : either *yes* or *no*, *no* meaning that the user's profile was not available when contributors went to judge it
* **profile_yn:confidence** : confidence in the existence/non-existence of the profile
* **created** : date and time of when the profile was created
* **description** : the user's Tweeter profile description
* **fav_number** : the amount of favorited tweets by the user
* **gender_gold** : the gender if the profile is golden
* **link_color** : the link color of the profile as a hex value
* **name** : the Tweeter user's name
* **profile_yn_gold** : *yes* or *no* whether the profile y/n value is golden
* **profileimage** : a link to the profile image
* **retweet_count** : the number of times the user has retweeted something
* **sidebar_color** : color of the profile sidebar as a hex value
* **text** : text of a random tweet from the user
* **tweet_coord** : if the location was available at the time of the tweet, the coordinates as a string ith the format[latitude, longitude]
* **tweet_count** : number of tweet of the users
* **tweet_created** : the time of the random tweet in **text**
* **tweet_id** : the tweet id of the random tweet
* **tweet_location** : the location of the tweet, based on the coordinates
* **user_timezone** : the timezone of the user

Most of these features are not relevant for our analysis, we will only focus on a few of them

In [10]:
import pandas as pd
import numpy as np

# we need latin-1 encoding because there are some special characters (é,...) that do not fit in default UTF-8
dataFrame = pd.read_csv('gender-classifier-DFE-791531.csv', encoding='latin-1')

#Show a sample of the dataset
dataFrame.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


In [23]:
dataFrame.loc[:1,['name']]
# Normalize text in the descriptions and tweet messages
import re

def text_normalizer(s):
    #we will normalize the text by using strings, lowercases and removing all the punctuations
    s = str(s) 
    s = s.lower()
    s = re.sub('\W\s',' ',s)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\s+',' ',s) #replace double spaces with single spaces
    
    return s
dataFrame['text_norm'] = [text_normalizer(s) for s in dataFrame['text']]
dataFrame['description_norm'] = [text_normalizer(s) for s in dataFrame['description']]

# Extract separate genders dataframes
male_data = dataFrame[(dataFrame['gender']=='male')&(dataFrame['gender:confidence']==1)]
female_data = dataFrame[(dataFrame['gender']=='female')&(dataFrame['gender:confidence']==1)]
brand_data = dataFrame[(dataFrame['gender']=='brand')&(dataFrame['gender:confidence']==1)]
male_data.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone,text_norm,description_norm
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai,robbie e responds to critics after win against...,i sing my own rhythm.
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada),ûïit felt like they were my friends and i was...,i'm the author of novels filled with family dr...
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada),hi jordanspieth looking at the url do you use ...,mobile guy 49ers shazam google kleiner perkins...
7,815719233,False,finalized,3,10/26/15 23:48,male,1.0,yes,1.0,12/3/12 21:54,...,C0DEED,Gala Bingo clubs bought for å£241m: The UK's l...,,112117,10/26/15 12:40,6.5873e+17,,,gala bingo clubs bought for å£241m the uk's la...,the secret of getting ahead is getting started.
17,815719243,False,finalized,3,10/26/15 22:50,male,1.0,yes,1.0,10/18/09 11:41,...,C0DEED,@coolyazzy94 Ditto - I'm still learning the fa...,,91,10/26/15 12:40,6.5873e+17,Glasgow,London,@coolyazzy94 ditto i'm still learning the favo...,over enthusiastic f1 fan model collector music...


In [98]:
#Exploration of which words are most used by which gender
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display

def compute_bag_of_words(text):
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform(text)
    vocabulary = vectorizer.get_feature_names()
    return vectors, vocabulary

def print_most_frequent(bow, vocab, n=20):
    idx = np.argsort(bow.sum(axis=0))
    for i in range(1,n+1):
        j = idx[0, -i]
        print(vocab[j])

male_bow, male_voc = compute_bag_of_words(male_data['description_norm'])

print_most_frequent(male_bow, male_voc)
#nothing special about these words really
print('---')
female_bow, female_voc = compute_bag_of_words(female_data['description_norm'])

print_most_frequent(female_bow, female_voc)
#nothing special about these words really
print('---')

brand_bow, brand_voc = compute_bag_of_words(brand_data['description_norm'])

print_most_frequent(brand_bow, brand_voc)
#nothing special about these words really

and
the
of
to
my
in
for
nan
you
co
is
êû
on
me
at
it
love
http
life
all
---
and
the
of
nan
to
my
you
in
is
me
for
love
with
it
life
on
co
be
are
all
---
the
and
for
to
of
nan
in
co
news
http
is
on
we
from
your
you
all
with
our
at


In [145]:
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors

#first let's put all the interesting text in one string for each tweet
dataFrame['all_text'] =dataFrame['text_norm'].str.cat(dataFrame['description_norm'],sep=' ')
dataFrameConf = dataFrame[(dataFrame['gender:confidence']==1)&(dataFrame['gender']!='unknown')]

from sklearn.preprocessing import LabelEncoder

full_bow, full_voc = compute_bag_of_words(dataFrameConf['all_text'])
X = full_bow
y = LabelEncoder().fit_transform(dataFrameConf['gender'])
# Encoder : 2 = male, 1 = female, 0 = brand

# Create Training and testing sets.
n,d = X.shape
test_size = n // 5
print('Split: {} testing and {} training samples'.format(test_size, y.size - test_size))
perm = np.random.permutation(y.size)
X_test  = X[perm[:test_size]]
X_train = X[perm[test_size:]]
y_test  = y[perm[:test_size]]
y_train = y[perm[test_size:]]

# Linear model regression
from sklearn import linear_model, metrics

def model_test(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)

    mse = metrics.mean_squared_error(y_test,y_pred)
    print('mse: {:.4f}'.format(mse))

    W = model.coef_
    idx = np.argsort((W))
    print('score: ', model.score(X_test,y_test))

    print('Best 20 male predictors and anti-predictors:')
    idx_male = np.argsort(abs(W[2,:]))
    for i in range(20):
        j = idx_male[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[2,j], full_voc[j]))
        
    print('Best 20 female predictors and anti-predictors:')
    idx_female = np.argsort(abs(W[1,:]))
    for i in range(20):
        j = idx_female[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[1,j], full_voc[j]))
        
    print('Best 20 brand predictors and anti-predictors:')
    idx_brand = np.argsort(abs(W[0,:]))
    for i in range(20):
        j = idx_brand[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[0,j], full_voc[j]))

model = linear_model.RidgeClassifier()
print('Testing Ridge Classifier model:')
model_test(model,X_train,y_train,X_test,y_test)


Split: 2760 testing and 11044 training samples
Testing naive bayes multinomial model:
mse: 0.4808
score:  0.678985507246
Best 20 male predictors and anti-predictors:
weight:  0.49, word: father
weight:  0.36, word: director
weight:  0.35, word: man
weight:  0.35, word: guy
weight:  0.34, word: photographer
weight: -0.33, word: mom
weight:  0.33, word: dad
weight:  0.33, word: journalist
weight:  0.32, word: producer
weight:  0.31, word: niggas
weight:  0.30, word: engineer
weight: -0.30, word: girl
weight:  0.29, word: husband
weight: -0.26, word: feminist
weight:  0.25, word: ceo
weight:  0.25, word: fan
weight:  0.25, word: political
weight:  0.25, word: season
weight: -0.24, word: mum
weight:  0.24, word: actor
Best 20 female predictors and anti-predictors:
weight:  0.42, word: mom
weight:  0.41, word: girl
weight:  0.38, word: mother
weight:  0.35, word: lover
weight:  0.34, word: mum
weight:  0.34, word: 17
weight: -0.33, word: father
weight:  0.32, word: feminist
weight:  0.30, w

In [149]:
model = linear_model.PassiveAggressiveClassifier()
print('Testing Passive Aggressive classifier model:')
model_test(model,X_train,y_train,X_test,y_test)

Testing Passive Aggressive classifier model:
mse: 0.5181
score:  0.665579710145
Best 20 male predictors and anti-predictors:
weight:  1.32, word: father
weight:  0.93, word: engineer
weight:  0.92, word: smoking
weight:  0.91, word: a6geg73buc
weight:  0.91, word: retire
weight:  0.89, word: goat
weight:  0.88, word: loyal
weight:  0.87, word: king
weight:  0.86, word: director
weight:  0.85, word: dad
weight:  0.84, word: niggas
weight:  0.84, word: smoke
weight:  0.82, word: steal
weight:  0.81, word: bread
weight: -0.81, word: feminist
weight: -0.81, word: _ùªä
weight:  0.80, word: marketer
weight:  0.79, word: photographer
weight:  0.79, word: mate
weight:  0.77, word: nickiminaj
Best 20 female predictors and anti-predictors:
weight: -1.14, word: father
weight: -0.98, word: loyal
weight: -0.96, word: schaeavery
weight: -0.96, word: smoke
weight: -0.95, word: king
weight: -0.95, word: bread
weight: -0.94, word: mate
weight:  0.90, word: adam
weight:  0.90, word: mum
weight:  0.88, w

In [147]:
model = linear_model.SGDClassifier()
print('Testing SGD classifier model:')
model_test(model,X_train,y_train,X_test,y_test)

Testing Linear Regression model:
mse: 0.5801
score:  0.647101449275
Best 20 male predictors and anti-predictors:
weight: -5.69, word: kath
weight:  4.27, word: father
weight:  3.74, word: marketer
weight:  3.20, word: niggas
weight:  3.20, word: photographer
weight:  3.20, word: engineer
weight:  3.02, word: journalist
weight: -3.02, word: beauty
weight: -2.85, word: queen
weight:  2.85, word: nickiminaj
weight:  2.85, word: dad
weight:  2.85, word: actor
weight: -2.85, word: girl
weight: -2.67, word: mum
weight:  2.67, word: comedian
weight:  2.67, word: producer
weight:  2.67, word: director
weight:  2.49, word: cars
weight:  2.49, word: drummer
weight:  2.49, word: pre
Best 20 female predictors and anti-predictors:
weight:  5.69, word: kath
weight: -3.38, word: father
weight: -3.20, word: êû
weight:  3.20, word: mom
weight:  3.02, word: mommy
weight:  2.85, word: communication
weight:  2.85, word: mum
weight: -2.85, word: dad
weight: -2.85, word: official
weight:  2.67, word: differ