# Twitter users gender classification

Schloesing Benjamin, Yao Yuan, Ramet Gaétan

## Introduction

The objective of this project is to find features which can help to determine a Twitter user's gender using machine learning.

## Step 1 : Import data

The dataset we will use is the [Twitter User Gender Classification](https://www.kaggle.com/crowdflower/twitter-user-gender-classification) dataset made available by [Crowdflower](https://www.crowdflower.com/). This datasets contains 20000 entries, each of them being a tweet from different users, with many other associated features which are listed here:

* **_unit_id** : a unique id for each user
* **_golden** : a boolean which states whether the user is included in the golden standard for the model
* **_unit_state** : the state of the obervation, eiter *golden* for gold standards or *finalized* for contributor-judged
* **_trusted_judgments** : the number of judgment on a user's gender. 3 for non-golden, or a unique id for golden
* **_last_judgment_at** : date and time of the last judgment, blank for golden observations
* **gender** : either *male*, *female* or *brand* for non-human profiles
* **gender:confidence** : a float representing the confidence of the gender judgment
* **profile_yn** : either *yes* or *no*, *no* meaning that the user's profile was not available when contributors went to judge it
* **profile_yn:confidence** : confidence in the existence/non-existence of the profile
* **created** : date and time of when the profile was created
* **description** : the user's Tweeter profile description
* **fav_number** : the amount of favorited tweets by the user
* **gender_gold** : the gender if the profile is golden
* **link_color** : the link color of the profile as a hex value
* **name** : the Tweeter user's name
* **profile_yn_gold** : *yes* or *no* whether the profile y/n value is golden
* **profileimage** : a link to the profile image
* **retweet_count** : the number of times the user has retweeted something
* **sidebar_color** : color of the profile sidebar as a hex value
* **text** : text of a random tweet from the user
* **tweet_coord** : if the location was available at the time of the tweet, the coordinates as a string ith the format[latitude, longitude]
* **tweet_count** : number of tweet of the users
* **tweet_created** : the time of the random tweet in **text**
* **tweet_id** : the tweet id of the random tweet
* **tweet_location** : the location of the tweet, based on the coordinates
* **user_timezone** : the timezone of the user

Most of these features are not relevant for our analysis, we will only focus on a few of them

In [132]:
import pandas as pd
import numpy as np

# we need latin-1 encoding because there are some special characters (é,...) that do not fit in default UTF-8
raw_data = pd.read_csv('gender-classifier-DFE-791531.csv', encoding='latin-1')
new_data=pd.read_csv('new_data.csv',encoding='latin-1')
#Show a sample of the dataset
new_data.head()

Unnamed: 0.1,Unnamed: 0,gender,gender:confidence,profileimage,pic_text
0,0,male,1.0,https://pbs.twimg.com/profile_images/414342229...,music people fun fashion
1,1,male,1.0,https://pbs.twimg.com/profile_images/539604221...,man people portrait one
2,2,male,0.6625,https://pbs.twimg.com/profile_images/657330418...,
3,3,male,1.0,https://pbs.twimg.com/profile_images/259703936...,
4,4,female,1.0,https://pbs.twimg.com/profile_images/564094871...,man people portrait two


In [112]:
#import imghdr 
#import requests 
#import io
#import urllib.request as ur
#from clarifai.rest import ClarifaiApp
#app = ClarifaiApp()


## I use clariai api to do image extraction, and I do a for-loop from 0 to 20049 which takes about 12 hours 
## and it is unrepeatable. Thus, I set all the codes into comments avoiding running again.

#raw_data['pic_text']=' '
#for i in range(20048,20049): 
    
#    url=raw_data['profileimage'][i]
#    index=url.rfind('.')
#    url_new=url[:index-7]+url[index:]
#    if '.gif' not in url_new:
#        if 'pb.com'not in url_new:
#            response = requests.get(url_new)
#            if(response.status_code != 404):
#                result=app.tag_urls([url_new])
#                for k in range(0,4):
#                    if(k==0):
#                        raw_data['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
#                    else:
#                        raw_data['pic_text'][i]=raw_data['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
##                print(i)
##            print(raw_data['pic_text'][i])    



#new_data=raw_data
#df = pd.DataFrame(new_data, columns = ['gender', 'gender:confidence', 'profileimage', 'pic_text'])
#df.to_csv('new_data.csv')

In [170]:
#create gender data seperately
male_data = new_data[(new_data['gender']=='male')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
female_data = new_data[(new_data['gender']=='female')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
brand_data = new_data[(new_data['gender']=='brand')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
genderConf = new_data[(new_data['gender:confidence']==1)&(new_data['gender']!='unknown')&(new_data['pic_text']!=' ')]


In [164]:
#Exploration of which words are most used by which gender
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display

def compute_bag_of_words(text):
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform(text)
    vocabulary = vectorizer.get_feature_names()
    return vectors, vocabulary

def print_most_frequent(bow, vocab, n=20):
    idx = np.argsort(bow.sum(axis=0))
    for i in range(1,n+1):
        j = idx[0, -i]
        print(vocab[j])

male_bow, male_voc = compute_bag_of_words(male_data['pic_text'])
#print_most_frequent(male_bow, male_voc)
#print('----------------------------')

female_bow, female_voc = compute_bag_of_words(female_data['pic_text'])
#print_most_frequent(female_bow, female_voc)
#print('----------------------------')

brand_bow, brand_voc = compute_bag_of_words(brand_data['pic_text'])
#print_most_frequent(brand_bow, brand_voc)

In [179]:
from sklearn.preprocessing import LabelEncoder

full_bow, full_voc = compute_bag_of_words(genderConf['pic_text'])
X = full_bow
y = LabelEncoder().fit_transform(genderConf['gender'])
# Encoder : 2 = male, 1 = female, 0 = brand
# Create Training and testing sets.
n,d = X.shape
test_size = n // 5

print('Split: {} testing and {} training samples'.format(test_size, y.size - test_size))
perm = np.random.permutation(y.size)
X_test  = X[perm[:test_size]]
X_train = X[perm[test_size:]]
y_test  = y[perm[:test_size]]
y_train = y[perm[test_size:]]

from sklearn import linear_model, metrics

def model_test(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)

    mse = metrics.mean_squared_error(y_test,y_pred)
    print('mse: {:.4f}'.format(mse))

    W = model.coef_
    idx = np.argsort((W))
    print('score: ', model.score(X_test,y_test))

    print('Best 20 male predictors and anti-predictors:')
    idx_male = np.argsort(abs(W[2,:]))
    for i in range(20):
        j = idx_male[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[2,j], full_voc[j]))
        
    print('Best 20 female predictors and anti-predictors:')
    idx_female = np.argsort(abs(W[1,:]))
    for i in range(20):
        j = idx_female[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[1,j], full_voc[j]))
        
    print('Best 20 brand predictors and anti-predictors:')
    idx_brand = np.argsort(abs(W[0,:]))
    for i in range(20):
        j = idx_brand[-1-i]
        print('weight: {:5.2f}, word: {}'.format(W[0,j], full_voc[j]))

model = linear_model.RidgeClassifier()
print('Testing Ridge Classifier model:')
model_test(model,X_train,y_train,X_test,y_test)

Split: 1055 testing and 4223 training samples
Testing Ridge Classifier model:
mse: 0.3488
score:  0.844549763033
Best 20 male predictors and anti-predictors:
weight:  1.06, word: amusing
weight:  1.01, word: pop
weight:  0.92, word: metallic
weight:  0.92, word: theater
weight:  0.91, word: real
weight:  0.90, word: performance
weight:  0.90, word: cardboard
weight:  0.88, word: identify
weight:  0.85, word: viral
weight: -0.85, word: cooking
weight:  0.84, word: adventure
weight:  0.83, word: vase
weight:  0.81, word: bookstore
weight:  0.80, word: stadium
weight:  0.79, word: man
weight:  0.79, word: angry
weight:  0.78, word: tiger
weight: -0.78, word: seat
weight:  0.77, word: texture
weight: -0.76, word: actress
Best 20 female predictors and anti-predictors:
weight:  1.25, word: artistic
weight: -1.16, word: body
weight:  1.16, word: cooking
weight:  1.08, word: actress
weight: -1.07, word: lips
weight:  1.06, word: guitar
weight:  0.94, word: calligraphy
weight: -0.93, word: pop


In [180]:
model = linear_model.PassiveAggressiveClassifier()
print('Testing Passive Aggressive classifier model:')
model_test(model,X_train,y_train,X_test,y_test)

Testing Passive Aggressive classifier model:
mse: 0.4540
score:  0.81327014218
Best 20 male predictors and anti-predictors:
weight: -2.85, word: woman
weight: -2.19, word: silhouette
weight:  2.08, word: pop
weight: -2.07, word: cooking
weight: -2.02, word: fall
weight: -2.02, word: mask
weight: -2.02, word: little
weight: -1.96, word: romance
weight:  1.96, word: theater
weight: -1.95, word: internet
weight:  1.95, word: metallic
weight:  1.94, word: award
weight: -1.88, word: inside
weight:  1.84, word: bookstore
weight: -1.83, word: pattern
weight:  1.81, word: river
weight: -1.79, word: fantasy
weight: -1.75, word: actress
weight: -1.74, word: blond
weight:  1.73, word: texture
Best 20 female predictors and anti-predictors:
weight:  2.79, word: cooking
weight: -2.74, word: body
weight:  2.74, word: artistic
weight: -2.60, word: lips
weight: -2.45, word: lingerie
weight: -2.15, word: partnership
weight:  2.13, word: squad
weight:  2.03, word: binoculars
weight: -1.99, word: pop
weig

In [181]:
model = linear_model.SGDClassifier()
print('Testing SGD classifier model:')
model_test(model,X_train,y_train,X_test,y_test)

Testing SGD classifier model:
mse: 0.3810
score:  0.829383886256
Best 20 male predictors and anti-predictors:
weight: -4.07, word: woman
weight:  3.17, word: texture
weight:  3.17, word: amusing
weight: -3.17, word: actress
weight: -3.17, word: fun
weight: -3.17, word: girl
weight:  2.71, word: outfit
weight:  2.71, word: pop
weight:  2.71, word: performance
weight: -2.71, word: blond
weight: -2.71, word: pet
weight: -2.71, word: glamour
weight: -2.26, word: rabbit
weight: -2.26, word: steel
weight: -2.26, word: mountain
weight:  2.26, word: real
weight: -2.26, word: basketball
weight: -2.26, word: mask
weight: -2.26, word: cooking
weight:  2.26, word: vase
Best 20 female predictors and anti-predictors:
weight: -4.97, word: lips
weight:  4.07, word: actress
weight:  3.62, word: brunette
weight:  3.17, word: artistic
weight: -3.17, word: body
weight: -2.71, word: club
weight:  2.71, word: guitar
weight: -2.71, word: pop
weight:  2.71, word: blond
weight:  2.71, word: cooking
weight: -2.

In [182]:
model = linear_model.LogisticRegression()
print('Testing Logistic Regression model:')
model_test(model,X_train,y_train,X_test,y_test)

Testing Logistic Regression model:
mse: 0.3223
score:  0.854028436019
Best 20 male predictors and anti-predictors:
weight: -2.44, word: woman
weight:  2.37, word: man
weight: -2.21, word: girl
weight: -1.89, word: fun
weight: -1.89, word: actress
weight: -1.37, word: internet
weight:  1.32, word: race
weight:  1.31, word: texture
weight: -1.30, word: glamour
weight:  1.28, word: pop
weight:  1.28, word: performance
weight: -1.27, word: shining
weight:  1.26, word: sketch
weight:  1.25, word: bill
weight: -1.23, word: disjunct
weight: -1.19, word: architecture
weight: -1.16, word: silhouette
weight:  1.16, word: amusing
weight:  1.10, word: stadium
weight:  1.09, word: insubstantial
Best 20 female predictors and anti-predictors:
weight:  2.65, word: actress
weight:  2.55, word: woman
weight: -2.43, word: man
weight: -2.12, word: lips
weight: -1.68, word: text
weight:  1.66, word: girl
weight:  1.46, word: glamour
weight:  1.46, word: brunette
weight: -1.44, word: body
weight:  1.41, wor