# Twitter users gender classification

Ramet Gaétan, Schloesing Benjamin, Yao Yuan

## Introduction

The objective of this project is to find features which can help to determine a Twitter user's gender using machine learning.

## Step 1 : Import data

The dataset we will use is the [Twitter User Gender Classification](https://www.kaggle.com/crowdflower/twitter-user-gender-classification) dataset made available by [Crowdflower](https://www.crowdflower.com/). This datasets contains 20000 entries, each of them being a tweet from different users, with many other associated features which are listed here:

* **_unit_id** : a unique id for each user
* **_golden** : a boolean which states whether the user is included in the golden standard for the model
* **_unit_state** : the state of the obervation, eiter *golden* for gold standards or *finalized* for contributor-judged
* **_trusted_judgments** : the number of judgment on a user's gender. 3 for non-golden, or a unique id for golden
* **_last_judgment_at** : date and time of the last judgment, blank for golden observations
* **gender** : either *male*, *female* or *brand* for non-human profiles
* **gender:confidence** : a float representing the confidence of the gender judgment
* **profile_yn** : either *yes* or *no*, *no* meaning that the user's profile was not available when contributors went to judge it
* **profile_yn:confidence** : confidence in the existence/non-existence of the profile
* **created** : date and time of when the profile was created
* **description** : the user's Tweeter profile description
* **fav_number** : the amount of favorited tweets by the user
* **gender_gold** : the gender if the profile is golden
* **link_color** : the link color of the profile as a hex value
* **name** : the Tweeter user's name
* **profile_yn_gold** : *yes* or *no* whether the profile y/n value is golden
* **profileimage** : a link to the profile image
* **retweet_count** : the number of times the user has retweeted something
* **sidebar_color** : color of the profile sidebar as a hex value
* **text** : text of a random tweet from the user
* **tweet_coord** : if the location was available at the time of the tweet, the coordinates as a string ith the format[latitude, longitude]
* **tweet_count** : number of tweet of the users
* **tweet_created** : the time of the random tweet in **text**
* **tweet_id** : the tweet id of the random tweet
* **tweet_location** : the location of the tweet, based on the coordinates
* **user_timezone** : the timezone of the user

Most of these features are not relevant for our analysis, we will only focus on a few of them, i.e. the colors of the sidebars and links, the texts in the description and in the tweets and finally, the content of the profile picture

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display

#graph
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource

%matplotlib notebook 
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import ndimage

from matplotlib import pyplot as plt
# 3D visualization
import pylab
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot

from collections import Counter


from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display
from sklearn import linear_model, metrics


# we need latin-1 encoding because there are some special characters (é,...) that do not fit in default UTF-8
dataFrame = pd.read_csv('gender-classifier-DFE-791531.csv', encoding='latin-1')

#Show a sample of the dataset
dataFrame.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


# Step 2: Data exploration




## Color features exploration

The first feature we are going to use for our analysis are the **link_color** and **sidebar_color**. On Twitter, it is possible to personalize your account by changing the colors of the links or the sidebars, and we expect people from different gender to have different behaviors in how they personalize their page. For example, we can expect females to use more "girly" colors such as pink or purple, while men would keep it more "manly" with some blue maybe.

In [2]:
#Definition of function for data exploration for the colors
#feature : 'sidebar_color', 'link_color'
# The colorGraphs function plots the most used colors by gender in 3 bar graphs
def colorsGraphs(df, feature, genderConfidence = 1, nbToRemove = 1):

    dfCol = df.loc[:,['gender:confidence', 'gender', feature]] #Remove weird values : E+17...
    dfColFiltered = dfCol[(dfCol['gender:confidence'] >= genderConfidence)&((dfCol[feature]).str.contains('E\+') != True)]   
    dfColFilteredMale = dfColFiltered[dfColFiltered['gender'] == 'male']
    dfColFilteredFemale = dfColFiltered[dfColFiltered['gender'] == 'female']
    dfColFilteredBrand = dfColFiltered[dfColFiltered['gender'] == 'brand']
    
    colorMale = dfColFilteredMale[feature]
    colorFemale = dfColFilteredFemale[feature]
    colorBrand = dfColFilteredBrand[feature]
    
    listMale = list(colorMale.values.flatten())
    listFemale = list(colorFemale.values.flatten())
    listBrand = list(colorBrand.values.flatten())
        
    nCommon = 30
    commonFemale = Counter(listFemale).most_common(nCommon)
    commonMale = Counter(listMale).most_common(nCommon)
    commonBrand = Counter(listBrand).most_common(nCommon)
    
    #print(commonBrand[0])
    del commonFemale[0:nbToRemove]
    del commonMale[0:nbToRemove]
    del commonBrand[0:nbToRemove]
    
    colorsFemale = [x[0] for x in commonFemale]
    colorsMale = [x[0] for x in commonMale]
    colorsBrand = [x[0] for x in commonBrand]
    
    colorsNumbFemale = [x[1] for x in commonFemale]
    colorsNumbMale = [x[1] for x in commonMale]
    colorsNumbBrand = [x[1] for x in commonBrand]
    
    colorsHexFemale = ['#' + x + '000000' for x in colorsFemale]
    colorsHexFemale = [x[0:7] for x in colorsHexFemale]
    colorsHexMale = ['#' + x + '000000' for x in colorsMale]
    colorsHexMale = [x[0:7] for x in colorsHexMale]
    colorsHexBrand = ['#' + x + '000000' for x in colorsBrand]
    colorsHexBrand = [x[0:7] for x in colorsHexBrand]
    
    rangeColFemale = list(range(len(colorsFemale)))
    rangeColMale = list(range(len(colorsMale)))
    rangeColBrand = list(range(len(colorsBrand)))
    
    fig1, ax1 = plt.subplots()
    
    bar_width = 0.5
    rects1 = plt.barh(rangeColFemale, colorsNumbFemale, bar_width, label = 'Female', color = colorsHexFemale)
    plt.yticks(rangeColFemale, colorsHexFemale)
    plt.xlabel('Color')
    plt.ylabel(feature)
    plt.title('Most used colors by Females')
    plt.tight_layout()
    plt.show()
    
    fig2, ax2 = plt.subplots()
    
    bar_width = 0.5
    rects1 = plt.barh(rangeColMale, colorsNumbMale, bar_width, label = 'Male', color = colorsHexMale)
    plt.yticks(rangeColMale, colorsHexMale)
    plt.xlabel('Color')
    plt.ylabel(feature)
    plt.title('Most used colors by Males')
    plt.tight_layout()
    plt.show()
    
    
    fig3, ax3 = plt.subplots()
    bar_width = 0.5
    rects1 = plt.barh(rangeColBrand, colorsNumbBrand, bar_width, label = 'Brand', color = colorsHexBrand)
    plt.yticks(rangeColBrand, colorsHexBrand)
    plt.xlabel('Color')
    plt.ylabel(feature)
    plt.title('Most used colors by Brands')
    plt.tight_layout()
    plt.show()


We wrote the **colorsGraphs** function to extract and plot the most used colors for sidebars and for links by each gender. As the color is not especially easy to deduce from its HEX code, we found it easier to read to plot each bar in its associated color. The first thing we can notice is that most users do not personalize their page much and keep one of the standard Twitter themes, regardless of their gender. In order to better visualize how the personalization differs, we removed these most used themes from the bar graphs.

In [3]:
#Data Exploration Colors
colorsGraphs(dataFrame, 'sidebar_color', 1, 4)
colorsGraphs(dataFrame, 'link_color', 1, 1)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

From the graphs, we can take the following conclusions:
* First it seems like users tend to change their link color more than their sidebar color
* Female users have indeed a preference for purple, pink and red colors, while male users tends to use more green and blue. Brands usually hae their paes in blue or green as well

These are only intuitions confirmed by the data, but if we want to predict the gender using the colors, we need a prediction model. 

## Text features exploration

Now that wwe have seen how users personalize the color of their pages, let's have a deeper look at what they actually write on Twitter. Here, we will explore bot the text from the users descriptions but also the text from the tweets themselves. As these two texts lies on different cells in the dataframe, we will first need to process it a bit. The first thing we wanted to do was to normalize the text by removing separators suchs as commas, and also normalize the text itself to have only lowercase letters. To do so, we wrote the **text_normalizer** function. We then grouped the description and tweet texts together. Finally, we wrote the **compute_bag_of_words** and **print_most_frequent** functions to visualize which words are most used by which genders

In [4]:
#Data Exploration - Text
# Normalize text in the descriptions and tweet messages
import re

def text_normalizer(s):
    #we will normalize the text by using strings, lowercases and removing all the punctuations
    s = str(s) 
    s = s.lower()
    s = re.sub('\W\s',' ',s)
    s = re.sub('\s\W',' ',s)
    #s = re.sub('\s[^[@\w]]',' ',s) #to keep the @ symbols used for "addressing"
    #s = re.sub('@',' search_arobass_sign ',s) #The CountVectorizer cant handle the @
    s = re.sub('\s+',' ',s) #replace double spaces with single spaces
    
    return s

# Adding dict to the dataframe containing normalized texts 
dataFrameText = dataFrame
dataFrameText['text_norm'] = [text_normalizer(s) for s in dataFrameText['text']]
dataFrameText['description_norm'] = [text_normalizer(s) for s in dataFrameText['description']]

# Now let's put all the interesting text, i.e. the description and the tweeet itself in one string for each tweet
dataFrameText['all_text'] =dataFrameText['text_norm'].str.cat(dataFrameText['description_norm'],sep=' ')
dataFrameText = dataFrameText[(dataFrameText['gender:confidence']==1)&(dataFrameText['gender']!='unknown')]

# Extract separate genders dataframes
male_data = dataFrameText[dataFrameText['gender']=='male']
female_data = dataFrameText[dataFrameText['gender']=='female']
brand_data = dataFrameText[dataFrameText['gender']=='brand']
male_data.head()

# The compute_bag_of_words function returns a table with the # of occurence of a word in the text
# and a vocabulary of all the different words
def compute_bag_of_words(text):
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform(text)
    vocabulary = vectorizer.get_feature_names()
    return vectors, vocabulary


#Exploration of which words are most used by which gender
def print_most_frequent(bow, vocab, gender, n=20):
    color_idx = ['brand', 'female', 'male']
    color_table = ['#4a913c', '#f5abb5', '#0084b4']
    label_table = ['Most used words by brands', 'Most used words by females', 'Most used words by Males']
    idx = np.argsort(bow.sum(axis=0))
    idx_most_used = np.zeros(n)
    occurence_number = np.zeros(n)
    words_most_used = ["" for x in range(n)]
    for i in range(0,n):
        idx_most_used[i] = idx[0, -1-i]
        words_most_used[i] = vocab[np.int64(idx_most_used[i])]
        occurence_number[i] = (bow.sum(axis=0))[0,idx_most_used[i]]
        #print(vocab[j])

    fig, ax = plt.subplots()
    
    bar_width = 0.5
    word_number = np.arange(n)+1
    rects1 = plt.barh(word_number,occurence_number, bar_width, label = label_table[color_idx.index(gender)], color = color_table[color_idx.index(gender)])
    plt.yticks(word_number,words_most_used)
    plt.ylabel('Most used words')
    plt.xlabel('Number of occurences')
    plt.title(label_table[color_idx.index(gender)])
    plt.tight_layout()
    plt.show()
        
male_bow, male_voc = compute_bag_of_words(male_data['all_text'])
print_most_frequent(male_bow, male_voc, 'male')

female_bow, female_voc = compute_bag_of_words(female_data['all_text'])
print_most_frequent(female_bow, female_voc, 'female')

brand_bow, brand_voc = compute_bag_of_words(brand_data['all_text'])
print_most_frequent(brand_bow, brand_voc, 'brand')
#nothing special about these words really

  out = N.ndarray.__getitem__(self, index)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The results are not quite as conclusive as with the colors. In fact, the most used words, regardless of the gender, are very simple words such as "the", "and", "to" or "of", and this does not give us any information about the gender really. One intersting thing we noticed is that the brands tends to use the words "weather", "channel" and "news" more than regular male and female users. This means that we have probably many information or weather channels accounts in our database. Another interesting fact is on the usage of the word "https". It seems like brands tend to post more links than standard users.

## Profile picture features exploration

To use the profile pictures information is a bit more difficult than using simple text or color codes. The first thing we need to do is to extract the picture content from the picture itself. To do so, we used [Clarify API](https://www.clarifai.com/api), however, as the process is very long to run on the whole dataFrame (approximately 12 hours), we do not recommend to run the code. Instead, we created a new dataFrame containing all the picture contents keyword, which we will use in further analysis.

Now that the we have the content of the profile pictures in text, we can run the same data exploration process than earlier, and see which contents is most used by which gender

In [5]:
#Data Exploration - Pictures

#import imghdr 
#import requests 
#import io
#import urllib.request as ur
#from clarifai.rest import ClarifaiApp
#app = ClarifaiApp()


## We used clariai api to do image extraction, which takes about 12 hours 
## . Thus, We set all the codes into comments avoiding running again.

#raw_data['pic_text']=' '
#for i in range(20048,20049): 
    
#    url=raw_data['profileimage'][i]
#    index=url.rfind('.')
#    url_new=url[:index-7]+url[index:]
#    if '.gif' not in url_new:
#        if 'pb.com'not in url_new:
#            response = requests.get(url_new)
#            if(response.status_code != 404):
#                result=app.tag_urls([url_new])
#                for k in range(0,4):
#                    if(k==0):
#                        raw_data['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
#                    else:
#                        raw_data['pic_text'][i]=raw_data['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
##                print(i)
##            print(raw_data['pic_text'][i])    



#new_data=raw_data
#df = pd.DataFrame(new_data, columns = ['gender', 'gender:confidence', 'profileimage', 'pic_text'])
#df.to_csv('new_data.csv')

new_data=pd.read_csv('new_data.csv',encoding='latin-1')
#Show a sample of the dataset
new_data.head()

Unnamed: 0.1,Unnamed: 0,gender,gender:confidence,profileimage,pic_text
0,0,male,1.0,https://pbs.twimg.com/profile_images/414342229...,music people fun fashion
1,1,male,1.0,https://pbs.twimg.com/profile_images/539604221...,man people portrait one
2,2,male,0.6625,https://pbs.twimg.com/profile_images/657330418...,
3,3,male,1.0,https://pbs.twimg.com/profile_images/259703936...,
4,4,female,1.0,https://pbs.twimg.com/profile_images/564094871...,man people portrait two


In [6]:
#create separate gender dataFrames
male_data = new_data[(new_data['gender']=='male')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
female_data = new_data[(new_data['gender']=='female')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
brand_data = new_data[(new_data['gender']=='brand')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
genderConf = new_data[(new_data['gender:confidence']==1)&(new_data['gender']!='unknown')&(new_data['pic_text']!=' ')]

male_bow, male_voc = compute_bag_of_words(male_data['pic_text'])
print_most_frequent(male_bow, male_voc,'male')
#print('----------------------------')

female_bow, female_voc = compute_bag_of_words(female_data['pic_text'])
print_most_frequent(female_bow, female_voc,'female')
#print('----------------------------')

brand_bow, brand_voc = compute_bag_of_words(brand_data['pic_text'])
print_most_frequent(brand_bow, brand_voc,'brand')

  out = N.ndarray.__getitem__(self, index)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The results here are very interesting. First we can notice that as it is supposed to be a "profile picture", most male and female users use a picture, which is why the words "adult", "people", "portrait" and respectively "man" and "woman" are the most recurrent. Regarding the brands side, mots of the pictures are "symbol", "illustration" or "design", which means they are probably logos of the brand themselves. 

# Step 3: Prediction model

In [7]:
# Definition of functions for data analysis and classification

# The model_test function is used to extract the best word predictors and
# anti-predictors for each gender. The model used must have a coef_ attribute
# representing the weight of each word
def model_test(model,X_train,y_train,X_test,y_test, full_voc, displayResults = True, displayColors = False):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)

    # compute MSE
    mse = metrics.mean_squared_error(y_test,y_pred)
    print('mse: {:.4f}'.format(mse))

    # W contain the weight for each predictor, for each gender
    W = model.coef_
    
    # Prints the accuracy of the gender prediction
    print('score: ', model.score(X_test,y_test))
    if(displayResults):
    # Male Predictors 
        print('Best 20 male predictors:')
        idx_male = np.argsort((W[2,:]))
        weight_male_pred = np.zeros(20)
        male_pred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_male[-1-i]
            weight_male_pred[i] = W[2,j]
            male_pred_label[i] = full_voc[j]
    
        fig1, ax1 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexMale = ['#' + x + '000000' for x in male_pred_label]
            colorsHexMale = [x[0:7] for x in colorsHexMale] 
            rects1 = plt.barh(pred_number,weight_male_pred, bar_width, label = 'Male Predictors', color = colorsHexMale)  
            plt.yticks(pred_number,colorsHexMale)
        else:
            rects1 = plt.barh(pred_number,weight_male_pred, bar_width, label = 'Male Predictors', color = '#0084b4')
            plt.yticks(pred_number,male_pred_label)
        plt.xlabel('Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 male predictors')
        plt.tight_layout()
        plt.show()
    # Male Anti-Predictors    
        print('Best 20 male anti-predictors:')
        idx_male = np.argsort(-(W[2,:]))
        weight_male_antipred = np.zeros(20)
        male_antipred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_male[-1-i]
            weight_male_antipred[i] = W[2,j]
            male_antipred_label[i] = full_voc[j]
    
        fig2, ax2 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexMaleAnti = ['#' + x + '000000' for x in male_antipred_label]
            colorsHexMaleAnti = [x[0:7] for x in colorsHexMaleAnti] 
            rects1 = plt.barh(pred_number,weight_male_antipred, bar_width, label = 'Male Anti-Predictors', color = colorsHexMaleAnti)
            plt.yticks(pred_number,colorsHexMaleAnti)
        else:
            rects1 = plt.barh(pred_number,weight_male_antipred, bar_width, label = 'Male Anti-Predictors', color = '#0084b4')
            plt.yticks(pred_number,male_antipred_label)
        plt.xlabel('Anti-Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 male anti-predictors')
        plt.tight_layout()
        plt.show()
    # Female Predictors    
        print('Best 20 female predictors:')
        idx_female = np.argsort((W[1,:]))
        weight_female_pred = np.zeros(20)
        female_pred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_female[-1-i]
            weight_female_pred[i] = W[1,j]
            female_pred_label[i] = full_voc[j]
    
        fig3, ax3 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexFemale = ['#' + x + '000000' for x in female_pred_label]
            colorsHexFemale = [x[0:7] for x in colorsHexFemale] 
            rects1 = plt.barh(pred_number,weight_female_pred, bar_width, label = 'Female Predictors', color = colorsHexFemale)  
            plt.yticks(pred_number,colorsHexFemale)
        else:
            rects1 = plt.barh(pred_number,weight_female_pred, bar_width, label = 'Female Predictors', color = '#f5abb5')
            plt.yticks(pred_number,female_pred_label)
        plt.xlabel('Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 female predictors')
        plt.tight_layout()
        plt.show()
    # Female Anti-Predictors    
        print('Best 20 female anti-predictors:')
        idx_female = np.argsort(-(W[1,:]))
        weight_female_antipred = np.zeros(20)
        female_antipred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_female[-1-i]
            weight_female_antipred[i] = W[1,j]
            female_antipred_label[i] = full_voc[j]
    
        fig4, ax4 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexFemaleAnti = ['#' + x + '000000' for x in female_antipred_label]
            colorsHexFemaleAnti = [x[0:7] for x in colorsHexFemaleAnti] 
            rects1 = plt.barh(pred_number,weight_female_antipred, bar_width, label = 'Female Anti-Predictors', color = colorsHexFemaleAnti)  
            plt.yticks(pred_number,colorsHexFemaleAnti)
        else:
            rects1 = plt.barh(pred_number,weight_female_antipred, bar_width, label = 'Female Anti-Predictors', color = '#f5abb5')
            plt.yticks(pred_number,female_antipred_label)
        plt.xlabel('Anti-Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 female anti-predictors')
        plt.tight_layout()
        plt.show()
    # Brand Predictors    
        print('Best 20 brand predictors:')
        idx_brand = np.argsort((W[0,:]))
        weight_brand_pred = np.zeros(20)
        brand_pred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_brand[-1-i]
            weight_brand_pred[i] = W[0,j]
            brand_pred_label[i] = full_voc[j]
    
        fig5, ax5 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexBrand = ['#' + x + '000000' for x in brand_pred_label]
            colorsHexBrand = [x[0:7] for x in colorsHexBrand] 
            rects1 = plt.barh(pred_number,weight_brand_pred, bar_width, label = 'Brand Predictors', color = colorsHexBrand)
            plt.yticks(pred_number,colorsHexBrand)
        else:
            rects1 = plt.barh(pred_number,weight_brand_pred, bar_width, label = 'Brand Predictors', color = '#4a913c')
            plt.yticks(pred_number,brand_pred_label)
        plt.xlabel('Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 brand predictors')
        plt.tight_layout()
        plt.show()
    # Brand Anti-Predictors    
        print('Best 20 brand anti-predictors:')
        idx_brand = np.argsort(-(W[0,:]))
        weight_brand_antipred = np.zeros(20)
        brand_antipred_label = ["" for x in range(20)]
        for i in range(20):
            j = idx_brand[-1-i]
            weight_brand_antipred[i] = W[0,j]
            brand_antipred_label[i] = full_voc[j]
    
        fig6, ax6 = plt.subplots()
    
        bar_width = 0.5
        pred_number = np.arange(20)+1
        if(displayColors):
            colorsHexBrandAnti = ['#' + x + '000000' for x in brand_antipred_label]
            colorsHexBrandAnti = [x[0:7] for x in colorsHexBrandAnti] 
            rects1 = plt.barh(pred_number,weight_brand_antipred, bar_width, label = 'Brand Anti-Predictors', color = colorsHexBrandAnti)  
            plt.yticks(pred_number,colorsHexBrandAnti)
        else:
            rects1 = plt.barh(pred_number,weight_brand_antipred, bar_width, label = 'Brand Anti-Predictors', color = '#4a913c')
            plt.yticks(pred_number,brand_antipred_label)
        plt.xlabel('Anti-Predictor')
        plt.ylabel('Weight')
        plt.title('Best 20 brand anti-predictors')
        plt.tight_layout()
        plt.show()
    
    return model

# feature is a string in order to use df[feature]

# The predictors function takes a dataframe, a specific feature (should be a string) and a model
# and performs the gender prediction. The set is split in 5 for cross-validation  
def predictors(df, feature, model, modelname, displayResults = True, displayColors = False):
    print('Testing', modelname, 'model for gender prediction using', feature)
    full_bow, full_voc = compute_bag_of_words(df[feature])
    X = full_bow
    y = LabelEncoder().fit_transform(df['gender'])
    # Create Training and testing sets.
    n,d = X.shape
    test_size = n // 5
    print('Split: {} testing and {} training samples'.format(test_size, y.size - test_size))
    perm = np.random.permutation(y.size)
    X_test  = X[perm[:test_size]]
    X_train = X[perm[test_size:]]
    y_test  = y[perm[:test_size]]
    y_train = y[perm[test_size:]]
    print('model: ', modelname)
    model = model_test(model,X_train,y_train,X_test,y_test, full_voc, displayResults = displayResults, displayColors = displayColors)
    
    return model, full_bow, full_voc


## Gender Prediction based on color features

We wrote the **predictors** function to extract the best predictors and anti-predictors of one specific feature for gender prediction. Here, we applied it to the **link_color**, using different linear models for the prediction. We chose to use linear models because they are simple, but still good enough to be efficient, and have a nice implementaion in the **sklearn** library. 

More specifically, these models have an attribute called **coef_** which gives the weight of each word (here, the color HEX codes) of the model. A word that has a high weight for a given gender means that, if a user make use of it, it has a strong probability of being of this specific gender.

First, we performed the clasification work using the color features:

In [8]:
# Classifier colors

dataFrameColor = dataFrame.loc[:,['gender:confidence', 'gender', 'link_color']]
dataFrameColorFiltered = dataFrameColor[(dataFrameColor['gender:confidence'] == 1)&(dataFrameColor['link_color'].str.contains('E\+') != True)]

feature = 'link_color'
df = dataFrameColorFiltered

# List of the classifiers we tested
modelList = [linear_model.RidgeClassifier(), 
             linear_model.SGDClassifier(),
             linear_model.LogisticRegression(),
             linear_model.PassiveAggressiveClassifier(),
             ]
modelNamesList = ['Ridge Classifier', 
                  'SGD Classifier',
                  'Logistic regression',
                  'Passive Aggressive Classifier',
                  ]

# for i in range(0, len(modelList)):
for i in range(0,1):
    model = modelList[i]
    modelName = modelNamesList[i]
    predictors(df, feature, model, modelName, displayResults = True, displayColors=True)



Testing Ridge Classifier model for gender prediction using link_color
Split: 2780 testing and 11120 training samples
model:  Ridge Classifier
mse: 1.0978
score:  0.440647482014
Best 20 male predictors:


<IPython.core.display.Javascript object>

Best 20 male anti-predictors:


<IPython.core.display.Javascript object>

Best 20 female predictors:


<IPython.core.display.Javascript object>

Best 20 female anti-predictors:


<IPython.core.display.Javascript object>

Best 20 brand predictors:


<IPython.core.display.Javascript object>

Best 20 brand anti-predictors:


<IPython.core.display.Javascript object>

From these bar graphs, we can definitely see that our intuitions are confirmed by the models. The strongest female color-predictors are almost all between pink, red and purple. Also, these colors quite strong anti-predictors for both males and brands. However, the linear models only achieve about 45% of accuracy in predicting the gender using only the colors. Once again, this is mostly because the vast majority of users do not change their sidebar link colors. 



## Gender prediction based on text features

Now, let's do the same and try to predict the users gender using text features:

In [14]:
# Classifier - Text
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors

feature = 'all_text'
df = dataFrameText

# for i in range(0, len(modelList)):
for i in range(0,1):
    model = modelList[i]
    modelName = modelNamesList[i]
    model_text, full_bow, full_voc = predictors(df, feature, model, modelName, displayResults = True)

Testing Ridge Classifier model for gender prediction using all_text
Split: 2760 testing and 11044 training samples
model:  Ridge Classifier
mse: 0.5025
score:  0.68115942029
Best 20 male predictors:


<IPython.core.display.Javascript object>

Best 20 male anti-predictors:


<IPython.core.display.Javascript object>

Best 20 female predictors:


<IPython.core.display.Javascript object>

Best 20 female anti-predictors:


<IPython.core.display.Javascript object>

Best 20 brand predictors:


<IPython.core.display.Javascript object>

Best 20 brand anti-predictors:


<IPython.core.display.Javascript object>

Here, as the data is much more varied and meaningful than simple color codes, we manae to obtain a prediction accuracy of about 65%. Strong predictors for male users are words such as "father", "boy", "man" or "niggas", while predictor for female users are "mom", "girl", "feminist" or "makeup", and of course, these words are anti-predictors of the opposite gender.

From the anti-predictors, it seems like female users do not tweet about sports ("player", "hit", "team", "season", "game") while male users are less susceptible to tweet about girls ("girl", "mother", "queen"). 

On the side of brads, we see that our intuitions are confirmed, as posting a link ("https") or tweeting about "news" and "weather" are typical of the brand "gender". 

Finally, some predictors for the female gender might looks quite odd, "\_ù" or "ï_" for example, but we think these are unicodes for emojis. However, we did not manage to find which ones.

## Gender prediction based on profile pictures features

Now, let's finally apply our classifiers using the profile picture contents:

In [10]:
# Classifier - Text
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors

feature = 'pic_text'
df = genderConf

# for i in range(0, len(modelList)):
for i in range(2,3):
    model = modelList[i]
    modelName = modelNamesList[i]
    predictors(df, feature, model, modelName, displayResults = True)

Testing Logistic regression model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Logistic regression
mse: 0.3223
score:  0.842654028436
Best 20 male predictors:


<IPython.core.display.Javascript object>

Best 20 male anti-predictors:


<IPython.core.display.Javascript object>

Best 20 female predictors:


<IPython.core.display.Javascript object>

Best 20 female anti-predictors:


<IPython.core.display.Javascript object>

Best 20 brand predictors:


<IPython.core.display.Javascript object>

Best 20 brand anti-predictors:


<IPython.core.display.Javascript object>

As the content of the profile picture is very representative of the user, the classifier manage to get up to 85% of accurate predictions, which is quite impressive. However, the predictors are not exactly as we expected them to be. Although "man" and "woman" are among the best predictors for their respective gender, we expected them to be way more important to the prediction of the gender for most of the classifiers.

Also, it seems like we have many twitter account from "actresses", or maybe many female users use pictures of actresses as their profile pictures. Unsurprisigly, "bikini" is an anti-predictor for male users. What is more surprising though is that "lingerie", "leather" and "lips" are strong predictors for brands. 

In [11]:
#Get profile pictures, to have it big just remove '_normal'
pd.options.display.max_colwidth = 100
print(dataFrame.loc[1, 'profileimage'])

from PIL import Image 
from io import BytesIO
import requests

url = dataFrame.loc[1, 'profileimage']
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print(img.format)  # 'JPEG'

https://pbs.twimg.com/profile_images/539604221532700673/WW16tBbU_normal.jpeg
JPEG


In [12]:
dataFrameGold = dataFrame[dataFrame['_golden']]

dataFrameGold.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone,text_norm,description_norm,all_text
20000,815746503,True,golden,249,,male,0.9612,yes,0.9612,8/5/10 8:31,...,Reimagining the #webdesign process by @InVisionApp - https://t.co/Vmb0OZU67e https://t.co/hFlWR8...,,3874,10/26/15 12:40,6.5873e+17,127.0.0.1,Athens,reimagining the webdesign process by invisionapp https://t.co/vmb0ozu67e https://t.co/hflwr8tfol,maker conceptor creative developer 0xbac3a9bd #conception geek dev agile neutralite opensource p...,reimagining the webdesign process by invisionapp https://t.co/vmb0ozu67e https://t.co/hflwr8tfol...
20001,815750089,True,golden,271,,brand,0.9622,yes,1.0,9/10/14 16:30,...,#WestHam Tweets: 52: Goal. @FulhamFC double their lead. No7 Dean O'Halloran gets in behind and s...,,24827,10/26/15 13:20,6.5874e+17,,,#westham tweets 52 goal fulhamfc double their lead no7 dean o'halloran gets in behind and scores...,we cover west ham united fc and soccer 24/7 player press is a curator of interesting sports cont...,#westham tweets 52 goal fulhamfc double their lead no7 dean o'halloran gets in behind and scores...
20002,815750297,True,golden,245,,brand,1.0,yes,1.0,5/11/09 15:31,...,Webber: 'It's a chance for the lads to pit their wits against a League club and see where they s...,,42075,10/26/15 12:40,6.5873e+17,"Wembley Stadium, London",London,webber it's a chance for the lads to pit their wits against a league club and see where they sta...,official twitter account of the football association tweeting news on england teams emirates fa ...,webber it's a chance for the lads to pit their wits against a league club and see where they sta...
20003,815750417,True,golden,245,,brand,0.6408,yes,1.0,8/1/14 13:20,...,Get Weather Updates from The Weather Channel. 15:40:07,,63240,10/26/15 12:40,6.5873e+17,,,get weather updates from the weather channel 15:40:07,,get weather updates from the weather channel 15:40:07 nan
20004,815750696,True,golden,261,,male,1.0,yes,1.0,3/26/12 14:40,...,@TheFalcoholic is like the mailman... Because he delivers! #MuteBuck &amp; #BlitzTheBooth on @Ra...,,3296,10/26/15 12:40,6.5873e+17,Parts Unknown,Pacific Time (US & Canada),@thefalcoholic is like the mailman.. because he delivers mutebuck amp blitzthebooth on rabbletv ...,comedian writer @rabbletv broadcaster host of youcalledit wrestling baseball boxing and mma fan ...,@thefalcoholic is like the mailman.. because he delivers mutebuck amp blitzthebooth on rabbletv ...


In [19]:
def test_external_data(test_bow, test_voc, full_bow, full_voc, model):
    new_bow = np.zeros(full_bow.get_shape())
    for s in test_voc:
        if(s in full_voc):
            idx_full = full_voc.index(s)
            idx_test = test_voc.index(s)
            
            new_bow[idx_full] = test_bow[idx_test]
    
    model.predict(new_bow)
    
text_test = 'niggas in Paris, gospel with the squad and my father sport team'

norm_text = text_normalizer(text_test)
test_bow, test_voc = compute_bag_of_words(norm_text)

ValueError: empty vocabulary; perhaps the documents only contain stop words