[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jD82HMRRkkO0qqYYFfDW35_rgS0G3Zny)

# Notebook #6: demographics

## Description:

In this notebook, we will show you how to infer the gender of a Twitter user based on its user name. We will focus on Brazilian Twitter users and use public data containing Brazilian first names and their associated gender. This data can be found [here](https://gist.githubusercontent.com/augustohp/2c59ceb96e195ea375abadb311637e7f/raw/d2007f9ad2ab3d317a9b45ec65e56be741758d8c/brazilian-names-and-gender.csv).  

## Import modules:

In [1]:
import pandas as pd
import os
import re
import unicodedata
import sys
import uuid
import tweepy
from IPython.display import Image
from IPython.core.display import HTML 

## Data import and preprocessing

We first load the dataset containing Brazilian first names and their associated gender in a Pandas dataframe. You can just modify the `names_data_path` and define it as the path to this CSV file on your local machine.

In [2]:
names_data_path = './brazilian-names-and-gender.csv'
names_df = pd.read_csv(os.path.join(names_data_path))
names_df.head()

Unnamed: 0,Name,Gender
0,Abel,M
1,Abelardo,M
2,Abner,M
3,Abraão,M
4,Absalom,M


We then preprocess the data by lowercasing the first names and renaming the dataframe columns. We have 6100 Brazilian first names in total.

In [3]:
names_df['Name'] = names_df['Name'].apply(lambda x: x.lower())
names_df.columns = ['name', 'gender']
names_df.shape

(6100, 2)

To avoid problems later, we keep only first names that at least contain two characters. This means that we discard the first name `'h'`.

In [4]:
names_df['len_name'] = names_df['name'].apply(lambda x: len(x))
names_df = names_df.loc[names_df['len_name'] > 1].reset_index(drop=True)
names_df.shape

(6099, 3)

Out of our remaining 6099 first names, 3537 are male first names and 2562 are female first names.

In [5]:
names_df['gender'].value_counts(dropna=False)

M    3537
F    2562
Name: gender, dtype: int64

After that, we load a dataset of tweets we previously downloaded from users located in Brazil. Because of Twitter policy, we unfortunately cannot publish this dataset. Feel free to download tweets using notebook #1 to reproduce this example with your own data.

In [6]:
tweet_path = './brazilian_tweets.parquet'
tweet_df = pd.read_parquet(tweet_path)

We limit this dataset to 1000 rows to limit computing time.

In [7]:
tweet_df = tweet_df[:1000]

## Gender inference based on Twitter user name

We will now infer the gender of Twitter users based on their names. The idea is to take the Twitter user name and to look for existing first names from our database inside this Twitter user name. If one first name is found in the Twitter user name, the Twitter user is assigned the gender related to this found name. 

For instance, in the case of Barack Obama, if we have the first name "Barack" in our database defined as a male name, since "Barack" is contained in the string "Barack Obama", our algorithm will assign the former US president to the male gender. 

Obviously, this is meant as an example and it is a very simplistic approach. Therefore, inference results might not always be correct.

<img src="https://www.researchgate.net/profile/Tiffany_Gallicano/publication/325827578/figure/fig1/AS:638921738289154@1529342221910/Sample-Twitter-profile-and-relevant-fields.png" width="600" height="600" />

Image source: Wesslen, R., Nandu, S., Eltayeby, O., Gallicano, T., Levens, S., Jiang, M., & Shaikh, S. (2018, June). Bumper Stickers on the Twitter Highway: Analyzing the Speed and Substance of Profile Changes. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 12, No. 1).

On Twitter, user have two types of name. The first one is the "display name" (Barack Obama in the example above) and the second one is the "screen name", commonly called the Twitter handle which can be found after the `@` symbol (BarackObama here). We will look for existing first names in the display name rather than in the handle since first name and last name are separated in the former, which make finding first names easier.

### Extract user handles from tweets

We first need to generate a list of user names we can then use to perform gender inference.

The first step is to generate a list of user handles. We will use the data from our Brazilian tweet dataset to generate a list of handles mentioned in these tweets. To do so, we will loop through all of the tweets from this dataset and look for user handles in each tweet. If a user handle is found, it is added to the `handles_list` list.

In [8]:
handles_list = list()
p = re.compile(r'@([^\s:]+)')
for i in range(tweet_df.shape[0]):
    handles_list = handles_list + p.findall(tweet_df['text'][i])

This way, we are able to gather 821 Twitter user handles.

In [9]:
len(handles_list)

821

### Get display name from Twitter handle

Now that we have a list of handles, we need to extract the display names related to each of these handles. We do this with Tweepy, by using the `get_user` method, which takes as input a Twitter handle and returns information about the related Twitter user.  

To use Tweepy, as seen in Notebook #1, you need to define your API keys and access tokens below. 

In [10]:
api_dict = {"API Key": "Your API key",
            "API Secret Key": "Your API secret key",
            "access_token": "Your access token",
            "access_token_secret": "Your secret access token"}

In [11]:
def get_auth(api_dict):
    # OAuth process, using the keys and tokens
    auth = tweepy.OAuthHandler(api_dict['API Key'], api_dict['API Secret Key'])
    auth.set_access_token(api_dict['access_token'], api_dict['access_token_secret'])

    # Creation of the actual interface, using authentication
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    try:
        api.verify_credentials()
    except:
        print(api_dict, ": error during authentication")
        sys.exit('Exit')
    return api

In [12]:
api = get_auth(api_dict)

Now that credentials have been identified, we loop through our list of handles `handles_list` and extract the display names related to each handle. Note that some users might have deleted their account or changed their handles. In that case, the handle is considered as invalid and the information cannot be extracted. 

In [13]:
names_dict = dict()
for handle in handles_list:
    try:
        name = api.get_user(handle)._json['name']
        names_dict[name] = handle
    except:
        print(f'Error with handle: {handle}')

Error with handle: _raafarocha
Error with handle: globoesportecem
Error with handle: ei__frases
Error with handle: vivimorena_makeup
Error with handle: IramilsonManaus
Error with handle: raayvitoria1
Error with handle: VerdadeJH
Error with handle: bncaique
Error with handle: Luixpest
Error with handle: disneyanims
Error with handle: dissebktt
Error with handle: jpica10
Error with handle: memesalegres
Error with handle: SDADHH1
Error with handle: MlNHYUKEY
Error with handle: murilinmebeja
Error with handle: sccpkarev
Error with handle: vhribeiro2
Error with handle: joaopedro_ribas
Error with handle: laneeoficial
Error with handle: favelaprogresso
Error with handle: Juverissimow
Error with handle: cnzinhah
Error with handle: laneeoficial
Error with handle: _Vn_Oliveira
Error with handle: _patrickpeixoto
Error with handle: Nallandaaguedaa
Error with handle: BTS_twt's
Error with handle: rabettao_
Error with handle: LaurinhaBonoro
Error with handle: ElisioDado
Error with handle: __jaol
Erro

The results are saved in a dictionary containing:
- the display names as keys 
- the handles as values:

In [14]:
names_dict

{'Samuel Nascimento💯': 'SamuKNascimento',
 'The QuasiEconomist Formerly Known As Caet': 'CaetCaetano1',
 'Maria Oaquim': 'mariaomed',
 'jhony': 'joaocgvitor',
 'ju': 'eaixu',
 'Gaby': 'Gabyspedroso',
 'espiã russa': 'agressivictor',
 'viciada em BIN para todas as mulheres': 'pekibete',
 'se mulher goza pq homem n engravida': 'tisgopjl',
 'Francisco De Laurentiis': 'f_delaurentiis',
 'SE Palmeiras': 'Palmeiras',
 'nemseioqtofzdd': 'Carvalho_joaoj',
 'Lare': 'laressa',
 '☭Lisa Débora Indarráus🏡 ♀️ 🏳️\u200d🌈🤘🏼': 'passarosErosas',
 'Reumalho': 'reumalho',
 'Kamis.': 'kamisfariias',
 '烤肉君KoRu': 'korukun',
 'Danny 🌹': 'Danypulper',
 'vivi morena': 'vivimakeup5',
 'Lorena⭐️': 'badgallore',
 'PetitAbel #BBB21': 'petitabell',
 'Tífany': 'tifanyfaner',
 'Isaque Morais': 'isaquemorais6',
 'Vera Magalhães VACINA JÁ 💉': 'veramagalhaes',
 'comentarista do bbb': 'maaferocha',
 'loló': 'loomarcal',
 'Rayssa Olivetti': 'RayssaOlivetti',
 'Bot Sentinel': 'BotSentinel',
 'Thomas Santana': 'thomassantanas

### Find existing names in display names

We generate a list `names_list` of our display names by extracting the keys from our dictionary names_dict. We drop potential duplicates and lowercase the display names.

In [15]:
#lowercase
names_dict = dict((k.lower(), v) for k,v in names_dict.items())
#drop duplicates
names_list = list(names_dict.keys())
names_list = list(dict.fromkeys(names_list))

In total, we now have 574 display names.

In [16]:
len(names_list)

574

We now loop through our display names. For each display name, we check whether it contains a first name contained in our database of Brazilian first names and gender. If it does, we assign the related handle to the gender of the matched first name. 

In [17]:
results_dict = dict()
for twitter_name in names_list:
    for num, name in enumerate(names_df['name'].tolist()):
        pattern = re.compile(f"(^|\W){name}(\W|$)")
        if pattern.match(twitter_name):
            print(f'***Display name on Twitter: {twitter_name} ***')
            print(f'Matched first name: {name}')
            print(f"Gender associated with matched first name: {names_df['gender'][num]}")
            handle = names_dict[twitter_name]
            results_dict[handle] = names_df['gender'][num]

***Display name on Twitter: samuel nascimento💯 ***
Matched first name: samuel
Gender associated with matched first name: M
***Display name on Twitter: maria oaquim ***
Matched first name: maria
Gender associated with matched first name: F
***Display name on Twitter: gaby ***
Matched first name: gaby
Gender associated with matched first name: F
***Display name on Twitter: francisco de laurentiis ***
Matched first name: francisco
Gender associated with matched first name: M
***Display name on Twitter: lorena⭐️ ***
Matched first name: lorena
Gender associated with matched first name: F
***Display name on Twitter: isaque morais ***
Matched first name: isaque
Gender associated with matched first name: M
***Display name on Twitter: vera magalhães vacina já 💉 ***
Matched first name: vera
Gender associated with matched first name: F
***Display name on Twitter: thomas santana ***
Matched first name: thomas
Gender associated with matched first name: M
***Display name on Twitter: joice hasselmann

***Display name on Twitter: nazaré amarga ***
Matched first name: nazaré
Gender associated with matched first name: M
***Display name on Twitter: luana pandoca 🔮 ***
Matched first name: luana
Gender associated with matched first name: F
***Display name on Twitter: amanda ***
Matched first name: amanda
Gender associated with matched first name: F
***Display name on Twitter: sérgio alves🚩 #forabolsonaro ***
Matched first name: sérgio
Gender associated with matched first name: M
***Display name on Twitter: mari ***
Matched first name: mari
Gender associated with matched first name: F
***Display name on Twitter: lorena nascimento ***
Matched first name: lorena
Gender associated with matched first name: F
***Display name on Twitter: sophie kwidsan🇺🇸 ***
Matched first name: sophie
Gender associated with matched first name: F
***Display name on Twitter: bruno covas ***
Matched first name: bruno
Gender associated with matched first name: M
***Display name on Twitter: pedro hales ***
Matched fi

***Display name on Twitter: chico - mundinho caio caloteiro ***
Matched first name: chico
Gender associated with matched first name: M
***Display name on Twitter: maria nas nuvens 📖☁️ ***
Matched first name: maria
Gender associated with matched first name: F
***Display name on Twitter: joão lucas 🏳️‍🌈🇧🇷 ***
Matched first name: joão
Gender associated with matched first name: M
***Display name on Twitter: michael clifford ***
Matched first name: michael
Gender associated with matched first name: M
***Display name on Twitter: gabriel ***
Matched first name: gabriel
Gender associated with matched first name: M
***Display name on Twitter: maria ***
Matched first name: maria
Gender associated with matched first name: F
***Display name on Twitter: igor reale ***
Matched first name: igor
Gender associated with matched first name: M
***Display name on Twitter: ruth💟 ***
Matched first name: ruth
Gender associated with matched first name: F
***Display name on Twitter: maurício moura ***
Matched f

We were able to infer the gender of 153 of our 574 display names. When looking at the first few results, the quality looks good: "samuel nascimento" is rightly classified as male whereas "maria oaquim" is righly classified as female.

In [18]:
len(results_dict)

153

## Conclusion:

In this notebook, we learned how to infer the gender of Brazilian Twitter users based on their display names. 

While the display name is a valuable information for gender inference, other data sources such as the profile picture, the user description or even the social network of the user could be used. Also, other user demographics could be inferred from these information sources, such as age or whether the Twitter account belongs to an organization or not. If you wish to have further information on Twitter demographic inference, you can have a look at the [M3 GitHub repository](https://github.com/euagendas/m3inference), that allows to predict gender, age, and human-vs-organization status, using both text and image inputs, for tweets in multiple languages.

## References:

Wang, Z., Hale, S., Adelani, D., Grabowicz, P., Hartmann, T., Flöck, F., & Jurgens, D. (2019). Demographic Inference and Representative Population Estimates from Multilingual Social Media Data. In Proceedings of the 2019 World Wide Web Conference.

Wesslen, R., Nandu, S., Eltayeby, O., Gallicano, T., Levens, S., Jiang, M., & Shaikh, S. (2018, June). Bumper Stickers on the Twitter Highway: Analyzing the Speed and Substance of Profile Changes. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 12, No. 1).