# CS579: Lecture 12  

** Demographic Inference I**

*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  
*[Illinois Institute of Technology](http://iit.edu)*

**dem·o·graph·ics**

statistical data relating to the population and particular groups within it.

E.g., age, ethnicity, gender, income, ...

# Why Demographics?

- Marketing
  - Who are my customers?
  - Who are my competitors' customers?
  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)
  
- Social Media as Surveys
  - E.g., 45% of tweets express positive sentiment toward Pres. Obama
  - Who wrote those tweets?
  
- Health
  - 2% of Facebook users are expressing flu-like symptoms
  - Are they representative of the full population?



** User profiles vary from site to site. **

![rahm](rahm.png)

![rahm-fb](rahm-fb.png)

![rahm-li](rahm-li.png)

# Approaches

- Clever use of external data
  - E.g., U.S. Census name lists for gender
- Look for keywords in profile
  - "African American Male"
  - "Happy 21st birthday to me"
- Machine Learning

In [1]:
# Guessing gender
# Collect 1000 tweets matching query "i"
import configparser
import sys
from TwitterAPI import TwitterAPI

def get_twitter(config_file):
    """ Read the config_file and construct an instance of TwitterAPI.
    Args:
      config_file ... A config file in ConfigParser format with Twitter credentials
    Returns:
      An instance of TwitterAPI.
    """
    config = configparser.ConfigParser()
    config.read(config_file)
    twitter = TwitterAPI(
                   config.get('twitter', 'consumer_key'),
                   config.get('twitter', 'consumer_secret'),
                   config.get('twitter', 'access_token'),
                   config.get('twitter', 'access_token_secret'))
    return twitter

twitter = get_twitter('twitter.cfg')
tweets = []
n_tweets=1000
for r in twitter.request('statuses/filter', {'track': 'i'}):
    tweets.append(r)
    if len(tweets) % 100 == 0:
        print('%d tweets' % len(tweets))
    if len(tweets) >= n_tweets:
        break
print('fetched %d tweets' % len(tweets))

100 tweets
200 tweets
300 tweets
400 tweets
500 tweets
600 tweets
700 tweets
800 tweets
900 tweets
1000 tweets
fetched 1000 tweets


In [2]:
# not all tweets are returned
# https://dev.twitter.com/streaming/overview/messages-types#limit_notices
[t for t in tweets if 'user' not in t][:6]

[{'limit': {'timestamp_ms': '1519674109806', 'track': 77}},
 {'limit': {'timestamp_ms': '1519674109815', 'track': 67}},
 {'limit': {'timestamp_ms': '1519674109824', 'track': 78}},
 {'limit': {'timestamp_ms': '1519674109848', 'track': 87}},
 {'limit': {'timestamp_ms': '1519674110828', 'track': 173}},
 {'limit': {'timestamp_ms': '1519674110827', 'track': 165}}]

In [3]:
# restrict to actual tweets
# (remove "deleted" tweets)
tweets = [t for t in tweets if 'user' in t]
print('fetched %d tweets' % len(tweets))

fetched 925 tweets


In [27]:
# Print last 10 names.
names = [t['user']['name'] for t in tweets]
names[-10:]

['Vinny',
 '💎Such a good time💎🇵🇱 will see Harry',
 'Taking Lunches',
 'Aug.14♌❤',
 'bianca ama e tem muito orgulho do exo',
 'T A M',
 'Av.Gazi KOZANOĞLU',
 'Neccar',
 'Lexi 💋🔥',
 'Miguel Sanchez']

In [28]:
# Fetch census name data from:
# http://www2.census.gov/topics/genealogy/1990surnames/
import requests
from pprint import pprint
males_url = 'http://www2.census.gov/topics/genealogy/' + \
            '1990surnames/dist.male.first'
females_url = 'http://www2.census.gov/topics/genealogy/' + \
              '1990surnames/dist.female.first'
males = requests.get(males_url).text.split('\n')
females = requests.get(females_url).text.split('\n')
print('males:')
pprint(males[:10])
print('females:')
pprint(females[:10])

males:
['JAMES          3.318  3.318      1',
 'JOHN           3.271  6.589      2',
 'ROBERT         3.143  9.732      3',
 'MICHAEL        2.629 12.361      4',
 'WILLIAM        2.451 14.812      5',
 'DAVID          2.363 17.176      6',
 'RICHARD        1.703 18.878      7',
 'CHARLES        1.523 20.401      8',
 'JOSEPH         1.404 21.805      9',
 'THOMAS         1.380 23.185     10']
females:
['MARY           2.629  2.629      1',
 'PATRICIA       1.073  3.702      2',
 'LINDA          1.035  4.736      3',
 'BARBARA        0.980  5.716      4',
 'ELIZABETH      0.937  6.653      5',
 'JENNIFER       0.932  7.586      6',
 'MARIA          0.828  8.414      7',
 'SUSAN          0.794  9.209      8',
 'MARGARET       0.768  9.976      9',
 'DOROTHY        0.727 10.703     10']


In [29]:
# Get names. 
male_names = set([m.split()[0].lower() for m in males if m])
female_names = set([f.split()[0].lower() for f in females if f])
print('%d male and %d female names' % (len(male_names), len(female_names)))
print('males:\n' + '\n'.join(list(male_names)[:10]))
print('\nfemales:\n' + '\n'.join(list(female_names)[:10]))

1219 male and 4275 female names
males:
bruce
vance
roy
dale
jessie
malcolm
saul
eddie
anton
daren

females:
trudi
jessie
maegan
mandy
allen
christie
brigida
janay
nannie
clementina


In [30]:
# Initialize gender of all tweets to unknown.
for t in tweets:
    t['gender'] = 'unknown'

In [31]:
# label a Twitter user's gender by matching name list.
import re
def gender_by_name(tweets, male_names, female_names):
    for t in tweets:
        name = t['user']['name']
        if name:
            # remove punctuation.
            name_parts = re.findall('\w+', name.split()[0].lower())
            if len(name_parts) > 0:
                first = name_parts[0].lower()
                if first in male_names:
                    t['gender'] = 'male'
                elif first in female_names:
                    t['gender'] = 'female'
                else:
                    t['gender'] = 'unknown'

gender_by_name(tweets, male_names, female_names)
# What's wrong with this approach?

In [32]:
from collections import Counter

def print_genders(tweets):
    counts = Counter([t['gender'] for t in tweets])
    print('%.2f of accounts are labeled with gender' % 
          ((counts['male'] + counts['female']) / sum(counts.values())))
    print('gender counts:\n', counts)
    for t in tweets[:20]:
        print(t['gender'], t['user']['name'])
    
print_genders(tweets)

0.31 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 641, 'male': 154, 'female': 130})
unknown 🌹Mrs McK 🌹
female Melanie Senpai
male Shelby Kocher
unknown catmom
unknown 🦅Kapalı Üst⭐️⭐️⭐️
male Glenn Fiedler
unknown K.Bundii✨
female Liz Ryan Murray
unknown MKL
unknown ♋️
unknown J
unknown buse atalay
unknown Comeden
unknown 🦁
unknown Görkem
unknown M.S.
female miriam
female Brooke Dunaway
female Missy Moo
unknown AK Avrupa


In [34]:
# What about ambiguous names?
def print_ambiguous_names(male_names, female_names):
    ambiguous = [n for n in male_names if n in female_names]  # names on both lists
    print('found %d ambiguous names:\n'% len(ambiguous))
    print('\n'.join(ambiguous[:20]))
    
print_ambiguous_names(male_names, female_names)

found 331 ambiguous names:

roy
dale
jessie
eddie
allen
blake
thomas
dong
shane
shelby
leslie
jerry
arthur
dallas
kim
david
stacy
chung
vernon
rene


In [36]:
# Keep names that are more frequent in one gender than the other.
def get_percents(name_list):
    # parse raw data to extract, e.g., the percent of males names John.
    return dict([(n.split()[0].lower(), float(n.split()[1]))
                  for n in name_list if n])

males_pct = get_percents(males)
females_pct = get_percents(females)

# Assign a name as male if it is more common among males than femals.
male_names = set([m for m in male_names if m not in female_names or
              males_pct[m] > females_pct[m]])
female_names = set([f for f in female_names if f not in male_names or
              females_pct[f] > males_pct[f]])

print_ambiguous_names(male_names, female_names)
print('%d male and %d female names' % (len(male_names), len(female_names)))

found 0 ambiguous names:


1146 male and 4017 female names


In [37]:
# Relabel twitter users (compare with above)
gender_by_name(tweets, male_names, female_names)
print_genders(tweets)

0.31 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 641, 'female': 148, 'male': 136})
unknown 🌹Mrs McK 🌹
female Melanie Senpai
female Shelby Kocher
unknown catmom
unknown 🦅Kapalı Üst⭐️⭐️⭐️
male Glenn Fiedler
unknown K.Bundii✨
female Liz Ryan Murray
unknown MKL
unknown ♋️
unknown J
unknown buse atalay
unknown Comeden
unknown 🦁
unknown Görkem
unknown M.S.
female miriam
female Brooke Dunaway
female Missy Moo
unknown AK Avrupa


In [38]:
# Who are the unknowns?
# "Filtered" data can have big impact on analysis.
unknown_names = Counter(t['user']['name']
                        for t in tweets if t['gender'] == 'unknown')
unknown_names.most_common(20)

[('Ayşe şentürk', 10),
 ('J', 2),
 ('🇵🇦', 2),
 ('✝️', 2),
 ('twogeekgirls', 2),
 ('idris kınay', 2),
 ('imheckifiknow', 2),
 ('em', 2),
 ('Kalb-i Selim ERDOĞAN', 2),
 ('Ahmetcan', 2),
 ('🌹Mrs McK 🌹', 1),
 ('catmom', 1),
 ('🦅Kapalı Üst⭐️⭐️⭐️', 1),
 ('K.Bundii✨', 1),
 ('MKL', 1),
 ('♋️', 1),
 ('buse atalay', 1),
 ('Comeden', 1),
 ('🦁', 1),
 ('Görkem', 1)]

In [40]:
# How do the profiles of male Twitter users differ from
# those of female users?

male_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'male']

female_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'female']
#male_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'male']

#female_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'female']

import re
def tokenize(s):
    return re.sub('\W+', ' ', s).lower().split() if s else []

male_words = Counter()
female_words = Counter()

for p in male_profiles:
    male_words.update(Counter(tokenize(p)))
                      
for p in female_profiles:
    female_words.update(Counter(tokenize(p)))

print('Most Common Male Terms:')
pprint(male_words.most_common(10))
    
print('\nMost Common Female Terms:')
pprint(female_words.most_common(10))

Most Common Male Terms:
[('i', 36),
 ('a', 33),
 ('the', 32),
 ('and', 30),
 ('of', 27),
 ('for', 21),
 ('to', 17),
 ('my', 16),
 ('in', 14),
 ('s', 13)]

Most Common Female Terms:
[('and', 30),
 ('i', 29),
 ('a', 27),
 ('to', 25),
 ('my', 18),
 ('the', 17),
 ('of', 16),
 ('in', 15),
 ('is', 13),
 ('love', 11)]


In [41]:
print(len(male_words))
print(len(female_words))

866
839


In [43]:
# Compute difference
diff_counts = dict([(w, female_words[w] - male_words[w])
                    for w in
                    set(female_words.keys()) | set(male_words.keys())])

sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])

print('Top Male Terms (diff):')
pprint(sorted_diffs[:10])

print('\nTop Female Terms (diff):')
pprint(sorted_diffs[-10:])

Top Male Terms (diff):
[('the', -15),
 ('for', -14),
 ('of', -11),
 ('m', -9),
 ('i', -7),
 ('fan', -7),
 ('we', -6),
 ('a', -6),
 ('if', -5),
 ('be', -5)]

Top Female Terms (diff):
[('love', 4),
 ('insta', 4),
 ('life', 4),
 ('an', 4),
 ('im', 4),
 ('girl', 5),
 ('21', 5),
 ('little', 5),
 ('to', 8),
 ('is', 8)]


** A problem with difference of counts:**

<br><br><br><br>
What if we have more male than female words in total?

<br><br><br><br>
Instead, consider "the probability that a male user writes the word **w**"

<br><br><br><br>

$$p(w|male) = \frac{freq(w, male)}
{\sum_i freq(w_i, male)} $$

** Odds Ratio (OR)**

The ratio of the probabilities for a word from each class:

$$ OR(w) = \frac{p(w|female)}{p(w|male)} $$


- High values --> more likely to be written by females
- Low values --> more likely to be written by males


In [45]:
def counts_to_probs(gender_words):
    """ Compute probability of each term according to the frequency
    in a gender. """
    total = sum(gender_words.values())
    return dict([(word, count / total)
                 for word, count in gender_words.items()])

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)

print('p(w|male)')
pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])

print('\np(w|female)')
pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])

p(w|male)
[('i', 0.025806451612903226),
 ('a', 0.023655913978494623),
 ('the', 0.022939068100358423),
 ('and', 0.021505376344086023),
 ('of', 0.01935483870967742),
 ('for', 0.015053763440860216),
 ('to', 0.012186379928315413),
 ('my', 0.011469534050179211),
 ('in', 0.01003584229390681),
 ('s', 0.00931899641577061)]

p(w|female)
[('and', 0.02243829468960359),
 ('i', 0.02169035153328347),
 ('a', 0.02019446522064323),
 ('to', 0.018698578908002993),
 ('my', 0.013462976813762155),
 ('the', 0.012715033657442034),
 ('of', 0.011967090501121914),
 ('in', 0.011219147344801795),
 ('is', 0.009723261032161555),
 ('love', 0.008227374719521317)]


In [46]:
def odds_ratios(male_probs, female_probs):
    return dict([(w, female_probs[w] / male_probs[w])
                 for w in
                 set(male_probs) | set(female_probs)])

ors = odds_ratios(male_probs, female_probs)

KeyError: 'selfcare'

In [47]:
print(len(male_probs))
print(len(female_probs))
print(female_probs['selfcare'])
'selfcare' in male_probs

866
839
0.0014958863126402393


False

** How to deal with 0-probabilities? **

$$p(w|male) = \frac{freq(w, male)}
{\sum_i freq(w_i, male)} $$

$freq(w, male) = 0$

Do we really believe there is **0** probability of a male using this term?

(Recall over-fitting discussion.)
<br><br><br><br>

** Additive Smoothing **

Reserve small amount of counts (e.g., 1) for unseen observations.

E.g., assume we've seen each word at least once in each class.

$$p(w|male) = \frac{1 + freq(w, male)}
{|W| + \sum_i freq(w_i, male)} $$

$|W|$: number of unique words.

In [49]:
# Additive smoothing. Add count of 1 for all words.
all_words = set(male_words) | set(female_words)
male_words.update(all_words)  
female_words.update(all_words)

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print('\n'.join(str(x) for x in 
                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))

('i', 0.012732278045423262)
('a', 0.01169993117687543)
('the', 0.011355815554026153)
('and', 0.010667584308327599)
('of', 0.009635237439779766)
('for', 0.007570543702684102)
('to', 0.006194081211286993)
('my', 0.005849965588437715)
('in', 0.00516173434273916)
('s', 0.004817618719889883)


In [50]:
# Even though word doesn't appear, has non-zero probability.
print(male_probs['selfcare'])

0.00034411562284927734


In [51]:
ors = odds_ratios(male_probs, female_probs)

sorted_ors = sorted(ors.items(), key=lambda x: -x[1])

print('Top Female Terms (OR):')
pprint(sorted_ors[:20])

print('\nTop Male Terms (OR):')
pprint(sorted_ors[-20:])

Top Female Terms (OR):
[('girl', 6.122191011235955),
 ('21', 6.122191011235955),
 ('little', 6.122191011235955),
 ('old', 5.10182584269663),
 ('an', 5.10182584269663),
 ('refuse', 5.10182584269663),
 ('im', 5.10182584269663),
 ('yourself', 4.081460674157303),
 ('cat', 4.081460674157303),
 ('wife', 4.081460674157303),
 ('mind', 4.081460674157303),
 ('books', 4.081460674157303),
 ('mom', 4.081460674157303),
 ('maga', 4.081460674157303),
 ('4', 4.081460674157303),
 ('them', 4.081460674157303),
 ('make', 4.081460674157303),
 ('country', 3.0610955056179776),
 ('read', 3.0610955056179776),
 ('going', 3.0610955056179776)]

Top Male Terms (OR):
[('women', 0.34012172284644193),
 ('m', 0.3139585133967156),
 ('great', 0.25509129213483145),
 ('director', 0.25509129213483145),
 ('etchednm', 0.25509129213483145),
 ('football', 0.25509129213483145),
 ('host', 0.25509129213483145),
 ('mac', 0.25509129213483145),
 ('future', 0.25509129213483145),
 ('artist', 0.25509129213483145),
 ('editor', 0.25509129