# Demographic labeling

## Goal

In this notebook, we will be associating our cleaned ads data with demographics. Associating a demographic that the IRA attempted to target with their ads is a very subjective process. Studies which inspired this analysis have not clearly communicated how they labeled their ads. From browsing the ads, it is very clear that most of the ads targeting is based on implicit biases and stereotypes about certain demographics. As a result, whenever a term is said to be related to a demographic, I have left a note indicating why and the add number refering to the pdf ad that was consulted to make the decision.

Labelling is also difficult as:

* There is overlap between possible target categories (You can be both LGBT and African-American)
* Some ad campaign didn't seem to have clear targets

## Summary of strategy

My first attempt at labeling was to use a [well known document clustering approach](https://iksinc.online/2016/05/16/topic-modeling-and-document-clustering-whats-the-difference/) using tf-idf followed by k-means. tf-idf is used to calculated the frequency of terms in a document and compare with their frequency accross all documents. Terms that are seen more frequently are less important where terms seen less frequently are more important.

K-Means is then used to identify clusters. To associate a cluster with a demographic, the different targeting interests will be listed by frequency of appearance in the ads of the cluster.

This first approach was not successful because a lot of the ads target the same demographic. We can see this by looking at the frequency of terms in each clusters. Most clusters contain 'African-American' when we would really like for this to be in their own cluster. This is due to ads targeting 'African-American' to be numerous in addition of having an overlap between 'African-American' and other categories. The [\[TEST\]\_k\_means_demographic_labeling]([TEST]_k_means_demographic_labeling.ipynb) notebook file contains code for this approach.

Inspired by the Oxford study which used a graph algorithms in order to identify communities within a graph, the second approach (and the one which will be shown in this notebook) was to a do a manual "Breadth first search" within the data. The criterions for belonging to a group are described below as well as the process used.

### Criterion

* Similar words are used
* Similar social and or cutlural interests are shared

As nothing forbids overlap between demographics such as "Enjoy memes", "LGBT" or "Mexican-American", whenever a it is unclear why a term was tied to a demographic, I have left a note indicating why and the add number refering to the pdf ad that was consulted to make the decision.

### Process

1. Create the list of interests as arrays of interest strings.
2. List the top 25 most frequent interests.
3. Attempt to create a demographic group with the top 25 interests.
4. Label rows based on the demographic group
5. Analyse whether terms related to those used to label the group should also be used as part of the interests labelling. If so add these to the labeling for the group and repeat 4-5, otherwise go to 6.
6. Stop listing interests frequency once the number of ads in a group stablelizes. Go to 3 with data not including the group identified.

We first import the packages required and load our "clean_data" dataframe which contains the fields detailed below.

| Field name             | Type     | Description                         |
|------------------------|----------|-------------------------------------|
| ad_targeting_interests | string   | Interests used to target users      |
| ad_impressions         | int      | Number of users who saw the ads     |
| ad_clicks              | int      | Number of times the ads was clicked |
| ad_spend               | float    | Money spent on the ad in RUB        |
| ad_creation_date       | datetime | Creation date of the ad             |
| ad_end_date            | datetime | Date at which the ad stopped        |

In [14]:
import pandas as pd
import numpy as np
import re

ads_df = pd.read_csv('../clean_data/clean_data.csv', parse_dates=['ad_creation_date', 'ad_end_date'])
ads_df.head(3)

Unnamed: 0,file_name,ad_targeting_interests,ad_impressions,ad_clicks,ad_spend,ad_creation_date,ad_end_date
0,P(1)0002823.txt,"Pan-Africanism, African-American Civil Rights...",10496,1823,200.0,2017-04-21,2017-04-22
1,P(1)0002837.txt,"Pan-Africanism, African-American Civil Rights...",16305,1337,499.49,2017-04-13,2017-04-14
2,P(1)0006304.txt,"Martin Luther King, Jr., Stop Racism!!, Afric...",8210,1788,1570.03,2017-05-29,2017-05-29


## Spliting the "ad_targeting_interests" string

We need to do a bit more work before we can use the interests to identify demographics. First we need to split the long ad_targeting_interests into a string. The interests are separated by a comma and an 'or' statement between the last two elements like so:

"interest1, interest2, interest3 or interest4"

We use a regular expression to transform this string in an array of interests.

In [15]:
# We compile the regular expression to improve performance
comma_separation = re.compile(r'(?u)(?:^|,)([^\",\n]*)')

# Returns an array of interests found in the ad_targeting_interests column
def get_arr_of_interests(interest_string):
    arr = []
    
    # Iterate through matches found by regular expression
    for matches in re.finditer(comma_separation, interest_string):
        
        # Remove whitespace from both side of the string
        match_value = matches.group(1).strip()
        
        # If " or " is present in the string split it.
        if match_value and ' or ' in match_value:
            arr.extend(match_value.split(' or '))
        elif match_value:
            arr.append(match_value)
    
    # Sometimes the string has one interest,
    # we add the entire string as the interest.
    if len(arr) == 0:
        arr.push(interest_string)
        
    return arr

# Obtain the interests array for each row using apply
ads_df['ad_interests_array'] = ads_df.ad_targeting_interests.apply(get_arr_of_interests)

# Add a column for demographic initialized as nan
ads_df['demographic'] = np.nan

# Show the first three entries
ads_df.head(3)

Unnamed: 0,file_name,ad_targeting_interests,ad_impressions,ad_clicks,ad_spend,ad_creation_date,ad_end_date,ad_interests_array,demographic
0,P(1)0002823.txt,"Pan-Africanism, African-American Civil Rights...",10496,1823,200.0,2017-04-21,2017-04-22,"[Pan-Africanism, African-American Civil Rights...",
1,P(1)0002837.txt,"Pan-Africanism, African-American Civil Rights...",16305,1337,499.49,2017-04-13,2017-04-14,"[Pan-Africanism, African-American Civil Rights...",
2,P(1)0006304.txt,"Martin Luther King, Jr., Stop Racism!!, Afric...",8210,1788,1570.03,2017-05-29,2017-05-29,"[Martin Luther King, Jr., Stop Racism!!, Afric...",


## Outputting most frequent words

We create a function which we can use to output the most frequent words found in the ad_interests_arrays.

In [16]:
# We use a default dictionary which simplifies are code
# by skipping check for entires not in our dictionary
from collections import defaultdict

# Given the ad_interests_arrays column (a pandas' series)
# Output the top n most frequent words and their associated row count
def print_top_words_from_arrays(series, n):
    
    # Create an empty interests dictionary
    word_dict = defaultdict(int)
    # For every array of interests
    for arr in series:
        # For every word
        for val in arr:
            # Increment count
            word_dict[val] += 1
    
    # Sort dictionary by count descending
    count = 0
    for w in sorted(word_dict, key=word_dict.get, reverse=True):
        print(w, word_dict[w])
        
        # output top n
        count +=1
        if count == n:
            break;
            
# Output top 25 words
print_top_words_from_arrays(ads_df.ad_interests_array, 25)

African-American history 673
Malcolm X 601
Martin Luther King 590
Jr. 508
African-American Civil Rights Movement (1954-68) 455
Black (Color) 398
African-American culture 252
Pan-Africanism 242
La Raza 185
Chicano rap 183
Lowrider 175
Black Consciousness Movement 148
Black Matters 132
Stop Police Brutality 129
Black nationalism 118
Police misconduct 113
African-American Civil Rights Movement(1954-68) 107
HuffPost Black Voices 101
BlackNews.com 99
BuzzFeed 98
CollegeHumor 91
Mexico 88
. Hispanidad 88
9GAG 86
Latin hip hop 83


## Create demographic, add similar words, rince, repeat

This is the very repetitive part of the process where we follow the steps below:

3. Attempt to create a demographic group with the top 25 interests.
4. Label rows based on the demographic group
5. Analyse whether terms related to those used to label the group should also be used as part of the interests labelling. If so add these to the labeling for the group and repeat 4-5, otherwise go to 6.
6. Stop listing interests frequency once the number of ads in a group stablelizes. Go to 3 with data not including the group identified.

Notes behind the choice to include or exclude words are included in the process below. 

In [17]:
# Labels a single row with a demographic given:
# the row, the themes associated with the group, the name of the group
def label_demographic(row, themes, group_name):
    interests_array = row['ad_interests_array']
    if pd.isnull(row['demographic']):
        for theme in themes:
            if theme in interests_array:
                    row['demographic'] = group_name
                    break
    return row

# Labels rows given the dataframe, themes and group name for the demographic
# Outputs the total number of rows labeled with the demographic given the themes
def label_demographic_rows(df, themes, group_name):
    df = df.apply(label_demographic, args=(themes, group_name), axis=1)
    print('This labelled %d rows!'% (df['demographic'] == group_name).sum())
    return df

## First demographic: African-American

From this list of words above, many seem to target African-Americans lets make this a first grouping.

In [18]:
african_american_themes = {
    'African-American history',
    'Malcolm X',
    'Martin Luther King',
    'African-American Civil Rights Movement (1954-68)',
    'Black (Color)',
    'African-American culture',
    'Pan-Africanism',
    'Black Consciousness Movement',
    'Black Matters',
    'Black nationalism',
    'African-American Civil Rights Movement(1954-68)',
    'HuffPost Black Voices',
    'BlackNews.com'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1139 rows!


We now look at the frequency of interests referenced by these rows. To do so, we create a function that prints the top 25 interests and their frequency for ads tagged with the demographic group.

In [22]:
# Given the dataframe, the themes for the demographic and its name
# Output the top 25 interests related to these ads that are not yet part of the themes array.
def print_top_references_for_theme(df, themes, group_name):
    rows = df[df['demographic'] == group_name]
    ad_interest_arrays = rows.ad_interests_array.apply(lambda x: list(set(x) - themes))
    print_top_words_from_arrays(ad_interest_arrays, 25)

In [23]:
print_top_references_for_theme(ads_df, african_american_themes, 'African-American')

Jr. 508
Stop Police Brutality 127
Police misconduct 113
Black history 74
Martin Luther King Ill 73
African-American Civil Rights Movement ( 1954-68) 63
AfricanAmerican culture 60
AfricanAmerican history 58
Jr.; African-American Civil Rights Movement (1954-68) 55
Black Power 43
Black History Month 43
Black Panther Party 41
Martin Luther King III 40
Black is beautiful 34
Say To No Racism 31
Angela Davis 31
Stop Racism!! 27
African American 27
African-American Civil Rights Movement (1954--68) 26
AfricanAmerican Civil Rights Movement(1954-68) 25
Black Girls Rock! 25
My Black is Beautiful 23
Anti-discrimination 23
African-American Civil Rights Movement (1954-68). African-American history 22
Human rights 19


In [36]:
african_american_themes = african_american_themes | {
    'Jr.',
    'Stop Police Brutality',
    'Police misconduct',
    'Black history',
    'Martin Luther King Ill',
    'African-American Civil Rights Movement ( 1954-68)',
    'AfricanAmerican culture',
    'AfricanAmerican history',
    'Jr.; African-American Civil Rights Movement (1954-68)',
    'Black Power',
    'Black History Month',
    'Black Panther Party',
    'Martin Luther King III',
    'Black is beautiful',
    'Say To No Racism',
    'Angela Davis',
    'Stop Racism!!',
    'African American',
    'African-American Civil Rights Movement (1954--68)',
    'AfricanAmerican Civil Rights Movement(1954-68)',
    'Black Girls Rock!',
    'My Black is Beautiful',
    'Anti-discrimination',
    'African-American Civil Rights Movement (1954-68). African-American history',
    'Human rights'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1196 rows!


In this case, we might wonder why 'Stop Police Brutality' was accepted in the list of African-American themes. One way to test is to output the list of ads which have this interest and are not already labeled African-American.

In [35]:
# Makes sure the interest string contains 'Stop Police Brutality' and that the row is not already tagged 'African-American'
ads_df[ads_df['ad_targeting_interests'].str.contains('Stop Police Brutality') & pd.isnull(ads_df['demographic'])]

Unnamed: 0,file_name,ad_targeting_interests,ad_impressions,ad_clicks,ad_spend,ad_creation_date,ad_end_date,ad_interests_array,demographic
23,P(1)0000588.txt,Say It Loud - I'm Black and I'm Proud or Stop...,1589,4,405.25,2016-01-15,NaT,"[Say It Loud - I'm Black and I'm Proud, Stop P...",
424,P(1)0003084.txt,Copwatch Rodney King; Police brutality in the ...,2195,91,2293.09,2016-05-11,2016-05-19,[Copwatch Rodney King; Police brutality in the...,
1214,P(1)0000483.txt,Police brutality in the United States or Stop...,33183,1,5505.22,2016-05-05,NaT,"[Police brutality in the United States, Stop P...",


From this output we see that the first row belongs in the group by the mention of "I'm Black ..." the second mentions [Rodney King](https://en.wikipedia.org/wiki/Rodney_King) an African-American activist against police brutality. The third row is more ambiguous, but after looking at file P(1)0000483.txt we find that another part of this entry references African-Americans.

A similar exercise reveals similar results for "Police misconduct", "Say To No Racisms", "Anti-discrimination", "Human rights" etc...

In [37]:
print_top_references_for_theme(ads_df, african_american_themes, 'African-American')

Cop Block 23
Social justice 19
Gun Owners of America 18
2nd Amendment 18
Self Defense Family 18
The Self Defense Company 18
Martial arts 18
Racism in the United States 14
Martin Luther King; Jr. 14
Concealed carry in the United States 14
Huey P. Newton 14
Justice 13
Police brutality in the United States 12
Gun Rights 12
Police Brutality is a Crime 11
Nelson Mandela 10
Malcolm X Memorial Foundation 10
Black Enterprise 9
Jr.; African-American culture 9
African National Congress 9
HuffPost Politics 9
Black Business Works 9
Jr.; African-American Civil Rights Movement (1954-68). African-American history 8
Violence prevention 8
Pan Africanist Congress of Azania 8


In [39]:
african_american_themes = african_american_themes | {
    'Cop Block',
    'Social justice',
    'Racism in the United States',
    'Martin Luther King; Jr.',
    'Huey P. Newton',
    'Police brutality in the United States',
    'Police Brutality is a Crime',
    'Nelson Mandela',
    'Malcolm X Memorial Foundation',
    'Black Enterprise',
    'Jr.; African-American culture',
    'African National Congress',
    'HuffPost Politics',
    'Black Business Works',
    'Jr.; African-American Civil Rights Movement (1954-68). African-American history',
    'Pan Africanist Congress of Azania',
    'Violence prevention'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1245 rows!


In [40]:
print_top_references_for_theme(ads_df, african_american_themes, 'African-American')

Gun Owners of America 18
2nd Amendment 18
Self Defense Family 18
The Self Defense Company 18
Martial arts 18
Concealed carry in the United States 14
St. Louis 13
Justice 13
Gun Rights 12
Union of Huffington Post Writers and Bloggers 8
Baptism 8
Afrocentrism 7
Humanitarianism 7
Humanitarian aid 7
Fight the Power 7
2016 7
United States presidential election 7
Black Tea Patriots 7
mother jones 7
National Museum of American History 7
Maya Angelou 7
Mumia AbuJamal 7
The Raw Story 7
Gospel 7
BLACK BUSINESS GLOBAL 7


In [41]:
african_american_themes = african_american_themes | {
    'St. Louis',
    'Union of Huffington Post Writers and Bloggers',
    'Baptism',
    'Afrocentrism',
    'Fight the Power',
    'United States presidential election',
    'Black Tea Patriots',
    'The Raw Story',
    'mother jones',
    'National Museum of American History',
    'Maya Angelou',
    'Mumia AbuJamal',
    'Gospel',
    'BLACK BUSINESS GLOBAL'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1276 rows!


In [42]:
print_top_references_for_theme(ads_df, african_american_themes, 'African-American')

Mumia Abu-Jamal 29
Gun Owners of America 18
2nd Amendment 18
Self Defense Family 18
The Self Defense Company 18
Martial arts 18
Concealed carry in the United States 14
Justice 13
Gun Rights 12
2016 9
Humanitarianism 7
Humanitarian aid 7
Black Business Builders Club 7
Kemetism 6
Medgar Evers 6
I Have a Dream 6
Mahatma Gandhi 6
African-American history. Malcolm X 6
Visual perception 6
Freckle 6
Baltimore 6
Color 6
Civil and political rights 6
AfricanAmerican culture. African-American Civil Rights Movement (1954-68) 5
Guns & Ammo 5


In [43]:
african_american_themes = african_american_themes | {
    'Mumia Abu-Jamal',
    'Black Business Builders Club',
    'Medgar Evers',
    'I Have a Dream',
    'African-American history. Malcolm X',
    'Baltimore',
    'Civil and political rights',
    'AfricanAmerican culture. African-American Civil Rights Movement (1954-68)',
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1277 rows!


At this point the number of labeled rows changes very little, we now identify the next demographic by only examining rows not previously labeled. A new heading is added to the next demographic:

## Second demographic: Mexican-American

In [45]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 25)

La Raza 185
Chicano rap 183
Lowrider 175
BuzzFeed 98
CollegeHumor 91
Mexico 88
. Hispanidad 88
9GAG 86
Latin hip hop 83
Hispanidad 79
Mexico. Latin hip hop. Chicano Movement 79
Mexican Pride 73
Chicano Movement 73
Hispanic and latino american culture 70
Hispanic culture 70
Mexican american culture 68
Latino culture 66
LGBT United 64
Patriotism 56
Being Patriotic 55
Independence 53
Don't Shoot 47
Chicano 46
LGBT community 31
Donald Trump for President 24


Many of the terms are part of Mexican-American identity politics notably:

* [La Raza](https://en.wikipedia.org/wiki/La_Raza)
* [Chicano / Chicana](https://en.wikipedia.org/wiki/Chicano)
* [Lowrider](https://en.wikipedia.org/wiki/Lowrider)
* [Hispanidad](https://en.wikipedia.org/wiki/Hispanidad)

This will be the second manually-labelled group.

In [46]:
mexican_american_themes = {
    'La Raza',
    'Chicano rap',
    'Lowrider',
    'Mexico',
    '. Hispanidad',
    'Latin hip hop',
    'Hispanidad',
    'Maxico. Latin hip hop. Chicano Movement',
    'Mexican Pride',
    'Chicano Movement',
    'Hispanic and latino american culture',
    'Hispanic culture'
    'Mexican american culture',
    'Latino culture',
    'Chicano'
}

ads_df = label_demographic_rows(ads_df, mexican_american_themes, 'Mexican-American')

This labelled 189 rows!


In [47]:
print_top_references_for_theme(ads_df, mexican_american_themes, 'Mexican-American')

Mexico. Latin hip hop. Chicano Movement 79
Hispanic culture 70
Mexican american culture 68
Being Chicano 24
Chicano. Chicano Movement 23
Chicano Movement. Hispanidad 9
Mexico. Latin hip hop 9
Culture of Mexico 8
Latin hip hop. Chicano 4
Being Chicano. Mexican american culture 4
Being Mexican 4
Mexican American Pride 3
So Mexican 3
Lowrider; Chicano rap 2
Chicano Movement. Being Latino 2
Hispanic american culture 2
Latin hip hop. Chicano Movement 1
Mexican american culture; Hispanic american culture 1
Mexico; Latin hip hop. Chicano Movement. Hispanidad 1
Hispanic american culture. Chicano Movement 1
Being Latino 1


In [48]:
mexican_american_themes = mexican_american_themes | {
    'Mexico. Latin hip hop. Chicano Movement',
    'Hispanic culture',
    'Mexican american culture',
    'Being Chicano',
    'Chicano. Chicano Movement',
    'Chicano Movement. Hispanidad',
    'Mexico. Latin hip hop',
    'Culture of Mexico',
    'Being Chicano. Mexican american culture',
    'Latin hip hop. Chicano',
    'Being Mexican',
    'Mexican American Pride',
    'So Mexican',
    'Lowrider; Chicano rap',
    'Chicano Movement. Being Latino',
    'Hispanic american culture',
    'Latin hip hop. Chicano Movement',
    'Mexican american culture; Hispanic american culture',
    'Mexico; Latin hip hop. Chicano Movement. Hispanidad',
    'Hispanic american culture. Chicano Movement',
    'Being Latino'
}

ads_df = label_demographic_rows(ads_df, mexican_american_themes, 'Mexican-American')

This labelled 189 rows!


At this point the number of labeled rows no longer changes, we now identify the next demographic by only examining rows not previously labeled.

## Third demographic: Memes

In [49]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 25)

BuzzFeed 98
CollegeHumor 91
9GAG 86
LGBT United 64
Patriotism 56
Being Patriotic 55
Independence 53
Don't Shoot 47
LGBT community 31
Donald Trump for President 24
Gun Owners of America 21
Donald Trump 21
iFunny 21
Imgur 20
Homosexuality 19
Politics 19
Funny Pics 19
Muslim Brotherhood 18
Funny Photo's 18
Funny Pictures 18
LOL 18
Native American Indian Wisdom 18
Cherokee language 18
Cherokee Nation 18
Republican Party (United States) 17


The most popular three terms describe their own category, people whom enjoy memes.

In [51]:
memes_themes = {
    'BuzzFeed',
    'CollegeHumor',
    '9GAG',
    'iFunny',
    'Imgur',
    'Funny Pics',
    'Funny Photo\'s',
    'Funny Pictures',
    'LOL'
}
ads_df = label_demographic_rows(ads_df, memes_themes, 'Memes')

This labelled 132 rows!


In [52]:
print_top_references_for_theme(ads_df, memes_themes, 'Memes')

Meme 9
Internet meme 9
Reddit 8
Fail Blog 8
NBA Memes 7
Meme Center 7
lmgur 3
Entertainment 3
Hobbies and activities 3
Humour 2
Reddit; BuzzFeed 1
Meme Center; NBA Memes 1
Imgur; CollegeHumor 1
Government 1
Politics and social issues 1


In [53]:
memes_themes = memes_themes | {
    'Meme',
    'Internet meme',
    'Reddit',
    'Fail Blog',
    'NBA Memes',
    'Meme Center',
    'lmgur',
    'Humour',
    'Reddit; BuzzFeed',
    'Meme Center; NBA Memes',
    'Imgur; CollegeHumor'
}
ads_df = label_demographic_rows(ads_df, memes_themes, 'Memes')

This labelled 137 rows!


At this point, very few new rows has been labeled. We find the next category.

## Fourth demographic: LGBT

In [54]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 25)

LGBT United 64
Patriotism 56
Being Patriotic 55
Independence 53
Don't Shoot 47
LGBT community 31
Donald Trump for President 24
Gun Owners of America 21
Donald Trump 21
Homosexuality 19
Politics 19
Muslim Brotherhood 18
Native American Indian Wisdom 18
Cherokee language 18
Cherokee Nation 18
Republican Party (United States) 17
Mixed martial arts 17
Martial arts 17
The Women's Self Defense Institute 17
PERSONAL & HOME DEFENSE 17
Self-defense 17
Selfdefense (United States) 17
Personal Defense 17
Right of self-defense 17
Self Defense Family 17


In [55]:
lgbt_themes = {
    'LGBT United',
    'LGBT community',
    'Homosexuality'
}
ads_df = label_demographic_rows(ads_df, lgbt_themes, 'LGBT')

This labelled 96 rows!


In [58]:
print_top_references_for_theme(ads_df, lgbt_themes, 'LGBT')

Fitness and wellness 4
Indiana 4
Motherhood 4
territory 4
Society 3
Philosophy 3
Photography 2
Family 2
News broadcasting 1
Political party 1
Breaking news 1
Hillary Clinton 1
Bernie Sanders 1
Same-sex marriage in the United States 1
Rainbow flag (LGBT movement) 1
LGBT Equality 1
Transgenderism 1
Puppy 1
Dogs 1
LGBT history 1
Liberalism 1
Libertarianism 1


In [59]:
lgbt_themes = lgbt_themes  | {
    'Same-sex marriage',
    'LGBT culture',
    'Gay pride',
    'Love',
    'LGBT rights by country',
    'Lesbian community',
    'LGBT social movements',
    'Politics and social issues',
    'Yoga',
    'Gay Rights',
    'Human Sexuality',
    'Bisexuality'
}
ads_df = label_demographic_rows(ads_df, lgbt_themes, 'LGBT')

This labelled 97 rows!


At this point, very few new row have been labeled. We find the next category.

## Fifth demographic: Right wing

In [61]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 15)

Patriotism 56
Being Patriotic 55
Independence 53
Don't Shoot 47
Donald Trump for President 24
Gun Owners of America 21
Donald Trump 21
Politics 19
Muslim Brotherhood 18
Native American Indian Wisdom 18
Cherokee language 18
Cherokee Nation 18
Republican Party (United States) 17
Mixed martial arts 17
Martial arts 17


In [62]:
right_wing_themes = {
    'Patriotism',
    'Being Patriotic',
    'Independence',
    'Donald Trump for President',
    'Gun Owners of America',
    'Donald Trump',
    'Republican Party (United States)'
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 170 rows!


In [63]:
print_top_references_for_theme(ads_df, right_wing_themes, 'Right wing')

Donald Trump Jr. 13
Concealed carry in the United States 9
National Rifle Association 7
The Tea Party 7
Gun Rights 7
Syria 7
Politics 7
Right to keep and bear arms 6
National Association for Gun Rights 6
Conservatism 6
The Second Amendment 6
2nd Amendment 5
Guns & Ammo 4
Young Republicans 3
Second Amendment to the United States Constitution 3
Dixie 3
Confederate States of America 3
conservative daily 2
dead hands 2
From my cold 2
lvanka Trump Fine Jewelry 2
donald j trump 2
Second Amendment Sisters 2
Gun Rights Across America 2
Veterans Day 2


In [64]:
right_wing_themes = right_wing_themes | {
    'Donald Trump Jr.',
    'Concealed carry in the United States',
    'National Rifle Association',
    'The Tea Party',
    'Gun Rights',
    'Right to keep and bear arms',
    'National Association for Gun Rights',
    'Conservatism',
    'The Second Amendment',
    '2nd Amendment',
    'Guns & Ammo',
    'Young Republicans',
    'Second Amendment to the United States Constitution',
    'Confederate States of America',
    'Dixie',
    'conservative daily',
    'dead hands',
    'From my cold',
    'lvanka Trump Fine Jewelry',
    'donald j trump',
    'Gun Rights Across America',
    'Second Amendment Sisters',
    'Veterans Day'
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 200 rows!


In [65]:
print_top_references_for_theme(ads_df, right_wing_themes, 'Right wing')

Flags of the Confederate States of America 7
Syria 7
Politics 7
Guns & Patriots 4
Immigration 4
Hart of Dixie 4
Tea Party Patriots 4
Anything About Guns 3
Guns.com 3
American Guns 3
Second Amendment Supporters 3
Proud to be an American 3
Redneck Nation 3
Mud & Trucks 3
100 Percent FED Up 3
Chicks On The Right 3
Conservative Tribune 3
Sons of Confederate Veterans 2
Southern United States 2
Support Our Veterans 2
Students for Concealed Carry 2
Protect the Second Amendment 2
AR-15 2
ForAmerica 2
United Daughters of the Confederacy 1


In [66]:
right_wing_themes = right_wing_themes | {
    'Flags of the Confederate States of America',
    'Guns & Patriots',
    'Immigration',
    'Hart of Dixie',
    'Tea Party Patriots',
    'Anything About Guns',
    'American Guns',
    'Guns.com',
    'Second Amendment Supporters',
    'Proud to be an American',
    'Redneck Nation',
    'Mud & Trucks',
    'Conservative Tribune',
    '100 Percent FED Up',
    'Chicks On The Right',
    'Sons of Confederate Veterans',
    'Southern United States',
    'Support Our Veterans',
    'Students for Concealed Carry',
    'Protect the Second Amendment',
    'AR-15',
    'ForAmerica',
    'Confederate Flag',
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 225 rows!


In [67]:
print_top_references_for_theme(ads_df, right_wing_themes, 'Right wing')

Politics 16
Syria 7
Confederate States Army 4
Support our troops 4
Supporting Our Veterans 3
Southern Pride 3
The Invaders 3
Vietnam Veterans of America 3
Veterans For America 3
Iraq and Afghanistan Veterans of America 3
Vietnam Veterans of America Foundation 3
American Patriot 2
American Patriots 2
Flag of the United States 2
American patriotism 2
Illegal immigration 2
Human migration 2
Disabled American Veterans 2
US Military Veterans 2
Concerned Veterans for America 2
Patriot Nation 2
United Daughters of the Confederacy 1
confederate states america 1
Robert E. Lee 1
Redneck Social Club 1


In [68]:
right_wing_themes = right_wing_themes | {
    'Confederate States Army',
    'Support our troops',
    'Supporting Our Veterans',
    'Southern Pride',
    'The Invaders',
    'Veterans For America',
    'Iraq and Afghanistan Veterans of America',
    'Vietnam Veterans of America Foundation',
    'Vietnam Veterans of America',
    'American patriotism',
    'Flag of the United States',
    'American Patriots',
    'American Patriot',
    'Human migration',
    'Illegal immigration',
    'Disabled American Veterans',
    'US Military Veterans',
    'Concerned Veterans for America',
    'Patriot Nation',
    'confederate states america',
    'Robert E. Lee',
    'Redneck Social Club',
    'United Daughters of the Confederacy'
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 238 rows!


In [69]:
print_top_references_for_theme(ads_df, right_wing_themes, 'Right wing')

Politics 16
Veterans 8
Syria 7
United States Department of Veterans Affairs 6
Vietnam Veterans Memorial Fund 3
Stop Illegal Immigration 3
Vietnam Veterans Memorial 2
Institute for Veterans and Military Families 2
Vietnam Veterans Against the War 2
vietnam veterans america 2
Chris Kyle 2
Wounded Warrior Project 2
Immigration law 2
Deportation 2
Fox News Politics 1
American History 1
Patriot(American Revolution) 1
The Conservative 1
Conservative News Today 1
College Republicans 1
Conservatism in the United States 1
Concealed carry 1
Tea Party movement 1
Ted Cruz 1
Conservative Republicans of Texas 1


In [70]:
right_wing_themes = right_wing_themes | {
    'Veterans',
    'United States Department of Veterans Affairs',
    'Vietnam Veterans Memorial Fund',
    'Stop Illegal Immigration',
    'vietnam veterans america',
    'Vietnam Veterans Against the War',
    'Institute for Veterans and Military Families',
    'Vietnam Veterans Memorial',
    'Chris Kyle',
    'Wounded Warrior Project',
    'Deportation',
    'Immigration law',
    'Fox News Politics',
    'American History',
    'Patriot(American Revolution)',
    'Conservative News Today',
    'The Conservative',
    'College Republicans',
    'Conservatism in the United States',
    'Concealed carry',
    'Tea Party movement',
    'Ted Cruz',
    'Conservative Republicans of Texas'
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 247 rows!


In [71]:
print_top_references_for_theme(ads_df, right_wing_themes, 'Right wing')

Politics 17
Syria 7
Fox News Channel 3
Military 2
Laura Ingraham 2
Bill O'Reilly (political commentator) 2
Donald Trump Jr.; Politics US politics (very conservative) 2
Michael Savage 2
Conservatism in the United States; Sean Hannity 2
Christianity 2
Andrew Breitbart breitbart 2
Michelle Malkin 2
Mike Huckabee 2
Jesus; TheBlaze 2
Mike Pence 2
Rush Limbaugh 2
Tucker Carlson 2
Bible 2
Ron Paul 2
Rand Paul 2
The Patriot Post 1
Thank A Soldier 1
U.S. Patriot Tactical 1
AMVETS 1
Emigration 1


In [72]:
right_wing_themes = right_wing_themes | {
    'Fox News Channel',
    'Military',
    'Bill O\'Reilly (political commentator)',
    'Mike Pence',
    'Jesus; TheBlaze',
    'Christianity',
    'Rand Paul',
    'Tucker Carlson',
    'Andrew Breitbart breitbart',
    'Bible',
    'Ron Paul',
    'Michael Savage',
    'Michelle Malkin',
    'Rush Limbaugh',
    'Mike Huckabee',
    'Donald Trump Jr.; Politics US politics (very conservative)',
    'Laura Ingraham',
    'Conservatism in the United States; Sean Hannity',
    'Thank A Soldier',
    'U.S. Patriot Tactical',
    'The Patriot Post',
    'AMVETS',
    'Emigration',
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 248 rows!


Row count has stopped increasing we move on to the next group.

## New entries to: African-American

In [75]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 5)

Don't Shoot 47
Muslim Brotherhood 18
Native American Indian Wisdom 18
Cherokee language 18
Cherokee Nation 18


A web search revealed that 'Don't Shoot' was a group constructed by the IRA after the [shooting of Philando Castille](https://en.wikipedia.org/wiki/Internet_Research_Agency). This was most-likely targeting African-Americans. We add it ot this category before proceeding.

In [76]:
african_american_themes = african_american_themes | {
    'Don\'t Shoot'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1324 rows!


In [77]:
print_top_references_for_theme(ads_df, african_american_themes, 'African-American')

Gun Owners of America 18
2nd Amendment 18
Self Defense Family 18
The Self Defense Company 18
Martial arts 18
Concealed carry in the United States 14
Justice 13
Gun Rights 12
2016 9
Humanitarianism 7
Humanitarian aid 7
Kemetism 6
Mahatma Gandhi 6
Visual perception 6
Freckle 6
Color 6
Guns & Ammo 5
Jr.. Stop Racism!!. AfricanAmerican culture. African-American Civil Rights Movement (1954-68) 5
Black history. AfricanAmerican Civil Rights Movement(1954-68) 5
Filming Cops 5
Melanin 5
Black panther 5
Equal opportunity 5
Slavery in the United States 5
Culture 5


In [78]:
african_american_themes = african_american_themes | {
    'Jr.. Stop Racism!!. AfricanAmerican culture. African-American Civil Rights Movement (1954-68)',
    'Black history. AfricanAmerican Civil Rights Movement(1954-68)',
    'Filming Cops',
    'Melanin',
    'Black panther',
    'Slavery in the United States'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1325 rows!


The number of rows has stabilized, we move on to the next group.

## Sixth demographic: Native-American

In [80]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 15)

Muslim Brotherhood 18
Native American Indian Wisdom 18
Cherokee language 18
Cherokee Nation 18
Mixed martial arts 17
Martial arts 17
The Women's Self Defense Institute 17
PERSONAL & HOME DEFENSE 17
Self-defense 17
Selfdefense (United States) 17
Personal Defense 17
Right of self-defense 17
Self Defense Family 17
Active Self Protection 17
American Indian Movement 17


Out of the top values of 18 rows, many have to do with Native-American hence our new category.

In [81]:
native_american_themes = {
    'Native American Indian Wisdom',
    'Cherokee language',
    'Cherokee Nation',
    'American Indian Movement'
}
ads_df = label_demographic_rows(ads_df, native_american_themes, 'Native-American')

This labelled 18 rows!


In [82]:
print_top_references_for_theme(ads_df, native_american_themes, 'Native-American')

Native News Online 4
Indian Country Today Media Network 3
Native American civil rights 2
Cherokee 1
Native american culture in the united states 1
American Indian Wars 1
All Things Cherokee 1
Native American Times 1
National Congress of American Indians 1
Native American music 1


In [83]:
native_american_themes = native_american_themes | {
    'Native News Online',
    'Indian Country Today Media Network',
    'Native American civil rights',
    'All Things Cherokee',
    'Cherokee',
    'American Indian Wars',
    'Native american culture in the united states',
    'National Congress of American Indians',
    'Native American Times',
    'Native American music'
}
ads_df = label_demographic_rows(ads_df, native_american_themes, 'Native-American')

This labelled 18 rows!


No new rows were added, we move to the next demographic.

## Seventh demographic: Muslim-American

In [85]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 15)

Muslim Brotherhood 18
Mixed martial arts 17
Martial arts 17
The Women's Self Defense Institute 17
PERSONAL & HOME DEFENSE 17
Self-defense 17
Selfdefense (United States) 17
Personal Defense 17
Right of self-defense 17
Self Defense Family 17
Active Self Protection 17
Libertarianism 13
Liberalism 13
Williams&Kalvin 13
Islam 12


In [86]:
muslim_american_themes = {
    'Muslim-Brotherhood',
    'Islam'
}

ads_df = label_demographic_rows(ads_df, muslim_american_themes, 'Muslim-American')

This labelled 12 rows!


In [87]:
print_top_references_for_theme(ads_df, muslim_american_themes, 'Muslim-American')

Allah 6
Quran 6
Muslim Brotherhood 5
Islamism 4
Muhammad 4
Islam in the United States 3
All-american muslim culture 3
Muslim world 3
Muslim American Society 2
State of Palestine 2
Mosque 2
Sunnah 2
Sharia 2
Glossary of Islam 2
Current events 1
Religion 1
Muslim Students' Association 1
Al Jazeera 1
Muslims Are Not Terrorists 1
ProductiveMuslim 1
Hadith 1
Muhammad al-Baqir 1
Hasan ibn Ali 1
Hajj 1
Ahl al-Bayt 1


In [88]:
muslim_american_themes = muslim_american_themes | {
    'Quran',
    'Allah',
    'Muslim Brotherhood',
    'Islamism',
    'Muhammad',
    'Islam in the United States',
    'Muslim world',
    'All-american muslim culture',
    'Muslim American Society',
    'State of Palestine',
    'Mosque',
    'Sunnah',
    'Glossary of Islam',
    'Sharia',
    'Muslim Students\' Association',
    'Religion',
    'Al Jazeera',
    'ProductiveMuslim',
    'Muslims Are Not Terrorists',
    'Muhammad al-Baqir',
    'Hajj',
    'Hasan ibn Ali',
    'Assalamu alaykum',
    'Ahl al-Bayt'
}
ads_df = label_demographic_rows(ads_df, muslim_american_themes, 'Muslim-American')

This labelled 30 rows!


In [89]:
print_top_references_for_theme(ads_df, muslim_american_themes, 'Muslim-American')

Hijra (Islam) 3
Mecca 2
As-salamu alaykum 2
Prophets and messengers in Islam 2
Proud to be A Muslim 2
Zaid Shakir 1
Abu Eesa Niamatullah 1
Islam ; Quran 1
State of Palestine; Muslim world; Mosque 1
Current events 1
Haram 1
Hadith 1
Fiqh 1
All Pakistan Muslim League 1
Ja'far al-Sadiq 1
Zakat 1
Medina 1
Muslim Youth 1
Muslim League (Pakistan) 1
Islam Book 1
Imam Ali Mosque 1
Ana muslim 1
History of Islam 1
Muslims Are Not Terrorists. Islamism 1
Arab world 1


In [90]:
muslim_american_themes = muslim_american_themes | {
    'Hijra (Islam)',
    'Mecca',
    'Prophets and messengers in Islam',
    'As-salamu alaykum',
    'Proud to be A Muslim',
    'Abu Eesa Niamatullah',
    'Zaid Shakir',
    'State of Palestine; Muslim world; Mosque',
    'Islam ; Quran',
    'Haram',
    'Hadith',
    'Muslim Youth',
    'Zakat',
    'Medina',
    'Muslim League (Pakistan)',
    'Imam Ali Mosque',
    'History of Islam',
    'All Pakistan Muslim League',
    'Islam Book',
    'Ana muslim',
    'Fiqh',
    'Ja\'far al-Sadiq',
    'Muslims Are Not Terrorists. Islamism',
    'Arab world'
}
ads_df = label_demographic_rows(ads_df, muslim_american_themes, 'Muslim-American')

This labelled 40 rows!


In [91]:
print_top_references_for_theme(ads_df, muslim_american_themes, 'Muslim-American')

Muslims for America 10
Current events 1
Prod uctiveMuslim 1
Hillary Clinton 1
Islam ism 1
Ramadan 1
Allah Akbr 1


In [92]:
muslim_american_themes = muslim_american_themes | {
    'Muslims for America',
    'Current events',
    'Prod uctiveMuslim',
    'Hillary Clinton',
    'Islam ism',
    'Ramadan',
    'Allah Akbr'
}
ads_df = label_demographic_rows(ads_df, muslim_american_themes, 'Muslim-American')

This labelled 40 rows!


The number of rows has stabilized we move on to the next demographic.

## Eight demographic: Self-Defence

In [94]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 15)

Mixed martial arts 17
Martial arts 17
The Women's Self Defense Institute 17
PERSONAL & HOME DEFENSE 17
Self-defense 17
Selfdefense (United States) 17
Personal Defense 17
Right of self-defense 17
Self Defense Family 17
Active Self Protection 17
Libertarianism 13
Liberalism 13
Williams&Kalvin 13
Free software 10
Stop A.1. 10


In [95]:
self_defense = {
    'Mixed martial arts',
    'Martial arts',
    'The Women\'s Self Defense Institute',
    'PERSONAL & HOME DEFENSE',
    'Self-defense',
    'Selfdefense (United States)',
    'Personal Defense',
    'Right of self-defense',
    'Self Defense Family',
    'Active Self Protection'
}
ads_df = label_demographic_rows(ads_df, self_defense, 'Self-Defense')

This labelled 17 rows!


In [96]:
print_top_references_for_theme(ads_df, self_defense, 'Self-Defense')

No new terms where related with these rows. We move on to the next demographic.

## Adding to Muslim-American and Right wing

In [97]:
print_top_words_from_arrays(ads_df.ad_interests_array[pd.isnull(ads_df['demographic'])], 15)

Libertarianism 13
Liberalism 13
Williams&Kalvin 13
Free software 10
Stop A.1. 10
United Muslims of America 9
Antelope Valley College 9
Bernie Sanders 8
Police 8
Blacktivist 6
Police officer 6
Jesus Daily 6
Texas 6
free music 5
Law enforcement 5


Libertarianism seems to belong in the right-wing category, while United Muslims of America should be put in Muslim Americain.

In [98]:
muslim_american_themes = muslim_american_themes | {
    'United Muslims of America'
}
ads_df = label_demographic_rows(ads_df, muslim_american_themes, 'Muslim-American')

This labelled 49 rows!


In [99]:
print_top_references_for_theme(ads_df, muslim_american_themes, 'Muslim-American')

No new rows where found!

In [101]:
right_wing_themes = right_wing_themes | {
    'Libertarianism',
    'Williams&Kalvin'
}
ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 274 rows!


Although we have seen the number of 'Right wing' entries increase, we also have very few rows left and will use a manual process for these last entries.

In [105]:
print('At this point, only ' + str(pd.isnull(ads_df['demographic']).sum()) + ' ads are unlabelled.')

At this point, only 120 ads are unlabelled.


In [106]:
 for v in ads_df[pd.isnull(ads_df['demographic'])]['ad_targeting_interests']:
    print(v)

Free software 
Free software 
Bernie Sanders
 free music or Free software
Reading
Blacktivist
 Grooveshark, Last.fm, SoundCloud, Vevo, Shazam (service) or Google Play Music
 Bernie Sanders, Social democracy, Liberalism or Democratic Party (United States)
Right-wing politics 
Blacktivist
 Antelope Valley College, 
 Free software 
 Antelope Valley College, 
Stop A.1.
Blacktivist
 Black Economic Empowerment 
Copwatch Rodney King; Police brutality in the United States; Stop Police Brutality; Cop Block or Photography is Not a Crime
Born Liberal
 Bernie Sanders or Liberalism
 Security alarm, Police, National security, Security guard, Police officer or Safety
Detroit
History 
Stop A.1.
Police, Law enforcement or Police officer
Stop A.1.
 Music or Rock music 
Iraq War Veterans, Veterans of Foreign Wars, Support Our Veterans. Veterans benefits support; Disabled American Veterans, Veterans Advantage or Dysfunctional Veterans
 State police, Law enforcement in the United States, Police, Sheriffs i

I manually looked at ads from the output above and identified the following trends:

* Adds containing Grooveshark, Music and Free  software were aimed at people willing to install free software from the group. The software would spread more adds to the user's friends.


* African-American entries containing (Black,  Blacktivist)


* Right-wing politics (Politics, Veterans, jesus, Secured Borders, Texas) seem to be aim at republican leaning demographic and will be included under the "Right-wing" umbrella category.


* Left-wing politics (Bernie Sanders, Innocence Project, Born Liberal)

In [107]:
# Reading is related to a group called South United
# Stop A.1. is about illegal immigration
# History is about a confederate group
# Fitness and wellness, Sports is about a pro police movement
# Automobiles as the ad stated "drive like a patriot"
right_wing_themes = right_wing_themes | {
    'Police',
    'Texas',
    'Heart of Texas',
    'Secured Borders',
    'Politics',
    'Veterans',
    'Jesus',
    'jesus love u',
    'Right-wing politics',
    'National Police Wives Association',
    'Police; Law Enforcement Today',
    'Right Wing News',
    'Syria',
    'State police',
    'Iraq War Veterans',
    'jesus love u. I Am a Child of God. Jesus Daily',
    'Jesus Daily',
    'Reading',
    'Stop A.1.',
    'Stop A. I.',
    'History',
    'Fitness and wellness',
    'Sports',
    'Automobiles'
}

ads_df = label_demographic_rows(ads_df, right_wing_themes, 'Right wing')

This labelled 329 rows!


In [108]:
free_music_software_themes = {
    'Grooveshark',
    'Free software',
    'Music'
}

ads_df = label_demographic_rows(ads_df, free_music_software_themes, 'Free music software')

This labelled 18 rows!


In [109]:
# Antelope Valley College is about stop police brutality
# Facism is also about police brutality
# Tax is targeted at African-American through the Behavior filter
# BM is for Black Matters
african_american_themes = african_american_themes | {
    'Blacktivist',
    'Black Economic Empowerment',
    'Racism in the United States Interest',
    'Understanding racial segregation in the united states',
    'African Methodist Episcopal Zion Church',
    'Copwatch Rodney King; Police brutality in the United States; Stop Police Brutality; Cop Block',
    'Detroit',
    'Stop Police Brutality',
    'Antelope Valley College',
    'Fascism',
    'Tax',
    'Trayvon Martin',
    'BM'
}

ads_df = label_demographic_rows(ads_df, african_american_themes, 'African-American')

This labelled 1351 rows!


In [110]:
left_wing_themes = {
    'Bernie Sanders',
    'Innocence Project',
    'Born Liberal',
    'Liberalism',
    'Homeless shelter',
}

ads_df = label_demographic_rows(ads_df, left_wing_themes, 'Left wing')

This labelled 18 rows!


In [111]:
native_american_themes = native_american_themes | {
    'Standing Rock Indian Reservation'
}
ads_df = label_demographic_rows(ads_df, native_american_themes, 'Native-American')

This labelled 19 rows!


In [112]:
memes_themes = memes_themes | {
    'Memopolis'
}
ads_df = label_demographic_rows(ads_df, memes_themes, 'Memes')

This labelled 138 rows!


At this point, the entry below is the only entry remaining and even after looking at the file it is difficult to identify what demographic the group was targeting. We will simply drop this row.

In [114]:
ads_df[pd.isnull(ads_df['demographic'])]

Unnamed: 0,file_name,ad_targeting_interests,ad_impressions,ad_clicks,ad_spend,ad_creation_date,ad_end_date,ad_interests_array,demographic
1710,P(1)0002143.txt,Landscape painting or Landscape,162,3,18.1,2015-06-09,2015-06-10,"[Landscape painting, Landscape]",


In [115]:
ads_df['demographic'].value_counts()

African-American       1351
Right wing              329
Mexican-American        189
Memes                   138
LGBT                     97
Muslim-American          49
Native-American          19
Free music software      18
Left wing                18
Self-Defense             17
Name: demographic, dtype: int64

In [116]:
ads_df = ads_df[~pd.isnull(ads_df['demographic'])]

We have now added a demographic tag to nearly all our rows and taken the time to periodicaly assess if the association we were making were correct by verify the related ads keywords.

In [118]:
ads_df.to_csv('../clean_data/labeled_clean_data.csv', index=None, header=True)

Shortcomings of this approach:

* Order matters
  * Labeled data is not visited after it is labeled the first time. This technique may not work as well if ads had belonged strongly to more than 1 category.
  
* Not as thorough as a manual review
  * Some studies used Mechanical Turks to review their labeling. Ideally we should do so as well.

* Subjective
  * Although choosing whether an ad targeted a demographic is a subjective exercise when using this method, I believe this manual labeling gave us a better understanding of the way the IRA picked the interests themselves. In comparison, a purely computational approach result's were difficult to tie back with the ads and interpret even though the method used (tfidf and kmeans) were very simple. Additionally, the name of the demographic group is also subjective and can influence the perception and understanding of the IRA's intent when used in graph. To reduce this influence, I have tried to keep away from names that are politically charged as much as possible.

This concludes demographic labeling. We can move on to the analysis notebook.