<a href="https://colab.research.google.com/github/MatthewAlexOBrien/All-The-News/blob/master/code/ethnicity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ****Identifying Race and Gender****

See: ***ethnicity.ipnyb*** 

Race of named subjects is identified with two methods. 
* i. DEEP FACE We use google images and DeepFace https://pypi.org/project/deepface/ to measure ethnicity and gender, which has an overall accuracy of 97% for predicting gender and 72% for predicting race. We obtain images of named subjects by 'googling' them and dowloading the first 4 images. 

* ii. ETHNICOLR We use the ethnicolr package https://ethnicolr.readthedocs.io/ethnicolr.html to predict race based on the letter sequences in the subjects' name. The authors of this package have a corresponding paper https://arxiv.org/pdf/1805.02109.pdf which outlines their procedure for predicting race. They use voter registration data in the US to train deep neural models that identify letter sequences that most correspond to specific ethnic origins.  The measure privides 85% accuracy when both the first and last name are identified. 

**Install and Import Libraries**


In [None]:
%%capture

# Packages not pre-install on Python 3.7
!pip3 install ethnicolr
!pip3 install cv2
!pip3 install deepface
!pip3 install google_images_download

In [None]:
%%capture

# Imports
import pandas as pd
import numpy as np
import glob
from google_images_download import google_images_download 
import cv2
import random 
from deepface import DeepFace

**Static Functions**

In [None]:
# Method take a given search query on google and download the first n number of images.
def downloadimages(query, n=3):
    arguments = {"keywords": query,
                 "format": "jpg",
                 "limit":n,
                 "print_urls":True,
                 "size": "medium"}
    try:
        response.download(arguments)
    except:
        pass


**Import and Clean Data**

In [None]:
# import
path = '.../Sentences'
all_files = glob.glob(path + "/*.csv")
li = []

# concatenating sentence dataframes together
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# basic cleaning
sentences = pd.concat(li, axis=0, ignore_index=True)
sentences = sentences[['year', 'month', 'day', 'publication', 'article_id', 'sentence_id', 'sentence', 'sentence_subject_names']]
sentences['sentence_subject_names'] = sentences['sentence_subject_names'].replace(r'[\[|\]|\']', '', regex=True)
sentences = sentences[sentences['sentence_subject_names'].map(lambda d: len(d.split(','))) ==1 ]

**Get Images of Subjects From Google**

In [None]:
# list of names to extract images for
names = sentences['sentence_subject_names'].unique()

# inputs
response = google_images_download.googleimagesdownload()
cascPath = ".../python3.7/site-packages/cv2/data/haarcascade_frontalface_default.xml"
faceCascade = cv2.CascadeClassifier(cascPath)

In [None]:
# download images from each name, keeping only images with exactly 1 face
for name in names_small:
    
    # dowloading image
    try:
        downloadimages(name, n=4) 
        print()
    except:
        print('issue with ' + str(name))
    
    # get list of jpg doloaded images for each name
    folder =str(os.getcwd()) + '/downloads/' + str(name) + '/' 
    files = glob.glob(folder + "/*")
    
    # check if each image has exactly 1 face
    if files:
        for file in files:
            image_path = file
            image = cv2.imread(image_path)
            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
            faces = faceCascade.detectMultiScale(
                gray,
                scaleFactor=1.1,
                minNeighbors=5,
                minSize=(30, 30),
                flags=cv2.CASCADE_SCALE_IMAGE
            )
            num_faces = len(faces)
            if num_faces != 1:
                os.remove(image_path)
            else:
                pass    
            
        # if we get more then 2 clean images, randomly keep only 2
        files_clean = glob.glob(folder + "/*")
        if len(files_clean) >2:
            keep_images = random.sample(files_clean, 2)
            drop_images = [drop for drop in files_clean if drop not in keep_images]
            if drop_images:
                for drop_image in drop_images:
                    os.remove(drop_image)
            else:
                pass

        else:
            pass
    else:
        pass
    

**Get Gender and Race**

In [None]:
# initiate new dataframe of names
names_df = pandas.DataFrame(columns=['sentence_subject_names', 'distance', 'gender', 'dominant_race', 'dominant_race_pct', 'white_pct'])


# get race, gender, and match % of each name
for name in names_small:
    image_folder = str(os.getcwd()) + '/downloads/' + name + '/'
    images = glob.glob(image_folder + "/*") 
    
    # getting distance metric between the two images
    try:
        if len(images) == 2:
            image_path_1 = images[0]
            image_path_2 = images[1]
            verify = DeepFace.verify(image_path_1,image_path_2)
            distance = verify['distance']
        else:
            distance = numpy.nan
    except:
        print('issue with verifying ' + name)
        distance = numpy.nan
        
    # getting race and gender
    try:
        result = DeepFace.analyze(image_path_1, actions = ['gender', 'race'])
        gender, dominant_race = result['gender'], result['dominant_race']
        dominant_race_pct, white_pct = result['race'][dominant_race], result['race']['white']

    except:
        print('issue with image ' + name)
        gender, race_pct, dominant_race, dominant_race_pct, white_pct = numpy.nan, numpy.nan, numpy.nan, numpy.nan, numpy.nan
        
    # appending into dataframe
    names_df.loc[-1] = [name, distance, gender, dominant_race, dominant_race_pct, white_pct]
    names_df.index = names_df.index + 1
    names_df = names_df.sort_index()

In [None]:
# merging race and gender to sentence dataframe
sentences = sentences.merge(names_df, on='sentence_subject_names', how='left')