The same name can be spelled out in a many ways (for example, Marc and Mark, or Elizabeth and Elisabeth). Sound can, therefore, be a better way to match names than spelling. In this project, we use the Python package Fuzzy to find out the genders of authors that have appeared in the New York Times Best Seller list for Children's Picture books.
Fuzzy is a python library implementing common phonetic algorithms quickly. Typically this is in string similarity exercises, but they’re pretty versatile.
It uses C Extensions (via Cython) for speed.
First, using fuzzy (sound) name matching, you will search for author names in a dataset provided by the US Social Security Administration that contains names and genders of all individuals who have applied for Social Security Cards. Next, we'll aggregate the author dataset by including gender. Finally, we will use the new dataset to plot the gender distribution of children's picture books authors over time.
Technologies used:
- Pandas DataFrames
- Numpy for basic statistics
- Matplotlib for plotting.