# <center> <div style="width: 370px;"> ![Exercise](./../Pandas/pictures/Exercise.jpg)

# <center> DataFrame Mini Project: Predict Gender from Name

In this project we want to generate a fake dataset and predict gender from their names. For example a name like 'Alireza' looks like Male.

## Required Libraries

To do so, we use the following libraries:

- `Faker`: To generate fake names
- `names-dataset`: To get gender and country info

So Let's first install the libraries by running:

In [1]:
!pip install Faker
!pip install names-dataset

Collecting Faker
  Obtaining dependency information for Faker from https://files.pythonhosted.org/packages/73/51/cbc859707aa0fc0ad3819ffb3bdaeee28d10d5ef30150ed9d16691ac3795/Faker-19.6.1-py3-none-any.whl.metadata
  Using cached Faker-19.6.1-py3-none-any.whl.metadata (15 kB)
Using cached Faker-19.6.1-py3-none-any.whl (1.7 MB)
Installing collected packages: Faker
Successfully installed Faker-19.6.1
Collecting names-dataset
  Using cached names-dataset-3.1.0.tar.gz (58.4 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pycountry (from names-dataset)
  Using cached pycountry-22.3.5.tar.gz (10.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: names-dataset, pycountry
  Building wheel for names-dataset (setup.py) ... [?25ldone
[?25h  Created wheel for names-dataset: filename=names_dataset-3.1.0-py3-none-any.whl s

In [3]:
from names_dataset import NameDataset
from faker import Faker
import numpy as np
import pandas as pd

In [4]:
en_name = Faker()
fa_name = Faker('Fa')

## Make Fake Name Dataset

In [5]:
def get_name():
    rand = np.random.randint(0, 2)
    if rand == 0:
        return fa_name.name()
    return en_name.name()

In [6]:
df = pd.DataFrame({'Name': [get_name() for _ in range(10)]})

In [7]:
df

Unnamed: 0,Name
0,Elizabeth Walsh
1,الناز پویان
2,Christine Mercado
3,Timothy Garcia
4,Rebecca Davis
5,معصومه عبدالعلی
6,جناب آقای علي هاشمی
7,Leslie Moore
8,محمدحسین رستمی
9,Karen Pham


In [8]:
def real_name(name):
    list_name = name.split()
    forbidden = ['جناب', 'آقای', 'دکتر', 'سرکار', 'خانم', 'مهندس', 'Dr.']
    list_name = [word for word in list_name if word not in forbidden]
    return " ".join(list_name)


In [9]:
df['first_name'] = df['Name'].apply(lambda full_name: real_name(full_name).split()[0])
df['last_name']  = df['Name'].apply(lambda full_name: full_name.split()[-1])

In [10]:
df

Unnamed: 0,Name,first_name,last_name
0,Elizabeth Walsh,Elizabeth,Walsh
1,الناز پویان,الناز,پویان
2,Christine Mercado,Christine,Mercado
3,Timothy Garcia,Timothy,Garcia
4,Rebecca Davis,Rebecca,Davis
5,معصومه عبدالعلی,معصومه,عبدالعلی
6,جناب آقای علي هاشمی,علي,هاشمی
7,Leslie Moore,Leslie,Moore
8,محمدحسین رستمی,محمدحسین,رستمی
9,Karen Pham,Karen,Pham


## Extract First name and Last name

In this section, using the `pandas` library, we check `gender`, `gender probability`, `country` and `country probability`

---

using names dataset you can get gender and country info from a name:

In [16]:
nd = NameDataset()

In [22]:
nd.search('Alireza')

{'first_name': {'country': {'United Arab Emirates': 0.007,
   'Afghanistan': 0.008,
   'Canada': 0.021,
   'Germany': 0.016,
   'United Kingdom': 0.01,
   'Iran, Islamic Republic of': 0.887,
   'Italy': 0.006,
   'Netherlands': 0.009,
   'Sweden': 0.01,
   'United States': 0.026},
  'gender': {'Female': 0.02, 'Male': 0.98},
  'rank': {'United Arab Emirates': 3654,
   'Afghanistan': 259,
   'Canada': 653,
   'Germany': 1256,
   'United Kingdom': 2583,
   'Iran, Islamic Republic of': 13,
   'Italy': 6458,
   'Netherlands': 2024,
   'Sweden': 283,
   'United States': 4266}},
 'last_name': {'country': {'United Arab Emirates': 0.013,
   'Belgium': 0.006,
   'Bahrain': 0.021,
   'Canada': 0.016,
   'Germany': 0.018,
   'United Kingdom': 0.013,
   'Iraq': 0.008,
   'Iran, Islamic Republic of': 0.744,
   'Saudi Arabia': 0.143,
   'United States': 0.018},
  'gender': {},
  'rank': {'Bahrain': 5587,
   'Iran, Islamic Republic of': 547,
   'United Arab Emirates': None,
   'Belgium': None,
   'Can

---

In [11]:
def get_gender(first_name):
    info = nd.search(first_name)['first_name']
    if info is None:
        return
        
    return max(info['gender'], key=info['gender'].get)

In [12]:
def get_gender_p(first_name):
    info = nd.search(first_name)['first_name']
    if info is None:
        return

    return max(info['gender'].values())

In [13]:
def get_country(first_name):
    info = nd.search(first_name)['first_name']
    if info is None:
        return

    return max(info['country'], key=info['country'].get)

In [14]:
def get_country_p(first_name):
    info = nd.search(first_name)['first_name']
    if info is None:
        return

    return max(info['country'].values())

In [17]:
df['gender'] = df['first_name'].apply(lambda first_name: get_gender(first_name))

In [18]:
df['gender probability'] = df['first_name'].apply(lambda first_name: get_gender_p(first_name))

In [19]:
df['country'] = df['first_name'].apply(lambda first_name: get_country(first_name))

In [20]:
df['country probability'] = df['first_name'].apply(lambda first_name: get_country_p(first_name))

In [21]:
df

Unnamed: 0,Name,first_name,last_name,gender,gender probability,country,country probability
0,Elizabeth Walsh,Elizabeth,Walsh,Female,0.991,United States,0.324
1,الناز پویان,الناز,پویان,Female,0.922,"Iran, Islamic Republic of",0.753
2,Christine Mercado,Christine,Mercado,Female,0.994,France,0.383
3,Timothy Garcia,Timothy,Garcia,Male,0.988,United States,0.53
4,Rebecca Davis,Rebecca,Davis,Female,0.993,United States,0.327
5,معصومه عبدالعلی,معصومه,عبدالعلی,Female,0.912,Iraq,0.413
6,جناب آقای علي هاشمی,علي,هاشمی,Male,0.964,Iraq,0.467
7,Leslie Moore,Leslie,Moore,Female,0.82,United States,0.519
8,محمدحسین رستمی,محمدحسین,رستمی,Male,0.969,"Iran, Islamic Republic of",0.546
9,Karen Pham,Karen,Pham,Female,0.981,United States,0.286
