# Name Data 

Author: Jade Aidoghie  
Date: 3/3/2024

## 1. Background
This notebook was created to look through a dataset containing names based on gender before implementing it into my Name Generator application. This is a recent update to my application which will allow users to select names based on their gender. Last names are typically unisex so this function will focus on the gender for first names.

The data consists of baby names from the US, UK, Canada, and Australia. More information about the data can be found on Kaggle where I sourced my information.

> Data: [Gender By Name - Kaggle](https://www.kaggle.com/datasets/rupindersinghrana/gender-by-name)

# 2. Preliminary Exploration
This section will explore the dataset to get a better understanding of it before altering the data.

In [125]:
import pandas as pd
import plotly.express as px

names = pd.read_csv('name_gender_dataset.csv')
names.head()


Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,0.014517
1,John,M,5260831,0.014398
2,Robert,M,4970386,0.013603
3,Michael,M,4579950,0.012534
4,William,M,4226608,0.011567


In [126]:
names.info() # Full report on columns and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147269 entries, 0 to 147268
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Name         147269 non-null  object 
 1   Gender       147269 non-null  object 
 2   Count        147269 non-null  int64  
 3   Probability  147269 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.5+ MB


In [127]:
names.shape # The dimensions of the data 

(147269, 4)

In [128]:
names.isnull().sum() # NULLS for each column

Name           0
Gender         0
Count          0
Probability    0
dtype: int64

In [129]:
names.Name.nunique() # Unique values

133910

In [130]:
names.describe() # Looking for rarity in count

Unnamed: 0,Count,Probability
count,147269.0,147269.0
mean,2481.161,6.790295e-06
std,46454.72,0.0001271345
min,1.0,2.73674e-09
25%,5.0,1.36837e-08
50%,17.0,4.65246e-08
75%,132.0,3.6125e-07
max,5304407.0,0.01451679


# 3. Data Tranformation
I only need the name and gender from this dataset so I'll be removing everything else. I will also be removing names that aren't used frequently as they might skew the amount of usable results. A more reliable version without duplicates will be stored into a new csv file and used in the application.

In [131]:
infrequent = names[names['Count'] < 499]
infrequent.head(30)

Unnamed: 0,Name,Gender,Count,Probability
18159,Akshaya,F,498,1e-06
18160,Alyia,F,498,1e-06
18161,Bama,F,498,1e-06
18162,Cherrell,F,498,1e-06
18163,Deandre,F,498,1e-06
18164,Debanhi,F,498,1e-06
18165,Digna,F,498,1e-06
18166,Gwendlyn,F,498,1e-06
18167,Jaysa,F,498,1e-06
18168,Krystian,F,498,1e-06


In [132]:
# Dropping the infrequent names from the original DataFrame
names = names.drop(infrequent.index)

In [133]:
# Removing unnecessary columns
names = names[['Name', 'Gender']]

# Removing duplicate rows to make sure there are unique name and gender combinations
names = names.drop_duplicates()

names.head(10)

Unnamed: 0,Name,Gender
0,James,M
1,John,M
2,Robert,M
3,Michael,M
4,William,M
5,Mary,F
6,David,M
7,Joseph,M
8,Richard,M
9,Charles,M


In [134]:
names.Name.nunique() # Checking for unique values again

16522

In [135]:
# Saving the cleaned dataset to a new CSV file
names.to_csv('cleaned_names.csv', index=False)

# 4. Data Visualization
Just for some more context, I want to know how many unique values there are for both genders.

In [136]:
# Filtering the dataset based on gender
female_names = names[names['Gender'] == 'F']['Name'].unique()
male_names = names[names['Gender'] == 'M']['Name'].unique()

# Getting the count of unique names for each gender
female_count = len(female_names)
male_count = len(male_names)

print(f"Unique female names: {female_count}")
print(f"Unique male names: {male_count}")

Unique female names: 11001
Unique male names: 7158


In [137]:
# Counting the total number of male and female names
gender_counts = names['Gender'].value_counts()

# Pie chart
fig = px.pie(gender_counts, 
             values=gender_counts.values, 
             names=gender_counts.index, 
             title='Total Number of Male and Female Names',
             color_discrete_sequence= ['#EA738D','#89ABE3'])

fig.show()

# 5. Report

* There are 16,522 unique names within the cleaned data set
* There are a total of 11,001 feminine names
* There are a total of 7,158 masculine names
* In total there are 18,159 total F and M names (This is more than total unique names considering that there are names that are used by both F and M)
* 60.6% are Feminine names and 39.4% are masculine names