# Name Data 

Author: Jade Aidoghie  
Date: 5/24/2024

## 1. Background
This notebook was created to look through a dataset containing names based on gender before implementing it into my Name Generator application. This is a recent update to my application which will allow users to select names based on their gender. Last names are typically unisex so this function will focus on the gender for first names.

The data consists of baby names from the US, UK, Canada, and Australia. More information about the data can be found on Kaggle where I sourced my information.

> Data: [Gender By Name - Kaggle](https://www.kaggle.com/datasets/rupindersinghrana/gender-by-name)

# 2. Preliminary Exploration
This section will explore the dataset to get a better understanding of it before altering the data.

In [53]:
import pandas as pd
import plotly.express as px

names = pd.read_csv('name_gender_dataset.csv')
names.head()


Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,0.014517
1,John,M,5260831,0.014398
2,Robert,M,4970386,0.013603
3,Michael,M,4579950,0.012534
4,William,M,4226608,0.011567


In [54]:
names.info() # Full report on columns and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147269 entries, 0 to 147268
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Name         147269 non-null  object 
 1   Gender       147269 non-null  object 
 2   Count        147269 non-null  int64  
 3   Probability  147269 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.5+ MB


In [55]:
names.shape # The dimensions of the data 

(147269, 4)

In [56]:
names.isnull().sum() # NULLS for each column

Name           0
Gender         0
Count          0
Probability    0
dtype: int64

In [57]:
names.Name.nunique() # Unique values

133910

# 3. Data Tranformation
I only need the name and gender from this dataset so I'll be removing everything else. A more reliable version without duplicates will be stored into a new csv file and used in the application.

In [58]:
# Removing unnecessary columns
names = names[['Name', 'Gender']]

# Removing duplicate rows to make sure there are unique name and gender combinations
names = names.drop_duplicates()

# Saving the cleaned dataset to a new CSV file
names.to_csv('cleaned_names.csv', index=False)

names.head(10)


Unnamed: 0,Name,Gender
0,James,M
1,John,M
2,Robert,M
3,Michael,M
4,William,M
5,Mary,F
6,David,M
7,Joseph,M
8,Richard,M
9,Charles,M


In [59]:
names.Name.nunique() # Checking for unique values again

133910

# 4. Data Visualization
Just for some more context, I want to know how many unique values there are for both genders.

In [61]:
# Filtering the dataset based on gender
female_names = names[names['Gender'] == 'F']['Name'].unique()
male_names = names[names['Gender'] == 'M']['Name'].unique()

# Getting the count of unique names for each gender
female_count = len(female_names)
male_count = len(male_names)

print(f"Unique female names: {female_count}")
print(f"Unique male names: {male_count}")

Unique female names: 89749
Unique male names: 57520


In [63]:
# Counting the total number of male and female names
gender_counts = names['Gender'].value_counts()

# Pie chart
fig = px.pie(gender_counts, 
             values=gender_counts.values, 
             names=gender_counts.index, 
             title='Total Number of Male and Female Names',
             color_discrete_sequence= ['#EA738D','#89ABE3'])

fig.show()

# 5. Report

About the data:
* There are 133,910 unique names within the data set
* There are a total of 89,749 feminine names
* There are a total of 57,520 masculine names
* In total there are 147,269 total F and M names (This is more than total unique names considering that there are names that are used by both F and M)
* 60.9% are Feminine names and 39.1% are masculine names