# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [68]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Step 2. Import the dataset from the file given in the repo US_Baby_Names_right.csv 

In [69]:
baby_names = pd.read_csv("./US_baby_Names_right.csv")

### Step 3. Assign it to a variable called baby_names.

In [70]:
# done in previous cell
baby_names

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1016390,5647421,5647422,Seth,2014,M,WY,5
1016391,5647422,5647423,Spencer,2014,M,WY,5
1016392,5647423,5647424,Tyce,2014,M,WY,5
1016393,5647424,5647425,Victor,2014,M,WY,5


### Step 4. See the first 10 entries

In [71]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [72]:
# use ".drop" method with axis = 1 to delete specific columns and "inplace = True" to ensure they're gone forever
baby_names.drop(["Unnamed: 0", "Id"], axis=1, inplace=True)
baby_names

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [73]:
# We can get the value counts of each gender in the dataset and compare them
print(baby_names["Gender"].value_counts())
print(f'There are {baby_names["Gender"].value_counts()[0]} female names and {baby_names["Gender"].value_counts()[1]} male names and therefore there are more female names in the dataset.')

# or we can filter through and get the lengths of each filtered dataset
amount_male = len(baby_names[baby_names["Gender"] == "M"]) 
amount_female = len(baby_names[baby_names["Gender"] == "F"])
if (amount_male >  amount_female):
    print(f"There are {amount_female} female names and {amount_male} male names and therefore there are more male names in the dataset.")
elif (amount_male == amount_female):
    print(f"There are {amount_female} female names and {amount_male} male names and therefore there is the same number of male and female names in the dataset.")
else:
    print(f"There are {amount_female} female names and {amount_male} male names and therefore there are more female names in the dataset.")


Gender
F    558846
M    457549
Name: count, dtype: int64
There are 558846 female names and 457549 male names and therefore there are more female names in the dataset.
There are 558846 female names and 457549 male names and therefore there are more female names in the dataset.


  print(f'There are {baby_names["Gender"].value_counts()[0]} female names and {baby_names["Gender"].value_counts()[1]} male names and therefore there are more female names in the dataset.')
  print(f'There are {baby_names["Gender"].value_counts()[0]} female names and {baby_names["Gender"].value_counts()[1]} male names and therefore there are more female names in the dataset.')


### Step 7. Group the dataset by name and assign to names

In [74]:
names = baby_names.groupby("Name")
# print the first value in each group (all names)
names.first()

Unnamed: 0_level_0,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,2013,M,NY,6
Aadan,2008,M,CA,7
Aadarsh,2009,M,IL,5
Aaden,2007,M,AL,5
Aadhav,2014,M,CA,6
...,...,...,...,...
Zyra,2012,F,CA,6
Zyrah,2011,F,CA,5
Zyren,2013,M,TX,6
Zyria,2007,F,GA,6


### Step 8. How many different names exist in the dataset?

In [81]:
print(f'There are {len(names["Name"].unique())} different names in the dataset')
# or use baby_names dataset
print(f'There are {len(baby_names["Name"].unique())} different names in the dataset')

There are 17632 different names in the dataset
There are 17632 different names in the dataset


### Step 9. What is the name with most occurrences?

In [82]:
# use value counts indexing the max using argmax() on grouped by names variable
print(f'The name with the most occurences is {names["Name"].value_counts().index[names["Name"].value_counts().argmax()]}')
# or
# use baby_names dataset and mode which is much easier
print(f'The name with the most occurences is {baby_names["Name"].mode()[0]}')

The name with the most occurences is Riley
The name with the most occurences is Riley


### Step 10. How many different names have the least occurrences?

In [83]:
# filter through value counts and only return the names that have only 1 occurence in the dataset which is the least number of occurences
# then get the length of this filtered dataset to get how many diff. names have the least occurences
print(f'The number of different names with the least occurences is {len(baby_names["Name"].value_counts()[baby_names["Name"].value_counts() == 1])}')

The number of different names with the least occurences is 3682


### Step 11. What is the median name occurrence?

In [85]:
# cannot be done because they are strings?
# print(baby_names["Name"].median())

### Step 12. What is the standard deviation of names?

In [87]:
# cannot be done because they are strings?
# print(baby_names["Name"].std())

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [88]:
baby_names.describe()

Unnamed: 0,Year,Count
count,1016395.0,1016395.0
mean,2009.053,34.85012
std,3.138293,97.39735
min,2004.0,5.0
25%,2006.0,7.0
50%,2009.0,11.0
75%,2012.0,26.0
max,2014.0,4167.0
