# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [2]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [3]:
url='https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'
baby_names = pd.read_csv(url)

### Step 4. See the first 10 entries

In [5]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [6]:
baby_names.drop(columns=['Unnamed: 0', 'Id'], axis=1, inplace=True)

In [7]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


In [8]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 5 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   Name    1016395 non-null  object
 1   Year    1016395 non-null  int64 
 2   Gender  1016395 non-null  object
 3   State   1016395 non-null  object
 4   Count   1016395 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 38.8+ MB


In [None]:
# More than a million entries!

### Step 6. Is there more male or female names in the dataset?

In [9]:
# Getting the distinct number of male and female names, not the number
# of people with male and female names

b1 =\
 (
    baby_names
 .groupby('Gender')
 .Name
 .count()
 .to_frame()
)

b1.loc['Total'] = b1.sum()
b1

Unnamed: 0_level_0,Name
Gender,Unnamed: 1_level_1
F,558846
M,457549
Total,1016395


In [10]:
# matches total number of entries from info() above

In [29]:
baby_names['Gender'].value_counts().to_frame()

Unnamed: 0,Gender
F,558846
M,457549


### Step 7. Group the dataset by name and assign to names

In [13]:
names = baby_names.groupby(['Name'])

### Step 8. How many different names exist in the dataset?

In [14]:
names.count()

Unnamed: 0_level_0,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,2,2,2,2
Aadan,4,4,4,4
Aadarsh,1,1,1,1
Aaden,196,196,196,196
Aadhav,1,1,1,1
...,...,...,...,...
Zyra,7,7,7,7
Zyrah,2,2,2,2
Zyren,1,1,1,1
Zyria,10,10,10,10


In [15]:
names.ngroups

17632

### Step 9. What is the name with most occurrences?

In [16]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


In [17]:
(
    names
 .Count
 .sum()
 .sort_values(ascending=False)
 .to_frame() # Need dataframe to preserve index in output
 .iloc[[0], :] # Need bracket around index to preserve dataframe visual structure
)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874


In [33]:
b3 = \
(
    baby_names
    .groupby('Name')
    .Count
    .sum()
    .to_frame()
)

b3

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6
...,...
Zyra,42
Zyrah,11
Zyren,6
Zyria,59


In [34]:
b3.Count.idxmax()

'Jacob'

In [36]:
b3.loc[b3.Count == b3.Count.max()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874


### Step 10. How many different names have the least occurrences?

In [18]:
minval = (
    names
 .Count
 .sum()
 .sort_values(ascending=True)
 .min()
)

minval

5

In [19]:
b2 = \
 (
    names
 .Count
 .sum()
 .sort_values(ascending=True)
 .to_frame()
 )

b2

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Destenie,5
Janisha,5
Lizvet,5
Arsalan,5
Janira,5
...,...
Isabella,204798
Ethan,209277
Michael,214405
Emma,214852


In [20]:
(b2
 .loc[b2.Count == minval]
)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Destenie,5
Janisha,5
Lizvet,5
Arsalan,5
Janira,5
...,...
Alyxandria,5
Ethanjoseph,5
Almira,5
Davarion,5


In [21]:
(b2
 .loc[b2.Count == minval]
 .shape[0]
)

2578

### Step 11. What is the median name occurrence?

In [22]:
med_name_num = b2.median()[0]

In [23]:
med_name_num

49.0

In [24]:
b2

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Destenie,5
Janisha,5
Lizvet,5
Arsalan,5
Janira,5
...,...
Isabella,204798
Ethan,209277
Michael,214405
Emma,214852


In [25]:
(
    b2
 .loc[b2.Count == med_name_num]
 
)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Caleah,49
Anely,49
Alysse,49
Emmanuela,49
Carlota,49
...,...
Sanjuanita,49
Jeovany,49
Deante,49
Jaice,49


### Step 12. What is the standard deviation of names?

In [26]:
b2.Count.std()

11006.06946789057

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [27]:
b2.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
