# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
baby_names = pd.read_csv('US_Baby_Names_right.csv', sep=',')

### Step 4. See the first 10 entries

In [3]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [4]:
# baby_names = baby_names.drop(['Unnamed: 0', 'Id'], axis=1)

In [None]:
del baby_names['Unnamed: 0']
del baby_names['Id']

In [5]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [37]:
baby_names.Gender.value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [9]:
names = baby_names.groupby('Name')

In [38]:
del baby_names['Year']
names = baby_names.groupby('Name').sum()

In [41]:
names.shape

(17632, 1)

### Step 8. How many different names exist in the dataset?

In [13]:
baby_names.Name.nunique()

17632

In [42]:
len(names)

17632

In [43]:
names.count()

Count    17632
dtype: int64

### Step 9. What is the name with most occurrences?

In [19]:
names.Count.sum().sort_values(ascending=False).head(1)

Name
Jacob    242874
Name: Count, dtype: int64

In [44]:
names.Count.idxmax()

'Jacob'

In [45]:
names[names.Count == names.Count.max()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874


### Step 10. How many different names have the least occurrences?

In [20]:
names.Count.sum().sort_values().head(10)

Name
Destenie     5
Janisha      5
Lizvet       5
Arsalan      5
Janira       5
Chuck        5
Sadrac       5
Theodoros    5
Sady         5
Janila       5
Name: Count, dtype: int64

In [48]:
len(names[names.Count == names.Count.min()])

2578

### Step 11. What is the median name occurrence?

In [32]:
names.Count.sum().median()

49.0

In [50]:
names[names.Count == names.Count.median()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aishani,49
Alara,49
Alysse,49
Ameir,49
Anely,49
Antonina,49
Aveline,49
Aziah,49
Baily,49
Caleah,49


### Step 12. What is the standard deviation of names?

In [52]:
names.Count.std()

11006.06946789057

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [36]:
names.Count.agg(['mean', 'min', 'max'])

Unnamed: 0_level_0,mean,min,max
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaban,6.000000,6,6
Aadan,5.750000,5,7
Aadarsh,5.000000,5,5
Aaden,17.479592,5,158
Aadhav,6.000000,6,6
Aadhya,11.325000,5,36
Aadi,8.078947,5,16
Aadin,5.000000,5,5
Aadit,6.000000,5,7
Aaditya,6.928571,5,11


In [53]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
