# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [10]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [11]:
url="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"
baby_names= pd.read_csv(url, sep=",")

baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 4. See the first 10 entries

In [12]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [13]:
baby_names.columns

Index(['Unnamed: 0', 'Id', 'Name', 'Year', 'Gender', 'State', 'Count'], dtype='object')

In [14]:

baby_names.drop(columns=["Unnamed: 0","Id"],inplace=True)
baby_names

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [31]:
baby_names.groupby(by="Gender").count()

Unnamed: 0_level_0,Name,Year,State,Count
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,558846,558846,558846,558846
M,457549,457549,457549,457549


### Step 7. Group the dataset by name and assign to names

In [33]:
names = baby_names.groupby(by="Name").sum()
names.sort_values("Count",ascending=False)

Unnamed: 0_level_0,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jacob,1141099,242874
Emma,1137085,214852
Michael,1161152,214405
Ethan,1139091,209277
Isabella,1137090,204798
...,...,...
Eniola,2012,5
Atlantis,2005,5
Marci,2004,5
Simarpreet,2011,5


### Step 8. How many different names exist in the dataset?

In [21]:
baby_names.Name.nunique()

17632

### Step 9. What is the name with most occurrences?

In [34]:

names = baby_names.groupby(by="Name").sum()
names.sort_values("Count",ascending=False).head(10)


Unnamed: 0_level_0,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jacob,1141099,242874
Emma,1137085,214852
Michael,1161152,214405
Ethan,1139091,209277
Isabella,1137090,204798
William,1131058,197894
Joshua,1141085,191551
Sophia,1131067,191446
Daniel,1163179,191440
Emily,1135066,190318


### Step 10. How many different names have the least occurrences?

In [37]:
len(names[names.Count == names.Count.min()])

2578

### Step 11. What is the median name occurrence?

In [43]:
names[names.Count == names.Count.median()]

Unnamed: 0_level_0,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aishani,14078,49
Alara,16079,49
Alysse,16057,49
Ameir,16086,49
Anely,16071,49
...,...,...
Sriram,14054,49
Trinton,16069,49
Vita,14075,49
Yoni,16060,49


### Step 12. What is the standard deviation of names?

In [45]:
names.Count.std()

11006.069467891111

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [46]:
baby_names.describe()

Unnamed: 0,Year,Count
count,1016395.0,1016395.0
mean,2009.053,34.85012
std,3.138293,97.39735
min,2004.0,5.0
25%,2006.0,7.0
50%,2009.0,11.0
75%,2012.0,26.0
max,2014.0,4167.0
