# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv",
                 sep=",", index_col=False)


### Step 4. See the first 10 entries

In [3]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [9]:
baby_names.drop(["Unnamed: 0", "Id"], axis=1)

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [19]:
female=baby_names[baby_names['Gender']=="F"]
female
print(len(female))

558846


In [18]:
male=baby_names[baby_names['Gender']=="M"]
male
print(len(male))

457549


### Step 7. Group the dataset by name and assign to names

In [23]:
names=baby_names.groupby(['Name']).count()
names

Unnamed: 0_level_0,Unnamed: 0,Id,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aaban,2,2,2,2,2,2
Aadan,4,4,4,4,4,4
Aadarsh,1,1,1,1,1,1
Aaden,196,196,196,196,196,196
Aadhav,1,1,1,1,1,1
...,...,...,...,...,...,...
Zyra,7,7,7,7,7,7
Zyrah,2,2,2,2,2,2
Zyren,1,1,1,1,1,1
Zyria,10,10,10,10,10,10


### Step 8. How many different names exist in the dataset?

In [24]:
len(names)

17632

### Step 9. What is the name with most occurrences?

In [27]:
names.sort_values("Count", ascending = 0).head()


Unnamed: 0_level_0,Unnamed: 0,Id,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Riley,1112,1112,1112,1112,1112,1112
Avery,1080,1080,1080,1080,1080,1080
Jordan,1073,1073,1073,1073,1073,1073
Peyton,1064,1064,1064,1064,1064,1064
Hayden,1049,1049,1049,1049,1049,1049


### Step 10. How many different names have the least occurrences?

In [30]:
names.sort_values("Count", ascending = True).head(10)


Unnamed: 0_level_0,Unnamed: 0,Id,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Katherina,1,1,1,1,1,1
Breyona,1,1,1,1,1,1
Greidy,1,1,1,1,1,1
Shriyan,1,1,1,1,1,1
Briah,1,1,1,1,1,1
Shrihan,1,1,1,1,1,1
Shreyansh,1,1,1,1,1,1
Merit,1,1,1,1,1,1
Brianah,1,1,1,1,1,1
Greggory,1,1,1,1,1,1


### Step 11. What is the median name occurrence?

In [32]:
baby_names.groupby(['Name']).agg('median')


Unnamed: 0_level_0,Unnamed: 0,Id,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,3866900.5,3866901.5,2013.5,6.0
Aadan,702439.0,702440.0,2008.5,5.5
Aadarsh,1728030.0,1728031.0,2009.0,5.0
Aaden,2867679.5,2867680.5,2010.0,10.0
Aadhav,709606.0,709607.0,2014.0,6.0
...,...,...,...,...
Zyra,1069863.0,1069864.0,2013.0,6.0
Zyrah,2743536.5,2743537.5,2012.0,5.5
Zyren,5074229.0,5074230.0,2013.0,6.0
Zyria,2139380.0,2139381.0,2008.0,6.0


### Step 12. What is the standard deviation of names?

In [33]:
baby_names.groupby(['Name']).agg('std')


Unnamed: 0_level_0,Unnamed: 0,Id,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,1.463004e+03,1.463004e+03,0.707107,0.000000
Aadan,2.181189e+06,2.181189e+06,2.872281,0.957427
Aadarsh,,,,
Aaden,1.615565e+06,1.615565e+06,2.044322,21.154974
Aadhav,,,,
...,...,...,...,...
Zyra,2.273476e+06,2.273476e+06,2.115701,1.154701
Zyrah,3.097243e+06,3.097243e+06,1.414214,0.707107
Zyren,,,,
Zyria,1.704179e+06,1.704179e+06,2.685351,0.737865


### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [34]:
names.describe()

Unnamed: 0.1,Unnamed: 0,Id,Year,Gender,State,Count
count,17632.0,17632.0,17632.0,17632.0,17632.0,17632.0
mean,57.644907,57.644907,57.644907,57.644907,57.644907,57.644907
std,122.029964,122.029964,122.029964,122.029964,122.029964,122.029964
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,2.0,2.0,2.0,2.0,2.0,2.0
50%,8.0,8.0,8.0,8.0,8.0,8.0
75%,39.0,39.0,39.0,39.0,39.0,39.0
max,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0
