# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd
from numpy import nan

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

In [2]:
address="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names=pd.read_csv(address)
baby_names

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1016390,5647421,5647422,Seth,2014,M,WY,5
1016391,5647422,5647423,Spencer,2014,M,WY,5
1016392,5647423,5647424,Tyce,2014,M,WY,5
1016393,5647424,5647425,Victor,2014,M,WY,5


### Step 4. See the first 10 entries

In [4]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [5]:
d1=baby_names.drop(["Unnamed: 0","Id"],axis=1)
d1

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Are there more male or female names in the dataset?

In [6]:
d2=baby_names['Gender'].value_counts()
d2

F    558846
M    457549
Name: Gender, dtype: int64

In [7]:
d3=baby_names.groupby("Gender").count()
d3

Unnamed: 0_level_0,Unnamed: 0,Id,Name,Year,State,Count
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
F,558846,558846,558846,558846,558846,558846
M,457549,457549,457549,457549,457549,457549


### Step 7. Group the dataset by name and assign to names

In [8]:
d4 = baby_names.groupby("Name").sum()
d4

Unnamed: 0_level_0,Unnamed: 0,Id,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,7733801,7733803,4027,12
Aadan,7158061,7158065,8039,23
Aadarsh,1728030,1728031,2009,5
Aaden,555052029,555052225,393963,3426
Aadhav,709606,709607,2014,6
...,...,...,...,...
Zyra,17538998,17539005,14085,42
Zyrah,5487073,5487075,4024,11
Zyren,5074229,5074230,2013,6
Zyria,29787029,29787039,20089,59


### Step 8. How many different names exist in the dataset?

In [9]:
d5=len(baby_names["Name"].unique().tolist())
d5

17632

### Step 9. What is the name with most occurrences?

In [10]:
d6=d4.Count.idxmax()
d6

'Jacob'

In [11]:
d4.loc["Jacob"]


Unnamed: 0    1665680788
Id            1665681356
Year             1141099
Count             242874
Name: Jacob, dtype: int64

### Step 10. How many different names have the least occurrences?

In [12]:
d4.Count.idxmin()

'Aadarsh'

### Step 11. What is the median name occurrence?

In [13]:
d4

Unnamed: 0_level_0,Unnamed: 0,Id,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,7733801,7733803,4027,12
Aadan,7158061,7158065,8039,23
Aadarsh,1728030,1728031,2009,5
Aaden,555052029,555052225,393963,3426
Aadhav,709606,709607,2014,6
...,...,...,...,...
Zyra,17538998,17539005,14085,42
Zyrah,5487073,5487075,4024,11
Zyren,5074229,5074230,2013,6
Zyria,29787029,29787039,20089,59


In [39]:
media=d4["Count"].mean()
media.round(2)



2008.93

In [38]:
d4.Count.describe()

count     17632.000000
mean       2008.932169
std       11006.069468
min           5.000000
25%          11.000000
50%          49.000000
75%         337.000000
max      242874.000000
Name: Count, dtype: float64

### Step 12. What is the standard deviation of names?

In [43]:
desv=d4["Count"].std()
desv.round()

11006.0

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [44]:
d4.Count.describe()

count     17632.000000
mean       2008.932169
std       11006.069468
min           5.000000
25%          11.000000
50%          49.000000
75%         337.000000
max      242874.000000
Name: Count, dtype: float64