# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
address = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'

In [3]:
baby_names = pd.read_csv(address)

In [6]:
baby_names.shape

(1016395, 7)

### Step 4. See the first 10 entries

In [4]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [8]:
baby_names.drop(['Unnamed: 0', 'Id'], axis=1, inplace=True)

In [9]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


In [10]:
# can also use

# del baby_names['Unnamed: 0']

### Step 6. Is there more male or female names in the dataset?

In [11]:
baby_names.Gender.value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [12]:
del baby_names['Year']

In [29]:
names_gb = baby_names.groupby(by='Name').agg({
    'Count' : 'sum'
})

In [30]:
names_gb

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6
...,...
Zyra,42
Zyrah,11
Zyren,6
Zyria,59


In [31]:
names_gb.reset_index(inplace=True)

In [32]:
names_gb.sort_values(by='Count', ascending=False)

Unnamed: 0,Name,Count
7198,Jacob,242874
5378,Emma,214852
12111,Michael,214405
5579,Ethan,209277
6973,Isabella,204798
...,...,...
5453,Eniola,5
2037,Atlantis,5
11478,Marci,5
15236,Simarpreet,5


### Step 8. How many different names exist in the dataset?

In [33]:
baby_names.Name.nunique()

17632

### Step 9. What is the name with most occurrences?

In [35]:
baby_names.head()

Unnamed: 0,Name,Gender,State,Count
0,Emma,F,AK,62
1,Madison,F,AK,48
2,Hannah,F,AK,46
3,Grace,F,AK,44
4,Emily,F,AK,41


In [47]:
sorted_names = names_gb.sort_values(by='Count', ascending=False)

In [53]:
sorted_names.head(3)

Unnamed: 0,Name,Count
7198,Jacob,242874
5378,Emma,214852
12111,Michael,214405


In [51]:
sorted_names.iloc[0,0]

'Jacob'

In [57]:
sorted_names.tail()

Unnamed: 0,Name,Count
5453,Eniola,5
2037,Atlantis,5
11478,Marci,5
15236,Simarpreet,5
13010,Nita,5


### Step 10. How many different names have the least occurrences?

In [71]:
names_gb.Count.min()

5

In [72]:
min_names

[              Name Gender State  Count
 166       Adrianna      F    AK      5
 167          Alice      F    AK      5
 168         Aliyah      F    AK      5
 169          Amaya      F    AK      5
 170      Anastasia      F    AK      5
 ...            ...    ...   ...    ...
 1016390       Seth      M    WY      5
 1016391    Spencer      M    WY      5
 1016392       Tyce      M    WY      5
 1016393     Victor      M    WY      5
 1016394     Waylon      M    WY      5
 
 [147086 rows x 4 columns]]

In [73]:
names_gb.columns

Index(['Name', 'Count'], dtype='object')

In [76]:
min_names = names_gb[names_gb['Count'] == 5]

In [78]:
min_names.shape[0]

2578

### Step 11. What is the median name occurrence?

In [81]:
median_names = names_gb[names_gb.Count == names_gb.Count.median()]

In [82]:
median_names

Unnamed: 0,Name,Count
524,Aishani,49
637,Alara,49
986,Alysse,49
1083,Ameir,49
1360,Anely,49
...,...,...
15403,Sriram,49
16250,Trinton,49
16622,Vita,49
17119,Yoni,49


### Step 12. What is the standard deviation of names?

In [87]:
st_dev = round(names_gb.Count.std(),2)

In [89]:
st_dev

11006.07

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [90]:
names_gb.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
