# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names= pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv",sep=",")
baby_names

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1016390,5647421,5647422,Seth,2014,M,WY,5
1016391,5647422,5647423,Spencer,2014,M,WY,5
1016392,5647423,5647424,Tyce,2014,M,WY,5
1016393,5647424,5647425,Victor,2014,M,WY,5


In [4]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1016395 non-null  int64 
 1   Id          1016395 non-null  int64 
 2   Name        1016395 non-null  object
 3   Year        1016395 non-null  int64 
 4   Gender      1016395 non-null  object
 5   State       1016395 non-null  object
 6   Count       1016395 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


### Step 4. See the first 10 entries

In [6]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [7]:
# deletes Unnamed: 0
del baby_names['Unnamed: 0']

# deletes Id
del baby_names['Id']

baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [8]:
baby_names["Gender"].value_counts()

Gender
F    558846
M    457549
Name: count, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [12]:
# you don't want to sum the Year column, so you delete it
del baby_names["Year"]
names= baby_names.groupby("Name").sum(numeric_only=True)
names.head()


Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6


In [13]:
names.sort_values("Count", ascending = 0).head()

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798


### Step 8. How many different names exist in the dataset?

In [14]:
# as we have already grouped by the name, all the names are unique already. 
# get the length of names
len(names)

17632

In [20]:
baby_names["Name"].nunique()

17632

### Step 9. What is the name with most occurrences?

In [17]:
names["Count"].idxmax()

'Jacob'

### Step 10. How many different names have the least occurrences?

In [18]:
names.sort_values("Count", ascending = 1).head(1)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Destenie,5


In [21]:
len(names[names.Count == 5]) ##Least occurrence is 5

2578

In [22]:
len(names[names.Count == names.Count.min()])

2578

### Step 11. What is the median name occurrence?

In [28]:
names["Count"] .median()

49.0

In [23]:
names[names["Count"] == names["Count"] .median()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aishani,49
Alara,49
Alysse,49
Ameir,49
Anely,49
...,...
Sriram,49
Trinton,49
Vita,49
Yoni,49


### Step 12. What is the standard deviation of names?

In [25]:
names["Count"] .std()

11006.06946789057

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [26]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0


In [27]:
baby_names.describe()

Unnamed: 0,Count
count,1016395.0
mean,34.85012
std,97.39735
min,5.0
25%,7.0
50%,11.0
75%,26.0
max,4167.0
