# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [65]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [66]:
baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')

### Step 4. See the first 10 entries

In [67]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [68]:
baby_names.drop(['Unnamed: 0', 'Id'], axis=1, inplace=True)

### Step 6. Are there more male or female names in the dataset?

In [69]:
baby_names['Gender'].value_counts()

Gender
F    558846
M    457549
Name: count, dtype: int64

In [70]:
# Alternative, but worse, way
baby_names.groupby('Gender')['Gender'].count()

Gender
F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to the variable 'names'. What are the top 5 names over all the years and states

In [71]:
names= baby_names.groupby('Name')
names['Count'].sum().sort_values(ascending=False).head(5)

Name
Jacob       242874
Emma        214852
Michael     214405
Ethan       209277
Isabella    204798
Name: Count, dtype: int64

In [72]:
# Alternative way
baby_names.groupby('Name')['Count'].sum().nlargest(5)

Name
Jacob       242874
Emma        214852
Michael     214405
Ethan       209277
Isabella    204798
Name: Count, dtype: int64

### Step 8. How many different names exist in the dataset?

In [73]:
baby_names['Name'].nunique()

17632

In [74]:
# but as we have already grouped by the name, all the names are unique already so we can also just do this
len(names)

17632

### Step 9. What is the name with most occurrences?

In [82]:
names['Count'].sum().idxmax()

'Jacob'

In [76]:
# Alternative way
names['Count'].sum().sort_values(ascending=False).head(1).index[0]

'Jacob'

In [77]:
# Alternative way
names.sum()[names.sum()['Count'] == names.sum()['Count'].max()].index[0]

'Jacob'

In [78]:
# Alternative way
names_count = names.sum()
names_count[names_count['Count'] == names_count['Count'].max()].index[0]

# As you can see: there are many ways to solve a question, some better than others. It's always a balance between readability and efficiency.

'Jacob'

### Step 10. How many different names have the least occurrences?

In [99]:
# Let's first see what the least occurence is
sorted_occ_names = names['Count'].sum().sort_values()

print('Min Occ:', sorted_occ_names.min())


# and now we can count the number of names that only appear 5 times
len(sorted_occ_names[sorted_occ_names == sorted_occ_names.min()])

Min Occ: 5


2578

### Step 11. What is the median name occurrence for this dataset?

In [103]:
sorted_occ_names.median()

np.float64(49.0)

### Step 12. What is the standard deviation of names?

In [104]:
sorted_occ_names.std()


np.float64(11006.069467890571)

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [105]:
sorted_occ_names.describe()

count     17632.000000
mean       2008.932169
std       11006.069468
min           5.000000
25%          11.000000
50%          49.000000
75%         337.000000
max      242874.000000
Name: Count, dtype: float64