# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [12]:
import pandas as pd
pd.set_option('display.max_rows',5)

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [13]:
df = pd.read_csv("US_Baby_Names_right.csv")

### Step 4. See the first 10 entries

In [14]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
...,...,...,...,...,...,...,...
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [15]:
df = df.loc[:,'Name':'Count']
df.head(2)

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48


### Step 6. Is there more male or female names in the dataset?

In [16]:
# Solution 1
df.groupby('Gender').size()

Gender
F    558846
M    457549
dtype: int64

In [17]:
# Solution 2
df['Gender'].value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [18]:
gb_names = df.groupby('Name')['Count'].agg('sum')
gb_names.sort_values(ascending=False)

Name
Jacob       242874
Emma        214852
             ...  
Janisha          5
Destenie         5
Name: Count, Length: 17632, dtype: int64

### Step 8. How many different names exist in the dataset?

In [19]:
len(gb_names), df['Name'].nunique()

(17632, 17632)

### Step 9. What is the name with most occurrences?

In [20]:
# Solution 1
gb_names.sort_values(ascending=False)[0:1]

Name
Jacob    242874
Name: Count, dtype: int64

In [21]:
# Solution 2
gb_names.idxmax()

'Jacob'

### Step 10. How many different names have the least occurrences?

In [22]:
gb_names[gb_names==gb_names.min()].shape

(2578,)

In [23]:
gb_names[gb_names==gb_names.min()].shape

(2578,)

### Step 11. What is the median name occurrence?

In [24]:
print("median name occurence: ", gb_names.median() )
gb_names[gb_names==gb_names.median()]

median name occurence:  49.0


Name
Aishani    49
Alara      49
           ..
Yoni       49
Zuleima    49
Name: Count, Length: 66, dtype: int64

### Step 12. What is the standard deviation of names?

In [25]:
print("median name standard deviation: ", gb_names.std() )

median name standard deviation:  11006.069467891111


### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [26]:
gb_names.describe()

count     17632.000000
mean       2008.932169
             ...      
75%         337.000000
max      242874.000000
Name: Count, Length: 8, dtype: float64

### BONUS

In [27]:
'Maryse' in df['Name'].unique()

True

In [28]:
'Maryse' in gb_names[gb_names==gb_names.min()].index

True

In [29]:
df[df['Name']=='Saad']

Unnamed: 0,Name,Year,Gender,State,Count
110029,Saad,2004,M,CA,5
112801,Saad,2005,M,CA,5
...,...,...,...,...,...
951412,Saad,2008,M,VA,5
952527,Saad,2009,M,VA,5
