# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [2]:
import numpy as np
import pandas as pd 
import io
import requests

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [3]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"
s = requests.get(url).content
baby_names = pd.read_csv(io.StringIO(s.decode('utf-8')))
baby_names

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1016390,5647421,5647422,Seth,2014,M,WY,5
1016391,5647422,5647423,Spencer,2014,M,WY,5
1016392,5647423,5647424,Tyce,2014,M,WY,5
1016393,5647424,5647425,Victor,2014,M,WY,5


### Step 4. See the first 10 entries

In [4]:
baby_names.head(10) 

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [5]:
df=baby_names.drop(['Unnamed: 0' , 'Id'], axis = 1) 
df

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [6]:
df['Name']

0             Emma
1          Madison
2           Hannah
3            Grace
4            Emily
            ...   
1016390       Seth
1016391    Spencer
1016392       Tyce
1016393     Victor
1016394     Waylon
Name: Name, Length: 1016395, dtype: object

In [7]:
df['Name'].info() 

<class 'pandas.core.series.Series'>
RangeIndex: 1016395 entries, 0 to 1016394
Series name: Name
Non-Null Count    Dtype 
--------------    ----- 
1016395 non-null  object
dtypes: object(1)
memory usage: 7.8+ MB


In [8]:
df['Name'].isin(['NaN']) 

0          False
1          False
2          False
3          False
4          False
           ...  
1016390    False
1016391    False
1016392    False
1016393    False
1016394    False
Name: Name, Length: 1016395, dtype: bool

### Step 7. Group the dataset by name and assign to names

In [9]:
g=df.groupby("Name")
g 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000292EDD900D0>

### Step 8. How many different names exist in the dataset?

In [10]:
df['Name'].count

<bound method Series.count of 0             Emma
1          Madison
2           Hannah
3            Grace
4            Emily
            ...   
1016390       Seth
1016391    Spencer
1016392       Tyce
1016393     Victor
1016394     Waylon
Name: Name, Length: 1016395, dtype: object>

In [11]:
occur = g.size()
display(occur)

Name
Aaban        2
Aadan        4
Aadarsh      1
Aaden      196
Aadhav       1
          ... 
Zyra         7
Zyrah        2
Zyren        1
Zyria       10
Zyriah       9
Length: 17632, dtype: int64

In [12]:
df.Name.describe()   

count     1016395
unique      17632
top         Riley
freq         1112
Name: Name, dtype: object

### Step 9. What is the name with most occurrences?

In [13]:
df["Name"][df['Name']==df["Name"].max()]

235986    Zyriah
244816    Zyriah
393474    Zyriah
855869    Zyriah
858842    Zyriah
862166    Zyriah
867512    Zyriah
881208    Zyriah
885288    Zyriah
Name: Name, dtype: object

In [14]:
g.max()

Unnamed: 0_level_0,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,2014,M,NY,6
Aadan,2014,M,TX,7
Aadarsh,2009,M,IL,5
Aaden,2014,M,WV,158
Aadhav,2014,M,CA,6
...,...,...,...,...
Zyra,2014,F,TX,8
Zyrah,2013,F,TX,6
Zyren,2013,M,TX,6
Zyria,2014,F,TX,7


### Step 10. How many different names have the least occurrences?

In [None]:
counter=0
for i in df.Name:
    if (i ==df["Name"].min()):
        counter+=1
counter

### Step 11. What is the median name occurrence?

In [None]:
g.median()

### Step 12. What is the standard deviation of names?

In [None]:
g.std() 

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [None]:
 df.describe()