# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np

from azureml import Workspace

ws = Workspace()

### Step 2. Import the dataset from this US_Baby_Names_right.csv 

In [2]:
ds = ws.datasets['US_Baby_Names_right.csv']

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names = ds.to_dataframe()

### Step 4. See the first 10 entries

In [5]:
baby_names[:10]

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [6]:
baby_names = baby_names.drop(['Unnamed: 0', 'Id'], axis = 1)

In [7]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [8]:
count = baby_names.groupby('Gender').count()['Count']
if count['F'] > count['M']:
    print('There are more females.')
elif count['M'] > count['F']:
    print('There are more males.')
else:
    print('There are the same number of males than females')
        
count

There are more females.


Gender
F    558846
M    457549
Name: Count, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [20]:
names = baby_names.groupby('Name').count()['Count']

### Step 8. How many different names exist in the dataset?

In [10]:
len(baby_names.Name.unique())

17632

### Step 9. What is the name with most occurrences?

In [21]:
name_list_sorted = names.sort_values(ascending = False)
print('Name: {}, occurrences: {}'.format(name_list_sorted.index[0], name_list_sorted[0]))

Name: Riley, occurrences: 1112


### Step 10. How many different names have the least occurrences?

In [12]:
sum([1 for count in name_list_sorted if count == name_list_sorted.min()])

3682

### Step 11. What is the median name occurrence?

In [13]:
median_value = name_list_sorted.median()
median_value

8.0

In [14]:
name_list_sorted.index[name_list_sorted == median_value]

Index(['Lukasz', 'Jamaree', 'Ambree', 'Amyri', 'Tovia', 'Judas', 'Willian',
       'Jalani', 'Jakya', 'Woodrow',
       ...
       'Zanaya', 'Kayliana', 'Kasie', 'Kaliana', 'Kamaile', 'Zakia', 'Kenzlie',
       'Katelynne', 'Kaileen', 'Kaio'],
      dtype='object', name='Name', length=360)

### Step 12. What is the standard deviation of names?

In [15]:
name_list_sorted.std()

122.02996350814088

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [16]:
name_list_sorted.describe()

count    17632.000000
mean        57.644907
std        122.029964
min          1.000000
25%          2.000000
50%          8.000000
75%         39.000000
max       1112.000000
Name: Count, dtype: float64