# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'
baby_names = pd.read_csv(url)

### Step 4. See the first 10 entries

In [3]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


In [4]:
baby_names_sorted_by_name = baby_names.sort_values(by='Name')
baby_names_sorted_by_name.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
693699,3865866,3865867,Aaban,2013,M,NY,6
695768,3867935,3867936,Aaban,2014,M,NY,6
897674,5061278,5061279,Aadan,2008,M,TX,5
120728,691905,691906,Aadan,2008,M,CA,7
138678,709855,709856,Aadan,2014,M,CA,5


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [5]:
baby_names.columns

Index(['Unnamed: 0', 'Id', 'Name', 'Year', 'Gender', 'State', 'Count'], dtype='object')

In [6]:
baby_names.drop(['Unnamed: 0', 'Id'],inplace=True, axis=1)
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


In [7]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 5 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   Name    1016395 non-null  object
 1   Year    1016395 non-null  int64 
 2   Gender  1016395 non-null  object
 3   State   1016395 non-null  object
 4   Count   1016395 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 38.8+ MB


In [8]:
baby_names.duplicated().value_counts()

False    1016395
Name: count, dtype: int64

### Step 6. Is there more male or female names in the dataset?

In [9]:
# The simplest way is:
baby_names_by_gender = baby_names.groupby('Gender').Name.count().reset_index()
baby_names_by_gender.rename(columns={'Name':'Count'}, inplace=True)
baby_names_by_gender

Unnamed: 0,Gender,Count
0,F,558846
1,M,457549


In [10]:
# if we consider that there are duplicated names in tha names column we can do the measurment based on the unique names only
baby_names_only_one_name = baby_names.drop_duplicates(subset='Name')
baby_names_only_one_name.Name.duplicated().value_counts()

Name
False    17632
Name: count, dtype: int64

In [11]:
baby_names_only_one_name.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


In [12]:
baby_names_by_gender = baby_names_only_one_name.groupby('Gender').Name.count().reset_index()
baby_names_by_gender.rename(columns={'Name':'Count'}, inplace=True)
baby_names_by_gender

Unnamed: 0,Gender,Count
0,F,10393
1,M,7239


### Step 7. Group the dataset by name and assign to names

In [13]:
names = baby_names.groupby('Name').Count.sum().reset_index()
names.sort_values(by='Count', ascending=False, inplace=True)
names.head()

Unnamed: 0,Name,Count
7198,Jacob,242874
5378,Emma,214852
12111,Michael,214405
5579,Ethan,209277
6973,Isabella,204798


In [14]:
names.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17632 entries, 7198 to 13010
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    17632 non-null  object
 1   Count   17632 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 413.2+ KB


### Step 8. How many different names exist in the dataset?

In [15]:
names.Name.duplicated().value_counts()

Name
False    17632
Name: count, dtype: int64

In [16]:
names.Name.count()

17632

### Step 9. What is the name with most occurrences?

In [17]:
names[names.Count == names.Count.max()]

Unnamed: 0,Name,Count
7198,Jacob,242874


In [18]:
names_sorted = names.sort_values(by='Count', ascending=False)
names_sorted.head(10)

Unnamed: 0,Name,Count
7198,Jacob,242874
5378,Emma,214852
12111,Michael,214405
5579,Ethan,209277
6973,Isabella,204798
16746,William,197894
8568,Joshua,191551
15373,Sophia,191446
4166,Daniel,191440
5367,Emily,190318


### Step 10. How many different names have the least occurrences?

In [19]:
names_sorted.tail()

Unnamed: 0,Name,Count
547,Aiyla,5
7539,Jamela,5
7541,Jamera,5
541,Aivy,5
13010,Nita,5


In [25]:
names[names.Count ==names.Count.min()].Name.count()

2578

### Step 11. What is the median name occurrence?

In [21]:
names.Count.median()

49.0

### Step 12. What is the standard deviation of names?

In [22]:
names.Count.std()

11006.0694678915

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [23]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
