# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [6]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [7]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv")
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 4. See the first 10 entries

In [8]:
baby_names.head(10)
#to see the first 10 entries we use .head()
#an empty parenthisis will automatically print the first 5 entries
#to get 10, we need to insert that value into the parenthesis

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [9]:
baby_names.drop(["Unnamed: 0", "Id"], axis=1, inplace=True)
#to delete columns .drop() can be used, specifying which colums to remove
#setting the axis to 1 specifies column, but can also do column= without axis
#to maintain the change set inplace to true
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [23]:
baby_names['Gender'].value_counts()
#.value_count() will count the total of each value in the specified column
#There are more female names by over 100,000 values
# #To format the output as a statement, we can set the above equa to a variable, say "winner"
# winner= count.idxmax()
#.idxmax() prints the index with the highest value
# print(count) - to see/verify the count matches the statement
# print("There are more " + winner)
#This will print out "There are more F"

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [19]:
names = baby_names.groupby('Name').sum()
#to group we use .groupby() on the dataset and sum the values in the other columns
names.drop("Year", axis=1, inplace=True)
#since it does not make sense to sum up the year, we will remove the Year column using .drop()
names.head()

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6


### Step 8. How many different names exist in the dataset?

In [20]:
baby_names.Name.nunique() 
#we can use .nunique() to see how many unique values are in the Name column
#To format the output as a statement, we can set the above equa to a variable, say "number"
#and print("There are " + str(number) + " names in the dataset")
#which will print "There are 17632 names in the dataset"
#another way,is to simply find th length as the data has already been grouped by name, using len()

17632

### Step 9. What is the name with most occurrences?

In [27]:
names.idxmax() 
#.idxmax() prints the index with the highest value
#because we used the newly created names variable we did not need to specify a column
#if more than one column, specify which after the name of the dataset

Count    Jacob
dtype: object

### Step 10. How many different names have the least occurrences?

In [43]:
least = names.Count == names.Count.min() 
#.min() will give us the smalles value in the Count column
#setting it equal to the Count column will give us a series of T/F values
#indicating which is equal to the minimum 
least.sum() 
#to see how many true outputs we got, we can sum it up
# another way to get the answer is finding the length
#len(names[names.Count == names.Count.min()])

2578

### Step 11. What is the median name occurrence?

In [61]:
median=names.Count == names.Count.median()
#we can do similarly as above but use .median() instead
occurrence = median.sum()
print(occurrence)
#if we want to see the number of occurence seperately we can set it to a variable and print
med_val = names.Count.median() 
print(med_val)
#if we want to see the what the median is we can set it to a variable and print

66
49.0


### Step 12. What is the standard deviation of names?

In [65]:
std_num = names.std()
#to find the standard deviation we use .std()
#again if you want to specify for which column, place name next to dataset name
#since we have only one colum in the dataset we are using , it is not necessary
std_num

Count    11006.069468
dtype: float64

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [67]:
names.describe()
#the summary statistics can be obtained using .describe() on the dataset

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
