# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [14]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [15]:
baby_names='https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'
baby_names_df=pd.read_csv(baby_names)
baby_names_df.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 4. See the first 10 entries

In [16]:
baby_names_df.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [19]:
baby_names_df.drop(columns=['Unnamed: 0', 'Id'], axis=1, inplace=True, errors='ignore')

print("Columns after deletion:\n", baby_names_df.columns.tolist())

Columns after deletion:
 ['Name', 'Year', 'Gender', 'State', 'Count']


### Step 6. Is there more male or female names in the dataset?

In [None]:
gender_counts = baby_names_df['Gender'].value_counts()

print("Count of names by gender:")
print(gender_counts)

if gender_counts['M'] > gender_counts['F']:
    print("\nThere are more MALE (M) names in the dataset.")
else:
    print("\nThere are more FEMALE (F) names in the dataset.")

Count of names by gender:
Gender
F    558846
M    457549
Name: count, dtype: int64

There are more FEMALE (F) names in the dataset.


### Step 7. Group the dataset by name and assign to names

In [22]:
names = baby_names_df.groupby('Name')

### Step 8. How many different names exist in the dataset?

In [23]:
unique_name_count = baby_names_df['Name'].nunique()

print(f"Number of unique baby names in the dataset: {unique_name_count}")

Number of unique baby names in the dataset: 17632


### Step 9. What is the name with most occurrences?

In [24]:
name_counts = baby_names_df.groupby('Name')['Count'].sum()
top_name = name_counts.idxmax()
top_count = name_counts.max()

print(f"The most common name is '{top_name}' with {top_count:,} occurrences.")

The most common name is 'Jacob' with 242,874 occurrences.


### Step 10. How many different names have the least occurrences?

In [25]:
min_count = name_counts.min()

num_names_with_min = (name_counts == min_count).sum()

print(f"Number of names with the least occurrences ({min_count}): {num_names_with_min}")

Number of names with the least occurrences (5): 2578


### Step 11. What is the median name occurrence?

In [26]:
median_occurrence = name_counts.median()

print(f"Median name occurrence: {median_occurrence:.0f}")

Median name occurrence: 49


### Step 12. What is the standard deviation of names?

In [27]:
std_dev = name_counts.std()

print(f"Standard deviation of name occurrences: {std_dev:,.0f}")

Standard deviation of name occurrences: 11,006


### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [28]:
summary = name_counts.describe(percentiles=[.25, .5, .75])
summary = summary.rename({
    '50%': 'median',
    '25%': 'Q1',
    '75%': 'Q3'
})

print("Name Occurrence Statistics:")
print(summary.to_string(float_format='{:,.0f}'.format))

Name Occurrence Statistics:
count     17,632
mean       2,009
std       11,006
min            5
Q1            11
median        49
Q3           337
max      242,874
