# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [40]:
import polars as pl

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [52]:
#use the scan_csv (LazyFrame) --> way more faster than classic read_csv (DataFrame)
baby_names = pl.scan_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')

# baby_names.collect()

### Step 4. See the first 10 entries

In [42]:
baby_names.head(10).collect()

Unnamed: 0_level_0,Id,Name,Year,Gender,State,Count
i64,i64,str,i64,str,str,i64
11349,11350,"""Emma""",2004,"""F""","""AK""",62
11350,11351,"""Madison""",2004,"""F""","""AK""",48
11351,11352,"""Hannah""",2004,"""F""","""AK""",46
11352,11353,"""Grace""",2004,"""F""","""AK""",44
11353,11354,"""Emily""",2004,"""F""","""AK""",41
11354,11355,"""Abigail""",2004,"""F""","""AK""",37
11355,11356,"""Olivia""",2004,"""F""","""AK""",33
11356,11357,"""Isabella""",2004,"""F""","""AK""",30
11357,11358,"""Alyssa""",2004,"""F""","""AK""",29
11358,11359,"""Sophia""",2004,"""F""","""AK""",28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [43]:

baby_names = baby_names.drop(['', 'Id'])

# Collect to see the result
baby_names.head(5).collect()

Name,Year,Gender,State,Count
str,i64,str,str,i64
"""Emma""",2004,"""F""","""AK""",62
"""Madison""",2004,"""F""","""AK""",48
"""Hannah""",2004,"""F""","""AK""",46
"""Grace""",2004,"""F""","""AK""",44
"""Emily""",2004,"""F""","""AK""",41


### Step 6. Is there more male or female names in the dataset?

In [44]:
baby_names.group_by('Gender').len().collect()

Gender,len
str,u32
"""M""",457549
"""F""",558846


### Step 7. Group the dataset by name and assign to names

In [45]:
names = baby_names.drop('Year').group_by('Name').sum()

sorted_names = names.sort(by='Count', descending=True)

sorted_names.collect()

Name,Gender,State,Count
str,str,str,i64
"""Jacob""",,,242874
"""Emma""",,,214852
"""Michael""",,,214405
"""Ethan""",,,209277
"""Isabella""",,,204798
…,…,…,…
"""Jobani""",,,5
"""Nadeem""",,,5
"""Amanpreet""",,,5
"""Shalim""",,,5


### Step 8. How many different names exist in the dataset?

In [46]:
len(names.collect())

17632

### Step 9. What is the name with most occurrences?

In [47]:
sorted_names.select('Name').head(1).collect().item()

'Jacob'

### Step 10. How many different names have the least occurrences?

In [48]:
# len(names[names.Count == names.Count.min()]) pandas solution

# First get the min count value
min_count = sorted_names.select(pl.min('Count')).collect().item()
# Then filter names with this min count and count them
counts_df = sorted_names.filter(pl.col('Count') == min_count).collect()
len(counts_df)

2578

### Step 11. What is the median name occurrence?

In [49]:
# First get the min count value
min_count = sorted_names.select(pl.median('Count')).collect().item()
# Then filter names with this min count and count them
counts_df = sorted_names.filter(pl.col('Count') == min_count).collect()
counts_df

Name,Gender,State,Count
str,str,str,i64
"""Nassir""",,,49
"""Rubin""",,,49
"""Gurshaan""",,,49
"""Nabeel""",,,49
"""Kaedence""",,,49
…,…,…,…
"""Kaelee""",,,49
"""Jaiyana""",,,49
"""Elizah""",,,49
"""Kaio""",,,49


### Step 12. What is the standard deviation of names?

In [50]:
sorted_names.select('Count').std().collect().item()

11006.06946789057

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [51]:
sorted_names.select('Count').describe()

statistic,Count
str,f64
"""count""",17632.0
"""null_count""",0.0
"""mean""",2008.932169
"""std""",11006.069468
"""min""",5.0
"""25%""",11.0
"""50%""",49.0
"""75%""",337.0
"""max""",242874.0
