# US - Baby Names

## Introduction:

We are going to use a subset of [US Baby Names from](https://www.kaggle.com/kaggle/us-baby-names) Kaggle.

In the file it will be names from 2004 until 2014

In [53]:
# Import the necessary libraries
from optimus import Optimus
from pyspark.sql.functions import *
import pandas as pd
import numpy as np
op = Optimus()

# Import the dataset and assing baby_names

In [54]:
baby_names_pd = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')

In [55]:
baby_names = op.spark.createDataFrame(baby_names_pd)

In [56]:
baby_names.describe().table()

summary  1 (string)  nullable,Unnamed: 0  2 (string)  nullable,Id  3 (string)  nullable,Name  4 (string)  nullable,Year  5 (string)  nullable,Gender  6 (string)  nullable,State  7 (string)  nullable,Count  8 (string)  nullable
count,1016395.0,1016395.0,1016395,1016395.0,1016395,1016395,1016395.0
mean,2830990.4619178567,2830991.4619178567,Infinity,2009.053189950757,,,34.85012421351935
stddev,1652475.6514804524,1652475.6514804524,,3.1382928281815494,,,97.3973464861767
min,11349.0,11350.0,Aaban,2004.0,F,AK,5.0
max,5647425.0,5647426.0,Zyriah,2014.0,M,WY,4167.0


# See the first 10 entries

In [57]:
baby_names.table(10)

Unnamed: 0  1 (bigint)  nullable,Id  2 (bigint)  nullable,Name  3 (string)  nullable,Year  4 (bigint)  nullable,Gender  5 (string)  nullable,State  6 (string)  nullable,Count  7 (bigint)  nullable
11349,11350,Emma,2004,F,AK,62
11350,11351,Madison,2004,F,AK,48
11351,11352,Hannah,2004,F,AK,46
11352,11353,Grace,2004,F,AK,44
11353,11354,Emily,2004,F,AK,41
11354,11355,Abigail,2004,F,AK,37
11355,11356,Olivia,2004,F,AK,33
11356,11357,Isabella,2004,F,AK,30
11357,11358,Alyssa,2004,F,AK,29
11358,11359,Sophia,2004,F,AK,28


# Delete the column 'Unnamed: 0' and 'Id'

In [58]:
baby_names = baby_names.drop("Unnamed: 0", "Id")

In [59]:
baby_names.table(5)

KeyboardInterrupt: 

#  Are there more male or female names in the dataset?

In [33]:
baby_names.groupby("Gender").count().orderBy("count").table()

Gender  1 (string)  nullable,count  2 (bigint)  not nullable
M,457549
F,558846


# Group the dataset by name and assign to names

In [34]:
baby_names = baby_names.drop("Year")

In [43]:
names = baby_names.groupby("Name").sum().cols.rename("sum(Count)","count")

In [44]:
names.table(5)

Name  1 (string)  nullable,count  2 (bigint)  nullable
Kiana,5965
Alayna,14171
Ember,3181
Tyler,129989
Maddox,20716


In [42]:
print(f'Columns: {op.profiler.dataset_info(names)["cols_count"]}', 
      f'Rows: {op.profiler.dataset_info(names)["rows_count"]}')

Columns: 2 Rows: 17632


In [49]:
names.sort(desc("count")).table(10)

Name  1 (string)  nullable,count  2 (bigint)  nullable
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798
William,197894
Joshua,191551
Sophia,191446
Daniel,191440
Emily,190318


# How many different names exist in the dataset?

In [50]:
names.count()

17632

# What is the name with most occurrences?

In [62]:
names.sort(desc("count")).select("Name").table(1)

Name  1 (string)  nullable
Jacob


# How many different names have the least occurrences?

In [86]:
min_oc = names.cols.min("count") # This will give you the min ocurrence for names
names.where(col("count") == min_oc).count()

2578

# What is the median name occurrence?

In [99]:
median_oc = names.approxQuantile("count", [0.5], relativeError=0)[0] # This will give you the median ocurrence for names
median_oc

49.0

# What is the standard deviation of names?

In [100]:
names.cols.std("count")

11006.06947

# Get a summary with the mean, min, max, std and quartiles.

In [111]:
names.select("count").describe().table()

summary  1 (string)  nullable,count  2 (string)  nullable
count,17632.0
mean,2008.932168784029
stddev,11006.069467890566
min,5.0
max,242874.0


In [117]:
names.approxQuantile("count",[0.25,0.5,0.75], relativeError=0)

[11.0, 49.0, 337.0]