# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 29 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 56.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=389f9594349fc6183501ec7d4fe0a3946ca973e321c6f8a634806f61f44985c8
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [2]:
from pyspark.sql import SparkSession, functions as f
from pyspark.files import SparkFiles

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

In [4]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"

spark = SparkSession.builder.appName("exercise60").getOrCreate()

spark.sparkContext.addFile(url)

df = spark.read.csv("file://" + SparkFiles.get("US_Baby_Names_right.csv"), header=True, inferSchema=True)
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Id: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Count: integer (nullable = true)



### Step 3. Assign it to a variable called baby_names.

### Step 4. See the first 10 entries

In [5]:
df.show(10)

+-----+-----+--------+----+------+-----+-----+
|  _c0|   Id|    Name|Year|Gender|State|Count|
+-----+-----+--------+----+------+-----+-----+
|11349|11350|    Emma|2004|     F|   AK|   62|
|11350|11351| Madison|2004|     F|   AK|   48|
|11351|11352|  Hannah|2004|     F|   AK|   46|
|11352|11353|   Grace|2004|     F|   AK|   44|
|11353|11354|   Emily|2004|     F|   AK|   41|
|11354|11355| Abigail|2004|     F|   AK|   37|
|11355|11356|  Olivia|2004|     F|   AK|   33|
|11356|11357|Isabella|2004|     F|   AK|   30|
|11357|11358|  Alyssa|2004|     F|   AK|   29|
|11358|11359|  Sophia|2004|     F|   AK|   28|
+-----+-----+--------+----+------+-----+-----+
only showing top 10 rows



### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [9]:
df = df.drop("_c0", "Id")
df.show()

+---------+----+------+-----+-----+
|     Name|Year|Gender|State|Count|
+---------+----+------+-----+-----+
|     Emma|2004|     F|   AK|   62|
|  Madison|2004|     F|   AK|   48|
|   Hannah|2004|     F|   AK|   46|
|    Grace|2004|     F|   AK|   44|
|    Emily|2004|     F|   AK|   41|
|  Abigail|2004|     F|   AK|   37|
|   Olivia|2004|     F|   AK|   33|
| Isabella|2004|     F|   AK|   30|
|   Alyssa|2004|     F|   AK|   29|
|   Sophia|2004|     F|   AK|   28|
|   Alexis|2004|     F|   AK|   27|
|Elizabeth|2004|     F|   AK|   27|
|   Hailey|2004|     F|   AK|   27|
|     Anna|2004|     F|   AK|   26|
|  Natalie|2004|     F|   AK|   25|
|    Sarah|2004|     F|   AK|   25|
|   Sydney|2004|     F|   AK|   25|
|      Ava|2004|     F|   AK|   23|
|  Trinity|2004|     F|   AK|   22|
|    Haley|2004|     F|   AK|   21|
+---------+----+------+-----+-----+
only showing top 20 rows



### Step 6. Is there more male or female names in the dataset?

In [14]:
"Males" if df.filter(df.Gender == "M").count() > df.filter(df.Gender == "F").count() else "Females"

'Females'

### Step 7. Group the dataset by name and assign to names

In [33]:
names = df.groupBy("Name")
counted_names = names.count()

### Step 8. How many different names exist in the dataset?

In [39]:
len_counted_names = counted_names.count()
len_counted_names

17632

### Step 9. What is the name with most occurrences?

In [35]:
counted_names.orderBy(f.desc("count")).show(1)

+-----+-----+
| Name|count|
+-----+-----+
|Riley| 1112|
+-----+-----+
only showing top 1 row



### Step 10. How many different names have the least occurrences?

In [38]:

lest_occurrences = counted_names.orderBy("count").limit(1).collect()[0][1]

counted_names.filter(f.col("count") == lest_occurrences).count()

3682

### Step 11. What is the median name occurrence?

In [40]:
counted_names.limit(len_counted_names//2).tail(1)

[Row(Name='Cecilio', count=10)]

### Step 12. What is the standard deviation of names?

In [43]:
counted_names.describe().filter(f.col("summary")=="stddev").show()

+-------+----+------------------+
|summary|Name|             count|
+-------+----+------------------+
| stddev|null|122.02996350813885|
+-------+----+------------------+



### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [46]:
counted_names.summary().show()

+-------+--------+------------------+
|summary|    Name|             count|
+-------+--------+------------------+
|  count|   17632|             17632|
|   mean|Infinity|57.644906987295826|
| stddev|    null|122.02996350813885|
|    min|   Aaban|                 1|
|    25%|Infinity|                 2|
|    50%|Infinity|                 8|
|    75%|Infinity|                39|
|    max|  Zyriah|              1112|
+-------+--------+------------------+

