# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [2]:
!pip install -q kaggle

In [4]:
from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mohamadalkiswani","key":"0db00ee50f04ec8790cff1a488ec9a2c"}'}

In [6]:
! cp kaggle.json ~/.kaggle/

In [7]:
! chmod 600 ~/.kaggle/kaggle.json

In [8]:
! kaggle datasets list

ref                                                         title                                           size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  ---------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
sudarshan24byte/online-food-dataset                         Online Food Dataset                              3KB  2024-03-02 18:50:30          24056        478  0.9411765        
bhavikjikadara/student-study-performance                    Student Study Performance                        9KB  2024-03-07 06:14:09          11894        153  1.0              
sukhmandeepsinghbrar/housing-price-dataset                  Housing Price Dataset                          780KB  2024-04-04 19:45:43           1395         26  1.0              
muhammadkashif724/netflix-tv-shows-2021                     Netflix TV Shows 2021                        

In [9]:
!kaggle datasets download -d kaggle/us-baby-names

Downloading us-baby-names.zip to /content
 91% 157M/173M [00:01<00:00, 124MB/s]
100% 173M/173M [00:01<00:00, 117MB/s]


In [11]:
!unzip us-baby-names.zip

Archive:  us-baby-names.zip
  inflating: NationalNames.csv       
  inflating: NationalReadMe.pdf      
  inflating: StateNames.csv          
  inflating: StateReadMe.pdf         
  inflating: database.sqlite         
  inflating: hashes.txt              


In [12]:
!ls

database.sqlite  kaggle.json	    NationalReadMe.pdf	StateNames.csv	 us-baby-names.zip
hashes.txt	 NationalNames.csv  sample_data		StateReadMe.pdf


In [14]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=d91d9e726d35b97b3b9d3999972ba3bcf567eade43a50a890de33efea4076581
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [15]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).

In [13]:
!wget https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv

--2024-04-12 11:24:21--  https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35762695 (34M) [text/plain]
Saving to: ‘US_Baby_Names_right.csv’


2024-04-12 11:24:21 (182 MB/s) - ‘US_Baby_Names_right.csv’ saved [35762695/35762695]



### Step 3. Assign it to a variable called baby_names.

In [16]:
baby_names = spark.read.csv("US_Baby_Names_right.csv", sep=',', header=True, inferSchema=True)

### Step 4. See the first 10 entries

In [17]:
baby_names.show(10)

+-----+-----+--------+----+------+-----+-----+
|  _c0|   Id|    Name|Year|Gender|State|Count|
+-----+-----+--------+----+------+-----+-----+
|11349|11350|    Emma|2004|     F|   AK|   62|
|11350|11351| Madison|2004|     F|   AK|   48|
|11351|11352|  Hannah|2004|     F|   AK|   46|
|11352|11353|   Grace|2004|     F|   AK|   44|
|11353|11354|   Emily|2004|     F|   AK|   41|
|11354|11355| Abigail|2004|     F|   AK|   37|
|11355|11356|  Olivia|2004|     F|   AK|   33|
|11356|11357|Isabella|2004|     F|   AK|   30|
|11357|11358|  Alyssa|2004|     F|   AK|   29|
|11358|11359|  Sophia|2004|     F|   AK|   28|
+-----+-----+--------+----+------+-----+-----+
only showing top 10 rows



### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [18]:
baby_names = baby_names.drop('_c0', 'Id')

In [19]:
baby_names.show()

+---------+----+------+-----+-----+
|     Name|Year|Gender|State|Count|
+---------+----+------+-----+-----+
|     Emma|2004|     F|   AK|   62|
|  Madison|2004|     F|   AK|   48|
|   Hannah|2004|     F|   AK|   46|
|    Grace|2004|     F|   AK|   44|
|    Emily|2004|     F|   AK|   41|
|  Abigail|2004|     F|   AK|   37|
|   Olivia|2004|     F|   AK|   33|
| Isabella|2004|     F|   AK|   30|
|   Alyssa|2004|     F|   AK|   29|
|   Sophia|2004|     F|   AK|   28|
|   Alexis|2004|     F|   AK|   27|
|Elizabeth|2004|     F|   AK|   27|
|   Hailey|2004|     F|   AK|   27|
|     Anna|2004|     F|   AK|   26|
|  Natalie|2004|     F|   AK|   25|
|    Sarah|2004|     F|   AK|   25|
|   Sydney|2004|     F|   AK|   25|
|      Ava|2004|     F|   AK|   23|
|  Trinity|2004|     F|   AK|   22|
|    Haley|2004|     F|   AK|   21|
+---------+----+------+-----+-----+
only showing top 20 rows



### Step 6. Is there more male or female names in the dataset?

In [20]:
f_count = baby_names.filter(col("Gender") == 'F').count()
m_count = baby_names.filter(col("Gender") == 'M').count()

In [21]:
print(f"Females: {f_count}, Males: {m_count}")

Females: 558846, Males: 457549


### Step 7. Group the dataset by name and assign to names

In [23]:
baby_names = baby_names.drop("Year")
names = baby_names.groupBy("Name").sum()
names.show()

+--------+----------+
|    Name|sum(Count)|
+--------+----------+
|   Kiana|      5965|
|  Alayna|     14171|
|   Ember|      3181|
|   Tyler|    129989|
|  Maddox|     20716|
|  Kellen|      6989|
|  Heaven|     12277|
|Julianne|      3465|
| Susanna|      1250|
|  Kenlee|       578|
|    Kloe|      1359|
|   Anyah|       472|
|   Tegan|      2721|
| Jazzlyn|      1173|
|Brileigh|       130|
|Analeigh|       505|
|Kamarion|      1030|
|   Aryan|      3322|
| Galilea|      2641|
|    Faye|      1211|
+--------+----------+
only showing top 20 rows



### Step 8. How many different names exist in the dataset?

In [24]:
names.count()

17632

### Step 9. What is the name with most occurrences?

In [25]:
names.orderBy('sum(Count)',ascending=False).show(1)

+-----+----------+
| Name|sum(Count)|
+-----+----------+
|Jacob|    242874|
+-----+----------+
only showing top 1 row



### Step 10. How many different names have the least occurrences?

In [29]:
names.show()

+--------+----------+
|    Name|sum(Count)|
+--------+----------+
|   Kiana|      5965|
|  Alayna|     14171|
|   Ember|      3181|
|   Tyler|    129989|
|  Maddox|     20716|
|  Kellen|      6989|
|  Heaven|     12277|
|Julianne|      3465|
| Susanna|      1250|
|  Kenlee|       578|
|    Kloe|      1359|
|   Anyah|       472|
|   Tegan|      2721|
| Jazzlyn|      1173|
|Brileigh|       130|
|Analeigh|       505|
|Kamarion|      1030|
|   Aryan|      3322|
| Galilea|      2641|
|    Faye|      1211|
+--------+----------+
only showing top 20 rows



In [34]:
min_count = names.select(F.min('sum(Count)')).collect()[0][0]
names.filter(col('sum(Count)')==min_count).count()

2578

### Step 11. What is the median name occurrence?

In [37]:
median_count = names.select(F.median('sum(Count)')).collect()[0][0]
median_count

49.0

In [38]:
names.filter(col('sum(Count)')==median_count).show()

+---------+----------+
|     Name|sum(Count)|
+---------+----------+
|    Baily|        49|
|    Jaice|        49|
| Antonina|        49|
|  Rebecka|        49|
|   Maisha|        49|
|Malillany|        49|
|    Jkwon|        49|
|    Anely|        49|
|Emmanuela|        49|
|     Vita|        49|
|  Zuleima|        49|
|   Alysse|        49|
|  Mariann|        49|
|   Kaelee|        49|
| Marquell|        49|
|   Kyndle|        49|
|  Jeovany|        49|
|   Ridwan|        49|
|     Riot|        49|
|  Deserae|        49|
+---------+----------+
only showing top 20 rows



### Step 12. What is the standard deviation of names?

In [39]:
std_dev = names.select(F.stddev('sum(Count)')).collect()[0][0]
std_dev

11006.069467890555

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [42]:
names.summary().show()

+-------+--------+------------------+
|summary|    Name|        sum(Count)|
+-------+--------+------------------+
|  count|   17632|             17632|
|   mean|Infinity| 2008.932168784029|
| stddev|    NULL|11006.069467890555|
|    min|   Aaban|                 5|
|    25%|Infinity|                11|
|    50%|Infinity|                49|
|    75%|Infinity|               337|
|    max|  Zyriah|            242874|
+-------+--------+------------------+

