<a href="https://colab.research.google.com/github/ChiamakaUkwuoma/assessments/blob/main/ansible-health/ukwuomachiamaka07_gmail_com_data_analyst_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
"""
NO NEED TO RUN THIS CELL

Data Analyst Assessment Task

Identify the 10 most popular names given to male and female babies in the United
States in 2007 that have significantly decreased in popularity, falling outside
the top 100 names in 2020. Provide a brief explanation of your methodology for
determining these names.


Methodology:
1. Download Social Security Administration baby names data.
2. Load data for years 2007 and 2020 into pandas DataFrames.
3. Compute rank within each gender for both years by descending count.
4. Select names with rank <= 50 in 2007 and rank > 100 in 2020.
5. Calculate drop amount (2020 rank - 2007 rank) for each qualifying name.
6. Sort by largest drops and select top 10 names for each gender.
7. Output the lists of names for male and female.
"""

'\nNO NEED TO RUN THIS CELL\n\nData Analyst Assessment Task\n\nIdentify the 10 most popular names given to male and female babies in the United\nStates in 2007 that have significantly decreased in popularity, falling outside \nthe top 100 names in 2020. Provide a brief explanation of your methodology for \ndetermining these names.\n\n\nMethodology:\n1. Download Social Security Administration baby names data.\n2. Load data for years 2007 and 2020 into pandas DataFrames.\n3. Compute rank within each gender for both years by descending count.\n4. Select names with rank <= 50 in 2007 and rank > 100 in 2020.\n5. Calculate drop amount (2020 rank - 2007 rank) for each qualifying name.\n6. Sort by largest drops and select top 10 names for each gender.\n7. Output the lists of names for male and female.\n'

In [13]:
!wget -q https://www.ssa.gov/oact/babynames/names.zip

In [14]:
import pandas as pd
import requests
import zipfile
import io

# Read from the downloaded file on disk
with zipfile.ZipFile('/content/names.zip') as z:
    # Load 2007 and 2020 data
    data_frame_2007 = pd.read_csv(z.open('yob2007.txt'), names=['Name', 'Sex', 'Count'])
    data_frame_2020 = pd.read_csv(z.open('yob2020.txt'), names=['Name', 'Sex', 'Count'])

# Add ranks
data_frame_2007['Rank'] = data_frame_2007.groupby('Sex')['Count'].rank(ascending=False)
data_frame_2020['Rank'] = data_frame_2020.groupby('Sex')['Count'].rank(ascending=False)

merged = pd.merge(data_frame_2007[['Name', 'Sex', 'Rank']],
                  data_frame_2020[['Name', 'Sex', 'Rank']],
                  on=['Name', 'Sex'],
                  suffixes=['_2007', '_2020'])

# Since top 10 didn't drop below 100, let's try top 50 names from 2007
print("Looking at top 50 names from 2007 that dropped outside top 100 in 2020:")

filtered = merged[(merged['Rank_2007'] <= 50) & (merged['Rank_2020'] > 100)]

# Sort by biggest drops and get top 10 for each gender
male_drops = filtered[filtered['Sex'] == 'M'].copy()
female_drops = filtered[filtered['Sex'] == 'F'].copy()

# Calculate drop amount for sorting
male_drops['Drop'] = male_drops['Rank_2020'] - male_drops['Rank_2007']
female_drops['Drop'] = female_drops['Rank_2020'] - female_drops['Rank_2007']

# Get top 10 biggest drops for each gender
top_male_drops = male_drops.nlargest(10, 'Drop')
top_female_drops = female_drops.nlargest(10, 'Drop')

print("\nTop 10 male names that dropped most significantly:")
for i, row in top_male_drops.iterrows():
    print(f"{row['Name']:12s} (2007 rank: {int(row['Rank_2007']):2d}, 2020 rank: {int(row['Rank_2020']):3d})")

print("\nTop 10 female names that dropped most significantly:")
for i, row in top_female_drops.iterrows():
    print(f"{row['Name']:12s} (2007 rank: {int(row['Rank_2007']):2d}, 2020 rank: {int(row['Rank_2020']):3d})")

# Get just the names as requested
male_names = top_male_drops['Name'].tolist()
female_names = top_female_drops['Name'].tolist()

print(f"\nMale names: {male_names}")
print(f"Female names: {female_names}")

Looking at top 50 names from 2007 that dropped outside top 100 in 2020:

Top 10 male names that dropped most significantly:
Brandon      (2007 rank: 31, 2020 rank: 165)
Justin       (2007 rank: 45, 2020 rank: 166)
Kevin        (2007 rank: 39, 2020 rank: 157)
Gavin        (2007 rank: 32, 2020 rank: 142)
Tyler        (2007 rank: 21, 2020 rank: 130)
Zachary      (2007 rank: 42, 2020 rank: 135)
Evan         (2007 rank: 40, 2020 rank: 105)

Top 10 female names that dropped most significantly:
Kaitlyn      (2007 rank: 44, 2020 rank: 488)
Jessica      (2007 rank: 42, 2020 rank: 399)
Destiny      (2007 rank: 41, 2020 rank: 347)
Alexis       (2007 rank: 19, 2020 rank: 309)
Makayla      (2007 rank: 47, 2020 rank: 325)
Jocelyn      (2007 rank: 50, 2020 rank: 269)
Lauren       (2007 rank: 28, 2020 rank: 232)
Sydney       (2007 rank: 37, 2020 rank: 241)
Brooke       (2007 rank: 45, 2020 rank: 236)
Alexa        (2007 rank: 40, 2020 rank: 230)

Male names: ['Brandon', 'Justin', 'Kevin', 'Gavin', 'Tyl