## A) Descriptive analysis
1. Please describe the format of the data files. Can you identify any limitations or distortions of the data?
2. What is the most popular name of all time? (Of either gender.)
3. What is the most gender ambiguous name in 2013? 1945?
4. Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?
5. Can you identify names that may have had an even larger increase or decrease in popularity?

#### Part 1 - Understanding the file

- Q: Please describe the format of the data files. Can you identify any limitations or distortions of the data?

- Data Format:

    - The data is provided in comma-delimited text files (.TXT), one for each state.

    - Each record contains five fields: State Code, Sex (M/F), Year of Birth, Name, and the number of occurrences.

- Data Limitations & Distortions:

    1. Exclusion of Uncommon Names: The most significant limitation is that names with fewer than 5 occurrences in a given state and year are completely omitted. This distorts the data by over-representing popular names and making it impossible to study the "long tail" of rare or unique names.

    2. Inaccurate National Aggregates: As a direct result of the first point, summing the state data will not produce an accurate national count. The true national total for any name will be higher than what can be calculated from these files.

    3. Source-Related Bias: The data is sourced exclusively from Social Security records. This means individuals born before the widespread adoption of Social Security in the mid-1930s, or those who never received a Social Security Number for other reasons, are not included. This particularly impacts the reliability of the data for the earliest years in the dataset.

#### Part 2 - Searching through the data

- What is the most popular name of all time? (Of either gender.)

In [6]:
#### Import necessary libraries
import os
import glob             # Finds all the files ending with .txt
import pandas as pd
from dotenv import load_dotenv

In [7]:
# Load variables from the .env file
load_dotenv()

# Get the data path from the environment variables
path = os.getenv("DATA_PATH")

# Check if the path was loaded correctly
if not path:
    raise ValueError("DATA_PATH was not found!")

# Locate the files ending in .txt
all_files = glob.glob(path + "/*.TXT")

# Create an empty list for all the dataframes
df_list = []

# Provide the column names since the text files do not
column_names = ['State','Gender','Birth_year','Name','Name_occurrence']

# loop through all the text files and create a dataframe for each
# Add each dataframe to the df_list
for filename in all_files:
    df = pd.read_csv(filename, header=None, names=column_names)
    df_list.append(df)

# Add dataframes together along the row (axis=0)
# Show the dataframe
names_df = pd.concat(df_list, axis=0, ignore_index=True)
print("Data was successfully loaded!")
names_df


Data was successfully loaded!


Unnamed: 0,State,Gender,Birth_year,Name,Name_occurrence
0,IN,F,1910,Mary,619
1,IN,F,1910,Helen,324
2,IN,F,1910,Ruth,238
3,IN,F,1910,Dorothy,215
4,IN,F,1910,Mildred,200
...,...,...,...,...,...
6311499,DE,M,2021,Thiago,5
6311500,DE,M,2021,Travis,5
6311501,DE,M,2021,Troy,5
6311502,DE,M,2021,Walker,5


In [8]:
# Groupby name and sum the occurrences each time it is found
# Sort the names in descending order to determine max value
total_name_counts = names_df.groupby('Name')['Name_occurrence'].sum()
most_popular_name = total_name_counts.sort_values(ascending=False).index[0]
print(f'The most popular name of all time is: {most_popular_name} with a total count of {total_name_counts.loc[most_popular_name]}')

The most popular name of all time is: James with a total count of 5054074


#### Part 3 - Boy or Girl
- Q: What is the most gender ambiguous name in 2013? 1945?

In [9]:
# Function to isolate the names by a specific year
def names_by_year(dataframe, column_name, year):
    """
    Filters a DataFrame to select rows for a specific year.

    Args:
        dataframe (pd.DataFrame): The input DataFrame.
        column_name (str): The name of the column containing the year.
        year (int): The year to filter by.

    Returns:
        pd.DataFrame: A new DataFrame containing only the data for the specified year.
    """
    names_in_year = dataframe[dataframe[column_name] == year]
    return names_in_year

# Function to calculate the total for each name in a specific year
def get_name_totals(dataframe_year, name_col, sex_col, count_col):
    """
    Groups a DataFrame by name and sex to calculate the total count for each name.

    Args:
        dataframe_year (pd.DataFrame): The input DataFrame (e.g., names_2013).
        name_col (str): The column name for the names.
        sex_col (str): The column name for the sex/gender.
        count_col (str): The column name for the counts/occurrences.

    Returns:
        pd.DataFrame: A new DataFrame with the summed counts for each name/sex combination.
    """
    # Group by the name and sex columns, then sum the count column for each group.
    name_totals = dataframe_year.groupby([name_col, sex_col])[count_col].sum().reset_index()
    return name_totals

# Function to calculate the ambiguity score of the names in the dataframe
def calculate_ambiguity_score(dataframe_name, female_col='F', male_col='M'):
    """
    Calculates and adds Total, Min_count, and Ambiguity_score columns to a DataFrame.

    Args:
        dataframe_name (pd.DataFrame): The input DataFrame. Must have columns for female and male counts.
        female_col (str): The name of the column for female counts. Defaults to 'F'.
        male_col (str): The name of the column for male counts. Defaults to 'M'.

    Returns:
        pd.DataFrame: The DataFrame with the new columns added.
    """
    # Create a copy to avoid modifying the original DataFrame unexpectedly
    df_copy = dataframe_name.copy()

    # Calculate the total and minimum counts
    df_copy['Total'] = df_copy[female_col] + df_copy[male_col]
    df_copy['Min_count'] = df_copy[[female_col, male_col]].min(axis=1)

    # Calculate the ambiguity score
    df_copy['Ambiguity_score'] = df_copy['Min_count'] / df_copy['Total']

    return df_copy

In [10]:
# Use function to isolate the names by year
names_2013 = names_by_year(names_df, 'Birth_year', 2013)
names_1945 = names_by_year(names_df, 'Birth_year', 1945)

# Use function to determine the total count for each name
names_total_2013 = get_name_totals(names_2013, 'Name','Gender', 'Name_occurrence')
names_total_1945 = get_name_totals(names_1945, 'Name','Gender', 'Name_occurrence')

In [11]:
# Create pivot table to place genders in separate columns
ambiguity_df_2013 = names_total_2013.pivot_table(
    index='Name',
    columns='Gender',
    values='Name_occurrence',
    fill_value=0 # Replace NaN with a 0
)

ambiguity_df_1945 = names_total_1945.pivot_table(
    index='Name',
    columns='Gender',
    values='Name_occurrence',
    fill_value=0 # Replace NaN with a 0 for names that are exclusively M or F
)

In [12]:
# Check if NaN or missing values were replaced with a zero
ambiguity_df_2013.isnull().sum()

Gender
F    0
M    0
dtype: int64

In [13]:
# Use the function to determine the ambiguity score of each name for the desired year
ambiguity_score_1945_df = calculate_ambiguity_score(ambiguity_df_1945, 'F', 'M')
ambiguity_score_2013_df = calculate_ambiguity_score(ambiguity_df_2013, 'F', 'M')

# Sort the rows in descending order and pull the top 10 names
ten_most_ambiguous_names_2013 = ambiguity_score_2013_df.sort_values(by=['Ambiguity_score'], ascending=False).head(10)
ten_most_ambiguous_names_1945 = ambiguity_score_1945_df.sort_values(by=['Ambiguity_score'], ascending=False).head(10)

# Print the names with the highest ambiguity score for 2013 and 1945
print(f'The most ambiguous name in the year 1945 was {ten_most_ambiguous_names_1945.index[0]}!\n')
print(f'The most ambiguous name in the year 2013 was {ten_most_ambiguous_names_2013.index[0]}!')

The most ambiguous name in the year 1945 was Maxie!

The most ambiguous name in the year 2013 was Arlin!


#### Part 4 - Mr & Ms popular

- Q: Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

In [14]:
# Get the total births for each year
total_births_per_year = names_df.groupby(['Birth_year'])['Name_occurrence'].sum().reset_index()
total_births_per_year.rename({'Name_occurrence': 'Total_births'}, axis=1, inplace=True)

# Merge total births back into the main dataframe
names_df = pd.merge(names_df,total_births_per_year,on='Birth_year')

"""
This allows us to compare the number of times a name appeared vs the number of births that year
"""

# Calculate the popularity of each name
names_df['Popularity'] = (names_df['Name_occurrence'] / names_df['Total_births'])

# Sum the popularity of name for each year regardless of gender
name_popularity = names_df.groupby(['Birth_year','Name'])['Popularity'].sum().reset_index()

# Create a range to view the different names over the years
start_year = 1980
end_year = name_popularity['Birth_year'].max()
pop_1980 = name_popularity[name_popularity['Birth_year'] == start_year]
pop_end = name_popularity[name_popularity['Birth_year'] == end_year]

In [15]:
# Merge using outer so all the names regardless if they showed each year
popularity_change_df = pd.merge(
    pop_1980[['Name','Popularity']],
    pop_end[['Name','Popularity']],
    on = 'Name',
    how = 'outer',
    suffixes = ('_1980','_end')
)

# Replace NaN with a very small number so we avoid division by zero errors
# To show a clear separation between the data and fill a value 1/1000 the size was used
popularity_change_df.fillna(1e-9, inplace=True)

In [16]:
# To calculate the change, we have to divide the difference in popularity by the oldest value
popularity_change_df['Percentage_change'] = (popularity_change_df['Popularity_end'] - popularity_change_df['Popularity_1980']) / popularity_change_df['Popularity_1980'] * 100

# Sort the popularity change in descending order
# Print the 10 largest increases
increase_sorted = popularity_change_df.sort_values(by=['Percentage_change'], ascending=False)
print(f'Largest popularity increase: {start_year} - {end_year}')
print(increase_sorted.head(10))

Largest popularity increase: 1980 - 2021
          Name  Popularity_1980  Popularity_end  Percentage_change
4380    Harper     1.000000e-09        0.003015       3.015064e+08
306      Aiden     1.000000e-09        0.002934       2.933652e+08
2059    Camila     1.000000e-09        0.002842       2.842287e+08
4541    Hudson     1.000000e-09        0.002714       2.713593e+08
5130    Jayden     1.000000e-09        0.002479       2.479313e+08
7138      Luca     1.000000e-09        0.002454       2.454427e+08
7719  Maverick     1.000000e-09        0.002343       2.342798e+08
7284   Madison     1.000000e-09        0.002110       2.109940e+08
5104     Jaxon     1.000000e-09        0.002024       2.023907e+08
4708      Isla     1.000000e-09        0.001960       1.960271e+08


In [17]:
# Determine how many times a name was used in 1980
counts_1980 = names_df[names_df['Birth_year'] == 1980].groupby('Name')['Name_occurrence'].sum().reset_index()
counts_1980

# Merge the 1980 data onto main dataframe to get a better idea of which names have decreased in popularity since that year
popularity_change_df = pd.merge(popularity_change_df,counts_1980, on = 'Name', how = 'left')
popularity_change_df.fillna(0, inplace=True)

In [18]:
# Filter for names that by occurrence to get rid of statistical noise (remove the rare names).
# Sort the names in ascending order to better see the name with the greatest percentage decrease
meaningful_names_df = popularity_change_df[popularity_change_df['Name_occurrence'] >= 1000]
decrease_sorted = meaningful_names_df.sort_values(by='Percentage_change', ascending=True)

In [19]:
print(f'Largest popularity decrease: {start_year} - {end_year}')
print(decrease_sorted.head(10))

Largest popularity decrease: 1980 - 2021
          Name  Popularity_1980  Popularity_end  Percentage_change  \
10828    Tonya         0.000982    1.000000e-09         -99.999898   
1656      Beth         0.000908    1.000000e-09         -99.999890   
6434    Kristi         0.000805    1.000000e-09         -99.999876   
7909   Michele         0.000798    1.000000e-09         -99.999875   
6713    Latoya         0.000793    1.000000e-09         -99.999874   
1765      Brad         0.000756    1.000000e-09         -99.999868   
10565    Tasha         0.000721    1.000000e-09         -99.999861   
6686   Latasha         0.000652    1.000000e-09         -99.999847   
3271     Ebony         0.000502    1.000000e-09         -99.999801   
9222    Rhonda         0.000484    1.000000e-09         -99.999793   

       Name_occurrence  
10828           3075.0  
1656            2844.0  
6434            2521.0  
7909            2499.0  
6713            2482.0  
1765            2368.0  
10565        

#### Part 5

- Can you identify names that may have had an even larger increase or decrease in popularity?

In [36]:
from scipy.stats import linregress
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)


In [38]:
# Define a function to calculate the slope of name popularity over time
def calculate_slope(group):
    """
    Fits a linear regression to the popularity trend of a name over time.

    Parameters:
        group (DataFrame): Subset of data for one name, with columns 'Birth_year' and 'Popularity'.

    Returns:
        float: The slope of the regression line (indicating trend direction and strength).
    """
    return linregress(x=group['Birth_year'], y=group['Popularity'])[0]

# Inform the user that the slope calculation is starting
print('Calculating the slopes for each name...')

# Apply the slope calculation to each name group
slopes = name_popularity.groupby('Name').apply(calculate_slope)

# Convert the Series to a DataFrame and drop names with NaN slopes
slopes_df = slopes.reset_index(name='Slope').dropna()

# Preview: 5 names with steepest **decline** in popularity (most negative slope)
print("\nTop 5 Declining Names:")
print(slopes_df.sort_values(by='Slope', ascending=True).head())

# Preview: 5 names with steepest **increase** in popularity (most positive slope)
print("\nTop 5 Rising Names:")
print(slopes_df.sort_values(by='Slope', ascending=False).head())


Calculating the slopes for each name...


  slope = ssxym / ssxm
  t = r * np.sqrt(df / ((1.0 - r + TINY)*(1.0 + r + TINY)))
  slope_stderr = np.sqrt((1 - r**2) * ssym / ssxm / df)



Top 5 Declining Names:
          Name     Slope
20630     Mary -0.000391
14747     John -0.000269
25279   Robert -0.000255
13374    James -0.000233
31023  William -0.000206

Top 5 Rising Names:
          Name     Slope
655      Aiden  0.000173
13900   Jayden  0.000161
10020   Everly  0.000135
16342  Kehlani  0.000115
13861    Jaxon  0.000106
