# Names of Children

#### National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number (Tabulated based on Social Security records as of March 6, 2022)
#### For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth.
#### To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.

#### Insights I want to gain from the data
#### Which names are the most commonly used for both sexes?
#### Which names have been phased out over the years?
#### Names that were introduced in more recent years
#### What is the ratio of male to female births?

In [2]:
#import module and libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from pandasql import sqldf


In [28]:
#folder path
path = "C:/Users/ogabi/Documents/Data Analysis/Practice datasets/names"
#change the directory
os.chdir(path)
# Read text File  
def read_text_file(file_path):
    with open(file_path, 'r') as f:
        print(f.read())

In [4]:
# iterate through all file
for file in os.listdir():
    # Check whether file is in text format or not
    if file.endswith(".txt"):
        file_path = f"{path}\{file}"
  
        # call read text file function
        read_text_file(file_path)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [6]:
#load file into numpy array, apparently, only one file was loaded
for file in os.listdir():
   arr= np.loadtxt(file_path, dtype = 'object', delimiter = ',')
print(arr)

[['Olivia' 'F' '17728']
 ['Emma' 'F' '15433']
 ['Charlotte' 'F' '13285']
 ...
 ['Zyian' 'M' '5']
 ['Zylar' 'M' '5']
 ['Zyn' 'M' '5']]


In [11]:
#reading the text files into pandas dataframe, same challenge as with numpy
for file in os.listdir():
     if file.endswith(".txt"):
            df = pd.read_csv(file_path, sep= ',', header=None, names=["Name", "Sex", "Frequency", "Year"]) 

In [12]:
df.head()

Unnamed: 0,Name,Sex,Frequency,Year
0,Olivia,F,17728,
1,Emma,F,15433,
2,Charlotte,F,13285,
3,Amelia,F,12952,
4,Ava,F,12759,


In [13]:
df.shape

(31537, 4)

In [14]:
df.tail()

Unnamed: 0,Name,Sex,Frequency,Year
31532,Zyeire,M,5,
31533,Zyel,M,5,
31534,Zyian,M,5,
31535,Zylar,M,5,
31536,Zyn,M,5,


In [30]:
#Trying a different approach
# List to store dataframes for each year
df_list = []

# Iterate through all files
for file in os.listdir():
    
    # Check whether file is in text format or not
    if file.endswith(".txt"):
        
        # Extract the year from the file name
        year = file[3:7]
        file_path = os.path.join(path, file)
        #print(file_path)
        
        # Read the text file into a DataFrame
        df_year = pd.read_csv(file_path, header=None, names=["Name", "Sex", "Frequency"])
        
        # Add a "Year" column with the current year
        df_year["Year"] = year
        
        # Append the DataFrame for the current year to the list
        df_list.append(df_year)
        
# Concatenate all dataframes into a single dataframe
df = pd.concat(df_list, ignore_index=True)

# Print the first 5 rows of the DataFrame
df.tail()

#success!

Unnamed: 0,Name,Sex,Frequency,Year
2052776,Zyeire,M,5,2021
2052777,Zyel,M,5,2021
2052778,Zyian,M,5,2021
2052779,Zylar,M,5,2021
2052780,Zyn,M,5,2021


In [31]:
df.head()

Unnamed: 0,Name,Sex,Frequency,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2052781 entries, 0 to 2052780
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   Name       object
 1   Sex        object
 2   Frequency  int64 
 3   Year       object
dtypes: int64(1), object(3)
memory usage: 62.6+ MB


In [32]:
df.describe()

Unnamed: 0,Frequency
count,2052781.0
mean,176.2917
std,1492.565
min,5.0
25%,7.0
50%,12.0
75%,32.0
max,99693.0


In [33]:
df.shape

(2052781, 4)

In [37]:
df.duplicated().sum()

0

In [49]:
df['Name'].unique()

array(['Mary', 'Anna', 'Emma', ..., 'Zeland', 'Zemariam', 'Zhayd'],
      dtype=object)

In [69]:
#most frequently used name overall
#sum of frequency of occurence of each name
df1 = df.groupby(['Name'])[['Frequency']].sum()
df1.head(10)

Unnamed: 0_level_0,Frequency
Name,Unnamed: 1_level_1
Aaban,120
Aabha,51
Aabid,16
Aabidah,5
Aabir,10
Aabriella,51
Aada,13
Aadam,320
Aadan,130
Aadarsh,233


In [48]:
df1.tail(10)

Unnamed: 0_level_0,Frequency
Name,Unnamed: 1_level_1
Zytaveon,17
Zytavion,5
Zytavious,43
Zyus,11
Zyva,38
Zyvion,5
Zyvon,7
Zyyanna,6
Zyyon,6
Zzyzx,10


In [67]:
df1.columns

Index(['Frequency'], dtype='object')

In [65]:
    #name with the highest occurrence
    #hf = df1['Frequency'].max()
    #print(hf)
    #df1[df1['Frequency'] == 5226309][['Name']]

KeyError: "None of [Index(['Name'], dtype='object')] are in the [columns]"