In [None]:
Argentina's Baby Name Analysis
This project will help you practice your Data Wrangling and Data Analysis skills by inspecting the names of babies given to Argentina's newborns from 1922 to 2015.
For this project, you'll have to use some groupby() skills as well as filtering, sorting and visualizations.
The dataset was extracted from the official Argentina opendata website: https://www.datos.gob.ar/dataset/otros-nombres-personas-fisicas
Let's get started!

### Imports + Reading data

import pandas as pd
import numpy as np
​
import matplotlib.pyplot as plt
The CSV we're using is rather large, so we have stored it compressed for this particular lab (using just zip). The pd.read_csv method can automatically infer the compression method and read the data directly from a compressed file.

df = pd.read_csv('data/historic-names-argentina.zip')
df.head()
name	quantity	year
0	Maria	314	1922
1	Rosa	203	1922
2	Jose	163	1922
3	Maria Luisa	127	1922
4	Carmen	117	1922

Data Cleaning
1. How many rows contain null values in the name column?
The first step is to identify null values in our dataframe. Let's start by name. Count the number of null values in the name column and answer, how many np.nan values are there?

df['name'].isnull().sum()
13

2. Drop any rows with null values, do it inplace
Let's clean the dataframe now by removing any rows that have null values (in any columns). Perform the cleaning task in-place, that is, modifying the original df variable.

df.dropna(inplace=True,how='any',axis=0)

df.head()
name	quantity	year
0	Maria	314	1922
1	Rosa	203	1922
2	Jose	163	1922
3	Maria Luisa	127	1922
4	Carmen	117	1922

Data Analysis
Now that our data is clean, it's time to start doing some analysis!
3. What's the most popular name from 1953?

df[df.year==1953].sort_values(by='quantity',ascending=False).values[0]
array(['Juan Carlos', 7357, 1953], dtype=object)

4. What's the most popular name from 1992?

df[df.year==1992].sort_values(by='quantity',ascending=False).values[0]
array(['Maria Belen', 6248, 1992], dtype=object)

5. What's the least popular name from 1978?

df[df.year==1978].sort_values(by='quantity').values[0]
array(['Valeria Karina Elizabeth', 1, 1978], dtype=object)

6. What's the least popular name from 2007?

df_2007 = df[df.year==2007]
df_2007[df_2007['quantity']==df_2007['quantity'].min()].sort_index(ascending=False).iloc[0]['name']
'Walter Alxander'

7. How many people were born in the year 1950?

df_1950 = df[df.year==1950]
df_1950.quantity.sum()
505873

8. How many people were born in the year 1980?

df[df.year==1980].quantity.sum()
961605

9. What's the Growth Rate of newborns from 1930 to 1990?

df_1930 = df[df.year==1930].quantity.sum()
df_1990 = df[df.year==1990].quantity.sum()
(df_1990-df_1930)/df_1930
9.071349113956797

10. What's the year with the most babies born?

df.groupby('year')['quantity'].agg('sum').sort_values(ascending=False).idxmax()
1993

11. What's the year with the least babies born?

df.groupby('year')['quantity'].agg('sum').sort_values(ascending=False).idxmin()
1922

12. Plot the number of babies born per year
Create a plot showing the total babies born per year.

yrange = list(range(0,1400001,200000))
yrange2 = ['{:,}'.format(i) for i in yrange ]

yrange2
['0',
 '200,000',
 '400,000',
 '600,000',
 '800,000',
 '1,000,000',
 '1,200,000',
 '1,400,000']

fig, ax = plt.subplots(figsize=(14, 7))
# your code...
grouped = df.groupby('year')['quantity'].agg('sum').reset_index()
ax.plot(grouped.year,grouped.quantity)
plt.title('Number of babies born per year')
ax.set_yticklabels(yrange2)
ax.set_yticks(yrange)
plt.xlabel('year')
/tmp/ipykernel_355/3957992026.py:6: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(yrange2)
Text(0.5, 0, 'year')

13. Create a dataframe representing the "uniqueness" of names

unique_names_df = df.groupby('year')[['quantity','name']].agg({'quantity':'sum','name':'size'}).rename( columns={'quantity':'Total Newborns','name':'Total Unique Names'}).sort_values(by='year')
unique_names_df['Uniqueness']= unique_names_df['Total Unique Names']/unique_names_df['Total Newborns']

14. What's the year with the most "variation" of names?

unique_names_df.sort_values(ascending=False, by='Uniqueness').iloc[0]
Total Newborns        23667.000000
Total Unique Names    10333.000000
Uniqueness                0.436599
Name: 1922, dtype: float64

15. When was the year with the least (lowest) "variation" of names?

unique_names_df.sort_values(ascending=False, by='Uniqueness').iloc[-1]
Total Newborns        961605.000000
Total Unique Names     71522.000000
Uniqueness                 0.074378
Name: 1980, dtype: float64

16. Create a visualization of the "uniqueness" of names across the years
Following the concepts from the previous activities, evaluate the "uniqueness" of names across the years and plot it

fig, ax = plt.subplots(figsize=(14, 7))
​
# your code...
ax.plot(unique_names_df.index,unique_names_df.Uniqueness, label='Uniqueness of names')
plt.xlabel('year')
plt.title("Baby name uniqueness across the years")
plt.legend()
<matplotlib.legend.Legend at 0x7f31680afb90>

17. How many babies were named "carlos"?
Warning! The following are all valid "Carlos", so be mindful about casing: Juan Carlos, Carlos, Giancarlos.

df.name.apply(lambda a: 1 if 'carlos' in a.lower() else 0)
0          0
1          0
2          0
3          0
4          0
          ..
9761604    0
9761605    0
9761606    0
9761607    0
9761608    0
Name: name, Length: 9761596, dtype: int64

df.loc[df.name.str.lower().str.contains('carlos'),'quantity'].sum()
1339111

18. What is the most popular "Carlos" name?

df.loc[df['name'].str.contains('carlos',case=False),['name','quantity']].sort_values(by='quantity',ascending=False).iloc[0]['name']
'Juan Carlos'

19. The "Diego" phenomenon

diegos = df.loc[df['name'].str.contains('Diego'),:]
diegos_per_year_s = diegos.groupby('year')['quantity'].agg('sum')

20. When was the year with most "Diegos" born?

diegos_per_year_s.idxmax()
1979

21. Create a visualization of "Diegos" born between 1970 and 2015
Both limits are inclusive.

fig, ax = plt.subplots(figsize=(14, 7))
​
# ... your code ...
diegos_per_year_s[
    (diegos_per_year_s.index >= 1960) &
    (diegos_per_year_s.index <= 2015)
].plot(ax=ax, title="Total 'Diegos' born per year [1960-2015]")
plt.xlabel('year')
​
Text(0.5, 0, 'year')

22. Extract the most popular names per year

# this is here just to show the format of the expected dataframe
# we must use a screenshot for the final project

df['max_year'] = df.groupby('year')['quantity'].transform('max')
most_popular_per_year_df = df.loc[df['quantity'] == df['max_year'], ['year', 'name', 'quantity']].sort_values(by='year')
most_popular_per_year_df.head(5)
​
year	name	quantity
0	1922	Maria	314
10333	1923	Maria	351
23110	1924	Maria	416
39091	1925	Maria	496
57887	1926	Maria	575

23. Which was the most popular name among the most popular names?
Which name got the "most popular name of the year" the most times?

most_popular_per_year_df['name'].value_counts().head()
name
Juan Carlos    38
Maria          11
Valentina       8
Maria Belen     8
Benjamin        7
Name: count, dtype: int64