# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

In [2]:
DATA_URI = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names = pd.read_csv(DATA_URI)

### Step 4. See the first 10 entries

In [4]:
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [10]:
columns = ['Unnamed: 0', 'Id']

In [12]:
baby_names.drop(columns, axis=1, inplace=True)
baby_names

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [59]:
baby_names.groupby('Gender').count()

Unnamed: 0_level_0,Name,Year,State,Count
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,558846,558846,558846,558846
M,457549,457549,457549,457549


In [60]:
baby_count = baby_names.groupby('Gender').count()
baby_count['Count']

Gender
F    558846
M    457549
Name: Count, dtype: int64

In [88]:
for i in range(2):
    if(baby_count['Count'][i] > baby_count['Count'][i+1]):
        print(f"There is more {baby_count.index[i]}, with total count of {baby_count['Count'][i]}")
        break
    else:
        print(f"There is more {baby_count.index[i+1]}, with total count of {baby_count['Count'][i+1]}")
        break

There is more F, with total count of 558846


### Step 7. Group the dataset by name and assign to names

In [87]:
names = baby_names.groupby('Name').count()
names

Unnamed: 0_level_0,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,2,2,2,2
Aadan,4,4,4,4
Aadarsh,1,1,1,1
Aaden,196,196,196,196
Aadhav,1,1,1,1
...,...,...,...,...
Zyra,7,7,7,7
Zyrah,2,2,2,2
Zyren,1,1,1,1
Zyria,10,10,10,10


In [34]:
names.index

Index(['Aaban', 'Aadan', 'Aadarsh', 'Aaden', 'Aadhav', 'Aadhya', 'Aadi',
       'Aadin', 'Aadit', 'Aaditya',
       ...
       'Zymire', 'Zyon', 'Zyonna', 'Zyquan', 'Zyquavious', 'Zyra', 'Zyrah',
       'Zyren', 'Zyria', 'Zyriah'],
      dtype='object', name='Name', length=17632)

In [93]:
names['Count']

Name
Aaban        2
Aadan        4
Aadarsh      1
Aaden      196
Aadhav       1
          ... 
Zyra         7
Zyrah        2
Zyren        1
Zyria       10
Zyriah       9
Name: Count, Length: 17632, dtype: int64

### Step 9. What is the name with most occurrences?

In [94]:
names['Count'].sort_values(ascending=False)

Name
Riley      1112
Avery      1080
Jordan     1073
Peyton     1064
Hayden     1049
           ... 
Ethon         1
Malajah       1
Euan          1
Azyria        1
Nissen        1
Name: Count, Length: 17632, dtype: int64

In [95]:
max(names['Count'].sort_values(ascending=False))

1112

In [106]:
names['Count'].sort_values(ascending=False).head(10)

Name
Riley     1112
Avery     1080
Jordan    1073
Peyton    1064
Hayden    1049
Taylor    1033
Jayden    1031
Alexis     984
Payton     971
Angel      962
Name: Count, dtype: int64

### Step 10. How many different names have the least occurrences?

In [103]:
least_occured_names = names[names['Count'] == 1]
least_occured_names

Unnamed: 0_level_0,Year,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aadarsh,1,1,1,1
Aadhav,1,1,1,1
Aadin,1,1,1,1
Aahna,1,1,1,1
Aaima,1,1,1,1
...,...,...,...,...
Zykeriah,1,1,1,1
Zykierra,1,1,1,1
Zymari,1,1,1,1
Zyquavious,1,1,1,1


In [117]:
len(least_occured_names)

3682

In [118]:
least_occured_names.index

Index(['Aadarsh', 'Aadhav', 'Aadin', 'Aahna', 'Aaima', 'Aalaya', 'Aaminah',
       'Aaniya', 'Aaria', 'Aariana',
       ...
       'Zunaira', 'Zyairah', 'Zyeria', 'Zyien', 'Zyire', 'Zykeriah',
       'Zykierra', 'Zymari', 'Zyquavious', 'Zyren'],
      dtype='object', name='Name', length=3682)

### Step 11. What is the median name occurrence?

In [119]:
name_occurance = names['Count'].sort_values()
name_occurance.median()

8.0

### Step 12. What is the standard deviation of names?

In [113]:
name_occurance.std()

122.02996350814125

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [114]:
name_occurance.describe()

count    17632.000000
mean        57.644907
std        122.029964
min          1.000000
25%          2.000000
50%          8.000000
75%         39.000000
max       1112.000000
Name: Count, dtype: float64