<a href="https://colab.research.google.com/github/Akshay069/Age-dataset-analysis/blob/main/Age_Data_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Age dataset: life, work, and death of 1.22M people**

**About Dataset**
The dataset contains structured information on the life, work, and death of more than 1 million deceased famous people.

\
\
**Paper abstract (ICWSM proceedings)**\
We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project. The dataset is the largest on notable deceased people and includes individuals from a variety of social groups, including but not limited to 107k females, 124 non-binary people, and 90k researchers, who are spread across more than 300 contemporary or historical regions. The final product provides new insights into the demographics of mortality in relation to gender and profession in history. The technical method demonstrates the usability of the latest text mining approaches to accurately clean historical data and reduce the missing values.

In [None]:
#Let's first import all the required liabraries. 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#Importing data from google drive 
file_location = '/content/drive/MyDrive/AgeDataset-V1.csv/AgeDataset-V1.csv'
df= pd.read_csv(file_location)

In [None]:
# Lets see what we have in our dataset by yop five rows
df.head()

Unnamed: 0,Id,Name,Short description,Gender,Country,Occupation,Birth year,Death year,Manner of death,Age of death
0,Q23,George Washington,1st president of the United States (1732–1799),Male,United States of America; Kingdom of Great Bri...,Politician,1732,1799.0,natural causes,67.0
1,Q42,Douglas Adams,English writer and humorist,Male,United Kingdom,Artist,1952,2001.0,natural causes,49.0
2,Q91,Abraham Lincoln,16th president of the United States (1809-1865),Male,United States of America,Politician,1809,1865.0,homicide,56.0
3,Q254,Wolfgang Amadeus Mozart,Austrian composer of the Classical period,Male,Archduchy of Austria; Archbishopric of Salzburg,Artist,1756,1791.0,,35.0
4,Q255,Ludwig van Beethoven,German classical and romantic composer,Male,Holy Roman Empire; Austrian Empire,Artist,1770,1827.0,,57.0


In [None]:
# Lets see last five rows from our dataset.
df.tail()

Unnamed: 0,Id,Name,Short description,Gender,Country,Occupation,Birth year,Death year,Manner of death,Age of death
1223004,Q77247326,Marie-Fortunée Besson,Frans model (1907-1996),,France,Tailor; model,1907,1996.0,,89.0
1223005,Q77249504,Ron Thorsen,xugador de baloncestu canadianu (1948–2004),,Canada; United States of America,Athlete,1948,2004.0,,56.0
1223006,Q77249818,Diether Todenhagen,German navy officer and world war II U-boat co...,,Germany,Military personnel,1920,1944.0,,24.0
1223007,Q77253909,Reginald Oswald Pearson,"English artist, working in stained glass, prin...",Male,United Kingdom,Artist,1887,1915.0,,28.0
1223008,Q77254864,Horst Lerche,German painter,Male,Germany,Artist,1938,2017.0,,79.0


In [None]:
# lets se the shape of our dataset.
df.shape

(1223009, 10)

In [None]:
# Lets inspect all the columns in our dataset.
df.columns

Index(['Id', 'Name', 'Short description', 'Gender', 'Country', 'Occupation',
       'Birth year', 'Death year', 'Manner of death', 'Age of death'],
      dtype='object')

In [None]:
# Lets see the overall summary of our dataset and inspect the datatype and null values. 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1223009 entries, 0 to 1223008
Data columns (total 10 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Id                 1223009 non-null  object 
 1   Name               1223009 non-null  object 
 2   Short description  1155109 non-null  object 
 3   Gender             1089363 non-null  object 
 4   Country            887500 non-null   object 
 5   Occupation         1016095 non-null  object 
 6   Birth year         1223009 non-null  int64  
 7   Death year         1223008 non-null  float64
 8   Manner of death    53603 non-null    object 
 9   Age of death       1223008 non-null  float64
dtypes: float64(2), int64(1), object(7)
memory usage: 93.3+ MB


In [None]:
# Total null values in the different columns.
df.isna().sum()

Id                         0
Name                       0
Short description      67900
Gender                133646
Country               335509
Occupation            206914
Birth year                 0
Death year                 1
Manner of death      1169406
Age of death               1
dtype: int64

In [None]:
# Lets make a copy of our main dataset for furthur changes.
data= df.copy()

In [None]:
data[data['Death year'].isna()]

Unnamed: 0,Id,Name,Short description,Gender,Country,Occupation,Birth year,Death year,Manner of death,Age of death
361,Q3611993,Issa Annamoradnejad,Creator of this dataset,Male,,,1992,,,


In [None]:
# As we see from our summary there is only one null value present in 'Death year' column which is the data of creator of this dataset lets drop this row for better analysis.
data.dropna(subset=['Death year'],inplace=True,axis=0 )

In [None]:
# Here we can see that we have successfully drop the row of creator data.
data.isna().sum()

Id                         0
Name                       0
Short description      67900
Gender                133646
Country               335508
Occupation            206913
Birth year                 0
Death year                 0
Manner of death      1169405
Age of death               0
dtype: int64

In [None]:
# Lets change the data type of required columns.
data[['Death year','Age of death']]= data[['Death year','Age of death']].astype('int')


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1223008 entries, 0 to 1223008
Data columns (total 10 columns):
 #   Column             Non-Null Count    Dtype 
---  ------             --------------    ----- 
 0   Id                 1223008 non-null  object
 1   Name               1223008 non-null  object
 2   Short description  1155108 non-null  object
 3   Gender             1089362 non-null  object
 4   Country            887500 non-null   object
 5   Occupation         1016095 non-null  object
 6   Birth year         1223008 non-null  int64 
 7   Death year         1223008 non-null  int64 
 8   Manner of death    53603 non-null    object
 9   Age of death       1223008 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 102.6+ MB


In [None]:
# Here i'm replacing null values of our dataset with required values.
data['Manner of death'].fillna('Unknown', inplace=True)
data['Short description'].fillna('Unavailable', inplace=True)
data['Gender'].fillna('unknown', inplace=True)
data['Country'].fillna('Unknown', inplace=True)
data['Occupation'].fillna('Unknown', inplace=True)

In [None]:
# Finally we have done with all the null values of our dataset. Now we can proceed further for analysis.
data.isna().sum()

Id                   0
Name                 0
Short description    0
Gender               0
Country              0
Occupation           0
Birth year           0
Death year           0
Manner of death      0
Age of death         0
dtype: int64