# Context
The dataset contains information about the top 100 celebrities from the IMDb site. The data has been collected through web scraping from the 'https://m.imdb.com/chart/starmeter/' webpage.

The dataset consists of the following columns:

Rank: The ranking position of the celebrity based on their popularity and starmeter rating.
Name: The name of the celebrity.
Date_of_birth: The date of birth of the celebrity.
Height: The height of the celebrity.
Roles: The roles played by the celebrity, including actor, director, producer, and other related roles.
Awards: The number of awards the celebrity has received, including both wins and nominations.
Famous_for: The movie or show for which the celebrity is widely recognized and known.
Birth_place: The birthplace of the celebrity.
This dataset provides valuable information about the top 100 celebrities, including their rankings, personal details such as birth date and height, their roles in the entertainment industry, their recognition through awards, the projects they are famous for, and their birthplaces. It can be used for various analyses and insights into the popularity and achievements of these celebrities.

# Purpose
As part of the challenge of creating an ETL pipeline, we will clean up this dataset in order to make it ready for further analysis of the data.

# Extract
Extracting the original data from a .csv file.

In [29]:
import pandas as pd

In [30]:
df = pd.read_csv('celeb_data.csv')
df.head()

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
0,1,Anya Chalotra,1995-07-21,5′ 6″ (1.68 m),['Actress'],1 win,The Witcher,"Wolverhampton, Staffordshire, England, UK"
1,2,Hayley Atwell,1982-04-05,5′ 6½″ (1.69 m),"['Actress', 'Soundtrack']",15 nominations,Captain America: The First Avenger,"London, England, UK"
2,3,Rebecca Ferguson,1983-10-19,5′ 5″ (1.65 m),"['Actress', 'Soundtrack', 'Producer']",9 wins,The Greatest Showman,Sweden
3,4,Vanessa Kirby,1988-04-18,5′ 7″ (1.70 m),"['Actress', 'Soundtrack', 'Producer']","5 wins , 43 nominations total",The Crown,"London, England, UK"
4,5,Tom Cruise,1962-07-03,5′ 7″ (1.70 m),"['Actor', 'Director', 'Producer']","59 wins , 108 nominations total",Top Gun,"Syracuse, New York, USA"


In [31]:
df.tail()

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
95,96,Elizabeth Debicki,1990-08-24,6′ 2¾″ (1.90 m),"['Actress', 'Producer']","6 wins , 38 nominations total",The Great Gatsby,"Paris, France"
96,97,John Krasinski,1979-10-20,6′ 3″ (1.91 m),"['Actor', 'Director', 'Producer']","18 wins , 79 nominations total",Tom Clancy's Jack Ryan,"Newton, Massachusetts, USA"
97,98,Idris Elba,1972-09-06,6′ 2½″ (1.89 m),"['Actor', 'Producer', 'Writer']","30 wins , 102 nominations total",Beasts of No Nation,"Hackney, London, England, UK"
98,99,Chloë Grace Moretz,1997-02-10,5′ 4″ (1.63 m),"['Actress', 'Soundtrack', 'Producer']",28 wins,Kick-Ass,"Atlanta, Georgia, USA"
99,100,Juno Temple,1989-07-21,5′ 2″ (1.57 m),"['Actress', 'Soundtrack']","6 wins , 19 nominations total",Atonement,"London, England, UK"


# Transform
Performing data verification and cleaning.

In [32]:
df.shape

(100, 8)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           100 non-null    int64 
 1   Name           100 non-null    object
 2   Date_of_birth  100 non-null    object
 3   height         96 non-null     object
 4   Role           100 non-null    object
 5   Awards         98 non-null     object
 6   Famous_for     100 non-null    object
 7   Birth_place    98 non-null     object
dtypes: int64(1), object(7)
memory usage: 6.4+ KB


In [34]:
# Checkin for duplicates values
df.duplicated().sum()

0

In [35]:
# Checkin for null values
pd.isnull(df).sum()

Rank             0
Name             0
Date_of_birth    0
height           4
Role             0
Awards           2
Famous_for       0
Birth_place      2
dtype: int64

In [36]:
# Finding the null data
df[pd.isna(df['height'])]

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
15,16,Ariana Greenblatt,2007-08-27,,['Actress'],1 nomination,Barbie,"New York, USA"
75,76,Miss Benny,1999-02-19,,"['Actress', 'Soundtrack']",,Glamorous,
77,78,Christopher McQuarrie,1968-10-25,,"['Director', 'Producer', 'Writer']","10 wins , 17 nominations total",The Usual Suspects,"Princeton, New Jersey, USA"
79,80,Ayo Edebiri,1995-10-03,,"['Actress', 'Producer', 'Writer']","3 wins , 13 nominations total",The Bear,"Boston, Massachusetts, USA"


In [37]:
# Finding the null data
df[pd.isna(df['Awards'])]

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
33,34,Liz Katz,1988-07-08,5′ (1.52 m),"['Actress', 'Director', 'Writer']",,Guest House,"Randolph, New Jersey, USA"
75,76,Miss Benny,1999-02-19,,"['Actress', 'Soundtrack']",,Glamorous,


In [38]:
# Finding the null data
df[pd.isna(df['Birth_place'])]

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
24,25,Kingsley Ben-Adir,1986-11-20,5′ 7½″ (1.71 m),['Actor'],"10 wins , 14 nominations total",The OA,
75,76,Miss Benny,1999-02-19,,"['Actress', 'Soundtrack']",,Glamorous,


The .dropna() approach will make us lose 6/100. We will then use the technique to fill in the missing data as appropriate for each column.

In [39]:
values = {'height': '-', 'Awards': 'without nominations', 'Birth_place': 'uninformed'}
df.fillna(value=values, inplace=True)

In [40]:
pd.isnull(df).sum()

Rank             0
Name             0
Date_of_birth    0
height           0
Role             0
Awards           0
Famous_for       0
Birth_place      0
dtype: int64

In [41]:
df.shape

(100, 8)

In [42]:
# knowing the data types
df.dtypes

Rank              int64
Name             object
Date_of_birth    object
height           object
Role             object
Awards           object
Famous_for       object
Birth_place      object
dtype: object

In [43]:
df.describe

<bound method NDFrame.describe of     Rank                Name Date_of_birth           height  \
0      1       Anya Chalotra    1995-07-21   5′ 6″ (1.68 m)   
1      2       Hayley Atwell    1982-04-05  5′ 6½″ (1.69 m)   
2      3    Rebecca Ferguson    1983-10-19   5′ 5″ (1.65 m)   
3      4       Vanessa Kirby    1988-04-18   5′ 7″ (1.70 m)   
4      5          Tom Cruise    1962-07-03   5′ 7″ (1.70 m)   
..   ...                 ...           ...              ...   
95    96   Elizabeth Debicki    1990-08-24  6′ 2¾″ (1.90 m)   
96    97      John Krasinski    1979-10-20   6′ 3″ (1.91 m)   
97    98          Idris Elba    1972-09-06  6′ 2½″ (1.89 m)   
98    99  Chloë Grace Moretz    1997-02-10   5′ 4″ (1.63 m)   
99   100         Juno Temple    1989-07-21   5′ 2″ (1.57 m)   

                                     Role                           Awards  \
0                             ['Actress']                            1 win   
1               ['Actress', 'Soundtrack']            

In [44]:
df.rename(columns={"height" : "Height"})

Unnamed: 0,Rank,Name,Date_of_birth,Height,Role,Awards,Famous_for,Birth_place
0,1,Anya Chalotra,1995-07-21,5′ 6″ (1.68 m),['Actress'],1 win,The Witcher,"Wolverhampton, Staffordshire, England, UK"
1,2,Hayley Atwell,1982-04-05,5′ 6½″ (1.69 m),"['Actress', 'Soundtrack']",15 nominations,Captain America: The First Avenger,"London, England, UK"
2,3,Rebecca Ferguson,1983-10-19,5′ 5″ (1.65 m),"['Actress', 'Soundtrack', 'Producer']",9 wins,The Greatest Showman,Sweden
3,4,Vanessa Kirby,1988-04-18,5′ 7″ (1.70 m),"['Actress', 'Soundtrack', 'Producer']","5 wins , 43 nominations total",The Crown,"London, England, UK"
4,5,Tom Cruise,1962-07-03,5′ 7″ (1.70 m),"['Actor', 'Director', 'Producer']","59 wins , 108 nominations total",Top Gun,"Syracuse, New York, USA"
...,...,...,...,...,...,...,...,...
95,96,Elizabeth Debicki,1990-08-24,6′ 2¾″ (1.90 m),"['Actress', 'Producer']","6 wins , 38 nominations total",The Great Gatsby,"Paris, France"
96,97,John Krasinski,1979-10-20,6′ 3″ (1.91 m),"['Actor', 'Director', 'Producer']","18 wins , 79 nominations total",Tom Clancy's Jack Ryan,"Newton, Massachusetts, USA"
97,98,Idris Elba,1972-09-06,6′ 2½″ (1.89 m),"['Actor', 'Producer', 'Writer']","30 wins , 102 nominations total",Beasts of No Nation,"Hackney, London, England, UK"
98,99,Chloë Grace Moretz,1997-02-10,5′ 4″ (1.63 m),"['Actress', 'Soundtrack', 'Producer']",28 wins,Kick-Ass,"Atlanta, Georgia, USA"


# LOAD
Update the clean data.

In [45]:
df.to_csv('new_celeb_data.csv', index=False)

In [46]:
new_df = pd.read_csv('new_celeb_data.csv')
new_df.head()

Unnamed: 0,Rank,Name,Date_of_birth,height,Role,Awards,Famous_for,Birth_place
0,1,Anya Chalotra,1995-07-21,5′ 6″ (1.68 m),['Actress'],1 win,The Witcher,"Wolverhampton, Staffordshire, England, UK"
1,2,Hayley Atwell,1982-04-05,5′ 6½″ (1.69 m),"['Actress', 'Soundtrack']",15 nominations,Captain America: The First Avenger,"London, England, UK"
2,3,Rebecca Ferguson,1983-10-19,5′ 5″ (1.65 m),"['Actress', 'Soundtrack', 'Producer']",9 wins,The Greatest Showman,Sweden
3,4,Vanessa Kirby,1988-04-18,5′ 7″ (1.70 m),"['Actress', 'Soundtrack', 'Producer']","5 wins , 43 nominations total",The Crown,"London, England, UK"
4,5,Tom Cruise,1962-07-03,5′ 7″ (1.70 m),"['Actor', 'Director', 'Producer']","59 wins , 108 nominations total",Top Gun,"Syracuse, New York, USA"
