ETL Pipeline for Fortune 1000 Data
===

## Introduction

In [None]:
'''
=================================================
ETL Pipeline for Fortune 1000 Data
Owner: Sam

This program will load the dataset and perform data exploration and quality check for the raw Fortune 1000 dataset (P2M3_Sam_data_raw.csv)
=================================================
'''



## Library

In [2]:
import pandas as pd

## Data Loading

In [3]:
# display max
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# load data
df = pd.read_csv('P2M3_Sam_data_raw.csv')

# data info
df.info()

# show the first 3 rows
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   company            1000 non-null   object 
 1   rank               1000 non-null   int64  
 2   rank_change        1000 non-null   float64
 3   revenue            1000 non-null   float64
 4   profit             997 non-null    float64
 5   num. of employees  999 non-null    float64
 6   sector             1000 non-null   object 
 7   city               1000 non-null   object 
 8   state              1000 non-null   object 
 9   newcomer           1000 non-null   object 
 10  ceo_founder        1000 non-null   object 
 11  ceo_woman          1000 non-null   object 
 12  profitable         1000 non-null   object 
 13  prev_rank          1000 non-null   object 
 14  CEO                1000 non-null   object 
 15  Website            1000 non-null   object 
 16  Ticker             951 no

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
0,Walmart,1,0.0,572754.0,13673.0,2300000.0,Retailing,Bentonville,AR,no,no,no,yes,1.0,C. Douglas McMillon,https://www.stock.walmart.com,WMT,352037
1,Amazon,2,0.0,469822.0,33364.0,1608000.0,Retailing,Seattle,WA,no,no,no,yes,2.0,Andrew R. Jassy,www.amazon.com,AMZN,1202717
2,Apple,3,0.0,365817.0,94680.0,154000.0,Technology,Cupertino,CA,no,no,no,yes,3.0,Timothy D. Cook,www.apple.com,AAPL,2443962
3,CVS Health,4,0.0,292111.0,7910.0,258000.0,Health Care,Woonsocket,RI,no,no,yes,yes,4.0,Karen Lynch,https://www.cvshealth.com,CVS,125204
4,UnitedHealth Group,5,0.0,287597.0,17285.0,350000.0,Health Care,Minnetonka,MN,no,no,no,yes,5.0,Andrew P. Witty,www.unitedhealthgroup.com,UNH,500468


In [4]:
df.prev_rank.unique()

array(['1.0', '2.0', '3.0', '4.0', '5.0', '10.0', '6.0', '9.0', '7.0',
       '8.0', '12.0', '13.0', '11.0', '15.0', '14.0', '27.0', '18.0',
       '16.0', '32.0', '23.0', '17.0', '21.0', '20.0', '19.0', '22.0',
       '24.0', '34.0', '26.0', '48.0', '53.0', '28.0', '30.0', '25.0',
       '35.0', '31.0', '29.0', '36.0', '51.0', '45.0', '41.0', '37.0',
       '39.0', '77.0', '33.0', '44.0', '40.0', '43.0', '38.0', '42.0',
       '46.0', '55.0', '52.0', '50.0', '81.0', '49.0', '47.0', '59.0',
       '57.0', '56.0', '54.0', '61.0', '62.0', '68.0', '82.0', '100.0',
       '70.0', '72.0', '66.0', '64.0', '60.0', '65.0', '67.0', '78.0',
       '63.0', '97.0', '69.0', '156.0', '71.0', '74.0', '76.0', '73.0',
       '75.0', '85.0', '88.0', '83.0', '89.0', '58.0', '127.0', '105.0',
       '79.0', '80.0', '95.0', '93.0', '84.0', '103.0', '87.0', '90.0',
       '149.0', '92.0', '123.0', '86.0', '96.0', '98.0', '102.0', '94.0',
       '91.0', '124.0', '99.0', '117.0', '101.0', '147.0', '114.0',
  

## Find duplicates in the data

In [5]:
# check for duplicates
print(f'Number of duplicates: {df.duplicated().sum()}')

Number of duplicates: 0


## Evaluate the missing value

In [6]:
# check for NaN values in each columns
nan_count = df.isnull().sum()

print("Number of NaN values per column:")
print(nan_count)

Number of NaN values per column:
company               0
rank                  0
rank_change           0
revenue               0
profit                3
num. of employees     1
sector                0
city                  0
state                 0
newcomer              0
ceo_founder           0
ceo_woman             0
profitable            0
prev_rank             0
CEO                   0
Website               0
Ticker               49
Market Cap           31
dtype: int64
