## Cleaning the Survey dataset
The dataset below is a survey on salaries of Managers across different career fields that was conducted between 2021 to 2024. I will use the Python library Pandas to perform the data cleaning  

In [1]:
#importing the library
import pandas as pd
#importing the dataet
df=pd.read_excel("Salary Survey.xlsx")

## Data Exploration

In [2]:
display(df.head())
print(f"\nShape :{df.shape}")
print(f"\nData Types: {df.dtypes}")

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,2021-04-27 11:02:09.743,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,2021-04-27 11:02:21.562,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,2021-04-27 11:02:38.125,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,2021-04-27 11:02:40.643,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,2021-04-27 11:02:41.793,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White



Shape :(28091, 18)

Data Types: Timestamp                                                                                                                                                                                                                               datetime64[ns]
How old are you?                                                                                                                                                                                                                                object
What industry do you work in?                                                                                                                                                                                                                   object
Job title                                                                                                                                                                                                                         

## Data Preprocessing
We'll start by renaming the columns for easier visibility and readbility

In [3]:
#renaming the columns
df.columns=[
    "Timestamp",
    "Age",
    "Job Industry",
    "Job Title",
    "Job Specifics",
    "Annual Salary",
    "Monetary Compensation",
    "Currency",
    "Other Currency",
    "Income Specifics",
    "Occupation Country",
    "US States",
    "Occupation City",
    "Overall Experience",
    "Experience",
    "Education Level",
    "Gender",
    "Race"
]
#lets display our new dataset
display(df.head())

Unnamed: 0,Timestamp,Age,Job Industry,Job Title,Job Specifics,Annual Salary,Monetary Compensation,Currency,Other Currency,Income Specifics,Occupation Country,US States,Occupation City,Overall Experience,Experience,Education Level,Gender,Race
0,2021-04-27 11:02:09.743,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,2021-04-27 11:02:21.562,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,2021-04-27 11:02:38.125,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,2021-04-27 11:02:40.643,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,2021-04-27 11:02:41.793,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


* lets check the data types of a new columns

In [4]:
df.dtypes

Timestamp                datetime64[ns]
Age                              object
Job Industry                     object
Job Title                        object
Job Specifics                    object
Annual Salary                     int64
Monetary Compensation           float64
Currency                         object
Other Currency                   object
Income Specifics                 object
Occupation Country               object
US States                        object
Occupation City                  object
Overall Experience               object
Experience                       object
Education Level                  object
Gender                           object
Race                             object
dtype: object

* Lets check for missing values

In [5]:
#checking for missing values
df.isna().sum()

Timestamp                    0
Age                          0
Job Industry                74
Job Title                    1
Job Specifics            20824
Annual Salary                0
Monetary Compensation     7308
Currency                     0
Other Currency           27884
Income Specifics         25047
Occupation Country           0
US States                 5029
Occupation City             82
Overall Experience           0
Experience                   0
Education Level            223
Gender                     171
Race                       177
dtype: int64

* From the checking for missing values, we note that some columns we need to be dropped and other columns we'll need to drop some of the rows

In [6]:
#dropping the columns with missing values
df.drop(columns=["Job Specifics","Other Currency","Income Specifics","US States"],inplace=True)
#dropping the missing values
df.dropna(inplace=True)

* Lets get the month and year columns from the Timestamp and drop the Timestamp column because we'll no longer need it

In [7]:
#getting the months
df["Record Month"]=df["Timestamp"].dt.month_name()
#getting the years
df["Record Year"]=df["Timestamp"].dt.year
#dropping the Timestamp column
df.drop(columns="Timestamp",inplace=True)

In [8]:
df["Monetary Compensation"]=df["Monetary Compensation"].astype(int)

* Lets get the Total Annual Salary by summing the Annual Salary and Monetary Compensation`

In [9]:
df["Total Annual Salary"]=df["Annual Salary"] + df["Monetary Compensation"]

In [10]:
df["Currency"].unique()

array(['USD', 'GBP', 'CAD', 'EUR', 'AUD/NZD', 'Other', 'CHF', 'ZAR',
       'HKD', 'SEK', 'JPY'], dtype=object)

* lets add these currencies beside their values

In [11]:
df["Annual Salary"]=df["Currency"] + " " + df["Annual Salary"].astype(str)
df["Monetary Compensation"]=df["Currency"] + " " + df["Monetary Compensation"].astype(str)
df["Total Annual Salary"]=df["Currency"] + " " + df["Total Annual Salary"].astype(str)
#dropping the currency column
df.drop(columns="Currency",inplace=True)

In [12]:
df.head()

Unnamed: 0,Age,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Overall Experience,Experience,Education Level,Gender,Race,Record Month,Record Year,Total Annual Salary
0,25-34,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,5-7 years,5-7 years,Master's degree,Woman,White,April,2021,USD 55000
1,25-34,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White,April,2021,GBP 58600
3,25-34,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 65000
4,25-34,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 67000
6,25-34,Publishing,Publishing Assistant,USD 33000,USD 2000,USA,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White,April,2021,USD 35000


cols_move=[("Year",0),("Month",1),("Total Annual Salary",5)]
cols=list(df.columns)
for col,position in cols_move:
    cols.remove(col)
    cols.insert(position,col)
df=df[cols]    

* Lets create categories for the Age column

In [13]:
df["Age"].unique()

array(['25-34', '45-54', '35-44', '18-24', '65 or over', '55-64',
       'under 18'], dtype=object)

In [14]:
def age_group(age):
    if age =="18-24":
        return "Young"
    elif age =="25-34":
        return "Mid Life"
    elif age =="35-44":
        return "Junior"
    elif age =="45-54":
        return "Experienced"
    elif age =="55-64":
        return "Senior"
    else:
        return "Retired"
df["Age Group"]=df["Age"].apply(age_group)

In [15]:
df.head()

Unnamed: 0,Age,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Overall Experience,Experience,Education Level,Gender,Race,Record Month,Record Year,Total Annual Salary,Age Group
0,25-34,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,5-7 years,5-7 years,Master's degree,Woman,White,April,2021,USD 55000,Mid Life
1,25-34,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White,April,2021,GBP 58600,Mid Life
3,25-34,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 65000,Mid Life
4,25-34,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 67000,Mid Life
6,25-34,Publishing,Publishing Assistant,USD 33000,USD 2000,USA,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White,April,2021,USD 35000,Mid Life


* Lets do the same for the Experience column 

In [16]:
df["Overall Experience"].unique()

array(['5-7 years', '8 - 10 years', '2 - 4 years', '21 - 30 years',
       '11 - 20 years', '41 years or more', '31 - 40 years',
       '1 year or less'], dtype=object)

In [17]:
df["Overall Experience"].value_counts()

Overall Experience
11 - 20 years       7132
8 - 10 years        3861
5-7 years           3440
21 - 30 years       2781
2 - 4 years         2069
31 - 40 years        682
1 year or less       338
41 years or more     105
Name: count, dtype: int64

In [18]:
def expe(year):
    if year =="1 year or less":
        return "Entry Level"
    elif year =="2 - 4 years":
        return "Junior Level"
    elif year =="5 - 7 years":
        return "Mid Level"
    elif year == "8 - 10 years":
        return "Experienced Level"
    elif year == "11 - 20 years":
        return "Senior Level"
    elif year == "21 - 30 years":
        return "Veteran Level"
    elif year == "31 - 40 years":
        return "Expert Level"
    else:
        return "Legendary Level"
df["Experience Level"]=df["Overall Experience"].apply(expe)

In [19]:
df.head()

Unnamed: 0,Age,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Overall Experience,Experience,Education Level,Gender,Race,Record Month,Record Year,Total Annual Salary,Age Group,Experience Level
0,25-34,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,5-7 years,5-7 years,Master's degree,Woman,White,April,2021,USD 55000,Mid Life,Legendary Level
1,25-34,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White,April,2021,GBP 58600,Mid Life,Experienced Level
3,25-34,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 65000,Mid Life,Experienced Level
4,25-34,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 67000,Mid Life,Experienced Level
6,25-34,Publishing,Publishing Assistant,USD 33000,USD 2000,USA,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White,April,2021,USD 35000,Mid Life,Junior Level


* Lets clean the Race column

In [20]:
df["Race"].unique()

array(['White', 'Hispanic, Latino, or Spanish origin, White',
       'Asian or Asian American, White', 'Asian or Asian American',
       'Another option not listed here or prefer not to answer',
       'Middle Eastern or Northern African',
       'Hispanic, Latino, or Spanish origin', 'Black or African American',
       'Black or African American, Hispanic, Latino, or Spanish origin, White',
       'Native American or Alaska Native, White',
       'Hispanic, Latino, or Spanish origin, Another option not listed here or prefer not to answer',
       'White, Another option not listed here or prefer not to answer',
       'Black or African American, Native American or Alaska Native, White',
       'Asian or Asian American, Another option not listed here or prefer not to answer',
       'Middle Eastern or Northern African, White',
       'Black or African American, White',
       'Asian or Asian American, Black or African American, White',
       'Black or African American, Hispanic, Latino

In [21]:
def clean_race(value):
    if "White" in value and "," not in value:
        return "White"
    elif "Black or African American" in value and "," not in value:
        return "Black or African American"
    elif "Asian or Asian American" in value and "," not in value:
        return "Asian or Asian American"
    elif "Hispanic, Latino, or Spanish origin" in value and "," not in value:
        return "Hispanic, Latino, or Spanish origin"
    elif "Native American or Alaska Native" in value and "," not in value:
        return "Native American or Alaska Native"
    elif "Middle Eastern or Northern African" in value and "," not in value:
        return "Middle Eastern or Northern African"
    elif "Another option not listed here or prefer not to answer" in value:
        return "Other or Prefer not to answer"
    else:
        return "Multiracial"
        
df["Race"] = df["Race"].apply(clean_race)

print(df["Race"].unique())
print(df["Race"].value_counts())

['White' 'Multiracial' 'Asian or Asian American'
 'Other or Prefer not to answer' 'Middle Eastern or Northern African'
 'Black or African American' 'Native American or Alaska Native']
Races
White                                 17160
Multiracial                            1245
Asian or Asian American                 974
Other or Prefer not to answer           526
Black or African American               427
Middle Eastern or Northern African       47
Native American or Alaska Native         29
Name: count, dtype: int64


In [22]:
df.head()

Unnamed: 0,Age,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Overall Experience,Experience,Education Level,Gender,Race,Record Month,Record Year,Total Annual Salary,Age Group,Experience Level,Races
0,25-34,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,5-7 years,5-7 years,Master's degree,Woman,White,April,2021,USD 55000,Mid Life,Legendary Level,White
1,25-34,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White,April,2021,GBP 58600,Mid Life,Experienced Level,White
3,25-34,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 65000,Mid Life,Experienced Level,White
4,25-34,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,8 - 10 years,5-7 years,College degree,Woman,White,April,2021,USD 67000,Mid Life,Experienced Level,White
6,25-34,Publishing,Publishing Assistant,USD 33000,USD 2000,USA,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White,April,2021,USD 35000,Mid Life,Junior Level,White


* Lets drop the unnecessary columns

In [24]:
df.drop(columns=["Age","Overall Experience","Experience","Race"],inplace=True)

In [25]:
df.head()

Unnamed: 0,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Education Level,Gender,Record Month,Record Year,Total Annual Salary,Age Group,Experience Level,Races
0,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,Master's degree,Woman,April,2021,USD 55000,Mid Life,Legendary Level,White
1,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,College degree,Non-binary,April,2021,GBP 58600,Mid Life,Experienced Level,White
3,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,College degree,Woman,April,2021,USD 65000,Mid Life,Experienced Level,White
4,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,College degree,Woman,April,2021,USD 67000,Mid Life,Experienced Level,White
6,Publishing,Publishing Assistant,USD 33000,USD 2000,USA,Columbia,College degree,Woman,April,2021,USD 35000,Mid Life,Junior Level,White


* lets clean the Gender column

In [28]:
df["Gender"].unique()

array(['Woman', 'Non-binary', 'Man', 'Other or prefer not to answer',
       'Prefer not to answer'], dtype=object)

In [29]:
def gender(value):
    if value =="Woman":
        return "Female"
    elif value =="Man":
        return "Male"
    elif value =="Non-binary":
        return "Non Binary"
    elif value =="Other or prefer not to answer":
        return "Other"
    else:
        return "Prefer not to answer"
df["Gender"]=df["Gender"].apply(gender)

In [30]:
df.head(4)

Unnamed: 0,Job Industry,Job Title,Annual Salary,Monetary Compensation,Occupation Country,Occupation City,Education Level,Gender,Record Month,Record Year,Total Annual Salary,Age Group,Experience Level,Race
0,Education (Higher Education),Research and Instruction Librarian,USD 55000,USD 0,United States,Boston,Master's degree,Female,April,2021,USD 55000,Mid Life,Legendary Level,White
1,Computing or Tech,Change & Internal Communications Manager,GBP 54600,GBP 4000,United Kingdom,Cambridge,College degree,Non Binary,April,2021,GBP 58600,Mid Life,Experienced Level,White
3,Nonprofits,Program Manager,USD 62000,USD 3000,USA,Milwaukee,College degree,Female,April,2021,USD 65000,Mid Life,Experienced Level,White
4,"Accounting, Banking & Finance",Accounting Manager,USD 60000,USD 7000,US,Greenville,College degree,Female,April,2021,USD 67000,Mid Life,Experienced Level,White


In [31]:
df["Occupation City"].unique()

array(['Boston', 'Cambridge', 'Milwaukee', ..., 'Shenzhen',
       'Bennettsville', 'Jhonston'], dtype=object)