<a href="https://colab.research.google.com/github/Pranav-Bhatlapenumarthi/Exploratory-Data-Analysis-EDA/blob/main/Pandas_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploratory Data Analysis using Pandas**

Dataset used: Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1

Source: https://oscarbaruffa.com/messy/

Step 1: Importing Pandas library and increasing maximum number of rows to be displayed

In [1]:
import pandas as pd
print(pd.options.display.max_rows)

pd.options.display.max_rows = 150 #Increasing displayed rows
print(pd.options.display.max_rows)

60
150


Step 2: Read the CSV file using Pandas and print the entire dataset

In [5]:
df = pd.read_csv("/content/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.csv") # Reading the original CSV file
print(df.to_string()) # Printing out the contents of the datasets

Step 3: Rename the columns (This has been done since the original columns were too verbose)

In [None]:
df.rename(columns={ # Renaming the columns
    "How old are you?": "Age",
    "What industry do you work in?": "Industry",
    "Job title": "Job_Title",
    "If your job title needs additional context, please clarify here:": "Job_Title_Context",
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": "Annual_Salary",
    "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.": "Additional compensation",
    "Please indicate the currrency": "Currency",
    "What country do you work in?": "Country",
    "What city do you work in?": "City",
    "How many years of professional work experience do you have overall?": "Years of experience",
    "How many years of professional work experience do you have in your field?": "Years of experience in field",
    "What is your highest level of education completed?": "Formal education",
    "What is your gender?": "Gender",
    "What is your race? (Choose all that apply.)": "Race"
}, inplace = True)
print(df.columns)
print(df.head())


Step 4: We drop the columns having all values as null(NaN)

In [None]:
df.drop(['Job_Title_Context', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:'], axis=1, inplace=True)
print(df.columns)
print(df.head())

Step 5: Converting data to appropriate datatype to aid further analysis

In [30]:
df["Annual_Salary"] = df["Annual_Salary"].str.replace(',','').astype(int)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])

print(type(df.loc[1, "Timestamp"]))
print(type(df.loc[1, "Annual_Salary"]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


Step 6: Cleaning empty cells

For numeric values, we replace the null values with the mean of the column

For string values, we remove the entire row(Removing few rows will not affect the analysis for such a large dataset)

In [None]:
annual_salary_mean = df["Annual_Salary"].mean()
addn_compensation_mean = df["Additional compensation"].mean()

df["Annual_Salary"].fillna(annual_salary_mean, inplace = True)
df["Additional compensation"].fillna(addn_compensation_mean, inplace = True)

null_columns = df.isnull().any() # Checking if any null values exist
print(null_columns)

df.dropna(inplace = True) # Removing rows having empty values
print(null_columns)

Step 7: Importing Matplotlib and Seaborn for graphical plots

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
print(sns.__version__)
print(df.columns)

Step 8: Finally, analysing numeric values graphically

In [None]:
col_study = ['Annual_Salary', 'Additional compensation'] # the columns that we want to focus on
sns.pairplot(df[col_study], height = 2.5)
plt.show()