<font size="6">**Step - 01 ---> Introduction**</font>


<font size = "5">**Data Description** </font>
<font size = "4">**The dataset is derived from the Aspiring Mind Employment Outcome 2015 (AMEO) study, focusing on employment outcomes among engineering graduates. It includes information on various aspects such as salary, job titles, job locations, standardized scores in cognitive, technical, and personality skills, as well as demographic features. The dataset contains approximately 4000 data points with around 40 independent variables, comprising both continuous and categorical data.** </font>
<font size = "5">**Objectives:** </font>

<font size="4">**1.Explore employment outcomes, focusing on salary, job titles, and locations.** </font>
<font size="4">**2.Understand data distribution and identify outliers.** </font>
<font size="4">**3.Investigate relationships between variables.** </font>
<font size="4">**4.Address research questions regarding salary and gender-specialization preferences.** </font>

<font size="4">**5.Provide actionable insights for career development, recruitment, and education in the engineering domain.** </font>






<font size = "6">**Step 2 - Import the data and display the head, shape, and description of the data:** </font>

In [None]:
import pandas as pd
import numpy as np

In [None]:
# load the dataset
data = pd.read_excel("data.xlsx")
data

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe()

<font size="5">**Data Manipulation** </font>

<font size="4">**We have to clean and prepare the dataset by addressing missing values, dropping unnecessary columns, encoding categorical variables, preprocessing for modeling if needed, conducting exploratory data analysis (EDA) etc.** </font>


In [None]:
data.columns

In [None]:
data.drop(["Unnamed: 0", "DOJ", "DOL", "CollegeCityID", "CollegeCityTier", "Domain", "ComputerProgramming", "ElectronicsAndSemicon", "ComputerScience",
"MechanicalEngg", "ElectricalEngg", "TelecomEngg", "CivilEngg"], axis=1, inplace=True)

In [None]:
data

In [None]:
data.info()

In [None]:
data["Designation"].unique()

In [None]:
data[data["Designation"] == "get"]

In [None]:
data.drop(data[data["Designation"] == "get"].index, inplace = True)
data

In [None]:
data["Designation"].value_counts()

In [None]:
data["Designation"] = data["Designation"].apply(lambda x : x.replace("ase", "application support engineer") )

In [None]:
data[data["Designation"] == "ase"]

In [None]:
res = data["Designation"].unique()
res.sort()
print(len(res))
res

In [None]:
data["Designation"] = data["Designation"].replace(to_replace = "dotnet developer", value = ".net developer" ,regex = False)
data["Designation"] = data["Designation"].replace(to_replace = ".net web developer", value = ".net developer" ,regex = False)

In [None]:
designation_mapping = {
    'assistant system engineer - trainee': 'assistant system engineer trainee',
    'assistant systems engineer': 'assistant system engineer',
    'associate software engg': 'associate software engineer',
    'business development managerde': 'business development manager',
    'business systems analyst': 'business system analyst',
    'co faculty': 'computer faculty',
    'db2 dba': 'databapplication support engineer administrator',
    'dba': 'databapplication support engineer administrator',
    'executive engg': 'executive engineer',
    'front end web developer': 'front end developer',
    'graduate trainee engineer': 'graduate engineer trainee',
    
    'hr executive': 'executive hr',
    'jr. software engineer': 'junior software engineer',
    'operations engineer and jetty handling': 'operations engineer',
    'qa analyst': 'quality analyst',
    'qa engineer': 'quality engineer',
    'qa trainee': 'quality trainee',
    'r & d': 'r&d engineer',
    'rf/dt engineer': 'rf engineer',
    'sales and service engineer': 'sales & service engineer',
    'seo': 'seo analyst',
    'software devloper': 'software developer',
    'software eng': 'software engineer',
    'software engg': 'software engineer',
    'software engineere': 'software engineer',
    'software enginner': 'software engineer',
    'software test engineer (etl)': 'software test engineer',
    'software test engineerte': 'software test engineer',
    'sr. engineer': 'senior engineer',
    'team leader': 'team lead',
    'systems analyst': 'system analyst',
    'systems administrator': 'system administrator',
    'testing engineer': 'test engineer',
    'web designer and joomla administrator': 'web designer',
    'web designer and seo': 'web designer',
}

data['Designation'] = data['Designation'].replace(designation_mapping)

In [None]:
len(data["Designation"].unique())

In [None]:
data["JobCity"].value_counts()

In [None]:
data["JobCity"].unique()

In [None]:
data["JobCity"] = data["JobCity"].apply(lambda x : str(x))

In [None]:
data["JobCity"].dtype

In [None]:
data["JobCity"] = data["JobCity"].apply(lambda x : x.replace("-1", "India") if "-1" in x else x )
data["JobCity"]   # data["JobCity"] = data.JobCity

In [None]:
data["JobCity"].value_counts()

In [None]:
# Convert 'JobCity' column to uppercase
data["JobCity"] = data["JobCity"].str.upper()

In [None]:
# Remove leading and trailing whitespaces from 'JobCity' column
data["JobCity"] = data["JobCity"].str.strip()

In [None]:
res = data["JobCity"].unique()
res.sort()
print(len(res))
res

In [None]:
data["JobCity"] = data["JobCity"].replace(to_replace="A-64,SEC-64,NOIDA", value="NOIDA", regex=False)

In [None]:
# Replace specific city names with standardized names
data['JobCity'] = data['JobCity'].replace({
    'AL JUBAIL,SAUDI ARABIA': 'INDIA',
    'AM': 'AMBALA',
    'AMBALA CITY': 'AMBALA',
    'AUSTRALIA': 'INDIA',
    'BANAGALORE': 'BANGALORE',
    'BANAGLORE': 'BANGALORE',
    'ASIFABADBANGLORE': 'BANGALORE',
    'BENGALURU': 'BANGALORE',
    'BANGLORE': 'BANGALORE',
    'BHUBANESHWAR': 'BHUBANESWAR',
    'BHUBNESHWAR': 'BHUBANESWAR',
    'CHENNAI & MUMBAI': 'CHENNAI',
    'CHENNAI, BANGALORE': 'CHENNAI',
    'DELHI/NCR': 'DELHI',
    'DUBAI': 'INDIA',
    'GAZIABAAD': 'GHAZIABAD',
    'GAZIBAAD': 'GHAZIABAD',
    'INDIRAPURAM, GHAZIABAD': 'GHAZIABAD',
    'GANDHINAGAR': 'GANDHI NAGAR',
    'GREATER NOIDA': 'NOIDA',
    'GURAGAON': 'GURGAON',
    'GURGA': 'GURGAON',
    'GURGOAN': 'GURGAON',
    'HDERABAD': 'HYDERABAD',
    'HYDERABAD(BHADURPALLY)': 'HYDERABAD',
    'JEDDAH SAUDI ARABIA': 'INDIA',
    'KOCHI/COCHIN': 'KOCHI',
    'KOCHI/COCHIN, CHENNAI AND COIMBATORE': 'KOCHI',
    'KOLKATA`': 'KOLKATA',
    'KUDANKULAM ,TARAPUR': 'KUDANKULAM',
    'LATUR (MAHARASHTRA )': 'LATUR',
    'LONDON': 'INDIA',
    'METTUR, TAMIL NADU': 'METTUR',
    'MUZAFFARNAGAR': 'MUZAFFARPUR',
    'MUZZAFARPUR': 'MUZAFFARPUR',
    'NAVI MUMBAI': 'MUMBAI',
    'NAVI MUMBAI , HYDERABAD': 'MUMBAI',
    'NEW DEHLI': 'DELHI',
    'NEW DELHI': 'DELHI',
    'DEHLI': 'DELHI',
    'NEW DELHI - JAISALMER': 'DELHI',
    'NOUDA': 'NOIDA',
    'PONDI': 'PONDICHERRY',
    'PONDY': 'PONDICHERRY',
    'PUNR': 'PUNE',
    'RAYAGADA, ODISHA': 'ODISHA',
    'ORISSA': 'ODISHA',
    'SADULPUR,RAJGARH,DISTT-CHURU,RAJASTHAN': 'RAJASTHAN',
    'SONIPAT': 'SONEPAT',
    'TIRUPATI': 'TIRUPATHI',
    'TECHNOPARK, TRIVANDRUM': 'TRIVANDRUM',
    'UNA': 'UNNAO',
    'VIZAG': 'VISAKHAPATNAM',
    'VSAKHAPTTNAM': 'VISAKHAPATNAM',
    'KALMAR, SWEDEN': 'INDIA',
})


In [None]:
len(data["JobCity"].unique())

In [None]:
data["Gender"].value_counts()

In [None]:
data["Gender"].unique()

In [None]:
data["10board"].value_counts()

In [None]:
data["10board"].unique()

In [None]:
data["10board"]=data["10board"].apply(lambda x:str(x))

data["10board"] = data["10board"].apply(lambda x : x.replace("0", "Indian Board of Secondary Education") if "0" in x else x)

In [None]:
data["10board"].value_counts()

In [None]:
data["12board"].value_counts()

In [None]:
data["12board"].unique()

In [None]:
data["12board"]=data["12board"].apply(lambda x:str(x))

data["12board"] = data["12board"].apply(lambda x : x.replace("0", "Indian Board of Secondary Education") if "0" in x else x)

In [None]:
data["12board"].value_counts()

In [None]:
data["Degree"].value_counts()

In [None]:
data["Degree"].unique()

In [None]:
data["Specialization"].value_counts()

In [None]:
data["Specialization"] = data["Specialization"].replace(to_replace="electronics & instrumentation eng", value="electronics and instrumentation engineering", regex=False)

In [None]:
data["CollegeState"].value_counts()

In [None]:
data["GraduationYear"].value_counts()

In [None]:
data.drop(data[data["GraduationYear"] == 0].index,inplace=True)
data

In [None]:
data["GraduationYear"].value_counts()

<font size="5">**## Cleaned Dataset**</font>


In [None]:
data.head()

<font size="6">**Data Visualization** </font>
<font size="6">**Step 3: Univariate Analysis:** </font>
<font size="5">**I) For numerical columns:** </font>

<font size="4">------>**Plot Probability Density Functions (PDFs), Histograms , Boxplots , Kdeplot in all educational percentages.** </font>  

<font size="4">------>**Find the outliers in each numerical column.** </font>

<font size="4">------>**Understand the probability and frequency distribution of each numerical column** </font>

<font size="5">**II) For categorical columns:** </font>

<font size="4">------>**Plot countplots to understand the frequency distribution.** </font>>
<font size="4">**Mention observations after each plot.** </font>






In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
data.head()

<font size="6">**I) For Numerical Columns:** </font>

<font size="5">**Histogram and PDFs in Passed Out Years:** </font>

<font size="4">---------->**Both Histograms and PDFs provide visualizations of the frequency distribution of numerical data.** </font>


In [None]:
data.describe()

In [None]:
# Histogram and PDFs of 12graduation
plt.figure(figsize=(8, 5))
plt.xticks(rotation=90)
sns.histplot(data["12graduation"], color="red")
plt.title("Distribution of 12th Graduation Year")
plt.show()

# OR
# plt.figure(figsize=(10, 5))
# plt.xticks(rotation=90)
# plt.hist(data["12graduation"], color="red", bins=50)  # Adjust the number of bins as needed
# plt.xlabel("12th Graduation")
# plt.ylabel("Frequency")
# plt.title("Distribution of 12th Graduation Year")
# plt.show()

<font size="4">------>**This histogram reveals that the majority of employees are completed their 12th graduation in the year 2009.** </font>


In [None]:
# Histogram and PDFs of GraduationYear
plt.figure(figsize=(8, 5))
plt.xticks(rotation=90)
sns.histplot(data["GraduationYear"], color="red")
plt.title("Distribution of Graduation Year")
plt.show()

<font size="4">------>**This histogram reveals that the majority of employees are completed their GraduationYear in the year 2013.** </font>

<font size="5">**kdeplot in all educational percentages** </font>

<font size="4">----->**A Kernel Density Estimation (KDE) plot is a non-parametric way to estimate the probability density function of a continuous random variable.** </font>

<font size="4">----->**KDE plot can be used to visualize the distribution of a single continuous variable. It is similar to a histogram but provides a smoother curve instead of discrete bins.** </font>

In [None]:
# Kdeplot of 10percentage
plt.figure(figsize=(8, 5))
sns.kdeplot(data["10percentage"], color="k")

<font size="4">------>**This KDE plot illustrates the density of employees' 10th percentage, revealing a left-skewed distribution where the majority of employees have around 85% in their 10th-grade exams.** </font>

In [None]:
# Kdeplot of 12percentage
plt.figure(figsize=(8, 5))
sns.kdeplot(data["12percentage"], color="k")

<font size="4">------>**This KDE plot shows the density distribution of employees' 12th-grade percentages, resembling a normal distribution with the majority falling between 70% to 80%, indicating a common range for most employees' 12th-grade scores.** </font>

In [None]:
# Kde plot of collegeGPA
plt.figure(figsize=(8, 5))
sns.kdeplot(data["collegeGPA"], color="k")

<font size="4">------>**This KDE plot illustrates the density distribution of employees' college GPA, resembling a normal distribution with the majority falling between 65 to 75, suggesting a common range for most employees' college GPAs.** </font>

<font size="5">**Boxplot in General Intelligence** </font>

In [None]:
# Box plot of English column
plt.figure(figsize=(8, 5))
plt.boxplot(data["English"])  # sns.boxplot(data["English"])
plt.title("English")
plt.show()

<font size = "4">------->**This box plot shows the distribution of marks in the English column for each employee, indicating the presence of numerous outliers, both at the high and low extremes, suggesting substantial variability in the English scores among employees.** </font>

In [None]:
# Box plot of Logical column
plt.figure(figsize=(8, 5))
plt.boxplot(data["Logical"])  
plt.title("Logical")
plt.show()

<font size = "4">------->**This box plot illustrates the distribution of marks in the Logical column for each employee, revealing the presence of numerous outliers, both at the high and low extremes, indicating considerable variability in Logical scores among employees.** </font>

In [None]:
# Box plot of Quant column
plt.figure(figsize=(8, 5))
plt.boxplot(data["Quant"]) 
plt.title("Quant")
plt.show()

<font size = "4">------->**This Box plot tells abouts the Quant column it shows the marks of each employee and this column have the many outliers like high extream outliers and low extream outliers** </font>

In [None]:
# Box plot of Salary column
plt.figure(figsize=(8, 5))
plt.boxplot(data["Salary"]) 
plt.title("Salary")
plt.show()

<font size = "4">------->**This Box plot tells abouts the Salary coloumn it shows the marks of each employee and this colomn have high extream outliers.The presence of outliers indicates that there are individuals with exceptionally high or low salaries compared to the rest of the group.** </font>

<font size = "5">**Boxplot of all numerical columns** </font>

In [None]:
## Boxplot of all numerical columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    plt.figure(figsize=(8, 6))
    plt.boxplot(x=data[col])
    plt.title(col)
    plt.show()

<font size = "5">**After performing the box plot for each column, it can be observed that:** </font>

<font size = "5">**Columns with Low and High Extremes Outliers:** </font>

<font size = "4">----->**12graduation: This column exhibits both low and high extremes outliers.** </font>

<font size = "4">----->**CollegeGPA: Similar to 12graduation, it displays outliers at both ends of the distribution.** </font>

<font size = "4">----->**English:There are outliers present at both high and low extremes.** </font>

<font size = "4">----->**Logical: This column also shows outliers at both ends of the distribution.** </font>

<font size = "4">----->**Quant: Similar to Logical, it has outliers at both high and low extremes.** </font>

<font size = "5">**Columns with No Outliers:** </font>

<font size = "4">----->**ID: There are no outliers observed in this column.** </font>

<font size = "4">----->**CollegeID: Similarly, this column does not contain any outliers.** </font>
 
<font size = "5">**Columns with High Extremes Outliers Only:** </font>

<font size = "4">----->**Salary: Outliers are predominantly located at the high extreme end, indicating significantly higher salaries compared to the majority of employees.** </font>

<font size = "5">**Columns with Low Extremes Outliers Only:** </font>

<font size = "4">----->**10percentage: Outliers are present only at the low extreme end.** </font>

<font size = "4">----->**12percentage: Similar to 10percentage, outliers are located at the low extreme.** </font>

<font size = "4">----->**GraduationYear: This column exhibits outliers solely at the low extreme.** </font>

<font size = "4">**These observations provide insights into the distribution and presence of outliers across different columns in the dataset.** </font>

<font size="6">**II) For categorical columns:** </font>

<font size="4">------>**Plot countplots to understand the frequency distribution.** </font>

In [None]:
# Count plot of JobCity column with color variation and adjusted label visibility
plt.figure(figsize=(20, 50))
plt.xticks(rotation=90)  # Rotate x-axis labels vertically
sns.countplot(y = data["JobCity"], palette="Set3")  # Set color palette 
plt.xlabel("Frequency")
plt.ylabel("Job City")
plt.title("Frequency Distribution of Job City")
plt.tight_layout()  # Adjust layout to improve label visibility
plt.show()


<font size = "4">----->**This countplot indicates that Bangalore has the highest number of employees, while Rajpura has relatively fewer employees.** </font>

In [None]:
# Count plot of Degree column with color variation
plt.figure(figsize=(10, 5))
plt.xticks(rotation=90)  # Rotate x-axis labels vertically
sns.countplot(x = data["Degree"], palette="Set3")  # Set color palette 
plt.title("Frequency Distribution of Degree")
plt.show()


<font size = "4">----->**The countplot  illustrates that the majority of employees hold a degree in BTech/BE, while a few number have an MSc degree. Additionally, this plot provides insight into the distribution of various degree streams.** </font>

In [None]:
# Count plot of CollegeState column with color variation 
plt.figure(figsize=(10, 5))
plt.xticks(rotation=90)  # Rotate x-axis labels vertically
sns.countplot(x = data["CollegeState"], palette="Set3")  # Set color palette 
plt.title("Frequency Distribution of CollegeState")
plt.show()


<font size = "4">----->**This Countplot indicates the distribution of employees based on the states where they pursued their education. It reveals that the highest number of employees are from Uttar Pradesh, while there are relatively fewer employees from Meghalaya.** </font>
>

Step 4: Bivariate Analysis

Explore relationships between numerical columns using scatter plots, hexbin plots, pair plots, etc.
Investigate patterns between categorical and numerical columns using swarm plots, boxplots, barplots, etc.
Identify relationships between categorical columns using stacked bar plots.
Mention observations after each plot.

<font size = "6">**Step 4: Bivariate Analysis:** </font>

<font size = "4">----->**Explore relationships between numerical columns using scatter plots, hexbin plots, pair plots, etc.** </font>

<font size = "4">----->**Investigate patterns between categorical and numerical columns using swarm plots, boxplots, barplots, etc.** </font>

<font size = "4">----->**Identify relationships between categorical columns using stacked bar plots.** </font>

<font size = "4">----->**Mention observations after each plot.** </font>


<font size = "5">**Explore relationships between numerical columns using scatter plots, hexbin plots, pair plots, etc.** </font>


In [None]:
# Scatterplot on Specialization,Salary comparing the Gender

plt.figure(figsize = (10, 5))
plt.xticks(rotation=90)
sns.scatterplot(x = "Specialization", y = "Salary", data = data, hue = "Gender")

<font size = "4" >------->**This scatter plot visualizes the relationship between employee specialization and salary, highlighting the gender distribution among employees. It allows us to observe how salaries vary across different specializations, while also showing the proportion of male and female employees within each specialization category.** </font>

In [None]:
# Scatter plot of Salary vs. CollegeGPA
plt.figure(figsize=(8, 6))
plt.scatter(data["Salary"], data["collegeGPA"], alpha=0.5)
plt.xlabel("Salary")
plt.ylabel("college GPA")
plt.title("Scatter Plot of Salary vs. College GPA")
plt.show()

<font size = "4" >------->**The scatter plot shows salary versus college GPA. Clusters of dots suggest similar salaries across GPAs at lower levels, while fewer dots are scattered at higher salaries, indicating a trend of fewer individuals earning higher incomes regardless of GPA.** </font>

In [None]:
# Hexbin plot of Salary vs. 10percentage
plt.figure(figsize=(8, 6))
plt.hexbin(data["Salary"], data["10percentage"], gridsize=20, cmap='Blues')
plt.xlabel("Salary")
plt.ylabel("10th Percentage")
plt.title("Hexbin Plot of Salary vs. 10th Percentage")
plt.colorbar(label='count in bin')
plt.show()

<font size = "4" >------->**The hexbin plot illustrates the correlation between salary and 10th percentage using hexagonal bins. Darker shades of blue indicate higher counts, revealing regions where salary and 10th percentage are more densely distributed.** </font>

In [None]:
# Pair plot of numerical columns
sns.pairplot( data, x_vars=["10percentage", "12percentage", "collegeGPA"],
                  y_vars=["10percentage", "12percentage","collegeGPA"],
                  hue="Gender")

<font size = "4" >------->**This pair plot tells about all the employee's percentages and it shows the difference of male and female.** </font>

<font size = "5">**Investigate patterns between categorical and numerical columns using swarm plots, boxplots, barplots, etc.** </font>


In [None]:
sns.swarmplot(data = data, x='Degree', y='collegeGPA')


In [None]:
# Swarm plot of Degree vs. Salary
plt.figure(figsize=(10, 8))
sns.swarmplot(x="Degree", y="Salary", data=data)
plt.xlabel("Degree")
plt.ylabel("Salary")
plt.title("Swarm Plot of Degree vs. Salary")
plt.xticks(rotation=45)
plt.show()

# Box plot of Degree vs. Salary
plt.figure(figsize=(10, 8))
sns.boxplot(x="Degree", y="Salary", data=data)
plt.xlabel("Degree")
plt.ylabel("Salary")
plt.title("Box Plot of Degree vs. Salary")
plt.xticks(rotation=45)
plt.show()

# Bar plot of Specialization vs. Salary
plt.figure(figsize=(10, 8))
sns.barplot(x="Specialization", y="Salary", data=data)
plt.xlabel("Specialization")
plt.ylabel("Salary")
plt.title("Bar Plot of Specialization vs. Salary")
plt.xticks(rotation=45)
plt.show()

# Stacked bar plot of Degree and Specialization vs. Salary
plt.figure(figsize=(10, 8))
sns.barplot(x="Degree", y="Salary", hue="Specialization", data=data, ci=None)
plt.xlabel("Degree")
plt.ylabel("Salary")
plt.title("Stacked Bar Plot of Degree and Specialization vs. Salary")
plt.xticks(rotation=45)
plt.legend(title="Specialization")
plt.show()