# Tasks

1. **Basic Data Exploration**: Identify the number of rows and columns in the dataset, determine the data types of each column, and check for missing values in each column.

2. **Descriptive Statistics**: Calculate basic statistics mean, median, mode, minimum, and maximum salary, determine the range of salaries, and find the standard deviation.

3. **Data Cleaning**: Handle missing data by suitable method with explain why you use it.

4. **Basic Data Visualization**: Create histograms or bar charts to visualize the distribution of salaries, and use pie charts to represent the proportion of employees in different departments.

5. **Grouped Analysis**: Group the data by one or more columns and calculate summary statistics for each group, and compare the average salaries across different groups.

6. **Simple Correlation Analysis**: Identify any correlation between salary and another numerical column, and plot a scatter plot to visualize the relationship.

8. **Summary of Insights**: Write a brief report summarizing the findings and insights from the analyses.

In [None]:
import pandas as pd
import numpy as np

#Load dataset
df = pd.read_csv('/content/Salaries.csv')
df.head()

In [None]:
df.columns

In [None]:
# @title #1) Basic Data Exploration
# Identify the number of rows and columns in the dataset
df.shape
#(rows #, columns#)

In [None]:
#determine the data types of each column
df.dtypes

In [None]:
#check for missing values in each column
df.isnull().sum()

In [None]:
# @title 2) Descriptive Statistics
df.describe()

In [None]:
# @title 3) Data Cleaning


missing_numerical = df[['BasePay', 'OvertimePay', 'OtherPay', 'TotalPay', 'TotalPayBenefits', 'Year']]
missing_numerical = missing_numerical.fillna(missing_numerical.mean()) #numercal data
df['Benefits'].fillna(0, inplace=True) #if there no benifit>> replace it with zero
df=df.drop(columns=['Notes','Status']) #all instances are empty
df.shape
df.head()

In [None]:
# @title 4) Basic Data Visualization
#Histogram to visualize the distribution of salaries
import matplotlib.pyplot as plt
import seaborn as sns

ax = sns.distplot(df.TotalPay, bins=50, kde=False)
plt.axvline(x=np.mean(df.TotalPay), color='g', label='mean')
plt.axvline(x=np.median(df.TotalPay), color='orange', label='median')
plt.legend(loc='upper right')

In [None]:
df['JobTitle'].value_counts()

In [None]:
#pie charts to represent the proportion of employees in different departments

plt.figure(figsize=(150, 100))

jobCounts = df['JobTitle'].value_counts()
plt.pie(jobCounts, labels=jobCounts.index, autopct='%1f%%', startangle=180,colors=sns.color_palette('pastel'))
plt.title('Proportion of Employees in Different Job Titles')
plt.show()



In [None]:
#hierarchal clustering

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler


sample_df = df.sample(n=500)
features = df[['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']].copy()
# Handling missing values
imputer = SimpleImputer(strategy='median')
features_imputed = imputer.fit_transform(features)
# Scaling the data
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_imputed)

from scipy.cluster.hierarchy import dendrogram, linkage
# Generating the linkage matrix
Z = linkage(features_scaled, 'ward')
# Plotting the dendrogram to help decide the number of clusters
plt.figure(figsize=(10, 7))
dendrogram(Z, no_labels=True)
plt.title('Hierarchical Clustering Dendrogram (Sampled Data)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()

from scipy.cluster.hierarchy import fcluster
# Assuming you decided on a distance threshold or directly specifying the number of clusters
k = 5  # for example, or use a distance threshold in fcluster
clusters = fcluster(Z, k, criterion='maxclust')
# Add cluster assignments back to the sample DataFrame
sample_df['Cluster'] = clusters

# Count the number of occurrences of each cluster
cluster_counts = sample_df['Cluster'].value_counts()
# Create a pie chart
plt.figure(figsize=(10, 7))
cluster_counts.plot.pie(autopct='%1.1f%%', startangle=90)
plt.title('Proportion of Employees by Cluster (Sampled Data)')
plt.ylabel('')  # This hides the y-label
plt.show()

# For hierarchical clustering, calculate the mean or median for each cluster
for c in sample_df['Cluster'].unique():
    print(f"Cluster {c} characteristics:")
    print(sample_df[sample_df['Cluster'] == c][['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']].mean())
    print("\n")

In [None]:
# @title 5) Grouped Analysis

print('The avarage before grouping:',df['TotalPay'].mean())

print('\n\n#################################################\n\n')

groupedData=df.groupby(['JobTitle', 'BasePay'])
groupedData.agg(['mean', 'median', 'std'])

print('\n\n#################################################\n\n')

print('The avarage after grouping:',groupedData['TotalPay'].mean())

# Print the average salaries across different groups




In [None]:
# @title 6) Simple Correlation Analysis
correlation = df['TotalPay'].corr(df['BasePay'])
plt.figure(figsize=(8, 6))
plt.scatter(df['TotalPay'], df['BasePay'], color='yellow')
plt.title('Scatter Plot of Total Salary vs Base Salary')
plt.xlabel('BasePay')
plt.ylabel('TotalPay')
plt.grid(True)

In [None]:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

import numpy as np

#Load dataset
df = pd.read_csv('/content/Salaries.csv')
df.head()
# Sample DataFrame with job titles (replace this with your actual DataFrame)


# Encode categorical feature using label encoding
label_encoder = LabelEncoder()
df['JobTitleEncoded'] = label_encoder.fit_transform(df['JobTitle'])

# Apply hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0)
cluster_labels = agg_clustering.fit_predict(df[['JobTitleEncoded']])

# Plot the dendrogram
plt.figure(figsize=(10, 6))
dendrogram(agg_clustering.children_, labels=df['JobTitle'].values, leaf_rotation=90)
plt.xlabel("Job Titles")
plt.ylabel("Distance")
plt.title("Dendrogram of Hierarchical Clustering")
plt.show()

# Print cluster labels
print("Cluster Labels:")
print(cluster_labels)


#7) Summary of Insights :
**1)** The dataset **before cleaning contains (148654 rows, 13 columns)**
and **after cleaning >> (148654 rows, 11 columns)**

**2)** There is **missing values** in this dataset

**3)** This dataset explain all financial information about employees in some organization

**4) The data visualization** : the figure 1  (histogram) shows the min, max, mean and median of the salaries (TotalPay), the  figure 2 (piechart) represent the proportion of employees in different JobTitles but is not practical, and there is no column showing the departments so I worked with my team at the university (AI Team) to cluster the job titles into departments using hierarchical clustering in 500 saple of data, it is showing in figure 3 and figure 4.


**5) Grouped Analysis**: I grouped the data by Job Title and Base Salary (by mean) to understand salary distribution withen the organization,I noticed that the mean of some jobs titles are higher than other.

**6) Correlation Analysis**: I choose to find the correlation between total salary and base salary to find out the extent of the connection between it, It indicated the strength and direction of the relationship,  the avarage of basic salaries are paid more by organization, and there are some outliers