# A Basic Data Analysis of Employee Dataset using Python

## Objective
The objective of this project is to perform a basic data analysis on an employee dataset using Python.
This project focuses on understanding how to load a CSV file and explore the dataset using basic Pandas functions.

## Importing Pandas Library

Pandas is a Python library used for data manipulation and analysis.
In this project, Pandas is used to read the CSV file and explore the employee dataset.
The library is imported with the `pd` to make the code easier to write and understand.

In [4]:
import pandas as pd

## Loading the Dataset

In this step, the employee dataset stored in a CSV file is loaded into the program.
The `read_csv()` function of Pandas is used to read the CSV file.
The data from the file is stored in a DataFrame, which is a tabular data structure with rows and columns.

In [5]:
df = pd.read_csv("D:\Projects\Salary_Data.csv")

## Displaying the Dataset

After loading the dataset, it is important to view the data to ensure it has been loaded correctly.
In this cell, the DataFrame is displayed to show the employee data in a tabular format.

In [6]:
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
...,...,...,...,...,...,...
6699,49.0,Female,PhD,Director of Marketing,20.0,200000.0
6700,32.0,Male,High School,Sales Associate,3.0,50000.0
6701,30.0,Female,Bachelor's Degree,Financial Manager,4.0,55000.0
6702,46.0,Male,Master's Degree,Marketing Manager,14.0,140000.0


## Checking the Type of the Dataset

In this step, the type of the dataset is checked to understand the data structure.
The dataset loaded using Pandas is stored in a DataFrame.
The `type()` function is used to identify the type of the object stored in the variable `df`.

In [8]:
type(df)

pandas.core.frame.DataFrame

## Viewing First Five Records using head()

The `head()` function is used to display the first five rows of the dataset.
This helps in understanding the structure, column names, and sample values of the data.
By default, `head()` shows the first five records.

In [9]:
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


## Viewing Last Five Records using tail()

The `tail()` function is used to display the last five rows of the dataset.
This helps in checking the ending records of the dataset and ensures that the data is complete.
By default, `tail()` displays the last five records.

In [10]:
df.tail()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
6699,49.0,Female,PhD,Director of Marketing,20.0,200000.0
6700,32.0,Male,High School,Sales Associate,3.0,50000.0
6701,30.0,Female,Bachelor's Degree,Financial Manager,4.0,55000.0
6702,46.0,Male,Master's Degree,Marketing Manager,14.0,140000.0
6703,26.0,Female,High School,Sales Executive,1.0,35000.0


## Checking the Shape of the Dataset

The `shape` attribute is used to find the number of rows and columns in the dataset.
It returns the output in the form of (rows, columns).
This helps in understanding the size of the dataset.

In [11]:
df.shape

(6704, 6)

## Viewing Column Names

In this step, the column names of the dataset are displayed.
Column names represent different attributes of employee data such as age, gender, education level, and salary.
The `columns` attribute is used to list all column names present in the dataset.

In [12]:
df.columns

Index(['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience',
       'Salary'],
      dtype='object')

## Accessing a Single Column

In this step, a single column is accessed from the dataset.
Column access allows us to view and analyze data related to a specific attribute.
Here, the 'Age' column is accessed from the DataFrame.

In [13]:
df['Age']

0       32.0
1       28.0
2       45.0
3       36.0
4       52.0
        ... 
6699    49.0
6700    32.0
6701    30.0
6702    46.0
6703    26.0
Name: Age, Length: 6704, dtype: float64

## Checking Data Type of a Column

In this step, the data type of a single column is checked.
The `type()` function is used to identify the type of data stored in the selected column.
This helps in understanding how the data is represented in the dataset.

In [14]:
type(df['Age'])

pandas.core.series.Series

## Dataset Information using info()

The `info()` function provides a concise summary of the dataset.
It displays the column names, data types, and non-null values present in each column.
This helps in understanding the structure and completeness of the dataset.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6702 non-null   float64
 1   Gender               6702 non-null   object 
 2   Education Level      6701 non-null   object 
 3   Job Title            6702 non-null   object 
 4   Years of Experience  6701 non-null   float64
 5   Salary               6699 non-null   float64
dtypes: float64(3), object(3)
memory usage: 314.4+ KB


## Statistical Summary using describe()

The `describe()` function provides statistical information about numerical columns in the dataset.
It shows values such as count, mean, minimum, maximum, and standard deviation.
This helps in understanding the distribution of numerical data.

In [16]:
df.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,6702.0,6701.0,6699.0
mean,33.620859,8.094687,115326.964771
std,7.614633,6.059003,52786.183911
min,21.0,0.0,350.0
25%,28.0,3.0,70000.0
50%,32.0,7.0,115000.0
75%,38.0,12.0,160000.0
max,62.0,34.0,250000.0


## Accessing Salary Column

In this step, the Salary column is accessed from the dataset.
This allows us to view salary details of employees separately for analysis.

In [17]:
df['Salary']

0        90000.0
1        65000.0
2       150000.0
3        60000.0
4       200000.0
          ...   
6699    200000.0
6700     50000.0
6701     55000.0
6702    140000.0
6703     35000.0
Name: Salary, Length: 6704, dtype: float64

## Checking Data Type of Salary Column

The data type of the Salary column is checked using the `type()` function.
This helps in identifying how Pandas stores the salary data internally.

In [18]:
type(df['Salary'])

pandas.core.series.Series

## Counting Values using value_counts()

The `value_counts()` function is used to count unique values in a column.
Here, it is applied to the Gender column to understand the distribution of employees based on gender.

In [19]:
df['Gender'].value_counts()

Gender
Male      3674
Female    3014
Other       14
Name: count, dtype: int64

## Calculating Mean of Salary

The mean represents the average value of a numerical column.
In this step, the mean salary of employees is calculated using the `mean()` function.
This helps in understanding the average salary level in the dataset.

In [8]:
df['Salary'].mean()

np.float64(115326.96477086132)

## Calculating Median of Salary

The median represents the middle value of a numerical dataset.
It helps in understanding the central value of salary without being affected by extreme values.
The `median()` function is used to calculate the median salary.

In [9]:
df['Salary'].median()

115000.0

## Calculating Mode of Salary

The mode represents the most frequently occurring value in a dataset.
In this step, the mode of the salary column is calculated using the `mode()` function.
This helps in identifying the most common salary value among employees.

In [10]:
df['Salary'].mode()

0    140000.0
Name: Salary, dtype: float64

## Finding Minimum Salary

The `min()` function is used to find the lowest salary value in the dataset.
This helps in identifying the minimum salary among employees.

In [11]:
df['Salary'].min()

350.0

## Finding Maximum Salary

The `max()` function is used to find the highest salary value in the dataset.
This helps in understanding the upper range of employee salaries.

In [12]:
df['Salary'].max()

250000.0

## Counting Total Number of Records

The `count()` function is used to count the total number of non-null values in a column.
This helps in understanding how many employee records are present in the dataset.

In [13]:
df['Salary'].count()

np.int64(6699)

## Observations

- The employee dataset contains a large number of records, indicating real-world employee data.
- The dataset includes both numerical and categorical attributes such as age, gender, education level, job title, years of experience, and salary.
- The data is well-structured in the form of rows and columns, making it suitable for analysis using Pandas.
- Using functions like `head()` and `tail()`, the initial and ending records of the dataset were examined.
- The `shape` attribute showed the total number of rows and columns present in the dataset.
- The `info()` function provided details about data types and confirmed that there are no major missing values.
- The `value_counts()` function helped in analyzing categorical data such as gender distribution.
- Statistical functions such as mean, median, and mode helped in understanding the salary distribution of employees.
- The minimum and maximum salary values gave insight into the salary range within the organization.

## Conclusion

This project focused on performing a basic data analysis of an employee dataset using Python and the Pandas library.
The dataset was successfully loaded and explored using fundamental Pandas functions such as head(), tail(), shape, columns, and info().
Basic statistical analysis was performed using functions like mean, median, mode, minimum, and maximum to understand salary patterns.
Categorical data analysis was also carried out using value_counts().
Overall, this project provided a clear understanding of how Python and Pandas can be used for basic data exploration and analysis in a structured and systematic manner.
