# 1. Importing Essentials and Exploring the Data

**Libraries**

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns 
plt.style.use("ggplot")
#pd.set_option('max_columns', 200)



**Importing CSV file**


In [None]:
df = pd.read_csv('../input/employee-dataset/Employee.csv')

**Data Preparation**

In [None]:
df.head()

**df.columns** Returns the columns' names

In [None]:
df.columns


**df.dtypes** Returns the data types

In [None]:
df.dtypes

**Making changes as needed** (optional)

In [None]:
#Change type:
df['Age'] = df['Age'].astype(float)

df.dtypes

In [None]:
#Rename Columns
df.columns
df.rename(columns={'PaymentTier':'PaymentLeague', 
                   'ExperienceInCurrentDomain': 'ExperienceInDomain' })

**Data Exploration**

**Data insights** | df.describe()

Using describe function we have a deeper understanding of data.

This approach helps us interpret the data more effectively and formulate meaningful insights.

For example, the mean value in the Age column is 31.56, which suggests the **average employee age** in this data set is around **29,3 years**.


In [None]:
df.describe()

**Checking for null values, duplicates etc**

In [None]:
df.isnull().sum()



**isna()** function is used to detect missing values

It returns a DataFrame with boolean values (True for missing values, False otherwise).

In [None]:
#Null or No Values
df.isna()

#Number of NoValues
df.isna().sum() 

**df.info()** A helpful snapshot of the structure and data types

In [None]:
df.info()

**Duplicates**

In [None]:
#Duplicates
df.duplicated()


In [None]:
#Subset Duplicates
df.duplicated(subset=['Age'])

#Locate the duplicates
df.loc[df.duplicated(subset=['JoiningYear'])].head(5)

In [None]:
#Show duplicates
df.query('City == "Bangalore"') 

In [None]:
#Remove Duplicates
# ~ Inverse: the "~" is used to filter out the duplicate rows and retain only the unique ones in the resulting DataFrame.
df = df.loc[~df.duplicated(subset=['JoiningYear', 'Age'])]\
      .reset_index(drop=True).copy()
        #removes (drops) error index. Μετρά από την αρχή με τη σειρά

df.shape

# 3. Feature Understanding
**Plotting Feature Distributions**
    
*      Histogram    
*      KDE
*      Boxplot

While **df.duplicated()** identifies rows with identical values across all columns, utilizing **df['JoiningYear'].value_counts()** provides valuable insight into the frequency of each unique value within the 'JoiningYear' column **specifically**. 

This approach allows for a comprehensive understanding of the distribution of joining years across the dataset.

For examples, the year 2017 seems to be the most common joining year, followed by 2015. 

It could be interesting to explore if there are any trends related to the year employees join the company (e.g., hiring trends, economic conditions).

In [None]:
df['JoiningYear'].value_counts()

**What are the top 3 cities with the most employees?**

In [None]:
ax = df['City'].value_counts() \
    .head(5) \
    .plot(kind='bar', title = 'Top 3 Cities')
ax.set_ylabel('Count')


****How are the employee ages distributed?****

**Histogram to visualize the distribution of employee ages**

In [None]:

dist = df['Age'].plot(kind='hist', 
                      bins= 25, 
                      title = 'Age distirbution')
dist.set_xlabel('Age')

**Kernel density estimation (KDE) plot to visualize the distribution of employee ages**

Unlike histograms that use bars, KDE plots use a smooth curve to represent the probability density of the data

In [None]:
dist = df['Age'].plot(kind='kde', 
                        # no bins
                      title = 'Age distirbution')

# 4.Feature Relationships
* Scatterplot
* Heatmap Correlation
* Pairplot
* Groupby Comparisons

In [None]:
#Combine and compare values
df.plot(kind='scatter', x='Age', y='ExperienceInCurrentDomain')

#plt.show()

In [None]:
#Seaborn (for more complpex analysis)

sns.scatterplot(x='Age', 
                y='ExperienceInCurrentDomain',
                hue= 'JoiningYear',
               data=df)

 

In [None]:
sns.pairplot(df, 
             vars=['JoiningYear','Age','ExperienceInCurrentDomain','PaymentTier'],
            hue='LeaveOrNot')

In [None]:
#Correlation
    #drop null values
    #corr() => correlation between the values
df_corr = df[['JoiningYear','Age','ExperienceInCurrentDomain','PaymentTier']].dropna().corr()
df_corr
#Positive values: the higher the better correlation

In [None]:
sns.heatmap(df_corr, annot=True)