## In this Exploratory Data Analysis, we are going to focus on data analyst jobs based on DataAnayst dataset.

First thig first, let's import our libraries that will be used. 

- The first two are our standard libraries for data manipulation; 
- The following three are main data visualization libraries; 
- The following 7 are plotly data visualization libraries and methods that enable us to create dynamic and interactive graphs; 
- The last one is for filtering warnings.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl 
import seaborn as sns

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff


import warnings
warnings.filterwarnings('ignore') 

# Knowing Dataset

As it is case with other areas or fields of different studies or professions, it is highly important to know your dataset before diving into analysis. First and foremost, we need to understand what we want to do with a given dataset and what can be done with it.




Let's read our csv dataset and have a look at basics of it

In [None]:
df = pd.read_csv("../input/data-analyst-jobs/DataAnalyst.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

It can be easily seen that even though we have information about the salary ('Salary Estimate'), it is not in the numeric version, same for 'Size' and 'Revenue'. Based on our research objects, we need to make some arrangements on abovementioned variables. 

# Preparing Dataset

- Is there any missing values?

In [None]:
df.isnull().sum()

At first glance, it seems that we have only one missing value. However, in the output of head() we can see (-1). Most of the time in the real world datasets, unknown and missing values are not recorded like sofware recognizable version of the missing value. In such cases we need to check, ddetect and deal with them. Let's define (-1) as missing value.

In [None]:
df.replace("-1", np.nan)

As it can be seen above, we define str (-1)s as missing values. However, we are not done with missing value detection task yet. There are also numeric and float types of (-1)s in our dataset. Therefore, we need to deal with its  all versions before moving on to analysis. 

In [None]:
df.replace("-1", np.nan, inplace=True)
df.replace(-1, np.nan, inplace=True)
df.replace(-1.0, np.nan, inplace=True)

Now, let's recheck missing values in our dataset. 

In [None]:
df.isnull().sum()

This time, it seems more realistic

In this analysis, we will concentrate on the following areas: 
- Job Title
- Salary Estimate
- Job Description
- Rating
- Company name
- Industry
- Sector

Let's start with **Job Title**

In [None]:
df["Job Title"].value_counts().head(20)

If paid attention, it can be seen same jobs were named as if they are different jobs ("**Senior Data Analyst**" and "**Sr. Data Analyst**"). To have a healty analysis we need to rename this kind of namings. 

In [None]:
df["Job Title"] = df["Job Title"].str.replace("Sr. Data Analyst", "Senior Data Analyst")
df["Job Title"] = df["Job Title"].str.replace("Sr Data Analyst", "Senior Data Analyst")
df["Job Title"] = df["Job Title"].str.replace("Data Analyst Senior", "Senior Data Analyst")
df["Job Title"] = df["Job Title"].str.replace('Jr. Data Analyst','Junior Data Analyst')
df["Job Title"] = df["Job Title"].str.replace('Jr Data Analyst','Junior Data Analyst')
df["Job Title"] = df["Job Title"].str.replace('Data Analyst Junior','Junior Data Analyst')

In [None]:
df["Job Title"].value_counts()[:20]

In [None]:
df["Job Title"].isnull().sum()

So far so good with Job Title. Let's move on to **Salary Estimate**

In [None]:
df["Salary Estimate"].sample(5)

In [None]:
df["Salary Estimate"].isnull().sum()

It seems that we have one missing values. Due to fact that we will use "Salary Estimate" in our EDA in detail, we want all our rows to be full. And secondly, salary estimations are not in the form we want (it is object type and there are some signs like **$**, **K**).

Let's find the row with missing salary estimate.

In [None]:
df[df["Salary Estimate"].isnull()]

There are several approaches to dealing with missing values such as dropping, replacing, or leaving them as they are. Here, replacing missing value with a salary estimate for a similiar job advertised by the same company. 

Let's we whether there is any other job ad by the same company

In [None]:
df[df["Company Name"] == "Protingent\n4.4"]

There is another ad by the same company.

In [None]:
df["Job Description"][2123]

In [None]:
df["Job Description"][2149]

Based on above job descriptions, there are some similarities between two job titles. However, when having a closer look, it can be inferred that job title with missing salary estimate needs more qualifications than the other job title does. Since we cannot make a sound estimation in such cases, it is better for us to drop missing value for the sake of our analysis. 

In [None]:
df.shape

In [None]:
df.drop(2149, inplace=True)

In [None]:
df["Salary Estimate"].isnull().sum()

In [None]:
df.shape

In [None]:
df["Salary Estimate"].sample(5)

We need to get rid of those $ and K signs, convert object type into numeric value and change range into min, max, and average salary estimates to make our analysis more compatiable.

Let's get the numbers out of it and by using the numbers, let's make:

- maximum salary column
- minimum salary column
- average salary column

In [None]:
df["Salary Minimum"] = df["Salary Estimate"].str.lstrip("$").str[:3].str.replace("K", "").str.strip().astype("float")

In [None]:
df["Salary Maximum"] = df['Salary Estimate'].str[6:10].str.replace('K','').str.lstrip('$').str.strip().astype('float')

In [None]:
df["Salary Average"] = (df["Salary Maximum"] + df["Salary Minimum"]) / 2

Let's check what we have after changes we have made.

In [None]:
df[["Salary Estimate", "Salary Maximum", "Salary Minimum", "Salary Average"]].sample(5)

All seems OK. Let move on to **Job Describtion**

In [None]:
df["Job Description"][0]

It tells us about some specific about the given job. We will focus on Python, Excel, Tableau, and SQL skill requirements for a given job.

In [None]:
df["Job Description"].isnull().sum()

How many Jobs are there that require Python skills?

In [None]:
df["python"] = df["Job Description"].str.contains("python", na=False, case=False)
df["python"].value_counts()

How many Jobs are there that require SQL skills?

In [None]:
df["SQL"] = df["Job Description"].str.contains("sql", na=False, case=False)
df["SQL"].value_counts()

How many Jobs are there that require Excel skills?

In [None]:
df["Excel"] = df["Job Description"].str.contains("excel", na=False, case=False)
df["Excel"].value_counts()

How many Jobs are there that require Tableau skills?

In [None]:
df["Tableau"] = df["Job Description"].str.contains("tableau", na=False, case=False)
df["Tableau"].value_counts()

It is time to move on to **Rating**

In [None]:
df.Rating.sample(5)

In [None]:
df.Rating.isnull().sum()

We have 272 missing values. Let leave them as they are and be careful not to use Salary Estimates and Job Tiles on these rows with missing values.

Now **Company Name**

In [None]:
df["Company Name"].sample(10)

ıt seems that some company names have their rating score attached at the very end. Let make sure of it.

In [None]:
df[["Company Name", "Rating"]].sample(10)

We are right on our first prediction. Therefore, we can remove them. Additionally, those without ratings attached at the end have missing rating scores.

In [None]:
df["Company Name"] = df["Company Name"].str.split("\n").str[0]
df["Company Name"].head()

In [None]:
df["Company Name"].isnull().sum()

In [None]:
df[df["Company Name"].isnull()]

We have only one missing value. We can leave it as it is.

Now **Industry**

In [None]:
df.Industry.value_counts()

In [None]:
df.Industry.isnull().sum()

They can be leave as they are, as well

And finally **Sector**

In [None]:
df.Sector.value_counts()

In [None]:
df.Sector.isnull().sum()

Let's check general info of our dataset after applying above arrangements before moving on to analyzing it. 

In [None]:
df.info()

# Analyzing Dataset

We will analyze the following variables:
- Job title
- Rating
- Company Name
- Industry
- Sector
- Salary Minimum
- Salary Maximum
- Salary Average
- python
- SQL 
- Excel 
- Tableau 

In [None]:
df.columns

In [None]:
df_analyis = df[['Job Title', 'Company Name', 'Rating', 'Industry', 'Sector', 'Salary Minimum','Salary Maximum', 'Salary Average','python', 'SQL', 'Excel', 'Tableau']]
df_analyis.head()

In [None]:
df_analyis.isnull().sum()

Let's see what we have in terms of statistics on our overall dataset.

In [None]:
df_analyis.describe()

- **Rating**: Mean is close to median (3.73 and 3.70 respectively). We can see that it has slightly right skewed distribution. We may expect outliers on minimum side of distribution.
- **Salary Minimum**: Mean is distant from median (54.26 and 50.00 respectively). Average minimum salary is around 57K and median value for minimum salary is around 50K. We can see that it has right skewed distribution. We may expect outliers on minimum side of distribution.
- **Salary Maximum**: Mean is distant from median (89.97 and 87.00 respectively). Average maximum salary is around 89K and median value for maximum salary is around 87K. We can see that it has right skewed distribution. We may expect outliers on minimum side of distribution.
- **Salary Average**: Average salary is around 72K, but still we can expect several outliers for this variable(min= 33.5K, max= 150K)

In [None]:
fig = px.histogram(data_frame=df_analyis, x="Salary Minimum", title="Data Analyst Jobs - Minimum Salary", marginal="box", hover_data=df_analyis[["Job Title", "Company Name"]])
fig.show()

It can be inferred from the histogram above that minimum salary for data analyst is an average between 40-70K, but we have quite skewed distribution on the minimum salary as we concluded before drawing graph.

In [None]:
fig = px.histogram(data_frame=df_analyis, x="Salary Maximum", title="Data Analyst Jobs - Minimum Salary", marginal="box", hover_data=df_analyis[["Job Title", "Company Name"]])
fig.show()
                  

Maximum salary for data analyst is an average between 65-90K, but we have quite skewed distribution on the maximum salary.

In [None]:
fig = px.histogram(data_frame=df_analyis, x="Salary Average", marginal="box", hover_data=df_analyis[["Job Title", "Company Name"]], title="Average Salary of Data Analyst Jobs")
fig.show()

Again, we can inferred for the above graph that average salary for data analyst is an average between 60-80K, but we have skewed distribution on the average salary.

This time, let's see what we have in terms of "**Data Analyst**" jop title advertisement and its salary scale.

Firstly, we create a new dataset that contains only "**Data Analyst**" jobs.

In [None]:
data_analyst_title = df_analyis[df_analyis["Job Title"] == "Data Analyst"]
data_analyst_title.head()

In [None]:
data_analyst_title.describe()

Based on descriptive statistical infos we have above, we can conclude that:
- **Rating**: Mean and median values are close to each other (3.850143 and 3.900000 respectively). Mean of data analyst title is slighlt higher than the mean of overall dataset's rating (3.73). We can expet slighly left skewed distribution.
- **Salary Minimum**: Average minimum salary is around 54K and median value for minimum salary is 51K (it is almost same with the mean and median of the whole dataset's Salary Minimum. We can expect a right skewed distribution and some outliers.
- **Salary Maximum**: Average maximum salary is almost 90K and median value for maximum salary is 85K. We can expect a right skewed distribution and some outliers.
- **Salary Average**: Average salary is around 72K, but still we can expect several outliers for this variable (min= 33.5K, max= 150K).

In [None]:
fig = px.histogram(data_frame=data_analyst_title, x="Salary Minimum", title="Minimum Salary of Data Analyst", marginal="box", hover_data=data_analyst_title[['Job Title', 'python', 'SQL', 'Excel', 'Tableau']])
fig.show()

Based on the above histogram, we can see that minimum salary for data analyst is an average between 40-60K, but we have quite skewed distribution on the minimum salary.

In [None]:
fig = px.histogram(data_frame=data_analyst_title, x="Salary Maximum", title="Maximum Salary of Data Analyst", marginal="box", hover_data=data_analyst_title[['Job Title', 'python', 'SQL', 'Excel', 'Tableau']])
fig.show()

Based on the above histogram, we can see that maximum salary for data analyst is an average between 67-95K, but we have quite skewed distribution on the maximum salary.

In [None]:
fig = px.histogram(data_frame=data_analyst_title, x="Salary Average", title="Average Salary of Data Analyst", marginal="box", hover_data=data_analyst_title[['Job Title', 'python', 'SQL', 'Excel', 'Tableau']])
fig.show()

Based on the above histogram, we can see that average salary for data analyst is an average between 55-80K, but we have skewed distribution on the average salary.

Let's see **Job Openings based on Job Titles**. Top 10 job titles in data analyst job advertisements.

In [None]:
a = df_analyis["Job Title"].value_counts()[:10]
a

In [None]:
plt.figure(figsize=(6, 4), dpi=(120))
sns.scatterplot(x=a.index, y=a.values)
plt.title("Number of Job Openings by Job Titles")
plt.xlabel("Job Title")
plt.ylabel("Number of Job Openings")
for i, ii in enumerate(a):
    plt.text(i, ii, str(ii))
plt.xticks(rotation=270);

Based on oor observation from the scatterplot above we can say that:

- Data Analyst
- Senior Data Analyst
- Junior Data Analyst
- Business Data Analyst are the most used titles in the job advertisements.

Let's see **Job Openings based on Industry**. Top 10 job titles in data analyst job advertisements.

In [None]:
b = df_analyis["Industry"].value_counts()[:10]
b

In [None]:
plt.figure(figsize=(8, 4), dpi=(120))
sns.scatterplot(x=b.index, y=b.values)
plt.title("Number of Job Openings by Industry")
plt.xlabel("Industry")
plt.ylabel("Number of Job Openings")
for i, ii in enumerate(b):
    plt.text(i, ii, str(ii), va="center")
plt.xticks(rotation=270);

As shown in the plot;

- IT Services
- Staffing & Outsourcing
- Health Care Services & Hospitals
- Consulting
- Computer Hardware & Software are the most data analyst job opening advertised industries.

Let's see **Job Openings based on Sector**. Top 10 job titles in data analyst job advertisements.

In [None]:
c = df_analyis["Sector"].value_counts()[:10]
c

In [None]:
plt.figure(figsize=(8, 4), dpi=(120))
sns.scatterplot(x=c.index, y=c.values)
plt.title("Number of Job Openings by Sector")
plt.xlabel("Sector")
plt.ylabel("Number of Job Openings")
for i, ii in enumerate(c):
    plt.text(i, ii, str(ii), va="center")
plt.xticks(rotation=270);

As shown in the plot;

- Information Technology
- Business Services
- Finance
- Health Care are the most data analyst job opening advertised sectors.

Let's see **programming laguage skills** required by Job Titles

In [None]:
lang_skills = df_analyis[["Job Title", "python", "Excel", "SQL", "Tableau"]]
lang_skills_1 = lang_skills.groupby("Job Title")[["python", "Excel", "SQL", "Tableau"]].sum().sort_values(by="python", ascending=False)[:10]
lang_skills_1['number_of_job_openings'] = df_analyis['Job Title'].value_counts()[:10].values
lang_skills_1

Based on the job advertisements' requirements, we can safely assume that, SQL and Excel keeps their importance. As a programming language Python is required almost 1 out of 3 times in the job advertisemenst, same is also true for visualiztion tool Tableau.

In [None]:
lang_skills_1.index

In [None]:
fig = px.bar(data_frame=lang_skills_1, x=lang_skills_1.index, y=["python", "Excel", "SQL", "Tableau"], title="Programming Languages")
fig.show()

And finally, let's see **Salary Distributions by Company Names**.

In [None]:
fig = px.scatter(df_analyis, x="Salary Minimum", y="Company Name", color="Rating", hover_data=['Industry', 'Job Title'], 
title = "Minimum Salary by Company Name with Rating Scores")
fig.show()

In [None]:
fig = px.scatter(df_analyis, x="Salary Maximum", y="Company Name", color="Rating", hover_data=['Industry', 'Job Title'], 
title = "Maximum Salary by Company Name with Rating Scores")
fig.show()

In [None]:
fig = px.scatter(df_analyis, x="Salary Average", y="Company Name", color="Rating", hover_data=['Industry', 'Job Title'], 
title = "Average Salary by Company Name with Rating Scores")
fig.show()

That was the final step of our EDA. Hope you have fun reading and studying it. 