- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

###  This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.


 #### **INTRO**

- This Exploratory Data Analysis (EDA) will focus on overall Data Analyst job title, job description, salary estimate

- Let's import the required libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff


import warnings
warnings.filterwarnings('ignore') 

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv')

In [None]:
df.head()

In [None]:
df.info()

- Even though we have information about the salary ('Salary Estimate'), it is not in the numeric version, same for 'Size' and 'Revenue'

- Based on our research interest we can make arrangments on these variables


### Preparing dataset for the EDA

- Look for the missing values
- Look for the area of interest (based on the research question we have)
- Make adjustments

- Let's look at the missing values

In [None]:
df.isnull().sum()

- Do we have only 1 missing value ?

- Nope. Even we look at the top 5 rows of the data we can easily see (-1). 
- Most of the time in the real world datasets, unknown and missing values are not recorded like sofware recognizable version of the missing value. Usually, as a data person, we have to detect and deal with them.
- So let's make (-1) as a mising value

In [None]:
df.replace('-1', np.nan)

- Are we done with the 'unrecognizable missing value detection job'
- Sorry, I don't think so.
- We have just dealt with str version of the (-1), but we can still see (-1) in the founded column, which is numeric type.
- So we have to deal with all possible version of the (-1)

In [None]:
df.replace(-1, np.nan, inplace=True)

In [None]:
df.replace('-1', np.nan, inplace=True)

In [None]:
df.replace(-1.0, np.nan, inplace=True)

- Now let's see whether we still have only one missing value

In [None]:
df.isnull().sum()

- Yes, that's the reality of the real life data. 

In this EDA, we will focus on:
   - Job Title
   - Salary Estimate
   - Job Description
   - Rating
   - Company name
   - Industry
   - Sector

- Let's look at the each variable in detail and make required adjustments.

#### Job Title

In [None]:
df['Job Title'].value_counts()

In [None]:
df['Job Title'].value_counts()[:20]

- As we have seen some of the naming in the job title can be renamed (such as Sr. --> Senior, Jr. ---> Junior)

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Sr. Data Analyst','Senior Data Analyst')

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Sr Data Analyst','Senior Data Analyst')

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Data Analyst Senior','Senior Data Analyst')

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Jr. Data Analyst','Junior Data Analyst')

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Jr Data Analyst','Junior Data Analyst')

In [None]:
df['Job Title']= df['Job Title'].str.replace ('Data Analyst Junior','Junior Data Analyst')

In [None]:
df['Job Title'].value_counts()[:20]

In [None]:
df['Job Title'].isnull().sum()

- So far seems OK.
- Let's move on to the **Salary Estimate**

### Salary Estimate

In [None]:
df['Salary Estimate']

In [None]:
df['Salary Estimate'].isnull().sum()

OK we have two issue to give attention.

- Missing value. In our case, only 1, but since we will use salary estimate in our analysis in detail, we don't want any row without salary estimate. 

- Salary estimate is not what we expected, it is not in the numeric form, we have **$** sign, **text** and numbers in the form of range between two number.

##### **Dealing with missing value in the Salary Estimate**

- In addition to domain knowledge and expertise, there are tons of different ways to deal with the missing values.
- in this EDA, we will use just one of them.

In [None]:
# First let's find the row with the missing value in the salary estimate column

In [None]:
df[df['Salary Estimate'].isnull()]

In [None]:
# let's see this company has any other advertisement

In [None]:
df[df['Company Name']=='Protingent\n4.4']

In [None]:
# Ok, we have another advertisement from the same company. Let's see the similarities in the job description.

In [None]:
df['Job Description'][2123]

In [None]:
df['Job Description'][2149]

- From the above job descriptions, even though there are similarities in the job description, job in the missing value row, needs more qualifications than the previous one
Based on the given info we can not assume same salary estimate for the missing salary estimate  value.
- It is better to drop missing value row.

In [None]:
df.drop(2149, inplace=True)

In [None]:
df['Salary Estimate'].isnull().sum()

#### To get Numeric values from Salary Estimate

- First let's remember, what format we have in the salary estimate variable

In [None]:
df['Salary Estimate'].sample(3)

- We have quite a lot of useful information, and we should make use of it.

Let's get the numbers out of it and by using the numbers, let's make: 
- maximum salary column
- minimum salary column
- average salary column

In [None]:
df['Salary_minimum']= df['Salary Estimate'].str.lstrip('$').str[:3].str.replace('K','').str.strip().astype('float')

In [None]:
df['Salary_maximum'] = df['Salary Estimate'].str[6:10].str.replace('K','').str.lstrip('$').str.strip().astype('float')

In [None]:
df['Salary_average'] = (df['Salary_maximum']+df['Salary_minimum'])/2

- Let's see everthing is in order.

In [None]:
df[['Salary Estimate','Salary_minimum','Salary_maximum','Salary_average']].sample(10)

- Seems quite OK

#### **Job Description**

In [None]:
df['Job Description'][0]

- Yes I agree quitea long one. But job descriptions contain tons of useful information. For this EDA we will focus on programming language requirements. Specifically, Python and SQL

In [None]:
df['Job Description'].isnull().sum()

- That's good, we can safely use job description variable for further analysis.

- First start with Python, and see how many job description contains/requires knowledge on Python.

In [None]:
df['python_job'] = df['Job Description'].str.contains('python', na=False, case=False)
df['python_job'].value_counts()

- Then look at the SQL

In [None]:
df['SQL_job'] = df['Job Description'].str.contains('sql', na=False, case=False)
df['SQL_job'].value_counts()

- Agreed good to see the old friend Ms Excel

In [None]:
df['excel_job'] = df['Job Description'].str.contains('excel', na=False, case=False)
df['excel_job'].value_counts()

- And lastly see Tableau

In [None]:
df['tableau_job'] = df['Job Description'].str.contains('tableau', na=False, case=False)
df['tableau_job'].value_counts()

- Let's move on to the 'Rating'

#### **Ratings**

In [None]:
df['Rating'].sample(10)

In [None]:
df['Rating'].isnull().sum()

- Right now, we will keep the row which has missing value on Rating, we ant to use salary and job title info on these rows.

#### **Company Name**

In [None]:
df['Company Name'].sample(10)

- OK. Something weird with the company names, it includes number, which most probably reflects, their rating score.
- Let explore it.

In [None]:
df[['Company Name', 'Rating']].sample(5)

- Yep, so we can safely remove rating part from the company name.

In [None]:
df['Company Name']= df['Company Name'].str.split('\n').str[0]

df['Company Name'].sample(3)

- Let's see whether any missing value in company name

In [None]:
df['Company Name'].isnull().sum()

In [None]:
df[df['Company Name'].isnull()]

- We can keep this row, for job title, salary and job description

In [None]:
df['Industry'].value_counts()

- Seems OK.
- Let's see the missing values, if any.

In [None]:
df['Industry'].isnull().sum()

- We will keep these rows which has missing value on Industry.

#### Sector

In [None]:
df['Sector'].value_counts()

- Seems quite OK.
- Let's look at the missing values, at least we know one, but let's see any other missng value in this variable

In [None]:
df['Sector'].isnull().sum()

- We will keep these rows which has missing value on Sector.

In [None]:
df.info()

### Analysis Part

Let's start the analysis part with needed variables, namely:
- Job title
- Rating
- Company Name
- Industry
- Sector
- Salary Minimum
- Salary Maximum
- Salary Average
- Python job
- SQL job
- Excel job
- Tableau job

In [None]:
df.columns

In [None]:
df_analyst = df[['Job Title', 'Company Name', 'Rating', 'Industry', 'Sector', 'Salary_minimum','Salary_maximum', 'Salary_average','python_job', 'SQL_job', 'excel_job', 'tableau_job']]
df_analyst.head()

In [None]:
df_analyst.isnull().sum()

In [None]:
df_analyst.describe()

Let's look at the some of the information we can get from above table:

- Average and also median value for the ratings of the companies very close to 4 (at round 3.7)
- Average minimum salary is around 54K and median value for minimum salary is around 50K. We can expect outliers from salary distribution and we acn also expect right skewed distribution of the minimum salary.
- Average maximum salary is almost 90K (89.97) and median value for maximum salary is 87K. We can expect outliers from salary distribution and we can also expect right skewed distribution of the maximum salary.
- Average salary is around 72K, but still we can expect several outliers for this variable(min= 33.5K, max= 150K)

In [None]:
fig = px.histogram(df_analyst, x= 'Salary_minimum', title='Minimum Salary of Data Analyst Jobs', marginal="box",hover_data = df_analyst[['Job Title', 'Company Name']])

fig.show()

- As seen in the histogram, minimum salary for data analyst is an average between 40-70K, but we have quite skewed distribution on the minimum salary.

In [None]:
fig = px.histogram(df_analyst, x= 'Salary_maximum', title='Maximum Salary of Data Analyst Jobs', marginal="box",hover_data = df_analyst[['Job Title', 'Company Name']])

fig.show()

- As seen in the histogram, maximum salary for data analyst is an average between 65-90K, but we have quite skewed distribution on the maximum salary.

In [None]:
fig = px.histogram(df_analyst, x= 'Salary_average', title='Average Salary of Data Analyst Jobs',marginal="box",hover_data = df_analyst[['Job Title', 'Company Name']])

fig.show()

- As seen in the histogram, average salary for data analyst is an average between 60-80K, but we have skewed distribution on the average salary.

- OK, we have overall info about the salary of the all data analyst job advertisements.
- Next we will specifically see 'Data Analyst' job advertisement and it's salary scale

### Data Analyst

In [None]:
data_analyst_df = df_analyst[df_analyst['Job Title']=='Data Analyst']
data_analyst_df.head()

In [None]:
data_analyst_df.describe()

Let's look at the some of the information we can get from above table:

- Average and also median value for the ratings of the companies very close to 4 (at round 3.9), it is a little bit more than overall job advertisements' rating average (3,7)
- Average minimum salary is around 54K and median value for minimum salary is 51K. It is almost same with the whole data minimum salary mean and median. We can expect outliers from salary distribution and we acn also expect right skewed distribution of the minimum salary.
- Average maximum salary is almost 90K (89.97) and median value for maximum salary is 85K. MMedian value is less than whole data maximum salary median value. We can expect outliers from salary distribution and we can also expect right skewed distribution of the maximum salary.
- Average salary is around 72K, but still we can expect several outliers for this variable(min= 33.5K, max= 150K)

In [None]:

fig = px.histogram(data_analyst_df, x= 'Salary_minimum',title='Minimum Salary of Data Analyst', marginal="box",hover_data = data_analyst_df[['Job Title', 'python_job', 'SQL_job','excel_job','tableau_job']])
fig.show()

- As seen in the graph, minimum salary for data analyst is an average between 40-60K, but we have quite skewed distribution on the minimum salary.

In [None]:
fig = px.histogram(data_analyst_df, x= 'Salary_maximum', title='Maximum Salary of Data Analyst',marginal="box",hover_data = data_analyst_df[['Job Title', 'python_job', 'SQL_job','excel_job','tableau_job']])
fig.show()

- As seen in the histogram, maximum salary for data analyst is an average between 67-95K, but we have quite skewed distribution on the maximum salary.

In [None]:
fig = px.histogram(data_analyst_df, x= 'Salary_average',title='Average Salary of Data Analyst', marginal="box",hover_data = data_analyst_df[['Job Title', 'python_job', 'SQL_job','excel_job','tableau_job']])
fig.show()

- As seen in the histogram, average salary for data analyst is an average between 55-80K, but we have skewed distribution on the average salary.

### Job Openings by Job Title

- Let's see top 20 job title in the data analyst job advertisements.

In [None]:
df_analyst['Job Title'].value_counts()[:20]

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(
    x= df_analyst['Job Title'].value_counts()[:20].index,
    y= df_analyst['Job Title'].value_counts()[:20].values,
    name='Number of Job Openings',
    mode='markers+text+lines',
    marker_color='blue',
    marker_size=10,
    text=df_analyst['Job Title'].value_counts()[:20].values,
    textposition='top center',
    line=dict(color='red',dash='dash'),
))
fig.update_layout(
    title= "<b>Number of Job Openings by Job titles</b>",
    xaxis_title="<b>Job Titles</b>",
    yaxis_title="<b>Number of Job Openings</b>",
    template='seaborn',
    font=dict(
        size=12,
        color="Black",
        family="Oswald', sans-serif"
        ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
    yaxis2=dict(showgrid=True,overlaying='y',side='right',title='<b>Number of Job Openings</b>'),
    legend=dict(yanchor="top",
    y=1.3,
    xanchor="left",
    x=0.78)
)
fig.show()

As shown in the plot;
- Data Analyst
- Senior Data Analyst
- Junior Data Analyst
- Business Data Analyst are the most used titles in the job advertisements.

### Job Openings by Industry

In [None]:
df_analyst['Industry'].value_counts()[:20]

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(
    x= df_analyst['Industry'].value_counts()[:20].index,
    y= df_analyst['Industry'].value_counts()[:20].values,
    name='Number of Job Openings',
    mode='markers+text+lines',
    marker_color='red',
    marker_size=10,
    text=df_analyst['Industry'].value_counts()[:20].values,
    textposition='top center',
    line=dict(color='white',dash='dash'),
))
fig.update_layout(
    title= "<b>Number of Job Openings by Industry</b>",
    xaxis_title="<b>industry</b>",
    yaxis_title="<b>Number of Job Openings</b>",
    template='seaborn',
    font=dict(
        size=12,
        color="Black",
        family="Oswald', sans-serif"
        ),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    yaxis2=dict(showgrid=True,overlaying='y',side='right',title='<b>Number of Job Openings</b>'),
    legend=dict(yanchor="top",
    y=1.3,
    xanchor="left",
    x=0.78)
)
fig.show()

As shown in the plot;

- IT Services                               
- Staffing & Outsourcing                     
- Health Care Services & Hospitals           
- Consulting                                 
- Computer Hardware & Software   are the most data analyst job opening advertised industries.

- It is important to mention also, Banking (Investment Banks - Banks) also advertised 129 (combined)  data analyst job  openings

### Job Openings by Sector

In [None]:
df_analyst['Sector'].value_counts()[:20]

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(
    x= df_analyst['Sector'].value_counts()[:20].index,
    y= df_analyst['Sector'].value_counts()[:20].values,
    name='Number of Job Openings',
    mode='markers+text+lines',
    marker_color='blue',
    marker_size=10,
    text=df_analyst['Sector'].value_counts()[:20].values,
    textposition='top center',
    line=dict(color='white',dash='dash'),
))
fig.update_layout(
    title= "<b>Number of Job Openings by Sector</b>",
    xaxis_title="<b>Sector</b>",
    yaxis_title="<b>Number of Job Openings</b>",
    template='seaborn',
    font=dict(
        size=12,
        color="Black",
        family="Oswald', sans-serif"
        ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
    yaxis2=dict(showgrid=True,overlaying='y',side='right',title='<b>Number of Job Openings</b>'),
    legend=dict(yanchor="top",
    y=1.3,
    xanchor="left",
    x=0.78)
)
fig.show()

As shown in the plot;

- Information Technology                               
- Business Services
- Finance
- Health Care are the most data analyst job opening advertised sectors.


- Now let's see **Python, SQL, Excel, Tableau** in an action.

### Required Programming Languages & Program Skills by Job Title

In [None]:
df4 = df_analyst[['Job Title','python_job', 'SQL_job','excel_job','tableau_job']].copy()

Lang = df4.groupby('Job Title')[['python_job', 'SQL_job','excel_job','tableau_job']].sum().sort_values(by='python_job',ascending=False).head(10)
df_lang = pd.DataFrame(Lang)
df_lang = df_lang.reset_index()

df_lang['number_of_job_openings'] = df_analyst['Job Title'].value_counts()[:10].values
columnsTitles = ['Job Title', 'number_of_job_openings','python_job', 'SQL_job','excel_job','tableau_job']

df_lang = df_lang.reindex(columns=columnsTitles)

df_lang


- Based on the job advertisements' requirements, we can safely assume that, SQL and Excel keeps their importance. As a programming language Python is required almost 1 out of 3 times in the job advertisemenst, same is also true for visualiztion tool Tableau.

In [None]:
fig = px.bar(df_lang, x='Job Title', y=['python_job', 'SQL_job','excel_job','tableau_job'], title="Languages")
fig.show()

- Finally see  based on company name, different salary distributions.

### Data Analyst Jobs Salary Range based on  Company Name with Rating Scores

In [None]:
fig = px.scatter(df_analyst, x="Salary_minimum", y="Company Name", 
                 color="Rating", 
                 hover_data=['Industry', 'Job Title'], 
                 title = "Minimum Salary by Company Name with Rating Scores")
fig.show()

- Minimum salary 110K is quite, I mean quite OK for working 3.9 rating company, like Netflix :)

In [None]:
fig = px.scatter(df_analyst, x="Salary_maximum", y="Company Name", 
                 color="Rating", 
                 hover_data=['Industry', 'Job Title'], 
                 title = "Maximum Salary by Company Name with Rating Scores")
fig.show()

- Company name 'Enjoy', maximum salary offfer 190K, most probably I will also 'Enjoy' to work there.

In [None]:
fig = px.scatter(df_analyst, x="Salary_average", y="Company Name", 
                 color="Rating", 
                 hover_data=['Industry', 'Job Title'], 
                 title = "Average Salary by Company Name with Rating Scores")
fig.show()

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time. 

- All the best 