Skip to content

Commit

Permalink
Merge pull request #106 from sanskritilabroo/main
Browse files Browse the repository at this point in the history
Fixing Issue #81 - added logreg metrics + visualization for job sats
  • Loading branch information
sanjay-kv committed May 18, 2024
2 parents 32fb8a4 + 59b67f3 commit 763a98c
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 25 deletions.
54 changes: 29 additions & 25 deletions Learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
<summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol>
<li>
<a href="#1 Project Description">Project Description</a>
<a href="#1-Project-Description">Project Description</a>
</li>
<li>
<a href="#2 Data Source">Data Source</a>
Expand Down Expand Up @@ -66,8 +66,7 @@
</ol>
</details>

# <a name="1 Project Description">Project description:</a>

<h1 id="1-Project-Description">Project description:</h1>

Stack overflow is a professional community for developers. They conduct developer surveys every year since 2011, and the collected data is available open-source on the web. The latest dataset 2020 was released on March 5th, 2021. With proper analysis, the Dataset would help us to answer real-world questions. For instance, we can find the most popular language that the developers use.We also can find the developer role which pays the highest salary. Our project is to analyze the last three years of the developer survey and gather meaningful insights from it.

Expand All @@ -76,12 +75,13 @@ As a first step, we will clean the data by removing null values and outliers in
The questions that we answered as part of the analysis were given in the `Data analysis and visualization section`. Please refer to the Jupyter notebook file for all the codes. This `readme.md` file explains the key steps and results that we got as part of our project.


# <a name="2 Data Source">Data source:</a>
<h1 id="2 Data Source">Data Source</h1>

The dataset is very diverse and came from a Stack overflow developer survey from 180 countries. Stack overflow has data collected through surveys from 2011 to 2020. We choose 2018,2019 and 2020 to analyze for the projects. The participants mostly from the US, India, and EMEA regions. The majority of the survey respondents had a background of developer/ coding experience. We performed various analysis and our key results are given in the `Data Analysis` section.

Dataset can be downloaded from the mentioned below link:


**Download Link** -> https://insights.stackoverflow.com/survey

**Available in GitHub community Exchange** ->https://education.github.com/globalcampus/exchange?utf8=%E2%9C%93&q=sanjay
Expand All @@ -90,7 +90,7 @@ The data are available in the CSV format ranging from 40 to 150 MB with data of

The reason why we chose this dataset is because of its diverse nature and it was completely uncleaned. We, as a developer, use Stack overflow to find answers for most of the questions we get. That encouraged us to explore and derive key insights from the survey results. Also, the Insights can be used for a better understanding of the information technology and hiring employees and job seekers for preparing the career resume building.

# <a name="3 Key Insights">Key Insights</a>
<h1 id="3 Key Insights">Key Insights</h1>

1. JavaScript has maintained its stronghold as the most commonly used programming language. Almost 70% of the respondents are using Javascript. HTML/CSS stands as the second most popular language with about 63%
2. About `55%` of respondents identify themselves as **full-stack developers**, and about `20%` consider themselves as **mobile developers**.
Expand All @@ -102,10 +102,8 @@ The data are available in the CSV format ranging from 40 to 150 MB with data of
8. Most of the Data scientist respondents came from United States(1550). And the country which has the second highest number of data scientist is India(540)
9. The country which pays the highest salary for Data scientist is Ireland($275,851). The second highest was Luxembourg($272,796). Australia pays about ($146,803)



# <a name="4 Data Cleaning">Data Cleaning</a>

<h1 id="4 Data Cleaning">Data Cleaning</h1>
<img src="https://recodehive.com/wp-content/uploads/2021/05/Data-Cleaning-1024x361.png">

As our first step, we started gathering information on all three datasets and looked into the columns that answer the questions we have as part of our research. The mentioned below columns were choosen as key factors for our analysis
Expand All @@ -127,7 +125,8 @@ Some of the column names were not easily understandable, for example, the column
| JobSat | CurrentJobSatis |
| JobSeek | JobStatus |

## <a name="4.1 Data Refactoring">4.1) Data Refactoring</a>

<h2 id="4.1 Data Refactoring">4.1) Data Refactoring</h2>

Most of the column values were more detailed and were difficult for analze. For instance, the values in the `EdLevel` column were as below.

Expand Down Expand Up @@ -185,7 +184,7 @@ Professional 1037

Similary, we followed the same for other columns such as `Gender` `Profession` `UndergradMajor` `JobStatus` `Employment`

## <a name="4.2 Categorising the data">4.2) Categorising the data</a>
<h2 id="4.2 Categorising the data">4.2) Categorising the data</h2>

One of our column `Ethnicity` had 173 values and had various subcategories. Some of the values are given below for reference.

Expand Down Expand Up @@ -239,7 +238,7 @@ df2020.loc[df['Ethnicity'].str.match('Multiracial') == True, 'Ethnicity'] = 'Mul

The above process has been carried out for all three data frames `2018` `2019` `2020`

## <a name="4.3 Handling the null values">4.3) Handling the null values</a>
<h2 id="4.3 Handling the null values">4.3) Handling the null values</h2>

<img src="https://recodehive.com/wp-content/uploads/2021/05/Message-from-Founder-1024x576.png">

Expand Down Expand Up @@ -306,19 +305,20 @@ All the null values were handled for all three data sets and ensured the dataset
| YearsCodePro | 18112 | 0 |
| JobSeek | 2153 | 0 |

# <a name="5 Data Analysis and Visualization">Data Analysis and Visualization</a>
<h1 id="5 Data Analysis and Visualization">Data Analysis and Visualization</h1>

After cleaning and handling outliers in all three datasets, we started looking for valuable insights that we can draw from it.

<img src="https://recodehive.com/wp-content/uploads/2021/05/Message-from-Founder-1024x576.jpg">

## <a name="5.1 Distribution of respondents based on country">5.1) Distribution of respondents based on country</a>
<h2 id="5.1 Distribution of respondents based on country">5.1) Distribution of respondents based on country</h2>

We made use of `plotly` to create a Geoplot showing where the respondents are from and how it's been distributed around the world. We found that most of the respondents are from America.India is in the second position in terms of the number of respondents.

<img src="Data/Images/Geo plot.png">

## <a name="5.2 Impact of participation rate due to different ethnicity">5.2) Impact of participation rate due to different ethnicity</a>

<h2 id="5.2 Impact of participation rate due to different ethnicity">5.2) Impact of participation rate due to different ethnicity</h2>

Consistent with data in all three years, We found that `white or european descent` has the highest participation rate overall.

Expand All @@ -337,29 +337,29 @@ for i, v in enumerate(count):

<img src="Data/Images/Ethnicity vs participation.png">

## <a name="5.3 Most popular programming language in three years">5.3) Most popular programming language in three years</a>
<h2 id="5.3 Most popular programming language in three years">5.3) Most popular programming language in three years</h2>

The most popular language that developers worked on between 2018 to 2020 is JavaScript(14%). The second and third highest working language is HTML/CSS(13%) and SQL(11%). JavaScript and SQL had the same steady increasing trend over the three years. The percentage of HTML/CSS was slightly increased from 2018 to 2019. However, it dropped to the same level as 2018 in 2020. Python was responsible for about 9% in 2018. After then, it decreased to 8% in 2019 and it rose 1% in 2020.

There are some languages that were in only 2019; Elixir, Clojure, F#, Web assembly, and Erlang. Perl, Haskell, Julia were in the 2019 and 2020 with small percentages.

<img src="Data/Images/popular language distribution.png">

## <a name="5.4 Distribution of developers based on their developer role">5.4) Distribution of developers based on their developer role</a>

<h2 id="5.4 Distribution of developers based on their developer role">5.4) Distribution of developers based on their developer role</h2>

Most of the respondents were either back-end or full-stack developers.  For those who are working as marketing and sales professionals, their percentage is lowest compare to others.

<img src="Data/Images/devtype distribution.png">



## <a name="5.5 Distribution of respondents based on age">5.5) Distribution of respondents based on age</a>
<h2 id="5.5 Distribution of respondents based on age">5.5) Distribution of respondents based on age</h2>

Most of the respondents are in the age range 25-29. This shows that most of the responents are those who recently joined the comapanies or those who have less than 5 years of experience.

<img src="Data/Images/age distribution.png">

## <a name="5.6 Salary distribution of top ten countries">5.6) Salary distribution of top ten countries</a>
<h2 id="5.6 Salary distribution of top ten countries">5.6) Salary distribution of top ten countries</h2>

Overall, the country which has the highest mean annual salary is the United States of America($240,000) Dollars. The second highest country which provides mean salary is Australia($164,926) Dollars. Though India has a higher number of respondents, it has the lowest mean salary of $25,213 which shows that mean salary of developed country is much higher than the developing countries.

Expand All @@ -385,29 +385,33 @@ plt.show()

<img src="Data/Images/salary top ten countries.png">

## <a name="5.7 Analysis of impact of education on salary">5.7) Analysis of impact of education on salary</a>

<h2 id="5.7 Analysis of impact of education on salary">5.7) Analysis of impact of education on salary</h2>

The respondents who have done Doctorate have the highest mean salary among all other education levels. Secondly, the respondents who have done Bachelors degree has more salary than that of Masters degree holders. This may be due to years of professional coding experience and due to the higher number of respondents in that category than that of Masters degree(No of respondents in Bachelor degree is 35659 and number of respondents in masters degree is 16940)

What is interesting is that the respondents who do not have any degree have a mean salary of $90k. This shows the improvement in online learning and advancement of technology that is shifting the company from relying on University degrees.

<img src="Data/Images/salary on edlevel.png">

## <a name="5.8 Gender distribution among top five countries in 2019">5.8) Gender distribution among top five countries in 2019</a>

<h2 id="5.8 Gender distribution among top five countries in 2019">5.8) Gender distribution among top five countries in 2019</h2>

Based on the top 5 countries where the respondents have given the survey, we categorized male and female respondents in those countries.

In terms of male and female statistics, it can be realized that the US has the relatively largest female percentage at about 10.9% followed by Canada and UK at 9.6% and 8.0% respectively. Female respondents were around 5% in India and Germany which is the least among the top 5 counties.

<img src="Data/Images/gender distribution top 5.png">

## <a name="5.9 Where most data scientist came from in 2019?">5.9) Where most data scientist came from in 2019?</a>
<h2 id="5.9 Where most data scientist came from in 2019?">5.9) Where most data scientist came from in 2019?</h2>

There are 5,788 data scientists who responded to the Stackoverflow survey in `2019`. Most data scientists are from the US with 1,550 people and it is 3 times higher than data scientists from India. Followed by Germany and the UK with 427 and 339 people respectively. The rest are Canada, France, Netherlands, Brazil, Russia, and Australia which have less than 200 data scientists.

<img src="Data/Images/DS_top contries.png">

## <a name="5.10 Countries which pays the most for data scientist in 2019">5.10) Countries which pays the most for data scientist in 2019</a>

<h2 id="5.10 Countries which pays the most for data scientist in 2019">5.10) Countries which pays the most for data scientist in 2019</h2>


In 2019, the top three countries which have a highest mean annual salary of a data scientist are Ireland (`$275,851`), Luxembourg (​`$272,769`), and the USA (`$265,211`). Apart from that, the mean salary of the rest countries is less than (`$200,000`) per year. Japan provides the highest mean annual salary among Asian countries (`$118,969`)

Expand Down Expand Up @@ -517,7 +521,7 @@ Top 2 features negatively effecting Job Satisfaction are age, country. So, in th
- UndergradMajor and other Science,are mostly satisfied.
- Most satisfied countries Malta, Ghana, Cyprus.

# <a name="7 Conclusion">Conclusion</a>
<h1 id="7 Conclusion">Conclusion:</h1>

Overall, we performed various analyses on the Stack overflow developer survey and derived insights from it.
We found which country has the highest no of respondents, which is the most popular language, education level of respondents, different roles of developers, and so on.
Expand Down

0 comments on commit 763a98c

Please sign in to comment.