## Explore data

In [24]:
# importing necessary libraries
import pandas as pd

In [25]:
# setting options
pd.set_option('display.max_columns', None)
pd.set_option("display.float_format", "{:.2f}".format)

In [26]:
df = pd.read_csv('developer_dataset.csv')

  df = pd.read_csv('developer_dataset.csv')


In [27]:
# which columns in the data
df.columns

Index(['RespondentID', 'Year', 'Country', 'Employment', 'UndergradMajor',
       'DevType', 'LanguageWorkedWith', 'LanguageDesireNextYear',
       'DatabaseWorkedWith', 'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Hobbyist', 'OrgSize', 'YearsCodePro',
       'JobSeek', 'ConvertedComp', 'WorkWeekHrs', 'NEWJobHunt',
       'NEWJobHuntResearch', 'NEWLearn'],
      dtype='object')

The are the following kinds of information:
- A variety of columns that identify the person (RespondentID, Year, Country)
- Information about their experiences (LanguageWorkedWith, DatabaseWorkedWith, UndergradMajor, etc.)
- Information about what they might want to do in the future (LanguageDesireNextYear, DatabaseDesireNextYear, etc.)

In [28]:
# how many rows in the data
df.count()

RespondentID              111209
Year                      111209
Country                   111209
Employment                109425
UndergradMajor             98453
DevType                   100433
LanguageWorkedWith        102018
LanguageDesireNextYear     96044
DatabaseWorkedWith         85859
DatabaseDesireNextYear     74234
PlatformWorkedWith         91609
PlatformDesireNextYear     85376
Hobbyist                   68352
OrgSize                    54804
YearsCodePro               94793
JobSeek                    60556
ConvertedComp              91333
WorkWeekHrs                51089
NEWJobHunt                 19127
NEWJobHuntResearch         18683
NEWLearn                   24226
dtype: int64

In [29]:
# basic summary statistics on the dataset
df.describe()

Unnamed: 0,RespondentID,Year,YearsCodePro,ConvertedComp,WorkWeekHrs
count,111209.0,111209.0,94793.0,91333.0,51089.0
mean,19262.04,2018.85,9.55,125177.66,41.05
std,11767.01,0.78,7.55,246121.76,13.83
min,1.0,2018.0,0.0,0.0,1.0
25%,9268.0,2018.0,4.0,46000.0,40.0
50%,18535.0,2019.0,8.0,79000.0,40.0
75%,28347.0,2019.0,14.0,120000.0,42.0
max,42857.0,2020.0,50.0,2000000.0,475.0


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111209 entries, 0 to 111208
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   RespondentID            111209 non-null  int64  
 1   Year                    111209 non-null  int64  
 2   Country                 111209 non-null  object 
 3   Employment              109425 non-null  object 
 4   UndergradMajor          98453 non-null   object 
 5   DevType                 100433 non-null  object 
 6   LanguageWorkedWith      102018 non-null  object 
 7   LanguageDesireNextYear  96044 non-null   object 
 8   DatabaseWorkedWith      85859 non-null   object 
 9   DatabaseDesireNextYear  74234 non-null   object 
 10  PlatformWorkedWith      91609 non-null   object 
 11  PlatformDesireNextYear  85376 non-null   object 
 12  Hobbyist                68352 non-null   object 
 13  OrgSize                 54804 non-null   object 
 14  YearsCodePro        

#### Observations about the dataset

The data set contains 21 columns and 111,209 rows.

The are columns: NEWJobHunt, NEWJobHuntResearch, NEWLearn - those have a lot of missing values.

The intresting columns are:
- ConvertedComp
- LanguageWorkedWith
- WorkWeekHrs

It is interesting to see how the salary is related to the programming languages that people use, and how many hours they work per week.

## Delete highly missing data

In [31]:
# the percentage missing data for each column
maxRows = df['RespondentID'].count()
print('% Missing Data:')
print((1 - df.count() / maxRows) * 100)

% Missing Data:
RespondentID              0.00
Year                      0.00
Country                   0.00
Employment                1.60
UndergradMajor           11.47
DevType                   9.69
LanguageWorkedWith        8.26
LanguageDesireNextYear   13.64
DatabaseWorkedWith       22.79
DatabaseDesireNextYear   33.25
PlatformWorkedWith       17.62
PlatformDesireNextYear   23.23
Hobbyist                 38.54
OrgSize                  50.72
YearsCodePro             14.76
JobSeek                  45.55
ConvertedComp            17.87
WorkWeekHrs              54.06
NEWJobHunt               82.80
NEWJobHuntResearch       83.20
NEWLearn                 78.22
dtype: float64


Based on the above numbers, it is safe to remove the following columns (because they have more than 60% missing values):

- NEWJobHunt
- NEWJobHuntResearch
- NEWLearn

In [None]:
df.drop(columns=['NEWJobHunt', 'NEWJobHuntResearch', 'NEWLearn'], axis=1, inplace=True)

In [33]:
df.columns

Index(['RespondentID', 'Year', 'Country', 'Employment', 'UndergradMajor',
       'DevType', 'LanguageWorkedWith', 'LanguageDesireNextYear',
       'DatabaseWorkedWith', 'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Hobbyist', 'OrgSize', 'YearsCodePro',
       'JobSeek', 'ConvertedComp', 'WorkWeekHrs'],
      dtype='object')