# The task is divided on 2 part:
- ## compare how parameters like: *country*, *professional*, *education* (formal and non-formal), *IT skill* and *work experience* impact on salary, 
- ## predict future salary for people whose salary value in `'NaN'`.

### Because schema has big number of different kind of questions (some of them are not connected with IT) not every columns are needed to make a job done.

In [1]:
import pandas as pd
import numpy as np

### Open new CSV file as ***'df_answers'***  DataFrame

In [2]:
df_answers = pd.read_csv('survey_results_public_clean.csv', index_col = False)
df_answers.head()

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,CareerSatisfaction,JobSatisfaction,Gender,Race,DeveloperType,HaveWorkedLanguage,Currency,Salary
0,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,,2 to 3 years,,,,Male,White or of European descent,,Swift,,
1,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,"Privately-held limited company, not in startup...",9 to 10 years,,,,Male,White or of European descent,,JavaScript; Python; Ruby; SQL,British pounds sterling (£),
2,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",Publicly-traded corporation,20 or more years,20 or more years,8.0,9.0,Male,White or of European descent,Other,Java; PHP; Python,British pounds sterling (£),113750.0
3,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",Non-profit/non-governmental organization or pr...,14 to 15 years,9 to 10 years,6.0,3.0,Male,White or of European descent,,Matlab; Python; R; SQL,,
4,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,"Privately-held limited company, not in startup...",20 or more years,10 to 11 years,6.0,8.0,,,Mobile developer; Graphics programming; Deskto...,,,


### Info about our new dataset

In [3]:
df_answers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51392 entries, 0 to 51391
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Professional        51392 non-null  object 
 1   ProgramHobby        51392 non-null  object 
 2   Country             51392 non-null  object 
 3   University          51392 non-null  object 
 4   EmploymentStatus    51392 non-null  object 
 5   FormalEducation     51392 non-null  object 
 6   MajorUndergrad      42841 non-null  object 
 7   HomeRemote          44008 non-null  object 
 8   CompanySize         38922 non-null  object 
 9   CompanyType         38823 non-null  object 
 10  YearsProgram        51145 non-null  object 
 11  YearsCodedJob       40890 non-null  object 
 12  CareerSatisfaction  42695 non-null  float64
 13  JobSatisfaction     40376 non-null  float64
 14  Gender              35047 non-null  object 
 15  Race                33033 non-null  object 
 16  Deve

***

### Let's check how many unique values are in every columns (exclude 'Salary')

In [4]:
def Dataframe_unique_values(df):
    '''
        Check number of unique values, their sum and total number of combinations in all columns.
        This give us some imagination how complex is our data set.
        
        new_df: DataFrame without 'Salary' column
    '''
    
    # Default values for calculations
    multiply, summary = 1, 0
    ## Exclude 'Salary' from DataFrame
    new_df = df.drop('Salary', axis=1)
    
    for val in new_df.columns:
        unique_num = len(new_df[val].unique())
        print(f"Number of unique values in '{val}': {unique_num}")
        multiply *= unique_num
        summary += unique_num
    print(f'Sum of all unique values in our columns are: {summary}')
    print(f'Number of all possible unique values combinations in dataset are: {multiply} and it is more than 10**{np.log10(float(multiply)):.0f}')
    return summary, multiply

unique_values_before = Dataframe_unique_values(df_answers)

Number of unique values in 'Professional': 5
Number of unique values in 'ProgramHobby': 4
Number of unique values in 'Country': 201
Number of unique values in 'University': 4
Number of unique values in 'EmploymentStatus': 7
Number of unique values in 'FormalEducation': 9
Number of unique values in 'MajorUndergrad': 17
Number of unique values in 'HomeRemote': 8
Number of unique values in 'CompanySize': 11
Number of unique values in 'CompanyType': 12
Number of unique values in 'YearsProgram': 22
Number of unique values in 'YearsCodedJob': 22
Number of unique values in 'CareerSatisfaction': 12
Number of unique values in 'JobSatisfaction': 12
Number of unique values in 'Gender': 30
Number of unique values in 'Race': 98
Number of unique values in 'DeveloperType': 1824
Number of unique values in 'HaveWorkedLanguage': 8439
Number of unique values in 'Currency': 18
Sum of all unique values in our columns are: 10755
Number of all possible unique values combinations in dataset are: 1032483080631

## We have got a big mess, so we must reduce the number of unique values as far as we can.

### Columns will be checking in the order below:
1) 'Country'
2) 'Professional', 'ProgramHobby', 'University', 'EmploymentStatus', 'FormalEducation'
3) 'MajorUndergrad', 'HomeRemote'
4) 'CompanySize', 'CompanyType'
5) 'YearsProgram', 'YearsCodedJob'
6) 'CareerSatisfaction', 'JobSatisfaction'
7) 'Gender'
8) 'Race'
9) 'DeveloperType'
10) 'HaveWorkedLanguage'
11) 'Currency'

### List of unique values in columns

In [5]:
def Unique_values_list(*cols):
    '''
        Listing first n-th unique values (names and quantity) in every columns.
        
        n: upper limit of printed unique values on the screen per columns
    '''
    
    n = 25
    
    for _ in list(cols):
        col_unique_names = df_answers[_].value_counts(dropna=False)
        unique_numbers = len(col_unique_names)
    
        if unique_numbers > n:
            print(f"Number of unique values in '{_}': {unique_numbers}\
            \n\nFirst {n} unique values:\n{col_unique_names[:n]}\n\n")
        else:
            print(f"Number of unique values in '{_}': {unique_numbers}\
            \n\nAll unique values:\n{col_unique_names}\n\n")

#### **1) 'Country'**

In [6]:
Unique_values_list('Country')

Number of unique values in 'Country': 201            

First 25 unique values:
United States         11455
India                  5197
United Kingdom         4395
Germany                4143
Canada                 2233
France                 1740
Poland                 1290
Australia               913
Russian Federation      873
Spain                   864
Netherlands             855
Italy                   781
Brazil                  777
Sweden                  611
Switzerland             595
Israel                  575
Romania                 561
Iran                    507
Austria                 477
Pakistan                454
Czech Republic          411
Belgium                 404
South Africa            380
Turkey                  363
Ukraine                 356
Name: Country, dtype: int64




#### Because many world countries' in this Series are probably the minority (small number of responders),
#### we will take countries with at least 50 representants - that is the simplest method of detection and removing outliers.

In [7]:
countries_names = df_answers.groupby('Country').filter(lambda x: len(x) >= 50)['Country'].unique()

print(f"Number of the filtered countries: {len(countries_names)}\n")
print(f"Names of the filtered countries: \n{countries_names}")

Number of the filtered countries: 81

Names of the filtered countries: 
['United States' 'United Kingdom' 'Switzerland' 'New Zealand' 'Poland'
 'Colombia' 'France' 'Canada' 'Germany' 'Greece' 'Brazil' 'Israel' 'Italy'
 'Belgium' 'India' 'Chile' 'Croatia' 'Argentina' 'Netherlands' 'Denmark'
 'Ukraine' 'Sri Lanka' 'Malaysia' 'Finland' 'Turkey' 'Spain' 'Austria'
 'Mexico' 'Russian Federation' 'Bulgaria' 'Uruguay' 'Estonia' 'Iran'
 'Bangladesh' 'Sweden' 'Lithuania' 'Romania' 'Costa Rica' 'Serbia'
 'Slovenia' 'United Arab Emirates' 'Tunisia' 'Kenya' 'Norway'
 'Dominican Republic' 'Belarus' 'Portugal' 'Czech Republic' 'Albania'
 'I prefer not to say' 'South Africa' 'Moldavia' 'Ireland' 'Nepal'
 'Pakistan' 'Slovak Republic' 'Hungary' 'Egypt' 'Australia' 'Japan'
 'South Korea' 'Vietnam' 'Saudi Arabia' 'Macedonia' 'Bosnia-Herzegovina'
 'Indonesia' 'Nigeria' 'Peru' 'Morocco' 'Armenia' 'Lebanon' 'China'
 'Latvia' 'Singapore' 'Thailand' 'Philippines' 'Hong Kong' 'Taiwan'
 'Afghanistan' 'Ghana' 'Ve

#### After making simple filtering we limit our countries to 81 positions (it is only 40% of total),
#### but we can also remove 'I prefer not to say' value which is useless and that will be made in the next step.

In [8]:
df_answers = df_answers[(df_answers['Country'].isin(countries_names)) & 
                        (~df_answers['Country'].isin(['I prefer not to say']))].reset_index(drop=True)
df_answers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50048 entries, 0 to 50047
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Professional        50048 non-null  object 
 1   ProgramHobby        50048 non-null  object 
 2   Country             50048 non-null  object 
 3   University          50048 non-null  object 
 4   EmploymentStatus    50048 non-null  object 
 5   FormalEducation     50048 non-null  object 
 6   MajorUndergrad      41807 non-null  object 
 7   HomeRemote          43004 non-null  object 
 8   CompanySize         38109 non-null  object 
 9   CompanyType         38015 non-null  object 
 10  YearsProgram        49825 non-null  object 
 11  YearsCodedJob       39961 non-null  object 
 12  CareerSatisfaction  41691 non-null  float64
 13  JobSatisfaction     39500 non-null  float64
 14  Gender              34324 non-null  object 
 15  Race                32380 non-null  object 
 16  Deve

#### **2) 'Professional', 'ProgramHobby', 'University', 'EmploymentStatus', 'FormalEducation'**

In [9]:
Unique_values_list('Professional')

Number of unique values in 'Professional': 5            

All unique values:
Professional developer                                  35368
Student                                                  7903
Professional non-developer who sometimes writes code     4955
Used to be a professional developer                       951
None of these                                             871
Name: Professional, dtype: int64




In [10]:
df_answers['Professional'] = df_answers['Professional'].replace('Used to be a professional developer', 'Professional developer')

In [11]:
Unique_values_list('ProgramHobby')

Number of unique values in 'ProgramHobby': 4            

All unique values:
Yes, I program as a hobby                    24193
Yes, both                                    13421
No                                            9505
Yes, I contribute to open source projects     2929
Name: ProgramHobby, dtype: int64




In [12]:
df_answers['ProgramHobby'] = df_answers['ProgramHobby'].replace(['Yes, both', 'Yes, I program as a hobby', 'Yes, I contribute to open source projects'], 'Yes')

In [13]:
Unique_values_list('University')

Number of unique values in 'University': 4            

All unique values:
No                     36742
Yes, full-time          9037
Yes, part-time          3218
I prefer not to say     1051
Name: University, dtype: int64




In [14]:
df_answers['University'] = df_answers['University'].replace('I prefer not to say', 'No')

In [15]:
Unique_values_list('EmploymentStatus')

Number of unique values in 'EmploymentStatus': 7            

All unique values:
Employed full-time                                      35399
Independent contractor, freelancer, or self-employed     5035
Employed part-time                                       3099
Not employed, and not looking for work                   2698
Not employed, but looking for work                       2672
I prefer not to say                                       994
Retired                                                   151
Name: EmploymentStatus, dtype: int64




In [16]:
df_answers['EmploymentStatus'] = df_answers['EmploymentStatus'].replace(['I prefer not to say', 'Retired'], 'Not employed, and not looking for work')

In [17]:
Unique_values_list('FormalEducation')

Number of unique values in 'FormalEducation': 9            

All unique values:
Bachelor's degree                                                    21031
Master's degree                                                      10939
Some college/university study without earning a bachelor's degree     7932
Secondary school                                                      5752
Doctoral degree                                                       1277
I prefer not to answer                                                1032
Primary/elementary school                                             1001
Professional degree                                                    686
I never completed any formal education                                 398
Name: FormalEducation, dtype: int64




In [18]:
df_answers['FormalEducation'] = df_answers['FormalEducation'].replace('I prefer not to answer', 'I never completed any formal education')
df_answers['FormalEducation'] = df_answers['FormalEducation'].replace("Some college/university study without earning a bachelor's degree", 'Secondary school')

#### How does it look like after modyfications?

In [19]:
Unique_values_list('Professional', 'ProgramHobby', 'University', 'EmploymentStatus', 'FormalEducation')

Number of unique values in 'Professional': 4            

All unique values:
Professional developer                                  36319
Student                                                  7903
Professional non-developer who sometimes writes code     4955
None of these                                             871
Name: Professional, dtype: int64


Number of unique values in 'ProgramHobby': 2            

All unique values:
Yes    40543
No      9505
Name: ProgramHobby, dtype: int64


Number of unique values in 'University': 3            

All unique values:
No                37793
Yes, full-time     9037
Yes, part-time     3218
Name: University, dtype: int64


Number of unique values in 'EmploymentStatus': 5            

All unique values:
Employed full-time                                      35399
Independent contractor, freelancer, or self-employed     5035
Not employed, and not looking for work                   3843
Employed part-time                                     

#### **3) 'MajorUndergrad', 'HomeRemote'**

In [20]:
Unique_values_list('MajorUndergrad')

Number of unique values in 'MajorUndergrad': 17            

All unique values:
Computer science or software engineering                        20875
NaN                                                              8241
Computer engineering or electrical/electronics engineering       4269
Computer programming or Web development                          3760
Information technology, networking, or system administration     2074
A natural science                                                1840
A non-computer-focused engineering discipline                    1762
Mathematics or statistics                                        1618
Something else                                                   1025
A humanities discipline                                           891
A business discipline                                             884
Fine arts or performing arts                                      639
Management information systems                                    630
A social s

### Take all non-technical discipline into one array and replace it with 'A humanities discipline' option. All missing cells will be filled by 'I never declared a major', because it probably means the same.

In [21]:
non_tech_array = ['A natural science', 'Fine arts or performing arts',
                  'A social science', 'Psychology', 'A health science']

df_answers['MajorUndergrad'] = df_answers['MajorUndergrad'].replace(non_tech_array, 'A humanities discipline')
df_answers['MajorUndergrad'] = df_answers['MajorUndergrad'].fillna('I never declared a major')

In [22]:
Unique_values_list('MajorUndergrad')

Number of unique values in 'MajorUndergrad': 11            

All unique values:
Computer science or software engineering                        20875
I never declared a major                                         8821
A humanities discipline                                          4330
Computer engineering or electrical/electronics engineering       4269
Computer programming or Web development                          3760
Information technology, networking, or system administration     2074
A non-computer-focused engineering discipline                    1762
Mathematics or statistics                                        1618
Something else                                                   1025
A business discipline                                             884
Management information systems                                    630
Name: MajorUndergrad, dtype: int64




In [23]:
Unique_values_list('HomeRemote')

Number of unique values in 'HomeRemote': 8            

All unique values:
A few days each month                                      15186
Never                                                      13706
NaN                                                         7044
All or almost all the time (I'm full-time remote)           4741
Less than half the time, but at least one day each week     4045
More than half, but not all, the time                       1822
It's complicated                                            1813
About half the time                                         1691
Name: HomeRemote, dtype: int64




In [24]:
df_answers['HomeRemote'] = df_answers['HomeRemote'].replace('Less than half the time, but at least one day each week', 'A few days each month')
df_answers['HomeRemote'] = df_answers['HomeRemote'].replace("It's complicated", 'About half the time')

#### It looks fine, but what about missing values (14% of total answers)?
#### I propose fill these values proportionally with remaining options using manually written function.

In [25]:
def Missing_values_multireplacement(col):
    '''
        Replace missing (NaN) values with strings from the same Series 
        based on their stats distribution.
        
        p: frequency of appearing unique values in column
    '''
    
    # Boolean mask
    missing = df_answers[col].isnull()
    # Counting unique values in columns and display them as decimal fractals
    col_stats = df_answers[col].value_counts(normalize=True)
    # Final replacing the missing data index by index
    df_answers.loc[missing, col] = np.random.choice(col_stats.index, size = len(df_answers[missing]), p = col_stats.values)

In [26]:
Missing_values_multireplacement('HomeRemote')
Unique_values_list('HomeRemote')

Number of unique values in 'HomeRemote': 5            

All unique values:
A few days each month                                22343
Never                                                15961
All or almost all the time (I'm full-time remote)     5538
About half the time                                   4088
More than half, but not all, the time                 2118
Name: HomeRemote, dtype: int64




#### **4) 'CompanySize', 'CompanyType'**

In [27]:
Unique_values_list('CompanySize')

Number of unique values in 'CompanySize': 11            

All unique values:
NaN                         11939
20 to 99 employees           8372
100 to 499 employees         7130
10,000 or more employees     5638
10 to 19 employees           4000
1,000 to 4,999 employees     3768
Fewer than 10 employees      3686
500 to 999 employees         2423
5,000 to 9,999 employees     1581
I don't know                  851
I prefer not to answer        660
Name: CompanySize, dtype: int64




### "I don't know", 'I prefer not to answer' will be treat as the missing data (they do not give us any useful information).
### All company size will be change into format *'x to y employees'

In [28]:
df_answers['CompanySize'] = df_answers['CompanySize'].replace(["I don't know", 'I prefer not to answer'], np.nan)
df_answers['CompanySize'] = df_answers['CompanySize'].replace('Fewer than 10 employees', '1 to 9 employees')

In [29]:
Unique_values_list('CompanySize')

Number of unique values in 'CompanySize': 9            

All unique values:
NaN                         13450
20 to 99 employees           8372
100 to 499 employees         7130
10,000 or more employees     5638
10 to 19 employees           4000
1,000 to 4,999 employees     3768
1 to 9 employees             3686
500 to 999 employees         2423
5,000 to 9,999 employees     1581
Name: CompanySize, dtype: int64




In [30]:
Unique_values_list('CompanyType')

Number of unique values in 'CompanyType': 12            

All unique values:
Privately-held limited company, not in startup mode                      16377
NaN                                                                      12033
Publicly-traded corporation                                               5826
I don't know                                                              3171
Sole proprietorship or partnership, not in startup mode                   2765
Venture-funded startup                                                    2358
Government agency or public school/university                             2351
I prefer not to answer                                                    1770
Pre-series A startup                                                      1257
Non-profit/non-governmental organization or private school/university     1189
State-owned company                                                        616
Something else                                        

In [31]:
df_answers['CompanyType'] = df_answers['CompanyType'].replace(["I don't know", 'I prefer not to answer'], np.nan)

In [32]:
Unique_values_list('CompanyType')

Number of unique values in 'CompanyType': 10            

All unique values:
NaN                                                                      16974
Privately-held limited company, not in startup mode                      16377
Publicly-traded corporation                                               5826
Sole proprietorship or partnership, not in startup mode                   2765
Venture-funded startup                                                    2358
Government agency or public school/university                             2351
Pre-series A startup                                                      1257
Non-profit/non-governmental organization or private school/university     1189
State-owned company                                                        616
Something else                                                             335
Name: CompanyType, dtype: int64




### For both columns we run *Missing_values_multireplacement()* function

In [33]:
Missing_values_multireplacement('CompanySize')
Missing_values_multireplacement('CompanyType')

In [34]:
Unique_values_list('CompanySize', 'CompanyType')

Number of unique values in 'CompanySize': 8            

All unique values:
20 to 99 employees          11621
100 to 499 employees         9659
10,000 or more employees     7681
10 to 19 employees           5488
1,000 to 4,999 employees     5127
1 to 9 employees             5016
500 to 999 employees         3327
5,000 to 9,999 employees     2129
Name: CompanySize, dtype: int64


Number of unique values in 'CompanyType': 9            

All unique values:
Privately-held limited company, not in startup mode                      24839
Publicly-traded corporation                                               8789
Sole proprietorship or partnership, not in startup mode                   4210
Venture-funded startup                                                    3568
Government agency or public school/university                             3548
Pre-series A startup                                                      1885
Non-profit/non-governmental organization or private school/universit

#### **5) 'YearsProgram', 'YearsCodedJob'**

In [35]:
Unique_values_list('YearsProgram')

Number of unique values in 'YearsProgram': 22            

All unique values:
20 or more years    8690
4 to 5 years        3738
3 to 4 years        3573
5 to 6 years        3472
2 to 3 years        3123
9 to 10 years       3115
6 to 7 years        2773
1 to 2 years        2665
7 to 8 years        2395
10 to 11 years      2134
14 to 15 years      1972
8 to 9 years        1851
15 to 16 years      1647
Less than a year    1425
11 to 12 years      1357
12 to 13 years      1271
13 to 14 years      1075
16 to 17 years      1032
19 to 20 years      1010
17 to 18 years       871
18 to 19 years       636
NaN                  223
Name: YearsProgram, dtype: int64




### There are no many missing values in this column, so I decide to delete these rows using dropna() method.

In [36]:
df_answers = df_answers.dropna(how='all', subset=['YearsProgram']).reset_index(drop=True)
df_answers['YearsProgram'] = df_answers['YearsProgram'].replace('Less than a year', '0 to 1 year')

Unique_values_list('YearsProgram')

Number of unique values in 'YearsProgram': 21            

All unique values:
20 or more years    8690
4 to 5 years        3738
3 to 4 years        3573
5 to 6 years        3472
2 to 3 years        3123
9 to 10 years       3115
6 to 7 years        2773
1 to 2 years        2665
7 to 8 years        2395
10 to 11 years      2134
14 to 15 years      1972
8 to 9 years        1851
15 to 16 years      1647
0 to 1 year         1425
11 to 12 years      1357
12 to 13 years      1271
13 to 14 years      1075
16 to 17 years      1032
19 to 20 years      1010
17 to 18 years       871
18 to 19 years       636
Name: YearsProgram, dtype: int64




In [37]:
Unique_values_list('YearsCodedJob')

Number of unique values in 'YearsCodedJob': 22            

All unique values:
NaN                 9896
1 to 2 years        5144
2 to 3 years        4631
3 to 4 years        3904
4 to 5 years        3303
20 or more years    3033
Less than a year    2940
5 to 6 years        2918
9 to 10 years       1918
6 to 7 years        1872
10 to 11 years      1643
7 to 8 years        1594
8 to 9 years        1265
15 to 16 years       832
14 to 15 years       825
11 to 12 years       817
12 to 13 years       731
16 to 17 years       685
17 to 18 years       535
13 to 14 years       521
19 to 20 years       420
18 to 19 years       398
Name: YearsCodedJob, dtype: int64




In [38]:
df_answers['YearsCodedJob'] = df_answers['YearsCodedJob'].replace('Less than a year', '0 to 1 year')

### Using *Missing_values_multireplacement()* function for 'YearsCodedJob' column

In [39]:
Missing_values_multireplacement('YearsCodedJob')
Unique_values_list('YearsCodedJob')

Number of unique values in 'YearsCodedJob': 21            

All unique values:
1 to 2 years        6412
2 to 3 years        5758
3 to 4 years        4893
4 to 5 years        4102
20 or more years    3752
0 to 1 year         3639
5 to 6 years        3624
9 to 10 years       2431
6 to 7 years        2374
10 to 11 years      2114
7 to 8 years        1951
8 to 9 years        1573
15 to 16 years      1055
11 to 12 years      1025
14 to 15 years      1015
12 to 13 years       931
16 to 17 years       846
17 to 18 years       651
13 to 14 years       646
19 to 20 years       537
18 to 19 years       496
Name: YearsCodedJob, dtype: int64




#### **6) 'CareerSatisfaction', 'JobSatisfaction'**

In [40]:
Unique_values_list('CareerSatisfaction')

Number of unique values in 'CareerSatisfaction': 12            

All unique values:
8.0     10806
7.0      9169
NaN      8248
9.0      5467
10.0     5211
6.0      4594
5.0      2979
4.0      1310
3.0      1014
2.0       472
0.0       361
1.0       194
Name: CareerSatisfaction, dtype: int64




### Missing position (answer) means 'no satisfaction', so it will be replace by 0.
### All numbers will be transform into integer.

In [41]:
df_answers['CareerSatisfaction'] = df_answers['CareerSatisfaction'].replace(np.nan, 0)
df_answers['CareerSatisfaction'] = df_answers['CareerSatisfaction'].astype('int8')

Unique_values_list('CareerSatisfaction')

Number of unique values in 'CareerSatisfaction': 11            

All unique values:
8     10806
7      9169
0      8609
9      5467
10     5211
6      4594
5      2979
4      1310
3      1014
2       472
1       194
Name: CareerSatisfaction, dtype: int64




In [42]:
Unique_values_list('JobSatisfaction')

Number of unique values in 'JobSatisfaction': 12            

All unique values:
NaN     10431
8.0      8785
7.0      7793
9.0      5473
6.0      4609
10.0     4031
5.0      3640
4.0      1813
3.0      1584
2.0       865
0.0       443
1.0       358
Name: JobSatisfaction, dtype: int64




### Missing position (answer) means 'no satisfaction', so it will be replace by 0.
### All numbers will be transform into integer.

In [43]:
df_answers['JobSatisfaction'] = df_answers['JobSatisfaction'].replace(np.nan, 0)
df_answers['JobSatisfaction'] = df_answers['JobSatisfaction'].astype('int8')

Unique_values_list('JobSatisfaction')

Number of unique values in 'JobSatisfaction': 11            

All unique values:
0     10874
8      8785
7      7793
9      5473
6      4609
10     4031
5      3640
4      1813
3      1584
2       865
1       358
Name: JobSatisfaction, dtype: int64


