In [1]:
import importlib
import toolbox
importlib.reload(toolbox)
#from ml_eda import edaDF

<module 'toolbox' from 'c:\\Users\\msieb\\OneDrive\\Documents\\1 NAIT\\1 - DATA3950 MACHINE LEARNING\\Repository\\3950-assignment-2-mlsiebold\\toolbox.py'>

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import SGDRegressor

from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

from ml_utils import edaDF
import toolbox as tb

In [3]:
# Download and load the data
import keras
import os

f_path_1 = "data/Euro_Salary.csv"
url_1 = "https://github.com/AkeemSemper/ml_data/raw/main/Euro_Salary.csv"
if not os.path.exists(f_path_1):
    file_1 = keras.utils.get_file(f_path_1, url_1)
    
df = pd.read_csv(f_path_1)

# Assignment 2 - Regression
### Predict the TOTAL COMPENSATION for this year. 

The data file provided is a salary survey for tech workers in Europe. We want to predict the total amount of compensation they bring in each year, based off of the details of their work. 

Some notes that will be important:
<ul>
    <li>The total compensation will need to be constructed, there is a column for salary, "Yearly brutto salary (without bonus and stocks) in EUR", as well as a column for bonus compensation, "Yearly bonus + stocks in EUR". 
    <li>Some categorical variables will need some work, and there isn't generally an exact answer. The main concern is things with categories that have a bunch of values with a very small count. For example, if there is only 1 person in City X, then that value likely needs to be addressed. We don't want it encoded into a new column of one 1 and thousands of 0s. 
    <li>There is an article exploring some of the data here: https://www.asdcode.de/2021/01/it-salary-survey-december-2020.html
    <li>Imputation and a bit of data manipulation will be required. 
    <li>Use any regression method you'd like. Some ones are closely related to what we've done, you may want to look at them, e.g. ExtraTreesRegressor. 
    <li>Initial accurracy, and potentially final accuracy, may not be great. When I made a plain model will little optimization the errors were large and the R2 was low. There is lots of room for optimization. 
    <li>Research challenge - try some work on the target, look into TransformedTargetRegressor and see if that helps. Recall in stats when we had skewed distributions... Maybe it helps, maybe it doesn't. 
    <li>EDA and data prep are up to you - you'll probably need to do a little exploring to figure out what cleanup is needed. When I did it, I did things kind of iteratively when I did it. For example, look at the value counts, figure out how to treat the different categories, clean something up, look at the results, potentially repeat if needed. After you figure out what needs to be done, you may be able to take some of those steps and incorporate them into a pipeline to be cleaner....
    <li><b>CRITICAL - Please make sure you publish it after having run it, all the output should be showing.</b>
</ul>

### Details and Deliverables

You'll need to build code to produce the predictions. In particular, there's a few things that'll be marked:
<ul>
    <li>Please add a "presentation version" at the bottom, where you show what you did, and the results. Basically, you start with the original data, you do some work to figure out what's needed, you try a few models and select the best. At the bottom, put what <i>you actually settled on</i>, i.e. after all the figuring and exploring, here's the code that goes from raw data to final results, and here's what the results were. I should be able to read this part and understand what you did clearly:
    <ul>
        <li> Please make a pipeline that does the prep work - you may need some exploration or several trials before settling on what exactly to use, that's normal. Once you've settled, build that into a pipeline so it's clear and repeatable.
        <li> What you settled on for data cleaning, along with what prompted it. 
        <li> Feature Selection - Please identify what you did for feature selection. No need for a long explaination, something along the lines of "I did X, and the result was that 4 features were removed". Try at least 2 things. 
        <li> Model selection - between selecting a model style and tuning it with hyperparameters, what did you test and what won?
        <li> Overall, how good was your model and what things may make sense to try to do even better? 
        <li> If you could use titles/bullet points I'd really appreciate it. 
    </ul>
    <li>Grade Breakdown:
    <ul>
        <li> Code is readable, there are comments: 20%
        <li> Explaination as defined above: 60% (20% each point)
        <li> Accuracy: 20% As compared to everyone else. This will be generously graded, I won't be surprised if overall accuracy is low for most people. 
    </ul>
</ul>

<b>The biggest challenge here is translating the data into something useful and clean. This will probably require a bit of exploration, examining the data, thinking about what it means, trying something, then making a model to see what the results are. In particular, think about what value some of the less clean bits of data my hold - binning/grouping, numerical transformations, outlier removal, etc... are all likely to be useful somewhere. You almost certainly need to look at it column by column and make a decision, I'll apologize up front, it isn't the most fun process in the world. There is not one specific correct answer.</b>

In [4]:
# Clean dataset (whitespace and weird characters)

import unicodedata

df = df.apply(lambda col: col.map(tb.clean_string))     # Clean entire df
df.columns = df.columns.map(tb.clean_string)            # Clean column headers

In [5]:
# Create data dictionary

data_dict = pd.DataFrame({
    'Dtype': df.dtypes,
    'Semantic Type': None,
    'Desc': None,
    '# of Nulls': df.isna().sum(),
    '# of Unique Values': df.nunique()
    })                                      #.reset_index(names='Column')

data_dict.index.name = 'Column'             # Give the index a name

In [6]:
# Create description dictionary to add manually input column descriptions to data dictionary

# Create list of columns
cols = df.columns.to_list()

# Build description dictionary scaffold
desc_dict = {col: '' for col in cols}
#desc_dict

# Build description dictionary scaffold
semantic_dict = {col: '' for col in cols}
#semantic_dict

In [7]:
df.columns

Index(['Timestamp', 'Age', 'Gender', 'City', 'Position',
       'Total years of experience', 'Years of experience in Germany',
       'Seniority level', 'Your main technology / programming language',
       'Other technologies/programming languages you use often',
       'Yearly brutto salary (without bonus and stocks) in EUR',
       'Yearly bonus + stocks in EUR', 'Number of vacation days',
       'Employment status', 'Contract duration', 'Main language at work',
       'Company size', 'Company type'],
      dtype='object', name='Column')

In [8]:
# Manually update description dictionary

desc_dict = {
 "Timestamp": "Date and time survey response was submitted (acts as survey submission identifier)",
 "Age": "Respondent's age in years",
 "Gender": "Respondent's self-report gender",
 "City": "City in which the respondent works (primarily German cities, with a few internationaly cities)",
 "Position": "Current job title or primary role",
 "Total years of experience": "Total number of years the respondent has worked in their profession",
 "Years of experience in Germany": "Number of years the respondent has worked specifically in Germany",
 "Seniority level": "Self‑reported seniority",
 "Your main technology / programming language": "Primary programming language or technology used in the respondent’s job",
 "Other technologies/programming languages you use often": "Additional languages, frameworks, or tools the respondent frequently uses",
 "Yearly brutto salary (without bonus and stocks) in EUR": "Annual gross base salary in euros, excluding bonuses and stock compensation",
 "Yearly bonus + stocks in EUR": "Annual bonus and/or stock compensation in euros",
 "Number of vacation days": "Number of paid vacation days per year",
 "Employment status": "Type of employment",
 "Contract duration": "Type of employment contract",
 "Main language at work": "Primary language used in the workplace",
 "Company size": "Approximate number of employees in the respondent’s company",
 "Company type": "Type of organization respondent is employed for"}

In [9]:
df.sample(5)

Column,Timestamp,Age,Gender,City,Position,Total years of experience,Years of experience in Germany,Seniority level,Your main technology / programming language,Other technologies/programming languages you use often,Yearly brutto salary (without bonus and stocks) in EUR,Yearly bonus + stocks in EUR,Number of vacation days,Employment status,Contract duration,Main language at work,Company size,Company type
757,26/11/2020 10:59:10,33.0,Male,Amsterdam,Backend Developer,8,,Senior,C#,"Python, .NET, SQL",75000.0,,27,Full-time employee,Temporary contract,English,11-50,Product
186,24/11/2020 12:40:25,31.0,Male,Berlin,Data Scientist,10,5.0,Lead,Python,"Kubernetes, Docker",115000.0,70000.0,28,Full-time employee,Unlimited contract,English,1000+,Product
381,24/11/2020 18:53:16,33.0,Male,Berlin,Software Engineer,10,,Senior,Salesforce,"Python, Javascript / Typescript, Apex",68000.0,,25,Full-time employee,Temporary contract,English,11-50,Startup
1069,03/12/2020 10:59:45,26.0,Male,Munich,Software Engineer,6,2.0,Senior,TypeScript,"AWS, Docker",93000.0,10000.0,30,Full-time employee,Unlimited contract,English,11-50,Startup
423,24/11/2020 20:29:50,39.0,Male,Berlin,DevOps,15,3.0,Senior,Python,"Python, Ruby, Java / Scala, Go, Rust, AWS, Kub...",70000.0,0.0,21,Full-time employee,Unlimited contract,English,101-1000,Product


In [10]:
print('\033[1m# of unique values:\033[0m')
print(df.nunique())

[1m# of unique values:[0m
Column
Timestamp                                                 1248
Age                                                         40
Gender                                                       3
City                                                       107
Position                                                   141
Total years of experience                                   48
Years of experience in Germany                              52
Seniority level                                             23
Your main technology / programming language                243
Other technologies/programming languages you use often     562
Yearly brutto salary (without bonus and stocks) in EUR     201
Yearly bonus + stocks in EUR                               168
Number of vacation days                                     43
Employment status                                           11
Contract duration                                            3
Main language at wor

In [11]:
# Manually update semantic type dictionary

semantic_dict = {
 'Timestamp': 'categorical nominal',
 'Age': 'numeric discrete',
 'Gender': 'categorical nominal',
 'City': 'categorical nominal',
 'Position': 'categorical nominal',
 'Total years of experience': 'numeric continuous',
 'Years of experience in Germany': 'numeric continuous',
 'Seniority level': 'categorical ordinal',
 'Your main technology / programming language': 'categorical nominal',
 'Other technologies/programming languages you use often': 'categorical nominal',
 'Yearly brutto salary (without bonus and stocks) in EUR': 'numeric continuous',
 'Yearly bonus + stocks in EUR': 'numeric continuous',
 'Number of vacation days': 'numeric discrete',
 'Employment status': 'categorical nominal',
 'Contract duration': 'categorical nominal',
 'Main language at work': 'categorical nominal',
 'Company size': 'categorical ordinal',
 'Company type': 'categorical nominal'}

In [12]:
# Map descriptions and data type to data dictionary 

#data_dict['Desc'] = data_dict['Column'].map(desc_dict)
#data_dict['Semantic Type'] = data_dict['Column'].map(semantic_dict)

data_dict['Desc'] = data_dict.index.map(desc_dict)
data_dict['Semantic Type'] = data_dict.index.map(semantic_dict)

In [13]:
# Check my handywork

data_dict

Unnamed: 0_level_0,Dtype,Semantic Type,Desc,# of Nulls,# of Unique Values
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Timestamp,object,categorical nominal,Date and time survey response was submitted (a...,0,1248
Age,float64,numeric discrete,Respondent's age in years,27,40
Gender,object,categorical nominal,Respondent's self-report gender,10,3
City,object,categorical nominal,City in which the respondent works (primarily ...,0,107
Position,object,categorical nominal,Current job title or primary role,6,141
Total years of experience,object,numeric continuous,Total number of years the respondent has worke...,16,48
Years of experience in Germany,object,numeric continuous,Number of years the respondent has worked spec...,32,52
Seniority level,object,categorical ordinal,Self‑reported seniority,12,23
Your main technology / programming language,object,categorical nominal,Primary programming language or technology use...,127,243
Other technologies/programming languages you use often,object,categorical nominal,"Additional languages, frameworks, or tools the...",157,562


In [14]:
# Save data dictionary to markdown file

data_dict.reset_index().to_csv('data_dictionary.tsv', sep='\t', index=False, encoding='utf-8-sig')

In [None]:
# Change dtypes based on semantic type

df = tb.apply_semantic_dtypes(df, data_dict, tb.semantic_to_dtype)

In [22]:
df.dtypes

Column
Timestamp                                                 category
Age                                                        float64
Gender                                                    category
City                                                      category
Position                                                  category
Total years of experience                                  float64
Years of experience in Germany                             float64
Seniority level                                           category
Your main technology / programming language               category
Other technologies/programming languages you use often    category
Yearly brutto salary (without bonus and stocks) in EUR     float64
Yearly bonus + stocks in EUR                               float64
Number of vacation days                                    float64
Employment status                                         category
Contract duration                                      

Column
Timestamp                                                 category
Age                                                        float64
Gender                                                    category
City                                                      category
Position                                                  category
Total years of experience                                  float64
Years of experience in Germany                             float64
Seniority level                                           category
Your main technology / programming language               category
Other technologies/programming languages you use often    category
Yearly brutto salary (without bonus and stocks) in EUR     float64
Yearly bonus + stocks in EUR                               float64
Number of vacation days                                    float64
Employment status                                         category
Contract duration                                      

In [None]:
df.sample(10)

Column,Timestamp,Age,Gender,City,Position,Total years of experience,Years of experience in Germany,Seniority level,Your main technology / programming language,Other technologies/programming languages you use often,Yearly brutto salary (without bonus and stocks) in EUR,Yearly bonus + stocks in EUR,Number of vacation days,Employment status,Contract duration,Main language at work,Company size,Company type
927,30/11/2020 11:06:44,25.0,Male,Stuttgart,Data Scientist,0.0,0.0,Junior,Python,"R, SQL, Hadoop Hive",58000.0,,30.0,Full-time employee,Unlimited contract,German,1000+,Handel
937,30/11/2020 11:45:40,23.0,Male,Dublin,Data Analyst,3.0,0.0,Senior,Python,"Python, R, SQL",49200.0,2000.0,21.0,Full-time employee,Unlimited contract,English,1000+,Product
1096,05/12/2020 18:10:14,35.0,Male,Fr,Backend Developer,4.0,2.5,Middle,Java,"Javascript / Typescript, Java / Scala, SQL",42000.0,,28.0,Full-time employee,Unlimited contract,English,51-100,Consulting / Agency
193,24/11/2020 12:51:24,33.0,Male,Berlin,Software Engineer,12.0,5.0,Senior,Javascript,"PHP, Javascript / Typescript, SQL, Docker",120000.0,,28.0,Self-employed (freelancer),Temporary contract,English,101-1000,Product
86,24/11/2020 11:45:31,31.0,Male,Berlin,Software Engineer,9.0,,Senior,"C#, .net core","Python, .NET, AWS, Azure, Kubernetes, Docker",60000.0,,25.0,Full-time employee,Unlimited contract,English,11-50,Startup
966,30/11/2020 14:36:32,35.0,Male,Berlin,Data Engineer,6.0,6.0,Senior,Java,"Python, Java / Scala, SQL, Go, AWS, Kubernetes...",200000.0,200000.0,14.0,Self-employed (freelancer),Temporary contract,English,11-50,Startup
357,24/11/2020 18:10:54,26.0,Male,Karlsruhe,Software Engineer,5.0,5.0,,,"Python, Kotlin, Javascript / Typescript, Java ...",55000.0,0.0,30.0,Full-time employee,Unlimited contract,50/50,11-50,Consulting / Agency
154,24/11/2020 12:14:27,36.0,Female,Erlangen,Project Manager,14.0,2.0,Middle,,Python,62000.0,,30.0,Full-time employee,Unlimited contract,English,1000+,Product
56,24/11/2020 11:33:33,35.0,Male,Munich,Software Engineer,11.0,3.0,Senior,Php,"Javascript / Typescript, AWS, Kubernetes, Docker",65000.0,5000.0,29.0,Full-time employee,Unlimited contract,German,101-1000,Product
184,24/11/2020 12:37:20,35.0,Male,Berlin,Product Manager,10.0,5.0,Senior,php,"PHP, Javascript / Typescript",90000.0,10000.0,,Full-time employee,Unlimited contract,English,101-1000,Product


In [None]:
# EDA

df.describe(include="all").T

Unnamed: 0_level_0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Timestamp,1253.0,1248.0,24/11/2020 14:07:23,2.0,,,,,,,
Age,1226.0,,,,32.509788,5.663804,20.0,29.0,32.0,35.0,69.0
Gender,1243.0,3.0,Male,1049.0,,,,,,,
City,1253.0,119.0,Berlin,681.0,,,,,,,
Position,1247.0,148.0,Software Engineer,387.0,,,,,,,
Total years of experience,1237.0,48.0,10,138.0,,,,,,,
Years of experience in Germany,1221.0,53.0,2,195.0,,,,,,,
Seniority level,1241.0,24.0,Senior,565.0,,,,,,,
Your main technology / programming language,1126.0,256.0,Java,184.0,,,,,,,
Other technologies/programming languages you use often,1096.0,562.0,Javascript / Typescript,44.0,,,,,,,


In [None]:
df[393:396]

Column,Timestamp,Age,Gender,City,Position,Total years of experience,Years of experience in Germany,Seniority level,Your main technology / programming language,Other technologies/programming languages you use often,Yearly brutto salary (without bonus and stocks) in EUR,Yearly bonus + stocks in EUR,Number of vacation days,Employment status,Сontract duration,Main language at work,Company size,Company type
393,24/11/2020 19:15:02,30.0,Male,Moscow,Software Engineer,5,0,Middle,C,C/C++,14712.0,0.0,30,Full-time employee,Unlimited contract,Russian,101-1000,Product
394,24/11/2020 19:15:49,33.0,Male,Berlin,Product Manager,5,5,Senior,,Python,70000.0,800.0,30,Full-time employee,,German,101-1000,Product
395,24/11/2020 19:19:30,35.0,Male,Berlin,QA Engineer,11,10,Senior,Java,"Python, Javascript / Typescript, .NET, Java / ...",74400.0,,30,Full-time employee,Unlimited contract,English,101-1000,Product


In [None]:
df.dtypes

Column
Timestamp                                                  object
Age                                                       float64
Gender                                                     object
City                                                       object
Position                                                   object
Total years of experience                                  object
Years of experience in Germany                             object
Seniority level                                            object
Your main technology / programming language                object
Other technologies/programming languages you use often     object
Yearly brutto salary (without bonus and stocks) in EUR    float64
Yearly bonus + stocks in EUR                               object
Number of vacation days                                    object
Employment status                                          object
Сontract duration                                          object
Mai

In [None]:
len(df)

1253

In [None]:
#df.info()
df["Age"].value_counts().sort_index()

Age
20.0      1
21.0      1
22.0      8
23.0     12
24.0     28
25.0     42
26.0     59
27.0     58
28.0     87
29.0     86
30.0    110
31.0     87
32.0     94
33.0     94
34.0     74
35.0     82
36.0     60
37.0     44
38.0     48
39.0     28
40.0     31
41.0     14
42.0     20
43.0     10
44.0      8
45.0     11
46.0      8
47.0      3
48.0      4
49.0      2
50.0      1
51.0      1
52.0      1
53.0      1
54.0      2
56.0      2
59.0      1
65.0      1
66.0      1
69.0      1
Name: count, dtype: int64

In [None]:
for column in df.columns:
    print(f'{column}:\n{df[column].unique()}\n')

Timestamp:
['24/11/2020 11:14:15' '24/11/2020 11:14:16' '24/11/2020 11:14:21' ...
 '18/01/2021 23:20:35' '19/01/2021 10:17:58' '19/01/2021 12:01:11']

Age:
[26. 29. 28. 37. 32. 24. 35. nan 34. 31. 41. 27. 25. 59. 36. 38. 40. 39.
 33. 30. 49. 48. 44. 66. 45. 43. 42. 46. 47. 56. 53. 65. 22. 23. 50. 51.
 21. 20. 54. 69. 52.]

Gender:
['Male' 'Female' nan 'Diverse']

City:
['Munich' 'Berlin' 'Hamburg' 'Wolfsburg' 'Stuttgart' 'Schleswig-Holstein'
 'London' 'Konstanz area' 'Frankfurt' 'Cologne' 'Kempten' 'Münster'
 'Erlangen' 'Vienna' 'Moldova' 'Rosenheim' 'Mannheim ' 'Boeblingen'
 'Düsseldorf' 'Ingolstadt' 'Nürnberg' 'Ansbach' 'Leipzig' 'Mannheim'
 'Tuttlingen' 'Bonn' 'Moscow' 'Koblenz' 'Warsaw' 'Heidelberg' 'Karlsruhe'
 'Köln' 'Aachen' 'Karlsruhe ' 'Samara' 'Riga, Latvia' 'Dusseldorf'
 'Zurich' 'Helsinki' 'Würzburg' 'Kiev' 'Den Haag' 'Amsterdam' 'Cracovia'
 'Tallinn' 'Prague' 'Utrecht' 'Stockholm' 'Braunschweig ' 'Dresden' 'Kyiv'
 'Stuttgart ' 'Malta' 'Lübeck' 'Nuremberg ' 'Bodensee' 'Mila

### **'Timestamp'**
- **Inconsistencies:**
    - dtype is object, should be datetime
- **Actions taken:**
    - Change dtype to datetime

In [None]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M:%S')
df['Timestamp'].dtype

dtype('<M8[ns]')

### **'Age'**
- **Inconsistencies:**
    - dtype is float64, should be integer
- **Actions taken:**
    - Change dtype to integer

In [None]:
df['Age'] = df['Age'].convert_dtypes(int)
df['Age'].dtype

Int64Dtype()

### **'Gender'**
- **Inconsistencies:**
    - None
- **Actions taken:**
    - None

In [None]:
np.sort(df['City'].unique()).tolist()

['Aachen',
 'Amsterdam',
 'Ansbach',
 'Barcelona',
 'Basel',
 'Berlin',
 'Bielefeld',
 'Bodensee',
 'Boeblingen',
 'Bonn',
 'Braunschweig ',
 'Brunswick',
 'Brussels',
 'Brussels ',
 'Bucharest',
 'Bölingen',
 'Cambridge',
 'City in Russia',
 'Cologne',
 'Copenhagen',
 'Cracovia',
 'Cracow',
 'Cupertino',
 'Darmstadt',
 'Den Haag',
 'Dortmund',
 'Dresden',
 'Dublin',
 'Dublin ',
 'Duesseldorf',
 'Dusseldorf',
 'Dusseldurf',
 'Düsseldorf',
 'Düsseldorf ',
 'Eindhoven',
 'Erlangen',
 'Fr',
 'France',
 'Frankfurt',
 'Friedrichshafen',
 'Hamburg',
 'Hannover',
 'Heidelberg',
 'Heidelberg ',
 'Heilbronn',
 'Helsinki',
 'Hildesheim',
 'Hildesheim ',
 'Ingolstadt',
 'Ingolstadt ',
 'Innsbruck',
 'Istanbul',
 'Jena',
 'Karlsruhe',
 'Karlsruhe ',
 'Kempten',
 'Kiev',
 'Koblenz',
 'Konstanz',
 'Konstanz area',
 'Krakow',
 'Kyiv',
 'Köln',
 'Leipzig',
 'Lisbon',
 'London',
 'Luttich',
 'Lübeck',
 'Madrid',
 'Malta',
 'Mannheim',
 'Mannheim ',
 'Marseille',
 'Milan',
 'Milano',
 'Minsk',
 'Moldova

### **'City'**
- **Inconsistencies:**
    - Countries rather than cities
    - Special characters ex. "()"
    - Trailing spaces
    - Invalid answers and extra text ("Prefer not to say", "area", "City in"
    - Abreviations
    - Spelling errors
    - Multiple spellings
    - Accents    
- **Actions taken:**
    - Removed trailing spaces
    - 

In [None]:
df['City'].unique()

array(['Munich', 'Berlin', 'Hamburg', 'Wolfsburg', 'Stuttgart',
       'Schleswig-Holstein', 'London', 'Konstanz area', 'Frankfurt',
       'Cologne', 'Kempten', 'Münster', 'Erlangen', 'Vienna', 'Moldova',
       'Rosenheim', 'Mannheim ', 'Boeblingen', 'Düsseldorf', 'Ingolstadt',
       'Nürnberg', 'Ansbach', 'Leipzig', 'Mannheim', 'Tuttlingen', 'Bonn',
       'Moscow', 'Koblenz', 'Warsaw', 'Heidelberg', 'Karlsruhe', 'Köln',
       'Aachen', 'Karlsruhe ', 'Samara', 'Riga, Latvia', 'Dusseldorf',
       'Zurich', 'Helsinki', 'Würzburg', 'Kiev', 'Den Haag', 'Amsterdam',
       'Cracovia', 'Tallinn', 'Prague', 'Utrecht', 'Stockholm',
       'Braunschweig ', 'Dresden', 'Kyiv', 'Stuttgart ', 'Malta',
       'Lübeck', 'Nuremberg ', 'Bodensee', 'Milan', 'Salzburg', 'Rome',
       'Wroclaw', 'Cupertino', 'Paris', 'Dublin ', 'Paderborn',
       'Konstanz', 'Ulm', 'Düsseldorf ', 'Barcelona', 'Bölingen',
       'Tampere (Finland)', 'Hannover', 'Bucharest', 'Siegen', 'Minsk',
       'Nuremberg', 'M

In [None]:
# Remove leading and trailing spaces

print(f"# of unique values BEFORE strip: {len(df['City'].unique())}")

df['City'] = df['City'].str.strip()

print(f"# of unique values AFTER strip: {len(df['City'].unique())}")

# of unique values BEFORE strip: 119
# of unique values AFTER strip: 109


In [None]:
# Remove accents

from unidecode import unicodedata

print(f"# of unique values BEFORE normalizing accents: {len(df['City'].unique())}")

df['City'] = df['City']    .str.encode('ascii', 'ignore') 

print(f"# of unique values AFTER normalizing accents: {len(df['City'].unique())}")

ModuleNotFoundError: No module named 'unidecode'

In [None]:
df["City"] = df["City"].replace({
    "Dusseldurf": "Düsseldorf",
    "Koln": "Cologne",
    "Cracovia": "Krakow"
})

### 'Position'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Total years of experience'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Years of experience in Germany'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Seniority level'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Your main technology / programming language'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Other technologies/programming languages you use often'
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Yearly brutto salary (without bonus and stocks) in EUR'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Yearly bonus + stocks in EUR'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Number of vacation days'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Employment status'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Сontract duration'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Main language at work'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Company size'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

###  'Company type'
- **Definition:**
    - 
- **Inconsistencies:**
    -
- **Actions taken:**
    - 

# Answers and Explainations
(Expand/modify as needed)

### Here's the Data Cleaning Steps I Used

### Here's my Tuning/Feature Selection Steps

### Here's my Model's Performance

### Here's my Final Conclusion on What Worked Best