## Data Understanding  

**1.0. What is the domain area of the dataset?**  
The dataset *Position_Salaries.csv* contains information about different salaries at a company for different positions!  

**1.1. Under which circumstances was it collected?**  
It was obtained through different websites!  

**2.0. Which data format?**  
The data is in *CSV* format!  

In [1]:
# Importing libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [2]:
position_dataset = pd.read_csv("../Datasets/Position_Salaries.csv")

In [11]:
RANDOM_STATE = 42

## Basic Exploratory Data Analysis

In [3]:
position_dataset.head()

Unnamed: 0,Position,Level,Salary
0,Business Analyst,1,45000
1,Junior Consultant,2,50000
2,Senior Consultant,3,60000
3,Manager,4,80000
4,Country Manager,5,110000


In [4]:
position_dataset.describe()

Unnamed: 0,Level,Salary
count,10.0,10.0
mean,5.5,249500.0
std,3.02765,299373.883668
min,1.0,45000.0
25%,3.25,65000.0
50%,5.5,130000.0
75%,7.75,275000.0
max,10.0,1000000.0


In [5]:
print(f"Number of features in the dataset is {position_dataset.shape[1]} and the number of observations/rows in the dataset is {position_dataset.shape[0]}")

Number of features in the dataset is 3 and the number of observations/rows in the dataset is 10


#### Checking Missing Values

In [6]:
position_dataset.isnull().sum()

Position    0
Level       0
Salary      0
dtype: int64

In [7]:
position_dataset.isna().sum()

Position    0
Level       0
Salary      0
dtype: int64

## Model Building

* Because the dataset is very small we are not going to split the dataset to training and test sets!  
* Here we are making both a simple linear regression and a polynomial regression model for better understanding of how *polynomial regression* work!  

#### Decision Tree Regression Model

In [8]:
X = position_dataset.iloc[:, 1:-1].values
y = position_dataset.iloc[:, -1].values

In [13]:
decision_tree_reg_model = DecisionTreeRegressor(random_state= RANDOM_STATE)

decision_tree_reg_model.fit(X, y)

### Main Question:

> A candidate that we are considering to hire is asking for a salary of 160,000 US dollars per year. He justifies this by claiming that he earned the same amount at his previous company! The person/candidate has worked as region manager for two years!  

> **Is his claim true or is it a bluff?**

### Answer

* **Predicting how much he should earn by simple linear regression**

He has worked as as region manager for over two years, therefore the position is *6.5* now!  
In other words his salary now must between 150,000 to 200,000!  

In [16]:
y_pred = decision_tree_reg_model.predict([[6.5]])

print("The predicted salary by DECISION TREE REGRESSION for the candidate is:", round(y_pred[0]), "US Dollars") 

The predicted salary by DECISION TREE REGRESSION for the candidate is: 150000 US Dollars
