# Salary: Simple linear regression
### In this assignment we are instructed to predict employee salaries from different employee characteristics or features, we are going to use a simple supervised learning technique: linear regression. We want to build a simple model to determine how well Years Worked predicts an employee’s salary. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
df_salary = pd.read_csv("salary.csv").dropna()
df_salary.head()

Unnamed: 0,salary,exprior,yearsworked,yearsrank,market,degree,otherqual,position,male,Field,yearsabs
0,53000.0,0,0,0,1.17,1,0,1,1,3,0
1,58000.0,1,0,0,1.24,1,0,1,1,2,0
2,45500.0,0,0,0,1.21,1,0,1,1,3,2
3,35782.0,0,2,1,0.99,1,0,1,1,4,1
4,34731.0,0,2,2,0.91,1,0,1,1,4,1


In [3]:
df_salary.info()
df_salary.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 513 entries, 0 to 513
Data columns (total 11 columns):
salary         513 non-null float64
exprior        513 non-null int64
yearsworked    513 non-null int64
yearsrank      513 non-null int64
market         513 non-null float64
degree         513 non-null int64
otherqual      513 non-null int64
position       513 non-null int64
male           513 non-null int64
Field          513 non-null int64
yearsabs       513 non-null int64
dtypes: float64(2), int64(9)
memory usage: 48.1 KB


(513, 11)

In [4]:
#defining the y variable
y = df_salary.salary

#splitting dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(df_salary, y, test_size=0.3, random_state = 0)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

#created a fitted model in one line
l_model = smf.ols(formula='salary ~ yearsworked', data=x_train).fit()

#print the coefficients
print(l_model.params)
print(l_model.summary())

(359, 11) (359,)
(154, 11) (154,)
Intercept      39952.042651
yearsworked      831.075426
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.382
Model:                            OLS   Adj. R-squared:                  0.380
Method:                 Least Squares   F-statistic:                     220.5
Date:                Mon, 20 May 2019   Prob (F-statistic):           3.57e-39
Time:                        12:20:12   Log-Likelihood:                -3814.8
No. Observations:                 359   AIC:                             7634.
Df Residuals:                     357   BIC:                             7641.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------

### Q1 - Using the statsmodels package, run a simple linear regression for Salary with one predictor variable: Years Worked.
#### a. The model does predict the dependent variable which is salary, with the r squared being 0.380 the points are a bit far apart indicating that it is a good fit but not the best and the p-value being 0.000 it is considered to be statistically significant because it is less that 0.05.
#### b. The percentage of the variance in employees’ salaries is 38.0%.

### Q2 - What does the unstandardized coefficient (B or 'coef' in statsmodels) tell you about the relationship between Years Worked and Salary?
#### A. The coef is the intercept meaning that if the yearsworked coef is zero(0) then the expected outcome/output will be  39952.042651. The relationship between years worked and salary is positive.

### Q3 - What do the 95% confidence intervals [0.025, 0.975] mean?
#### A. The confidence intervals is a range of values where we can be certain that 95% of people fall in that range, in this case of years worked and salary we can see that we have a 95% certainty that when you have worked for a year your salary has an increase that is between  788.597 -  994.567​.

In [5]:
pre_df = pd.DataFrame({'yearsworked': [12]})
l_model.predict(pre_df)

0    49924.947761
dtype: float64

### Q4 - Calculate the expected salary for someone with 12 years’ work experience.
#### A. The expected salary for someone who has worked for 12 years is R49 924.95

In [6]:
pre_df = pd.DataFrame({'yearsworked': [80]})
l_model.predict(pre_df)

0    106438.076714
dtype: float64

### Q5 - Calculate the expected salary for someone with 80 years’ work experience. Are there any problems with this prediction? If so, what are they?
#### A. The expected salary for someone who has worked for 80 years is R106 438.07

### Q6 - We have only looked at the number of years an employee has worked. What other employee characteristics might influence their salary?
#### A. Other employee characteristics are position, years rank and years absent.

## Fit your model on test set

In [7]:
predictions = l_model.predict(x_test)

In [10]:
l_model = smf.ols(formula='salary ~ yearsworked', data=x_test).fit()

print(l_model.params)
print(l_model.summary())

Intercept      40496.039934
yearsworked      851.837961
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.407
Model:                            OLS   Adj. R-squared:                  0.403
Method:                 Least Squares   F-statistic:                     104.2
Date:                Mon, 20 May 2019   Prob (F-statistic):           5.82e-19
Time:                        12:23:35   Log-Likelihood:                -1632.8
No. Observations:                 154   AIC:                             3270.
Df Residuals:                     152   BIC:                             3276.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------

### Q7 - How does your model compare when running it on the test set - what is the difference in the Root Mean Square Error (RMSE) between the training and test sets?
#### A. 

In [11]:
from sklearn import metrics

In [14]:
print(np.sqrt(metrics.mean_squared_error(y_test, predictions)))

9773.089452635013


In [16]:
prediction = l_model.predict(x_train)
print(np.sqrt(metrics.mean_squared_error(y_train, prediction)))

10003.147741324348
