#### Salary Prediction Web App – Project Process

##### 1) Collected and preprocessed workplace-related data for salary prediction.
##### 2) Performed exploratory data analysis (EDA) to understand feature relationships.
##### 3) Selected relevant features influencing salary (e.g., Years at company, Satisfaction Level, Average monthly hours etc.).
##### 4) Built a regression model using Scikit-learn (e.g., Linear Regression or Random Forest).
##### 5) Evaluated model performance using metrics like Mean Squared Error (MSE).
##### 6) Saved the trained model using joblib or pickle for deployment.
##### 7) Created an interactive web app using Streamlit to collect user input.
##### 8) Integrated the model into the Streamlit app for real-time predictions.
##### 9) Deployed the app locally.


In [None]:
import pandas as pd
data = pd.read_csv("employee_attrition_data.csv")

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.isna().sum()

In [None]:
data.duplicated().sum()

In [None]:
data.columns

##### Dropping A Column

In [None]:
data.drop(columns="Employee_ID",inplace=True)
data.head()

#### Grouping Columns
##### i) Grouping two columns

In [None]:
import matplotlib.pyplot as plt

In [None]:
data["Gender"].value_counts().plot(kind="pie")
plt.ylabel("")
plt.title("Gender counts of Employees")
plt.show()

In [None]:
data.groupby("Job_Title")["Salary"].mean().sort_values(ascending = False)

In [None]:
data.groupby("Job_Title")["Salary"].mean().sort_values(ascending = False).plot(kind ="bar")
plt.title("Average salary by job title")
plt.ylabel("Mean Salary")
plt.show()

##### ii) Grouping multiple columns

In [None]:
data.head()

In [None]:
data.groupby(["Department","Promotion_Last_5Years",])["Salary"].mean()

#### Index Resetting

In [None]:
data.groupby(["Department","Promotion_Last_5Years",])["Salary"].mean().reset_index()

In [None]:
data.describe()

In [None]:
data["Salary"].describe()

In [None]:
data.columns

In [None]:
data.head()

In [None]:
X= data[["Years_at_Company","Satisfaction_Level", "Average_Monthly_Hours"]]
y=data[["Salary"]]

#### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2)

#### Scaling

In [None]:
X

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train =scaler.fit_transform(X_train)

In [None]:
X_train

In [None]:
# to prevent dataleakage we need to import joblib
import joblib
joblib.dump(scaler,"scaler.pkl")


In [None]:
X_test = scaler.fit_transform(X_test)

In [None]:
X_test.shape

# Training ML models

#### 1) Linear Regression ML Model

In [None]:
# But before Training we need to define the function that gives the regression results.
# To avoid writing mean Squarred erroe, mean absolute error

import numpy as np
from sklearn.metrics import mean_absolute_error,mean_squared_error

def results(predictions):
    print("Mean absolute error on model is {}".format(mean_absolute_error(y_test,predictions)))
    print("Root mean squared error on model is {}".format(mean_squared_error(y_test,predictions)))




In [None]:
# Now Linear Regression

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train , y_train)

In [None]:
# predicting using linear regression
lr.predict(X_test)
predictionslr = lr.predict(X_test)
results(predictionslr)

#### 2) Support Vector Regression ML Model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
svrmodel = SVR()    

In [None]:
param_gridsvr = {"C":[0.01,0.1,0.5],"degree":[2,3,4], "kernel":["linear","rbf","poly"]}   # defining the paramaters
gridsvr= GridSearchCV(svrmodel,param_gridsvr)                                             # initializing the grid search

In [None]:
gridsvr.fit(X_train, y_train.values.ravel())

In [None]:
gridsvr.best_params_

In [None]:
predictionssvr = gridsvr.predict(X_test)
results(predictionssvr)

#### 3) Tree based Random Forest Regression  ML model

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfrmodel = RandomForestRegressor()
param_gridrfr = {"n_estimators":[2,3,4,5,6],"max_depth":[5,10,15]}

In [None]:
gridrfr = GridSearchCV(rfrmodel, param_gridrfr)
gridrfr.fit(X_train,y_train.values.ravel())

In [None]:
gridrfr.best_params_

In [None]:
predictionsrfr = gridrfr.predict(X_test)
results(predictionsrfr)

In [None]:
gridsvr # because mean squarred error is lesss as compared to others ML model

In [None]:
joblib.dump(gridsvr,"model.pkl")

In [None]:
X.columns