# Problem Statement
- **Context and Company Background:** TechWorks Consulting, a company specializing in IT talent recruitment, and highlights its unique approach to matching skilled IT professionals with job opportunities.
- **Data Description:** The Dataset conatins information about colleges, cities, roles, previous experience, and salary. This information will be used to train and test the predictive model.
- **Regression Task:** The primary objective is to perform a regression task, where the aim is to predict a continuous variable, specifically the salary of newly hired employees.
- **Role of Statistics:** The role of statistics is to build and check the accuracy of the model.
- **Data Preprocessing:** Data Preprocessing is most important task as it involves tasks like handling missing values, outliers, categorical variables, normalization, and feature selection.

# Creating a Salary Prediction Model: A Systematic Approach
- **Data Understanding:**
  - Begin by thoroughly understanding the provided dataset, including its structure, columns, and the meaning of each variable. Gain insights into the data's distribution, summary statistics, and potential outliers.
- **Data Preprocessing:**
  - Handle Missing Values: Identify and address missing data by imputation or removal, ensuring that data is complete.
  - Outlier Detection and Treatment: Detect and handle outliers in the dataset, which could impact the model's accuracy.
  - Convert Categorical Data: Transform categorical variables (e.g., "College" and "City") into numerical format.
  - Normalize Data: Normalize numerical features to bring them to a common scale to avoid any feature dominating the model.
  - Feature Selection: Use statistical techniques such as Lasso, Ridge, or correlation analysis to select the most relevant features for salary prediction.
    - **Performing Exploratory Data Analysis (EDA)**
    - **Model Selection:**
        - Choose different regression models (e.g., Linear Regression, Multi Linear Regression) to build and evaluate the predictive models.
    - **Model Training and Evaluation:**
        - Split the dataset into training and testing sets to train the models and assess their performance.
        - Use appropriate evaluation metrics like Mean Squared Error (MSE), R-squared, and Mean Absolute Error (MAE) to measure the model's accuracy.
        - Experiment with different hyperparameters for each model and use cross-validation to avoid overfitting.
    - **Model Comparison:**
        - Compare the performance of different models and select the one with the best accuracy and generalization.
    - **Further Improvement:**
        - Consider additional techniques for model improvement, such as feature engineering, hyperparameter tuning, and ensemble methods.

# The available ML model options had to perform on this task
#### In the task of predicting employee salary at TechWorks Consulting, there are several machine learning model options available for regression tasks. The choice of the model depends on various factors, including the nature of the data, the complexity of the problem, and the need for model interpretability. Here are some of the available ML model options:
**1. Linear Regression:**
- Linear regression is a simple and interpretable model that assumes a linear relationship between the features and the target variable (salary). It's a good starting point and can provide baseline performance.

**2. Ridge Regression and Lasso Regression:**
- Ridge and Lasso regression are regularization techniques that can be used to handle multicollinearity and prevent overfitting. They are variants of linear regression that add regularization terms to the cost function.

**3. Decision Trees:**
- Decision tree-based models, like Random Forest and Gradient Boosting, are capable of capturing non-linear relationships in the data. They can handle both numerical and categorical features and automatically deal with feature importance.

**4. K-Nearest Neighbors (KNN):**
- KNN is a non-parametric method that makes predictions based on the average of the 'k' nearest data points. It can be effective for small to medium-sized datasets.

**5. Polynomial Regression:**
- Polynomial regression can be used to capture non-linear relationships by introducing polynomial features.

I will be performing 3 of them with default parameters and with somes doing changes in parameter to showcase it.

In [1]:
# Import the pandas library for data manipulation and analysis
# Import the numpy library for numerical operations and array processing
# Import the seaborn library for data visualization

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Read a CSV file into a DataFrame

df = pd.read_csv("C:/Users/Preet/ML-PROJECT/data/ML case Study.csv")
college = pd.read_csv("C:/Users/Preet/ML-PROJECT/data/Colleges.csv")
cities = pd.read_csv("C:/Users/Preet/ML-PROJECT/data/cities.csv")

In [None]:
# Overview of Data

df.head()

In [None]:
# Overview of College data

college.head()

In [None]:
# Overview of City data

cities.head()

In [None]:
# Extract data from the "Tier 1," "Tier 2," and "Tier 3" columns of the 'college' DataFrame
# and store them in separate lists 'Tier1,' 'Tier2,' and 'Tier3' for further analysis.

Tier1 = college["Tier 1"].tolist()
Tier2 = college["Tier 2"].tolist()
Tier3 = college["Tier 3"].tolist()

In [None]:
# Printing data contains in Tier1

Tier1

In [None]:
# Assign tier values to colleges in the DataFrame based on their tier classification
# - If a college is in 'Tier1', set its value to 3
# - If a college is in 'Tier2', set its value to 2
# - If a college is in 'Tier3', set its value to 1
# Tier1 college get value of 3 and tier 3 of 1 because tier1 college has higher weightage then 2 and 3.

for item in df.College:
    if item in Tier1:
        df["College"].replace(item,3,inplace=True)
    elif item in Tier2:
        df["College"].replace(item,2,inplace=True)
    elif item in Tier3:
        df["College"].replace(item,1,inplace=True)

In [None]:
df.head()

In [None]:
# Extracting lists of metropolitan and non-metropolitan cities from the 'cities' DataFrame

metro = cities['Metrio City'].tolist()
non_metro_cities = cities['non-metro cities'].tolist()

In [None]:


for item in df.City:
    if item in metro:
        df['City'].replace(item,1,inplace=True)
    elif item in non_metro_cities:
        df['City'].replace(item,0,inplace=True)

In [None]:
df.head()

In [None]:


df = pd.get_dummies(df, drop_first=True)

In [None]:
df.sample(5)

In [None]:
# Checking missing values in data

df.isna().sum()

In [None]:
# Information about data
df.info()

In [None]:
# Statistical info about numerical data

df.describe()

# Detection of Outliers

In [None]:
# Using seaborn library to plot box plot for detection of outliers
sns.boxplot(df['Previous CTC'])
plt.savefig('plots/previous_ctc_boxplot.png')

In [None]:
sns.boxplot(df['Graduation Marks'])
plt.savefig('plots/graduation_marks_boxplot.png')

In [None]:
sns.boxplot(df['EXP (Month)'])
plt.savefig('plots/exp_month_boxplot.png')

In [None]:
sns.boxplot(df['CTC'])
plt.savefig('plots/ctc_boxplot.png')

In [None]:
# Corelation between variables
corr = df.corr()
corr

In [None]:
# Visual representation of corr
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.savefig('plots/correlation_heatmap.png')

#### Outliers present in Previous CTC column

In [None]:
percent25 = df['Previous CTC'].quantile(0.25)
percent75 = df['Previous CTC'].quantile(0.75)

In [None]:
iqr = percent75-percent25

In [None]:
upper_limit = percent75 + 1.5*iqr
lower_limit = percent25 - 1.5*iqr

In the above DataFrame, These are outliers present in "Previous CTC"column. As seen these outliers are not extreme, so in my opinion keeping these data may not affect much on my model.

#### Outliers present in CTC column

In [None]:
percent25 = df['CTC'].quantile(0.25)
percent75 = df['CTC'].quantile(0.75)

In [None]:
iqr = percent75-percent25

In [None]:
upper_limit = percent75 + 1.5*iqr
lower_limit = percent25 - 1.5*iqr

In [None]:
df[(df['CTC'] < lower_limit) | (df['CTC'] > upper_limit)]

As seen above, these are some outliers in "CTC" column but they are not as extreme that can make any huge difference while making prediction. Therefore in my opinion keeping those outliers into data is more useful than removing.

### Conclusion on detection of Outliers:
- There were as such no extreme outliers present in our dataset that can make any huge difference in machine learning model. Also from describe function it is clear that there is no extreme outliers.
- As seen above in "Previous CTC" and "CTC", there are some outliers but from my perspective these are not going to affect my model.
- In the HeatMap figure, there are some relation between Role_manager and CTC and Previous CTC and CTC

# Applying Machine Learning models without Feature Scaling
Considering all possible algorithm without any scaling to check performance of model.

In [None]:
# Import necessary libraries for data splitting, modeling, and evaluation

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
# Split data into dependent and Independent Variable

X = df.loc[:, df.columns != 'CTC']
y = df['CTC']

In [None]:
# Split Data into train and test with test_size = 0.2(80% data into train and 20% to test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
y_test

In [None]:
# Create a LinearRegression model
linear_reg = LinearRegression()

# Fit the model to the training data
linear_reg.fit(X_train, y_train)

# Make predictions on the test data
linear_reg_pred = linear_reg.predict(X_test)

# Calculate and print the R-squared (r2) score
print("r2_score:",r2_score(y_test, linear_reg_pred))

# Calculate and print the Mean Absolute Error (MAE)
print("MAE:", mean_absolute_error(y_test, linear_reg_pred))

# Calculate and print the Mean Squared Error (MSE)
print("MSE:", mean_squared_error(y_test, linear_reg_pred))

print()

# Print the coefficients of the linear regression model
print("Coef:",linear_reg.coef_)

# Print the intercept of the linear regression model
print("Intercept:",linear_reg.intercept_)

In [None]:
# Import the Ridge regression model
ridge = Ridge()

# Fit the model to training data
ridge.fit(X_train, y_train)

# Make prediction on test data
ridge_predict = ridge.predict(X_test)

# Calculate and print the R-squared (r2) score
print("r2_score:",r2_score(y_test, ridge_predict))

# Calculate and print the Mean Absolute Error (MAE)
print("MAE:", mean_absolute_error(y_test, ridge_predict))

# Calculate and print the Mean Squared Error (MSE)
print("MSE:", mean_squared_error(y_test, ridge_predict))

print()

# Print the coefficients of the linear regression model
print("Coef:",ridge.coef_)

# Print the intercept of the linear regression model
print("Intercept:",ridge.intercept_)

In [None]:
# Create a Ridge regression model with a specified alpha value and solver
ridge_tuned = Ridge(alpha=0.3, solver='cholesky')

# Fit the Ridge model to the training data
ridge_tuned.fit(X_train, y_train)

# Make predictions on the test data using the tuned Ridge model
ridge_predict_tuned = ridge.predict(X_test)

# Calculate and print the R-squared (r2) score to evaluate model performance
print("r2_score:",r2_score(y_test, ridge_predict_tuned))

# Calculate and print the Mean Absolute Error (MAE) to evaluate model performance
print("MAE:",mean_absolute_error(y_test, ridge_predict_tuned))

# Calculate and print the Mean Squared Error (MSE) to evaluate model performance
print("MSE:", mean_squared_error(y_test, ridge_predict_tuned))

print()

# Print the coefficients of the linear regression model
print("Coef:",ridge_tuned.coef_)

# Print the intercept of the linear regression model
print("Intercept:",ridge_tuned.intercept_)