# Salary Prediction (Linear Regression) - Guided Notebook

**Goal:** Build a simple Linear Regression model to predict salary from years of experience. This notebook is designed to teach you step-by-step — you will *write* the code in the `YOUR TURN` cells and run them. If you get stuck, there are hints and ready-to-run examples commented below each `YOUR TURN` cell.

---

**How to use this notebook**
- Open it in Jupyter Notebook / JupyterLab or Google Colab.
- Follow the sections in order. Each section has a short explanation (markdown) and a `YOUR TURN` code cell where you should type and run code.
- If you are stuck, expand the hint (commented code) and run it.



## 1) Imports & setup

In this section import the libraries we will use. Keep this cell simple.


In [None]:
# YOUR TURN: import the basic libraries (pandas, numpy, matplotlib, seaborn, sklearn parts)
# Hint (uncomment if you need):
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LinearRegression
# from sklearn import metrics

print('Write your imports above and run this cell.')

## 2) Load the dataset

Place your CSV in the repository (e.g. `/data/Salary_dataset.csv`) or upload it to Colab and update the path. Load it with pandas and inspect the first rows.


In [None]:
# YOUR TURN: set DATA_PATH to your CSV file and load it using pandas
# Example hint (uncomment to run if stuck):
# DATA_PATH = 'data/Salary_dataset.csv'  # change path if needed
# df = pd.read_csv(DATA_PATH)
# display(df.head())

print('Load the CSV into a dataframe named df.')

## 3) Quick EDA (Exploratory Data Analysis)

Check info(), describe(), missing values, and a simple scatter plot of YearsExperience vs Salary.


In [None]:
# YOUR TURN: run basic EDA on df
# Hint:
# print(df.info())
# print(df.describe())
# print(df.isna().sum())
# df.plot.scatter(x='YearsExperience', y='Salary', figsize=(8,5))

print('Do basic EDA: info, describe, isna, and scatter plot.')

## 4) Prepare features (X) and target (y)

Drop or select the correct columns. Then do a train_test_split.


In [None]:
# YOUR TURN: create X and y, then split to train/test
# Hint:
# X = df[['YearsExperience']]
# y = df['Salary']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Prepare X, y and split into train/test.')

## 5) Scaling (StandardScaler) - optional

For this simple dataset scaling is optional, but it's good practice. Use fit on training only.


In [None]:
# YOUR TURN: apply StandardScaler to X_train and X_test
# Hint:
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

print('Apply scaler only on training data (fit) then transform test. If you skip scaling, use X_train / X_test directly for Linear Regression.')

## 6) Train a Linear Regression model

Train `LinearRegression()` on training data, then predict on test set.


In [None]:
# YOUR TURN: train LinearRegression and predict
# Hint:
# model = LinearRegression()
# model.fit(X_train_scaled, y_train)     # or model.fit(X_train, y_train) if not scaling
# y_pred = model.predict(X_test_scaled)  # or model.predict(X_test)

print('Train a LinearRegression model and create predictions y_pred.')

## 7) Evaluation

Compute MAE, MSE, RMSE and R2. Then show a scatter plot of true vs predicted and the regression line.


In [None]:
# YOUR TURN: evaluate the model
# Hint:
# from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# mae = mean_absolute_error(y_test, y_pred)
# mse = mean_squared_error(y_test, y_pred)
# rmse = mse ** 0.5
# r2 = r2_score(y_test, y_pred)
# print(f'MAE: {mae:.2f}, RMSE: {rmse:.2f}, R2: {r2:.3f}')

# Plot
# plt.figure(figsize=(8,6))
# plt.scatter(X_test, y_test, label='True')
# plt.scatter(X_test, y_pred, label='Predicted')
# plt.plot(X_test, y_pred, color='red', linewidth=2)
# plt.legend()
# plt.xlabel('YearsExperience')
# plt.ylabel('Salary')
# plt.title('True vs Predicted (Test Set)')

print('Calculate MAE/MSE/RMSE/R2 and plot results.')

## 8) Save the model (optional)

Save the trained model using `joblib` or `pickle` so you can load it later without re-training.


In [None]:
# YOUR TURN: save your model
# Hint:
# import joblib
# joblib.dump(model, 'salary_model.joblib')

print('Save the model to disk so you can reuse it.')

## 9) Try other models (YOUR TURN)

Try: DecisionTreeRegressor, RandomForestRegressor, SVR. Compare R2 and RMSE. Use cross-validation if you want.


In [None]:
# YOUR TURN: try at least one other regressor and compare metrics
# Hint example:
# from sklearn.ensemble import RandomForestRegressor
# rf = RandomForestRegressor(n_estimators=100, random_state=42)
# rf.fit(X_train, y_train)
# y_pred_rf = rf.predict(X_test)
# print('R2 RF:', r2_score(y_test, y_pred_rf))

print('Try RandomForest or SVR and compare performance.')

## 10) Conclusion & Next steps

Write a short conclusion: what did you learn? What can improve? Suggestions:
- Try feature engineering
- Try polynomial regression
- Add cross-validation and hyperparameter tuning

---

Good job — once you finish, push to GitHub and include this repository in your portfolio!