# **Learning About MSE**

### **Taking the small data set**

In [3]:
# 1. Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 2. Create the Dataset
data = {
    'YearsExperience': [1, 2, 3, 4, 5],
    'Salary': [35, 40, 50, 60, 65]
}


df = pd.DataFrame(data)

# 3. Split the Data into Training and Testing Sets
X = df[['YearsExperience']]  # Features (Years of Experience)
y = df['Salary']  # Target (Salary)

# Splitting the dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Train the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# 5. Test the Model (Make Predictions)
y_pred = model.predict(X_test)

# 6. Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Display the test set and predicted results
print("\nTest Set (Years of Experience vs Actual Salary):")
print(pd.DataFrame({'YearsExperience': X_test['YearsExperience'], 'Actual Salary': y_test}))

print("\nPredicted Salary based on Years of Experience:")
print(pd.DataFrame({'YearsExperience': X_test['YearsExperience'], 'Predicted Salary': y_pred}))

# 7. Take User Input to Predict the Salary
years_of_experience = float(input("\nEnter the years of experience to predict the salary: "))

# Reshape the input to match the expected shape for prediction (2D array)
user_input = np.array([[years_of_experience]])

# Predict salary based on user input
predicted_salary = model.predict(user_input)

# Display the result
print(f"\nThe predicted salary for {years_of_experience} years of experience is: ${predicted_salary[0]:.2f}k")


Mean Squared Error: 8.163265306122472

Test Set (Years of Experience vs Actual Salary):
   YearsExperience  Actual Salary
1                2             40

Predicted Salary based on Years of Experience:
   YearsExperience  Predicted Salary
1                2         42.857143

The predicted salary for 10.0 years of experience is: $104.57k




### **Taking the relatively big data set**

In [7]:
# 1. Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 2. Create the Dataset
data = {
    'YearsExperience': [
        1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7, 
        3.9, 4.0, 4.0, 4.1, 4.5, 4.9, 5.1, 5.3, 5.9, 6.0, 
        6.8, 7.1, 7.9, 8.2, 8.7, 9.0, 9.5, 9.6, 10.3, 10.5
    ],
    'Salary': [
        39343.00, 46205.00, 37731.00, 43525.00, 39891.00, 56642.00, 60150.00, 
        54445.00, 64445.00, 57189.00, 63218.00, 55794.00, 56957.00, 57081.00, 
        61111.00, 67938.00, 66029.00, 83088.00, 81363.00, 93940.00, 91738.00, 
        98273.00, 101302.00, 113812.00, 109431.00, 105582.00, 116969.00, 
        112635.00, 122391.00, 121872.00
    ]
}
df = pd.DataFrame(data)

# 3. Split the Data into Training and Testing Sets
X = df[['YearsExperience']]  # Features (Years of Experience)
y = df['Salary']  # Target (Salary)

# Splitting the dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0000001, random_state=52)

# 4. Train the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# 5. Test the Model (Make Predictions)
y_pred = model.predict(X_test)

# 6. Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Display the test set and predicted results
print("\nTest Set (Years of Experience vs Actual Salary):")
print(pd.DataFrame({'YearsExperience': X_test['YearsExperience'], 'Actual Salary': y_test}))

print("\nPredicted Salary based on Years of Experience:")
print(pd.DataFrame({'YearsExperience': X_test['YearsExperience'], 'Predicted Salary': y_pred}))

# 7. Take User Input to Predict the Salary
years_of_experience = float(input("\nEnter the years of experience to predict the salary: "))

# Reshape the input to match the expected shape for prediction (2D array)
user_input = np.array([[years_of_experience]])

# Predict salary based on user input
predicted_salary = model.predict(user_input)

# Display the result
print(f"\nThe predicted salary for {years_of_experience} years of experience is: ${predicted_salary[0]:.2f}k")


Mean Squared Error: 3102669.275094161

Test Set (Years of Experience vs Actual Salary):
    YearsExperience  Actual Salary
20              6.8        91738.0

Predicted Salary based on Years of Experience:
    YearsExperience  Predicted Salary
20              6.8      89976.560454

The predicted salary for 10.0 years of experience is: $120180.56k




### **Conclusion**

# Why is the Mean Squared Error (MSE) Larger for the Bigger Dataset?

The reason why the **Mean Squared Error (MSE)** is much larger for the bigger dataset compared to the smaller one is primarily due to the complexity of the data and how well the linear regression model fits the data. Let’s break down the key reasons:

## 1. Larger Dataset = More Variability:
A larger dataset like the second one includes more data points and potentially more variability in the relationship between `YearsExperience` and `Salary`. This variability might introduce more noise or data points that deviate from a perfect linear relationship, causing larger prediction errors.

The smaller dataset is simple and may follow a more linear pattern, which means a linear regression model can fit it quite easily, resulting in a lower MSE.

## 2. Outliers and Non-linearity:
In the larger dataset, there might be some **outliers** or a **non-linear relationship** between experience and salary, especially for higher years of experience. For example, the salaries might not increase proportionally with years of experience. These outliers or deviations from a straight line will increase the MSE because the linear regression model tries to fit a straight line to data that might not follow a perfect linear pattern.

The smaller dataset has fewer data points and likely no outliers or significant deviations, so the model can better fit those points.

## 3. Underfitting for Complex Data:
With the larger dataset, the model might be **underfitting** because linear regression assumes a simple linear relationship between the feature (years of experience) and the target (salary). If the relationship between experience and salary in the larger dataset is more complex, linear regression won’t capture it effectively, leading to higher errors.

For instance, salaries in real-world scenarios may not always increase linearly with years of experience; they might plateau or even dip at certain points.

## 4. Training/Test Split and Size:
In a smaller dataset, even the test set will be very small (just one point in your first example), which can result in artificially low MSE. On the other hand, the larger dataset has a larger test set, and hence, the errors will accumulate across more test points, leading to a larger MSE.

---

## Example Summary:

### Small Dataset:
- Only 5 data points, simple pattern.
- Model fits well to the small number of points.
- **MSE**: 8.16 (very small)

### Larger Dataset:
- More data points (30), more variability.
- Potential non-linear relationships or outliers.
- **MSE**: 49,830,096.86 (larger)

---

## How to Reduce MSE in Larger Datasets:

### 1. Use More Advanced Models:
Consider using more sophisticated models that can capture non-linear relationships, such as **polynomial regression**, **decision trees**, or **random forests**.

### 2. Feature Engineering:
Introduce more features that could explain salary variability better, such as **job role**, **location**, or **education**. This might reduce unexplained variability and improve model performance.

### 3. Outlier Detection:
Analyze the data for **outliers** that might be distorting the model's fit. Removing or handling outliers can improve performance.

### 4. Polynomial Regression:
If the relationship between experience and salary is non-linear, a **polynomial regression** (degree 2 or 3) can help the model better fit the data.

