
### Short Coding Project: Non-linear Regression

#### Project Overview
This project consists of a few short tasks where you will apply the concepts learned in the Non-linear Regression lab. You will work with a dataset, implement non-linear regression models, and evaluate their performance.

- Delete the `# Your Code Here` comments and write your code.
- **Do not change** the variable names.



### Load the Dataset and Clean the Data
Start by loading the dataset and cleaning specific columns. We use the `Retype` function to convert the necessary values into numeric format.


In [None]:
import pandas as pd
import re

# Load the dataset
url = 'https://raw.githubusercontent.com/CyConProject/Lab/main/Datasets/world-data-2023.csv'
df = pd.read_csv(url)

# List of columns to be processed
columns = [
    "GDP",
    "Life expectancy",
    "Population",
]

# Function to convert and clean values in the columns
def Retype(x):
    if type(x) != str:
        return x
    x = re.sub(r"%|\$|,", "", x)
    return float(x)

# Apply the Retype function to the relevant columns
for column in columns:
    df[column] = df[column].apply(Retype)

# Display the first few rows after data conversion
df.head()


### Question 1: Create a New Column for GDP Per Capita
Create a new column called `GDP Per Capita` by dividing the `GDP` by `Population`.


In [None]:
# Create a new column for GDP Per Capita
df['GDP Per Capita'] = # YOUR CODE HERE

# Display the first few rows with the new column
df[['GDP', 'Population', 'GDP Per Capita']].head()

### Visualize the Data
You can see a scatter plot visualizing the relationship between `GDP Per Capita` and `Life Expectancy`.


In [None]:
import matplotlib.pyplot as plt

# Scatter plot of GDP Per Capita vs. Life Expectancy
plt.scatter(df['GDP Per Capita'], df['Life expectancy'])
plt.title('Scatter Plot of GDP Per Capita vs. Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.ylabel('Life Expectancy')
plt.show()

### Question 2: Fit a Power Function Model
Fit a non-linear regression model to the data using the `GDP Per Capita` and `Life Expectancy` columns. Use the Numpy library and write a power function in this form:
$$ y = a \cdot x^b \$$
Then use `curve_fit` to fit the model to the `GDP Per Capita` and `Life Expectancy` data.

In [None]:
from scipy.optimize import curve_fit
import numpy as np

# Define a non-linear function (e.g., power function)
def non_linear_func(x, a, b):
    return # YOUR CODE HERE

# Sort the DataFrame by 'GDP_Per_Capita' column
df.sort_values(by='GDP Per Capita', inplace=True)

# Drop rows with NaN values in 'GDP Per Capita' and 'Life expectancy'
df = df.dropna(subset=['GDP Per Capita', 'Life expectancy'])

# Fit the non-linear model
popt, pcov = # YOUR CODE HERE

# Display the parameters
print(f"Optimal parameters: {popt}")



### Question 3: Make Predictions and Plot the Results
Use the fitted model to make predictions.

In [None]:

# Make predictions using the fitted model
predictions = # YOUR CODE HERE

# Plot the original data and the fitted curve
plt.scatter(df['GDP Per Capita'], df['Life expectancy'], color='blue', label='Actual')
plt.plot(df['GDP Per Capita'], predictions, color='red', label='Fitted Curve')
plt.title('Non-linear Regression Fit: GDP Per Capita vs. Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.ylabel('Life Expectancy')
plt.legend()
plt.show()



### Question 4: Evaluate the Model
Calculate the Mean Squared Error (MSE) and R-squared values to evaluate the model's performance.


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Calculate evaluation metrics
mse = # YOUR CODE HERE
r2 = # YOUR CODE HERE

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

### Question 5: Advanced Non-linear Regression Model

In this question, you will implement a more advanced non-linear regression model to predict the relationship between `GDP Per Capita` and `Life Expectancy`. The model to be used is:

$$ y = log(ax^3 + bx^2 + cx + d + epsilon) $$

This model combines polynomial terms and a logarithmic transformation. Follow the steps below:

1. **Define the advanced non-linear function**: Define a function to represent the advanced non-linear model.
2. **Fit the model to the data**: Use the `curve_fit` function to fit the advanced model to the data, providing initial guesses for the parameters. You also need to set the `maxfev=20000` to ensure enough iterations for the fitting process.

    **Hint**: If you want to learn more about initial guesses, you can refer to the SciPy documentation on `curve_fit` in this [link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html).
    
3. **Make predictions**: Use the fitted model to make predictions for `Life Expectancy` based on `GDP Per Capita`.


In [None]:
# Step 1: Define the advanced non-linear function with a small constant to avoid negative or zero values
def advanced_non_linear_func(x, a, b, c, d):
    epsilon = 1e-6  # Small positive constant to prevent invalid values inside log
    return # YOUR CODE HERE

# Step 2: Fit the advanced non-linear model with initial guesses and increased maxfev
initial_guess = [1, 1, 1, 1]  # Initial guesses for a, b, c, d
popt_adv, pcov_adv = # YOUR CODE HERE

# Step 3: Make predictions using the fitted model
predictions_adv = # YOUR CODE HERE

# Step 4: Plot the original data and the fitted curve
plt.scatter(df['GDP Per Capita'], df['Life expectancy'], color='blue', label='Actual')
plt.plot(df['GDP Per Capita'], predictions_adv, color='purple', label='Advanced Non-linear Fit')
plt.title('Advanced Non-linear Model: GDP Per Capita vs. Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.ylabel('Life Expectancy')
plt.legend()
plt.show()

# Step 5: Calculate evaluation metrics
mse_adv = mean_squared_error(df['Life expectancy'], predictions_adv)
r2_adv = r2_score(df['Life expectancy'], predictions_adv)

print(f'Advanced Non-linear Model - Mean Squared Error: {mse_adv}')
print(f'Advanced Non-linear Model - R-squared: {r2_adv}')

# Step 6: Compare with the power function model results
print(f'Power Function Model - Mean Squared Error: {mse}')
print(f'Power Function Model - R-squared: {r2}')

As you can see, the results show that making the model more complicated does not necessarily improve performance. The power function model, despite being simpler, performs slightly better with a lower Mean Squared Error (17.18 vs. 17.55) and a higher R-squared value (0.692 vs. 0.685), indicating that added complexity does not always lead to better results.