<a href="https://colab.research.google.com/github/Anissa7/Math-for-machine-learning-Summative-/blob/main/Math_for_machine_learning_summative_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Great! Let's refine your explanation and output for **Task 1** to align perfectly with the assignment and make it clear for submission. Here's how we can summarize the steps and insights gained:

---

### Task 1: Linear Regression and Model Comparison

#### **1. Dataset Preprocessing**
- Loaded the dataset `health_indicators_bfa.csv` relevant to public health forecasting in Burkina Faso.
- Preprocessed the data:
  - Converted the `YEAR (DISPLAY)` column to numeric format.
  - Extracted numerical values from the `Value` column (target variable).
  - Handled missing values to ensure the data was clean and usable.

#### **2. Models Implemented**
- Built and trained three regression models using scikit-learn:
  1. **Linear Regression**
  2. **Decision Tree Regressor**
  3. **Random Forest Regressor**

#### **3. Model Comparison**
- Evaluated the models using the **Mean Squared Error (MSE)** on the test data:
  - **Linear Regression**: `MSE = [insert value here]`
  - **Decision Tree Regressor**: `MSE = [insert value here]`
  - **Random Forest Regressor**: `MSE = [insert value here]`

#### **4. Best Model Selection**
- The model with the lowest MSE was identified as the **best-performing model**:
  - **Best Model**: [e.g., Random Forest Regressor]
  - **MSE**: [insert value here]

#### **5. Saved Model**
- Saved the best-performing model to a file named `best_model.pkl` for reuse in predictions.

#### **6. Prediction Script**
- Created a prediction function using the saved model. Given an input, such as `{'YEAR (DISPLAY)': 2023, 'Numeric': 10}`, the model predicts the public health indicator value with accuracy.

---

### Output Example
Here’s how the output would appear when running the prediction script:

```plaintext
Mean Squared Errors: {'Linear Regression': 150.0, 'Decision Tree': 120.0, 'Random Forest': 100.0}
Best Model: Random Forest Regressor
Prediction: [50.0]
```

---

In [1]:
pip install pandas numpy matplotlib seaborn scikit-learn




In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import joblib

# Load dataset
data = pd.read_csv('health_indicators_bfa.csv')

# Preprocessing: Select relevant features and target
# Convert 'YEAR (DISPLAY)' to numeric, handling errors
data['YEAR (DISPLAY)'] = pd.to_numeric(data['YEAR (DISPLAY)'], errors='coerce')
# errors='coerce' will replace non-numeric values with NaN

# Extract the first numerical value from the 'Value' column
# Assuming the numerical value is always the first part of the string,
# before any spaces or brackets
# You might need to adjust this regex based on the actual format of your data
data['Value'] = data['Value'].str.extract(r'(\d+\.?\d*)').astype(float)

# Now select features and target
X = data[['YEAR (DISPLAY)', 'Numeric']]
y = data['Value']

# Drop rows with missing values in X or y
# Before dropping rows in X, remove rows with NaNs in 'y' as well
data_cleaned = data.dropna(subset=['YEAR (DISPLAY)', 'Numeric', 'Value'])  # Drop rows with NaNs in any of these columns

X = data_cleaned[['YEAR (DISPLAY)', 'Numeric']]
y = data_cleaned['Value']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

# ... (Rest of your code remains the same)
# Train Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

# Train Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Determine the best-performing model
models_mse = {
    "Linear Regression": mse_lr,
    "Decision Tree": mse_dt,
    "Random Forest": mse_rf
}
best_model_name = min(models_mse, key=models_mse.get)
best_model = {
    "Linear Regression": lr_model,
    "Decision Tree": dt_model,
    "Random Forest": rf_model
}[best_model_name]

# Save the best-performing model
joblib.dump(best_model, "best_model.pkl")

# Print results
print(f"Mean Squared Errors: {models_mse}")
print(f"Best Model: {best_model_name}")

Mean Squared Errors: {'Linear Regression': 4915692843.558872, 'Decision Tree': 5063250275.507248, 'Random Forest': 4240337245.1114206}
Best Model: Random Forest


In [16]:
import pandas as pd
import joblib

# Load the best model
loaded_model = joblib.load("best_model.pkl")

# Define feature names based on your training data
feature_names = ['YEAR (DISPLAY)', 'Numeric']

# Test the model with sample data
sample_input = pd.DataFrame([[2023, 50]], columns=feature_names)
prediction = loaded_model.predict(sample_input)
print(f"Prediction: {prediction}")


Prediction: [50.]
