# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [32]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [34]:
# YOUR CODE HERE
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename)


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [35]:
#Your Code Here

# Data Preprocessing
# Fill missing values with mean (for simplicity)
df.fillna(df.mean(), inplace=True)

# Prepare data for the prediction tasks
X = df.drop(columns=['country', 'year', 'Life Ladder', 'GINI index (World Bank estimate)'])
y_happiness_score = df['Life Ladder']
y_happiness_rank = df['Life Ladder'].rank(ascending=False)
y_happiness_level = pd.cut(df['Life Ladder'], bins=[-np.inf, 4, 5, 6, 7, np.inf], labels=["Very Unhappy", "Unhappy", "Neutral", "Happy", "Very Happy"])
y_social_support = df['Social support']
y_healthy_life_expectancy = df['Healthy life expectancy at birth']
y_perceptions_of_corruption = df['Perceptions of corruption']

# Split data into training and testing sets consistently for all tasks
X_train, X_test, y_train_happiness_score, y_test_happiness_score, y_train_happiness_rank, y_test_happiness_rank, y_train_happiness_level, y_test_happiness_level, y_train_social_support, y_test_social_support, y_train_healthy_life_expectancy, y_test_healthy_life_expectancy, y_train_perceptions_of_corruption, y_test_perceptions_of_corruption = train_test_split(X, y_happiness_score, y_happiness_rank, y_happiness_level, y_social_support, y_healthy_life_expectancy, y_perceptions_of_corruption, test_size=0.2, random_state=42)

# Scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)







## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [36]:
# YOUR CODE HERE

# Linear Regression for predicting Happiness Score
lr_model = LinearRegression()
lr_scores = cross_val_score(lr_model, X_train_scaled, y_train_happiness_score, scoring='neg_mean_squared_error', cv=5)
happiness_score_rmse = np.sqrt(-lr_scores.mean())

# Random Forest Regression for predicting Happiness Score
rf_model = RandomForestRegressor()
rf_scores = cross_val_score(rf_model, X_train_scaled, y_train_happiness_score, scoring='neg_mean_squared_error', cv=5)
happiness_score_rf_rmse = np.sqrt(-rf_scores.mean())

# Random Forest Regression for predicting Happiness Rank
rf_rank_model = RandomForestRegressor()
rf_rank_scores = cross_val_score(rf_rank_model, X_train, y_train_happiness_rank, scoring='neg_mean_squared_error', cv=5)
happiness_rank_rmse = np.sqrt(-rf_rank_scores.mean())

# Logistic Regression for predicting Happiness Level (Classification)
logreg_model = LogisticRegression(max_iter=1000, solver='saga')  # Increase max_iter and try a different solver if needed
logreg_scores = cross_val_score(logreg_model, X_train_scaled, y_train_happiness_level, scoring='accuracy', cv=5)
happiness_level_accuracy = logreg_scores.mean()

# Random Forest Classifier for predicting Happiness Level (Classification)
rf_classifier = RandomForestClassifier()
rf_classifier_scores = cross_val_score(rf_classifier, X_train, y_train_happiness_level, scoring='accuracy', cv=5)
happiness_level_rf_accuracy = rf_classifier_scores.mean()

# Regression for predicting Social Support
social_support_model = LinearRegression()
social_support_scores = cross_val_score(social_support_model, X_train_scaled, y_train_social_support, scoring='neg_mean_squared_error', cv=5)
social_support_rmse = np.sqrt(-social_support_scores.mean())

# Regression for predicting Healthy Life Expectancy
healthy_life_model = LinearRegression()
healthy_life_scores = cross_val_score(healthy_life_model, X_train_scaled, y_train_healthy_life_expectancy, scoring='neg_mean_squared_error', cv=5)
healthy_life_rmse = np.sqrt(-healthy_life_scores.mean())

# Regression for predicting Perceptions of Corruption
corruption_model = LinearRegression()
corruption_scores = cross_val_score(corruption_model, X_train_scaled, y_train_perceptions_of_corruption, scoring='neg_mean_squared_error', cv=5)
corruption_rmse = np.sqrt(-corruption_scores.mean())

# Print evaluation results
print("Happiness Score RMSE (Linear Regression):", happiness_score_rmse)
print("Happiness Score RMSE (Random Forest):", happiness_score_rf_rmse)
print("Happiness Rank RMSE (Random Forest):", happiness_rank_rmse)
print("Happiness Level Accuracy (Logistic Regression):", happiness_level_accuracy)
print("Happiness Level Accuracy (Random Forest):", happiness_level_rf_accuracy)
print("Social Support RMSE (Linear Regression):", social_support_rmse)
print("Healthy Life Expectancy RMSE (Linear Regression):", healthy_life_rmse)
print("Perceptions of Corruption RMSE (Linear Regression):", corruption_rmse)




Happiness Score RMSE (Linear Regression): 0.3027692789139585
Happiness Score RMSE (Random Forest): 0.2081896063466236
Happiness Rank RMSE (Random Forest): 83.26532783542936
Happiness Level Accuracy (Logistic Regression): 0.8342425702811245
Happiness Level Accuracy (Random Forest): 0.7758265060240964
Social Support RMSE (Linear Regression): 1.3597283057474732e-16
Healthy Life Expectancy RMSE (Linear Regression): 9.608689386223203e-15
Perceptions of Corruption RMSE (Linear Regression): 2.335893102061827e-16
