# Loading Libraries

This section is importing necessary libraries, including pandas for data manipulation, matplotlib and seaborn for data visualization, sklearn for machine learning tasks, and numpy for numerical operations.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Loading First Dataset

In [None]:
os.chdir(r'C:\Users\guilh\OneDrive\Área de Trabalho\FinalProject\FinalProject') 
#Loading the dataset from a CSV file named 'calories.csv' into a pandas DataFrame
data = pd.read_csv('calories.csv')
data.head() # Display the first few rows of the loaded dataset

# Loading Second Dataset

In [None]:
#Loading the second dataset from a CSV file named 'exercise.csv' into a pandas DataFrame
exercise_data = pd.read_csv('exercise.csv')
exercise_data.head()

# Merging the Datasets

In [None]:
#Merging the two datasets ('data' and 'exercise_data') on the 'User_ID' column
merged_data = pd.merge(data, exercise_data, on='User_ID')
merged_data.head()

***

# Checking Missing Values

Check for missing values in the merged dataset. The isnull() function returns a boolean mask of the same shape as the original DataFrame, where True indicates a missing value and False indicates a present value.   
The sum() function then counts the number of True values in each column, effectively counting the number of missing values.
missing_values = merged_data.isnull().sum()   

In [None]:
# Check for missing values in the merged dataset
missing_values = merged_data.isnull().sum()
missing_values

***

# Histogram of the Calories Column

The histogram will display the distribution of calories burned   
plt.figure(figsize=(10, 6)) # Set the figure size to 10 inches wide and 6 inches tall   
   
bins=30 specifies that the histogram should be divided into 30 bins
kde=True adds a kernel density estimate (KDE) to the histogram, which is a smoothed curve that estimates the underlying distribution of the data
sns.histplot(merged_data['Calories'], bins=30, kde=True)

In [None]:
#Creating a histogram of the 'Calories' column
plt.figure(figsize=(10, 6)) # Set the figure size to 10 inches wide and 6 inches tall
sns.histplot(merged_data['Calories'], bins=30, kde=True)# bins=30 specifies that the histogram should be divided into 30 bins
plt.title('Distribution of Calories Burned')
plt.xlabel('Calories')
plt.ylabel('Frequency')
plt.show()

***

# Scatter Plot to Explore the Relationship between Duration and Calories Burned

This code creates a scatter plot of the 'Duration' and 'Calories' columns in the merged_data DataFrame using the sns.scatterplot function from the seaborn library.   
The scatter plot displays the relationship between the duration of exercise and the number of calories burned, with the x-axis representing the duration of exercise in minutes and the y-axis representing the number of calories burned.   
The hue parameter is used to color the points based on the 'Gender' column, and the style parameter is used to change the marker style based on the 'Gender' column. The alpha parameter is used to set the transparency of the points.

In [None]:
#Creating a scatter plot to explore the relationship between Duration and Calories burned
plt.figure(figsize=(10, 6))# Set the figure size to 10 inches wide and 6 inches tall
sns.scatterplot(x='Duration', y='Calories', data=merged_data, hue='Gender', style='Gender', alpha=0.6)
plt.title('Relationship Between Exercise Duration and Calories Burned by Gender')
plt.xlabel('Duration (minutes)')
plt.ylabel('Calories Burned')
plt.legend(title='Gender')
plt.grid(True)
plt.show()

***

# Boxplot for Comparing the Heart Rate Across Different Genders During Exercise

This code creates a boxplot of the 'Heart_Rate' column in the merged_data DataFrame using the sns.boxplot function from the seaborn library.    
The boxplot displays the distribution of heart rate during exercise for each gender, with the x-axis representing the gender of the individual and the y-axis representing the heart rate in beats per minute (bpm).

In [None]:
#Creating a boxplot to compare the Heart Rate across different Genders during exercise
plt.figure(figsize=(10, 6))# Set the figure size to 10 inches wide and 6 inches tall
sns.boxplot(x='Gender', y='Heart_Rate', data=merged_data)
plt.title('Heart Rate Distribution by Gender During Exercise')
plt.xlabel('Gender')
plt.ylabel('Heart Rate (bpm)')
plt.show()

***

# Correlation Matrix/Heatmap

This code calculates the correlation matrix for the numeric columns in the merged_data DataFrame using the corr() function. The corr() function calculates the pairwise correlation of columns, and the result is a DataFrame with the correlation coefficient for each pair of columns.

In [None]:
# Select only numeric columns
numeric_data = merged_data.select_dtypes(include=['int64', 'float64'])

# Calculate the correlation matrix
correlation_matrix = numeric_data.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Correlation Matrix Heatmap')
plt.show()

***

# Dropping Useless Column (User_ID) from the Dataset

This code drops the 'User_ID' column from the merged_data DataFrame using the drop() function. The axis=1 parameter specifies that the column should be dropped (as opposed to a row, which would be axis=0).
   
The resulting DataFrame, merged_data_dropped, is then displayed using the head() function, which shows the first few rows of the DataFrame.

In [None]:
#Dropping the 'User_ID' column from the merged dataset
merged_data_dropped = merged_data.drop('User_ID', axis=1)#drop a column from a DataFrame
merged_data_dropped.head()

***

# Linear Regression

This code prepares the data for a linear regression model by encoding the 'Gender' column as 1 for 'male' and 0 for 'female', and splitting the data into features (X) and target (y). The data is then split into training and testing sets using the train_test_split() function.   
A linear regression model is created using the LinearRegression() function and trained on the training data using the fit() function. Predictions are then made on the training and testing data using the predict() function.   
The training and testing accuracy is calculated using the r2_score() function, and a plot is created to compare the predicted and actual values for the training and testing data.   
   
Explanation:   
test_size: This parameter specifies the proportion of the dataset to include in the test split. The value 0.2 means that 20% of the dataset will be used for testing, while the remaining 80% will be used for training.   

random_state=42. This parameter is used to control the randomness of the data split.By specifying a random state (in this case, 42), you ensure that the data split is consistent and reproducible every time you run the code.

In [None]:
#Preparing the data: encoding 'Gender' and splitting into features and target
merged_data_dropped['Gender'] = merged_data_dropped['Gender'].apply(lambda x: 1 if x == 'male' else 0)
X = merged_data_dropped.drop('Calories', axis=1)
y = merged_data_dropped['Calories']
#Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # The test_size=0.2 parameter specifies that 20% of the data should be used for testing
#Creating a Linear Regression model and fit it to the training data
lr_model = LinearRegression()# The LinearRegression() function is used to create a linear regression model
lr_model.fit(X_train, y_train) # The fit() function is used to train the model on the training data
#Predictions on training and testing data
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)
#Calculating training and testing accuracy using R^2 score
train_accuracy = r2_score(y_train, y_train_pred) # The r2_score() function is used to calculate the R^2 score for the training and testing data
test_accuracy = r2_score(y_test, y_test_pred)
#Training vs Testing plot
plt.figure(figsize=(10, 5))
plt.scatter(y_train, y_train_pred, color='blue', label='Training data', alpha=0.5)
plt.scatter(y_test, y_test_pred, color='red', label='Testing data', alpha=0.5)# The alpha=0.5 parameter sets the transparency of the points
plt.title('Calories Burned: Predicted vs Actual')
plt.xlabel('Actual Calories')
plt.ylabel('Predicted Calories')
plt.legend()
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--')
plt.grid(True)
plt.show()
(train_accuracy, test_accuracy)

The training accuracy is 0.9671621074066676, and the testing accuracy is 0.9672937151257295. This means that the model is able to predict the number of calories burned with high accuracy on both the training and testing data.

In the line X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42), the test_size parameter specifies the proportion of the data to be used for the testing set. In this case, test_size=0.2 means that 20% of the data will be used for testing, and the remaining 80% will be used for training.

The random_state parameter is used to specify the seed for the random number generator, which determines the random shuffling of the data before splitting. In this case, random_state=42 means that the same random shuffling will be used each time the code is run, which ensures reproducibility of the results.

***

# Random Forest Regressor

The training and testing accuracy is calculated using the r2_score() function, and a plot is created to compare the predicted and actual values for the training and testing data.

In [None]:
#Creating a Random Forest Regressor model and fit it to the training data
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
#Predictions on training and testing data
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)
#Calculating training and testing accuracy using R^2 score
train_accuracy_rf = r2_score(y_train, y_train_pred_rf)
test_accuracy_rf = r2_score(y_test, y_test_pred_rf)
# Training vs Testing plot
plt.figure(figsize=(10, 5))
plt.scatter(y_train, y_train_pred_rf, color='blue', label='Training data', alpha=0.5)
plt.scatter(y_test, y_test_pred_rf, color='red', label='Testing data', alpha=0.5)
plt.title('Calories Burned: Predicted vs Actual with Random Forest')
plt.xlabel('Actual Calories')
plt.ylabel('Predicted Calories')
plt.legend()
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--')
plt.grid(True)
plt.show()
(train_accuracy_rf, test_accuracy_rf)

The training accuracy is 0.9996877292400479, and the testing accuracy is 0.9982158297720679. This means that the model is able to predict the number of calories burned with very high accuracy on both the training and testing data.
Note that the training accuracy is very close to 1, which indicates that the model is almost perfectly fitting the training data. 

***