# Portfolio

This portfolio showcases the assignments completed for the Data Science learning unit. Each assignment demonstrates the application of data science techniques and tools to solve real-world problems.

#### Datasets Used:
- **[Gym Members Dataset](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset):** Contains information about gym members, their exercise habits, and demographics.
- **[Fraud Detection Dataset](https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection):** Includes financial transaction data for identifying fraudulent activities.
- **[Calorie Expenditure Dataset](https://www.kaggle.com/competitions/playground-series-s5e5/data):** Provides data on calorie expenditure during various activities.

# Assignment 1: What is Data Science?

Data Science is a multidisciplinary field that uses various (scientific) methodes to extract insights and knowledge from structured and unstructured data. It involves the collection, cleaning, analysis, and visualization of data to inform decision-making and drive business strategies.

#### What does a Data Scientist do?

A Data Scientist is a person who responsible for collecting, analyzing and orginizing data to extract useful information. Then communicating the findings to people who can use it.

# Assignment 2: Popular Tools for Data Science

Some of the popular tools used in Data Science include:

- **Programming Languages**: Python, R, SQL
- **Data Manipulation and Analysis**: Pandas, NumPy
- **Data Visualization**: Matplotlib, Seaborn, Tableau
- **Machine Learning**: Scikit-learn, TensorFlow, PyTorch
- **Big Data Tools**: Apache Spark, Hadoop
- **Integrated Development Environments (IDEs)**: Jupyter Notebook, RStudio

# Assignment 3: Univariate Analisys

In [None]:
import pandas as pd
import seaborn as sns
penguins3 = sns.load_dataset("penguins")
penguins3.head()

#### Univariate Analysis on categorical data on the penguins dataset

In [None]:
penguins3['island'].value_counts(dropna=True).plot(kind='bar')
# penguins['sex'].value_counts(dropna=True).plot(kind='bar')
# penguins['species'].value_counts(dropna=True).plot(kind='bar')

# Assignment 4 - Finding a dataset

I have looked on Kaggle for a dataset to use for the univariate analysis. I found a dataset about gym members and their exercise habits. The dataset contains information about gym members, their exercise habits, and their demographics. The dataset includes columns such as age, gender, height, weight, and exercise frequency. The dataset is suitable for univariate analysis as it contains both categorical and numerical variables.

The dataset can be found here: [https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset)

In [None]:
import pandas as pd
gym4 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
gym4.head()

# Assignment 5 - Univariate Analysis

For this assignment I will be using the lifeExpectancyAtBirth dataset from the WHO. The dataset contains information about life expectancy at birth for different countries and regions. The dataset includes columns such as country, region, year, and life expectancy. The dataset is suitable for univariate analysis as it contains both categorical and numerical variables.

The dataset can be found here: [https://www.kaggle.com/datasets/utkarshxy/who-worldhealth-statistics-2020-complete](https://www.kaggle.com/datasets/utkarshxy/who-worldhealth-statistics-2020-complete)

In [None]:
import pandas as pd

# Load the dataset
life_expectancy5 = pd.read_csv("Datasets\\lifeExpectancyAtBirth.csv")
life_expectancy5.head()

In [None]:
# Filter the dataframe for 2019 and 'Both sexes'
expectancy_2019 = life_expectancy5[
    (life_expectancy5['Period'] == 2019) & 
    (life_expectancy5['Dim1'] == 'Both sexes')
]

In [None]:
# Filter the dataframe for 2019 and 'Both sexes'
print("Five countries with the highest life expectancy:")
print(expectancy_2019.nlargest(5, 'First Tooltip')[['Location', 'First Tooltip']])

print("\n")

print("Five countries with the lowest life expectancy:")
print(expectancy_2019.nsmallest(5, 'First Tooltip')[['Location', 'First Tooltip']])

# Assignment 6

For this assignment I will be using the dataset I chose in assignment 4. The dataset contains information about gym members, their exercise habits, and their demographics. I wil be performing univariate analysis on 2 categorical collums and 2 numerical columns.

In [None]:
import pandas as pd
import seaborn as sns
gym6 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")

I will be using the following columns for the univariate analysis:
Gender and WorkoutType for the categorical columns and Age and Weight for the numerical columns.

The dataset can be found here: [https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset)

In [None]:
gym6['Gender'].value_counts(dropna=True).plot(kind='bar')

In [None]:
gym6['Workout_Type'].value_counts(dropna=True).plot(kind='bar')

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
gym6['Weight (kg)'].plot(kind='box', vert=True, patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.title('Distribution of Weight (kg)', fontsize=14)
plt.xlabel('Weight (kg)', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

In [None]:
gym6['Workout_Frequency (days/week)'].value_counts().plot(kind='pie', autopct='%1.1f%%', ylabel='', title='Workout Frequency (days/week)')

# Assignment 7 - Distributions

I will analyse the histograms of at least two numerical columns from the dataset I chose in portfolio assignment 4. I will observe the distribution of the data and identify whether it resembles a uniform, normal, or another type of distribution.  

In [None]:
import pandas as pd
gym7 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")

In [None]:
gym7['BMI'].plot(kind='hist', bins=25)

The above analysis was done on the BMI variable. The histogram of the BMI variable shows that the data is Log-Normal distributed. The histogram is skewed to the right, indicating that there are a few individuals with very high BMI values. The majority of the individuals have a BMI value between 18 and 28, which is considered normal weight.

In [None]:
gym7['Session_Duration (hours)'].plot(kind='hist', bins=25)

The above analysis was done using the workout session duration. I have observed that the data is Tukey-Lambda distributed. The histogram is in the shape of a bell curve, indicating that the majority of the individuals have a workout session duration between 60 and 90 minutes. There are a still some individuals with a workout session duration of more than 90 minutes and less than 30 minutes.

# Assignment 8 - Confidence Intervals

In this assignment I will calculate the 90%, 95%, 99%, and 99.99% confidence intervals for at least two numerical columns in the dataset I chose in portfolio assignment 4. I will observe the impact of the confidence level on the interval. Next, I will calculate the 95% confidence interval again but using only the first 10% of the rows. I will compare this interval to the previous 95% confidence interval I calculated and analyze the impact of having less data.

In [None]:
import pandas as pd
import scipy.stats as st
gym8 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")

Below is the code I used to calculate the confidence intervals for the Calories_Burned variable. I made a function that takes the dataset as input and prints the confidence interval. I used the t-distribution to calculate the confidence intervals. Why I used the t-distribution? Because the sample size is small and the population standard deviation is unknown.

In [None]:
def print_confidence(dataset : pd.DataFrame):
  confidence_levels = [0.90, 0.95, 0.99, 0.9999]
  for confidence in confidence_levels:
    interval = st.t.interval(
      confidence, 
      len(dataset) - 1, 
      loc=dataset['Calories_Burned'].mean(), 
      scale=st.sem(dataset['Calories_Burned'])
    )
    print(f"{float(confidence * 100)}% Confidence Interval: {interval}")

In [None]:
# Print confidence intervals for the entire dataset
print_confidence(gym8)

Next, I will use the same function to calculate the confidence intervals for only the first 10% of the rows. I will compare this interval to the previous 95% confidence interval I calculated and analyze the impact of having less data and why this is important.

In [None]:
# Calculate the first 10% of the dataset
first_10_percent_count = int(len(gym8) * 0.1)
first_10_percent_data = gym8.iloc[:first_10_percent_count]

# Print confidence intervals for the first 10% of the dataset
print_confidence(first_10_percent_data)

With **less data**, we have less information about the overall population. This makes our estimates of the mean more uncertain which is reflected in a **wider confidence interval**. To still capture the true population mean with the same level of confidence (e.g. 95%), we need to account for this uncertainty by **expanding the range**. 

Alternatively, if we had a dataset that was **sorted** in any way, we would have to be careful about the first 10% of the rows. For example, if the dataset was sorted by age, the first 10% of the rows would only contain individuals aged 18-25. This would **not be a representative sample** of the overall population and would lead to **inaccurate estimates** of the confidence intervals.

# Assignment 9 - Numerical Bivariate Analysis

I performed a bivariate analysis on the columns with numerical data in the penguins dataset. I used `corr()` on the DataFrame to calculate all the correlations. The correlation table was displayed with colors using the code example provided earlier.

In [None]:
import seaborn as sns
penguins9 = sns.load_dataset("penguins")

In [None]:
corrola = penguins9.corr(numeric_only=True)
corrola.style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

# Assignment 10
I performed a bivariate analysis on the `BMI` and `Calories_Burned` columns from the dataset I chose in portfolio assignment 4. The Pearson correlation coefficient was calculated, and a scatter plot was created to visualize the relationship between these two variables. The correlation coefficient indicates the strength and direction of the linear relationship, while the scatter plot provides a visual representation of the data points.

In [None]:
import pandas as pd
gym10 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
correlation = gym10.corr(numeric_only=True)

In [None]:
correlation.style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

In [None]:
gym10.plot(kind='scatter', x='Calories_Burned', y='Session_Duration (hours)', title='Calories Burned vs Session Duration')

In [None]:
gym10.plot(kind='scatter', x='Fat_Percentage', y='Calories_Burned', title='Fat Percentage vs Calories Burned')

# Assignment 11 - Numerical vs Categorical Analysis

I am going to do a Numerical VS Categorical bivariate analysis on the penguins dataset. I have chosen the "species" categorical collum. I will use `.groupby('<columnname>').mean()` to examine the means of the numerical columns and see if there are differences between the categories. Next, I will use the seaborn barplot to visualize the mean and confidence intervals for each of the numerical columns (`bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`). I have done this for the `species` column, the `sex` column and the `island` column.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
penguins11 = sns.load_dataset("penguins")
penguins11.head()

In [None]:
def display_barplot(data, x, y, title):
  plt.figure(figsize=(10, 6))
  sns.barplot(data=data, x=x, y=y)
  plt.title(title, fontsize=16)
  plt.xlabel(x, fontsize=14)
  plt.ylabel(y, fontsize=14)
  plt.xticks(rotation=45)
  plt.show()

In [None]:
penguins11.groupby('species').mean(numeric_only=True)


In [None]:
penguins11.groupby('island').mean(numeric_only=True)

In [None]:
penguins11.groupby('sex').mean(numeric_only=True)

In [None]:
# Loop through the numerical columns and create bar plots grouped by 'species'.
for column in ['bill_length_mm', 'body_mass_g', 'bill_depth_mm', 'flipper_length_mm']:
  display_barplot(data=penguins11, x='species', y=column, title=f'{column} by Species')

SCIENCE CONCLUSION SPECIES

In [None]:
# Loop through the numerical columns and create bar plots grouped by 'island'.
for column in ['bill_length_mm', 'body_mass_g', 'bill_depth_mm', 'flipper_length_mm']:
  display_barplot(data=penguins11, x='island', y=column, title=f'{column} by Island')

SCIENCE CONCLUSION ISLAND

In [None]:
# Loop through the numerical columns and create bar plots grouped by 'sex'.
for column in ['bill_length_mm', 'body_mass_g', 'bill_depth_mm', 'flipper_length_mm']:
  display_barplot(data=penguins11,x='sex',y=column,title=f'{column} by Sex')

SCIENCE CONCLUSION SEX

# Assignment 12

For this assignment, I performed a bivariate analysis on three combinations of numerical and categorical columns from the gym members dataset chosen in portfolio assignment 4. The combinations analyzed are:

1. **Workout_Type** (categorical) vs **Calories_Burned** (numerical)
2. **Gender** (categorical) vs **BMI** (numerical)
3. **Workout_Frequency (days/week)** (categorical) vs **Session_Duration (hours)** (numerical)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
gym12 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
gym12.head()

In [None]:
def create_contingency(dataset, column1, column2):
  return dataset.groupby(column1)[column2].mean().reset_index().style.set_table_styles([
  {'selector': 'caption', 'props': [('font-size', '16px'), ('text-align', 'center')]}])

### Calories Burned per Workout Type

The dataset provides insights into the average calories burned for different workout types. By analyzing this data, we can identify which workout types are more effective in terms of calorie expenditure. This information is valuable for individuals aiming to optimize their fitness routines based on their goals, such as weight loss or endurance building.

In [None]:
create_contingency(gym12 ,"Workout_Type", "Calories_Burned")

In [None]:
display_barplot(data=gym12, x='Workout_Type', y='Calories_Burned', title='Calories Burned by Workout Type')

In the graph above, you can see that the average calories burned per workout type do not differ much. They all fall within each other's 95% confidence interval. This means that there is no significant difference between the average calories burned per workout type. This could be because the dataset is not large enough, that the workout types do not differ much in intensity, or that the individuals in the dataset are not representative of the overall population.

### BMI per Gender

The dataset provides insights into the average BMI for different genders. By analyzing this data, we can identify any significant differences in BMI between males and females. This information is valuable for understanding the impact of the gender on body composition and health.

In [None]:
create_contingency(gym12, "Gender", "BMI")

In [None]:
display_barplot(data=gym12, x='Gender', y='BMI', title='BMI by Gender')

The graph above shows that the average BMI does differ per gender. The average BMI of men is higher than that of women. This may be because men have more muscle mass on average than women. Women also have more fat mass on average than men. This is a normal difference between the genders.

### Session_Duration (hours) per Workout_Frequency (days/week) 

The dataset provides insights into the average session duration for different workout frequencies. By analyzing this data, we can identify how the frequency of workouts affects the duration of each session. This information is valuable for individuals looking to optimize their workout schedules and ensure they are getting the most out of their gym time.

In [None]:
gym12.groupby("Workout_Frequency (days/week)")["Session_Duration (hours)"].mean().reset_index().style.set_table_styles([
  {'selector': 'caption', 'props': [('font-size', '16px'), ('text-align', 'center')]}])

In [None]:
display_barplot(data=gym12, x='Workout_Frequency (days/week)', y='Session_Duration (hours)',
                title='Session Duration by Workout Frequency')

In the graph above, you can see that the average session duration does differ per workout frequency. The average session duration is higher for individuals who work out more frequently. This may be because individuals who work out more frequently are more experienced and can handle longer sessions. It may also be because individuals who work out more frequently have more time to spend at the gym or they enjoy working out more.

# Assignment 13 - Categorical Bivariate Analysis

I performed a bivariate analysis on the penguins dataset for the combinations of species vs sex and island vs sex. In this analysis, I first considered whether a correlation between the columns was expected. Then, I created a contingency table to observe the ratios between categories and visualized these ratios using a bar plot. Finally, I conducted a chi-squared test to determine the statistical significance of the correlation, interpreted the results, and calculated the likelihood of a correlation between the columns.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
penguins13 = sns.load_dataset("penguins")
penguins13.head()

#### My hypothesis:
I do not expect a correlation between the species and sex columns. The species of the penguins are not related to how many male or female penguins there are. </br>
I do not expect a correlation between the island and sex columns. The island of the penguins are not related to how many males or female penguins there are. </br>
I do expect a correlation between the species and island columns. The species of the penguins are related to which island they live on. </br>

I am going to make use of a few functions to make the code more readable and reusable.

In [None]:
def create_contingency_table(dataset, column1, column2):
    return dataset.groupby([column1, column2]).size().unstack(column1, fill_value=0)

In [None]:
def make_cat_vs_cat_barplot(dataset, column1, column2):
  dataset.groupby([column1, column2]).size().unstack(column1, fill_value=0).plot(kind='bar')

In [None]:
from scipy.stats import chi2_contingency
def check_cat_vs_cat_correlation(dataset, column1, column2):
    contingency_table = create_contingency_table(dataset, column1, column2)
    chi2 = chi2_contingency(contingency_table)
    print(chi2)
    p_value = chi2[1]
    odds_of_correlation = 1 - p_value
    print(f"The odds of a correlation between {column1} and {column2} is {odds_of_correlation * 100}% (Based on a p value of {p_value}).")
    print("This percentage needs to be at least 95% for a significant correlation.")

### Species vs Sex

First, I will create a contingency table to observe the ratios between the categories so I can see if there is a correlation between the species and sex columns.

In [None]:
create_contingency_table(penguins13, 'species','sex')

The contingency table above shows the number of penguins per species and sex. The table shows that there are nearly equal numbers of males and females for each species. The only exception is the Gentoo species, where there are slightly more males than females. This is not a significant difference. Next I will use Chi-squared test to determine if there is a significant correlation between the species and sex columns. The null hypothesis is that there is no correlation because of the contengency table.

In [None]:
check_cat_vs_cat_correlation(penguins13, 'species','sex')

As I expected, the p-value is way higher than 0.05. This means that there is no significant correlation between the species and sex columns. In this case there is a 2.4% change of correlation. This is also visable in a bar plot. I will put that underneath this text.

In [None]:
make_cat_vs_cat_barplot(penguins13, 'species','sex')

### Island vs Sex

First, I will create a contingency table to observe the ratios between the categories so I can see if there is a correlation between the sex and island columns.

In [None]:
create_contingency_table(penguins13, 'island','sex')

The contingency table above shows the number of penguins per island and sex. The table shows that there are nearly equal numbers of males and females on each island. I do not expect a correlation between the island and sex columns. 

In [None]:
check_cat_vs_cat_correlation(penguins13, 'sex','island')

As I expected, the p-value is way higher than 0.05. This means that there is no significant correlation between the island and sex columns. In this case there is a 2.8% change of correlation. This is also visable in a bar plot.

In [None]:
make_cat_vs_cat_barplot(penguins13, 'sex', "island")

# Assignment 14

I performed a catagorical bivariate analysis on my dataset that I chose in assignment 4. In this analysis, I first consider whether a correlation between the columns is expected. Then, I create a contingency table to observe the ratios between categories and visualized these ratios using a bar plot. Finally, I conducted a chi-squared test to determine the statistical significance of the correlation, interpreted the results, and calculated the likelihood of a correlation between the columns.

I am going to make use of a few functions to make the code more readable and reusable.

In [None]:
import pandas as pd

gym14 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
gym14.head()

#### My hypothesis:
I do expect a correlation between the experience and workout frequency columns. The more experienced individuals are, the more frequently they work out. </br>
I do not expect a correlation between the workout type and workout frequency columns. The workout type does not determine how frequently individuals work out. </br>
I do not expect a correlation between the workout type and experience columns. The workout type does not determine how experienced individuals are. </br>

#### Experience vs Workout Frequency

In [None]:
create_contingency_table(gym14, 'Experience_Level', 'Workout_Frequency (days/week)')

In the contingency table above, I can already see that there is a correlation between the experience and workout frequency columns. The table shows that the more experienced individuals are, the more frequently they work out. This is also visible in the bar plot.

In [None]:
make_cat_vs_cat_barplot(gym14, 'Experience_Level', 'Workout_Frequency (days/week)')

I will now use Chi-squared test to determine if there is a significant correlation between the experience and workout frequency columns.

In [None]:
check_cat_vs_cat_correlation(gym14, 'Experience_Level', 'Workout_Frequency (days/week)')

#### Workout Type vs Workout Frequency

In [None]:
create_contingency_table(gym14, 'Workout_Type', 'Workout_Frequency (days/week)')

In the contingency table above, I can't clearly see a correlation. I will make a bar plot to make it more visual.

In [None]:
make_cat_vs_cat_barplot(gym14, 'Workout_Type', 'Workout_Frequency (days/week)')

In the bar graph I clearly saw that there is probably no correlation. I will now use Chi-squared test to make sure.

In [None]:
check_cat_vs_cat_correlation(gym14, 'Workout_Type', 'Workout_Frequency (days/week)')

As expected, the p-value is 0.76... (which is way higher than 0.05). This means that there is no significant correlation between the workout type and workout frequency columns. In this case there is a 23.6% change of correlation.

#### Experience vs Workout Type

In [None]:
create_contingency_table(gym14, 'Experience_Level', 'Workout_Type')

In the contingency table above, I cannot clearly see a correlation between the experience and workout type columns. I will make a bar plot to make it more visual and see if there is a correlation.

In [None]:
make_cat_vs_cat_barplot(gym14, 'Experience_Level', 'Workout_Type')

In the bar graph I clearly saw that there is probably no correlation. I will run the Chi-squared test to confirm this.

In [None]:
check_cat_vs_cat_correlation(gym14, 'Experience_Level', 'Workout_Type')

And the Chi-squared test confirms that there is no significant correlation between the experience and workout type columns. In this case there is a 55.3% change of correlation.

# Assignment 15 - Classification

In this assignment, I am going to train a decision tree to predict the species of a penguin based on their caracteristics. I will use the penguins dataset for this assignment. The dataset contains information about the species, island, bill length, bill depth, flipper length, and body mass of penguins. I will use the `DecisionTreeClassifier` from the `sklearn.tree` module to train the decision tree.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import graphviz as gv
import sklearn.tree as tree
import sklearn.model_selection as msn

penguins15 = sns.load_dataset("penguins")
penguins15.head()

I am going to train the model using the DecisionTreeClassifier from the sklearn.tree module. The code below shows how. I am setting max depth to 5 as this is a good starting point. I will also set the random state to 42 to ensure that the results are reproducible. As I researched earlier, the sex and island columns are not useful for the prediction. I will drop these columns from the dataset I am going to use.

In [None]:
features15 = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
dt15 = tree.DecisionTreeClassifier(max_depth=5)
dt15.fit(penguins15[features15], penguins15["species"]) # First the columns I am using to predict, then the target column.

The code above has outputted the decision tree. In order to visualize the decision tree, I will use a custom function shown below.

In [None]:
def plot_tree_classification(model, features, class_names):
  return gv.Source(tree.export_graphviz(model, out_file=None, feature_names=features, class_names=class_names, filled=True, rounded=True, special_characters=True))

Using this custom function, I can visualize the decision tree:

In [None]:
plot_tree_classification(dt15, features15, np.sort(penguins15.species.unique()))

Now I will go on to validate the model. I will use `dt15.predict` to use the model to predict the species of the penguins that are in the dataset.

In [None]:
predictions = dt15.predict(penguins15[features15])

Using the code below, I can see the predictions that the model made.

In [None]:
predictions

Using another custom function, I can calculate the accuracy of the model. The accuracy is the percentage of correct predictions made by the model.

In [None]:
def calculate_accuracy(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
    return (predictions == actuals).sum() / len(actuals)

In [None]:
calculate_accuracy(predictions, penguins15.species)

As you can see, the accuracy of the model is 99.41%. This is because the dataset is relatively small and the decision tree is probably `overfitting` the data. I will now use a `70/30 split` to train and test the model. I will use the `train_test_split` function from the `sklearn.model_selection` module to do this. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
penguins15_train, penguins15_test = msn.train_test_split(penguins15, test_size=0.3, random_state=42, stratify=penguins15['species'])

Then I will remake the decision tree and visualize it again. I will use the same custom function as before to do this.

In [None]:
features15 = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
dt15 = tree.DecisionTreeClassifier(max_depth=5)
dt15.fit(penguins15_train[features15], penguins15_train["species"])

Now I will use the trained decision tree to predict the species of a penguin based on their characteristics. I will use the test set to evaluate the performance of the model. Also, I will use the train set to evaluate if the model is overfitting the data.

In [None]:
predictionsOnTrainset15 = dt15.predict(penguins15_train[features15])
predictionsOnTestset15 = dt15.predict(penguins15_test[features15])

accuracyTrain = calculate_accuracy(predictionsOnTrainset15, penguins15_train.species)
accuracyTest = calculate_accuracy(predictionsOnTestset15, penguins15_test.species)

print("Accuracy on training set " + str(accuracyTrain))
print("Accuracy on test set " + str(accuracyTest))

An unexpected result of the model is that the accuracy of the model on the test set is `97%`.

In [None]:
# Train and evaluate a decision tree with max_depth=6
dt15 = tree.DecisionTreeClassifier(max_depth=6)
dt15.fit(penguins15_train[features15], penguins15_train["species"])
train_predictions = dt15.predict(penguins15_train[features15])
test_predictions = dt15.predict(penguins15_test[features15])
train_accuracy = (train_predictions == penguins15_train["species"]).mean()
test_accuracy = (test_predictions == penguins15_test["species"]).mean()
print(f"Max Depth: 6 - Train Accuracy: {train_accuracy:.2%}, Test Accuracy: {test_accuracy:.2%}")

# Train and evaluate a decision tree with max_depth=3
dt15 = tree.DecisionTreeClassifier(max_depth=3)
dt15.fit(penguins15_train[features15], penguins15_train["species"])
train_predictions = dt15.predict(penguins15_train[features15])
test_predictions = dt15.predict(penguins15_test[features15])
train_accuracy = (train_predictions == penguins15_train["species"]).mean()
test_accuracy = (test_predictions == penguins15_test["species"]).mean()
print(f"Max Depth: 3 - Train Accuracy: {train_accuracy:.2%}, Test Accuracy: {test_accuracy:.2%}")

Setting the max depth to 6 increased the accuracy of the model on the test set. Setting it to 3 decreased the accuracy both on the train and test set.

# Assignment 16

In this assignment, I am going to train a decision tree to predict a fraudulent transaction. For this assignment, I will use a new dataset that I found on Kaggle. The dataset contains information about credit card transactions and whether they are fraudulent or not. I will use the `DecisionTreeClassifier` from the `sklearn.tree` module to train the decision tree. The dataset can be found here: [https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/](https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/)

In [None]:
import numpy as np
import pandas as pd
import hashlib as hl
import graphviz as gv
import sklearn.tree as tree
import matplotlib.pyplot as plt
import sklearn.model_selection as msn

fraud16 = pd.read_csv("Datasets\\financial_fraud_detection_dataset.csv")
fraud16.head()

I am first going to drop the a few columns from the dataset. These columns are not useful for the prediction.

In [None]:
fraud16.drop(columns=["transaction_id"], inplace=True)
fraud16.drop(columns=["fraud_type"], inplace=True)
fraud16.drop(columns=["device_hash"], inplace=True)

Then I will convert the timestamp column to multiple columns. I will convert the timestamp column to hour, day of week, month and is_weekend.

In [None]:
# Convert 'timestamp' to datetime
fraud16['timestamp'] = pd.to_datetime(fraud16['timestamp'], format='ISO8601')

# Add new columns
fraud16['hour'] = fraud16['timestamp'].dt.hour
fraud16['day_of_week'] = fraud16['timestamp'].dt.weekday
fraud16['month'] = fraud16['timestamp'].dt.month
fraud16['is_weekend'] = fraud16['timestamp'].dt.weekday.isin([5, 6])

I will use a `70/30 split` to train and test the model. I will use the `train_test_split` function from the `sklearn.model_selection` module to do this. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
fraud16_train, fraud16_test = msn.train_test_split(fraud16, test_size = 0.3, random_state = 42, stratify = fraud16["is_fraud"])

As before, I am going to train the model using the DecisionTreeClassifier from the sklearn.tree module. I am setting max depth to 5 as this is a good starting point. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
numerical_cols = ["amount", "time_since_last_transaction", "spending_deviation_score", "velocity_score", "geo_anomaly_score", "hour", "day_of_week", "month", "is_weekend"]
dt16 = tree.DecisionTreeClassifier(max_depth=5)
dt16.fit(fraud16_train[numerical_cols], fraud16_train["is_fraud"]) # First the columns I am using to predict, then the target column.

Using this custom function, I can visualize the decision tree:

In [None]:
plot_tree_classification(dt16, numerical_cols, ['False', 'True'])

Now I will use the trained decision tree to predict if the transaction was fraudulent. I will use the test set to evaluate the performance of the model. Also, I will use the train set to evaluate if the model is overfitting the data.

In [None]:
predictionsOnTrainset16 = dt16.predict(fraud16_train[numerical_cols])
predictionsOnTestset16 = dt16.predict(fraud16_test[numerical_cols])

accuracyTrain = calculate_accuracy(predictionsOnTrainset16, fraud16_train.is_fraud)
accuracyTest = calculate_accuracy(predictionsOnTestset16, fraud16_test.is_fraud)

print(f"Model Performance:")
print(f"Accuracy on Training Set: {accuracyTrain:.2%}")
print(f"Accuracy on Test Set: {accuracyTest:.2%}")

The accuracy of the model is 96.4%. I read online that the 94% accuracy is probably because the dataset is heavy imbalanced. There are a lot more non-fraudulent transactions than fraudulent transactions. Either way, I still think it was a learning experience. 

# Assignment 17 - Regression

In this assignment, I am going to train a decision tree to predict a the body mass (gram) of a penguin. For this assignment, I will use the penguins dataset. The dataset contains information about the species, island, bill length, bill depth, flipper length, and body mass of penguins.

In [None]:
import numpy as np
import pandas as pd
import hashlib as hl
import seaborn as sns
import graphviz as gv
import sklearn.tree as tree
import matplotlib.pyplot as plt
import sklearn.model_selection as msn

penguins17 = sns.load_dataset("penguins")
penguins17.head()

I will first use a correlation matrix to see if there are any correlations between the columns. I will use the `corr()` function from the pandas library to do this. I will also use the `seaborn` library to visualize the correlation matrix.

In [None]:
penguins17.corr(numeric_only=True).style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

Then, I will spit the dataset into a train and test set. I will use the `train_test_split` function from the `sklearn.model_selection` module to do this. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
penguins17_train, penguins17_test = msn.train_test_split(penguins17, test_size=0.3, random_state=42)

Next, I will train the model using the DecisionTreeRegressor from the sklearn.tree module. I am setting max depth to 4 as this is a good starting point. I will also set the random state to 42 to ensure that the results are reproducible. I am going to use all the columns except the species, sex and island columns because these columns are not supported by the DecisionTreeRegressor. Also I will not be using the body mass column because this is the column I am trying to predict.

In [None]:
# Drop rows with NaN in the target column
penguins17_train = penguins17_train.dropna(subset=['body_mass_g'])

# Define features and target
features17 = ['flipper_length_mm', 'bill_depth_mm', 'bill_length_mm']
dt_regression17 = tree.DecisionTreeRegressor(max_depth=4)

# Fit the model
dt_regression17.fit(penguins17_train[features17].values, penguins17_train['body_mass_g'].values)

I will make a custom function to visualize the regression decision tree. This is done with the code below. 

In [None]:
from sklearn import tree
import graphviz

def plot_tree_regression(model, features):
    dot_data = tree.export_graphviz(model, out_file=None, feature_names=features, filled=True, rounded=True, special_characters=True)
    graph = graphviz.Source(dot_data)  
    return graph 

In [None]:
plot_tree_regression(dt_regression17, features17)

In [None]:
def calculate_rmse(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
    
    return (((predictions - actuals) ** 2).sum() / len(actuals)) ** (1/2)

def calculate_rmsle(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
    
    return (((np.log1p(predictions) - np.log1p(actuals)) ** 2).sum() / len(actuals)) ** (1/2)

The code above is for calculating the RMSE. The RMSE is the root mean squared error. It is a measure of how well the model fits the data. The lower the RMSE, the better the model fits the data. The RMSE is calculated by taking the square root of the mean of the squared differences between the predicted and actual values.

Below is the result of the RMSE calculation for the train and test set.

In [None]:
predictionsOnTrainset17 = dt_regression17.predict(penguins17_train[features17].values)
predictionsOnTestset17 = dt_regression17.predict(penguins17_test[features17].values)

rmseTrain17 = calculate_rmse(predictionsOnTrainset17, penguins17_train.body_mass_g)
rmseTest17 = calculate_rmse(predictionsOnTestset17, penguins17_test.body_mass_g)

print("RMSE on training set " + str(rmseTrain17))
print("RMSE on test set " + str(rmseTest17))

The RMSE on the training set is significantly lower than the RMSE on the test set. This suggests that the model may be overfitting the training data. Overfitting occurs when a model learns the training data too well, including its noise and outliers, which reduces its ability to generalize to unseen data. This is evident here as the model performs much better on the training set compared to the test set.

To address this issue, I will experiment with different variations of the model by adjusting the `max_depth` parameter of the decision tree and the columns used for training.

In [None]:
from itertools import combinations
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error, accuracy_score
import numpy as np

def find_best_tree_model(
    features,
    train_df,
    test_df,
    target_column,
    model_type='regressor',            # 'regressor' or 'classifier'
    algorithm='decision_tree',         # 'decision_tree' or 'random_forest'
    depth_range=(3, 8),                # inclusive range of depths to try
    n_estimators_list=(100,),          # only used if algorithm='random_forest'
):
    """
    Searches feature subsets, tree depths, (and for RF: n_estimators)
    to find the best DecisionTree/RandomForest model.

    Parameters:
        features (list of str): columns to consider as inputs
        train_df, test_df (pd.DataFrame): train & test sets
        target_column (str): name of the y‐column
        model_type (str): 'regressor' or 'classifier'
        algorithm (str): 'decision_tree' or 'random_forest'
        depth_range (tuple): (min_depth, max_depth) inclusive
        n_estimators_list (iterable of int): RF n_estimators to try

    Returns:
        dict:
            'best_features': tuple of column names
            'best_depth': int
            'best_n_estimators': int or None
            'train_score': float
            'test_score': float
            'best_model': fitted estimator
    """

    # choose classes & metric functions
    if model_type == 'regressor':
        TreeCls = DecisionTreeRegressor
        RFCls   = RandomForestRegressor
        score_fn = lambda y_pred, y_true: np.sqrt(mean_squared_error(y_true, y_pred))
        better = lambda a, b: a < b      # lower RMSE is better
        default_n = None
    else:
        TreeCls = DecisionTreeClassifier
        RFCls   = RandomForestClassifier
        score_fn = lambda y_pred, y_true: accuracy_score(y_true, y_pred)
        better = lambda a, b: a > b      # higher accuracy is better
        default_n = 1

    best = {
        'best_features': None,
        'best_depth': None,
        'best_n_estimators': default_n,
        'train_score': -np.inf if model_type=='classifier' else np.inf,
        'test_score': -np.inf if model_type=='classifier' else np.inf,
        'best_model': None
    }

    # helper to instantiate
    def make_model(max_depth, n_estimators):
        if algorithm == 'decision_tree':
            return TreeCls(max_depth=max_depth)
        else:
            return RFCls(max_depth=max_depth, n_estimators=n_estimators)

    all_subsets = (
        [ (f,) for f in features ] +
        [ subset for r in range(2, len(features)+1)
                 for subset in combinations(features, r) ]
    )

    for subset in all_subsets:
        X_train = train_df[list(subset)].values
        X_test  = test_df[list(subset)].values
        y_train = train_df[target_column].values
        y_test  = test_df[target_column].values

        for depth in range(depth_range[0], depth_range[1] + 1):
            for n_est in (n_estimators_list if algorithm=='random_forest' else [None]):
                model = make_model(depth, n_est)
                model.fit(X_train, y_train)

                y_pred_train = model.predict(X_train)
                y_pred_test  = model.predict(X_test)

                train_score = score_fn(y_pred_train, y_train)
                test_score  = score_fn(y_pred_test, y_test)

                # print progress
                print(f"Features={list(subset)}, depth={depth}"
                      f"{(', n_est=' + str(n_est)) if n_est is not None else ''}"
                      f" → train={train_score:.4f}, test={test_score:.4f}")

                if better(test_score, best['test_score']):
                    best.update({
                        'best_features': subset,
                        'best_depth': depth,
                        'best_n_estimators': n_est,
                        'train_score': train_score,
                        'test_score': test_score,
                        'best_model': model
                    })

    return best

In [None]:
# Drop rows with NaN values in the features or target column
penguins17_train_clean = penguins17_train.dropna(subset=features17 + ['body_mass_g'])
penguins17_test_clean = penguins17_test.dropna(subset=features17 + ['body_mass_g'])

result = find_best_tree_model(
    features=features17,
    train_df=penguins17_train_clean,
    test_df=penguins17_test_clean,
    target_column='body_mass_g',
    model_type='regressor',
    algorithm='decision_tree',
    depth_range=(3, 8)
)

print(f"Best Features: {result['best_features']}")
print(f"Best Model Depth: {result['best_depth']}")
print(f"Train RMSE: {result['train_score']:.2f}")
print(f"Test RMSE: {result['test_score']:.2f}")

According to this very complex algorithm (written by me and my friend ChatGPT), the best RMSE is achieved with a max depth is 3 while using all the columns. The RMSE is 335.4 on the test set and 350.4 on the train set.

# Assignment 18

In this assignment, I am going to train a decision tree to predict a persons calorie expenditure based on their characteristics. I will use a new dataset for this assignment. I found a dataset on Kaggle that contains information about the characteristics of individuals and their calorie expenditure. The dataset can be found here: [https://www.kaggle.com/competitions/playground-series-s5e5/data](https://www.kaggle.com/competitions/playground-series-s5e5/data)

In [None]:
import numpy as np
import pandas as pd
import hashlib as hl
import graphviz as gv
import sklearn.tree as tree
import matplotlib.pyplot as plt

expenditure18 = pd.read_csv("Datasets\\calorie_expenditure.csv")
expenditure18.head()

I will first use a correlation matrix to see if there are any correlations between the columns. I will use the `corr()` function from the pandas library to do this. I will also use the `seaborn` library to visualize the correlation matrix.

In [None]:
expenditure18.corr(numeric_only=True).style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

Then, I will spit the dataset into a train and test set. I will use the `train_test_split` function from the `sklearn.model_selection` module to do this. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
expenditure18_train, expenditure18_test = msn.train_test_split(expenditure18, test_size=0.3, random_state=42)

Next, I will train the model using the DecisionTreeRegressor from the sklearn.tree module. I am setting max depth to 4 as this is a good starting point. I will also set the random state to 42 to ensure that the results are reproducible. I am not going to use only numerical columns because categorical columns are not supported by the DecisionTreeRegressor.

In [None]:
features18 = ['Duration', 'Heart_Rate', 'Body_Temp']
dt_regression18 = tree.DecisionTreeRegressor(max_depth = 4)
dt_regression18.fit(expenditure18_train[features18].values, expenditure18_train['Calories'].values)

In [None]:
plot_tree_regression(dt_regression18, features18)

In [None]:
predictionsOnTrainset18 = dt_regression18.predict(expenditure18_train[features18].values)
predictionsOnTestset18 = dt_regression18.predict(expenditure18_test[features18].values)

rmseTrain18 = calculate_rmse(predictionsOnTrainset18, expenditure18_train.Calories)
rmseTest18 = calculate_rmse(predictionsOnTestset18, expenditure18_test.Calories)

print("RMSE on training set " + str(rmseTrain18))
print("RMSE on test set " + str(rmseTest18))

The RMSE is marginaly lower on the test set than on the train set. This suggests that the model is not overfitting the training data and is able to generalize well to unseen data. The model performs well on both the training and test sets, indicating that it has learned the underlying patterns in the data effectively. Still, I will experiment with different variations of the model by adjusting the `max_depth` parameter of the decision tree and the columns used for training.

In [None]:
result = find_best_tree_model(
  features=features18,
  train_df=expenditure18_train,
  test_df=expenditure18_test,
  target_column='Calories',
  model_type='regressor',
  algorithm='decision_tree',
  depth_range=(3, 8)
)

print(f"Best Features: {result['best_features']}")
print(f"Best Model Depth: {result['best_depth']}")
print(f"Train RMSE: {result['train_score']:.2f}")
print(f"Test RMSE: {result['test_score']:.2f}")

After running my custom function for finding the best RMSE, I put my findings the code below.

In [None]:
features18_2 = ['Duration', 'Heart_Rate']

dt_regression18_2 = tree.DecisionTreeRegressor(max_depth = 7)
dt_regression18_2.fit(expenditure18_train[features18_2].values, expenditure18_train['Calories'].values)

predictionsOnTrainset18_2 = dt_regression18_2.predict(expenditure18_train[features18_2].values)
predictionsOnTestset18_2 = dt_regression18_2.predict(expenditure18_test[features18_2].values)

rmseTrain18_2 = calculate_rmse(predictionsOnTrainset18_2, expenditure18_train.Calories)
rmseTest18_2 = calculate_rmse(predictionsOnTestset18_2, expenditure18_test.Calories)

print("RMSE on training set " + str(rmseTrain18_2))
print("RMSE on test set " + str(rmseTest18_2))

The result of testing the model with different combinations of columns and max depth is shown above. The best RMSE is achieved with a depth of 7 while using `Duration` and `Heart_Rate` columns. The RMSE is only 11.53 on the test set and 11.52 on the train set. The overall RMSE is very low, which is a good sign. But the test set RMSE is a bit higher than the train set RMSE. So there has been a increase of accuracy, but a decrease of generalization.

If I use the model to predict the calorie expenditure of myself, I get the following result:

In [None]:
import numpy as np
print(dt_regression18_2.predict(np.array([[60, 60]]))) # One hour at 60 BPM

## Question
 
> **Is the RMSE just a magical number or is it tied to the measurement of the data?**  
> **For example, does an RMSE of 11 mean I am 11 calories off or is it just an abstract number unrelated to the data?**

### 🤖 Answer from ChatGPT

**RMSE (Root Mean Squared Error)** is **not a magical number**.  
It is very much tied to the **scale** and **units** of your target variable — in this case, **calorie expenditure**.

If your `Calories` column is measured in calories (which it almost certainly is), then:

##### ✅ **An RMSE of 11.53 means:**
> ➡️ **On average, your model's predictions are off by about 11.53 calories.**

#### 🔍 A Bit More Depth

- 📐 **RMSE** is calculated as the **square root of the average of the squared differences** between predicted and actual values.
- ⚖️ It **penalizes larger errors more heavily** than smaller ones because of the squaring.
- 📎 The **unit of RMSE** is the same as your target variable (in this case, calories).

#### 📊 Real Example

If someone actually burned **250 calories**, and your model predicted **238.5**,  
that’s an error of **11.5 calories** — which would directly factor into the RMSE you're seeing.

# Assignment 19 - Random Forest

In this assignment, I am going to train a random forest to predict Experience_Level based on the characteristics of a gym member. I will use the same dataset as in assignment 4.

In [None]:
import numpy as np
import pandas as pd
import hashlib as hl
import graphviz as gv
import sklearn.tree as tree
import sklearn.ensemble as ens
import matplotlib.pyplot as plt
import sklearn.model_selection as msn

gym19 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
gym19.head()
gym19.dropna()

I will first check the correlation matrix to see if there are any correlations between the columns. I will use the `corr()` function from the pandas library to do this. I will also use the `seaborn` library to visualize the correlation matrix.

In [None]:
gym19.corr(numeric_only=True).style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

First, I will define the features and target variable. The features are the columns that I will use to train the model and the target variable is the column that I want to predict. I have chosen the following columns based on the correlation matrix as features: 'Workout_Frequency (days/week)', 'Session_Duration (hours)', 'Calories_Burned', 'Water_Intake (liters)' and 'Fat_Percentage'.

In [None]:
features19 = ['Workout_Frequency (days/week)', 'Session_Duration (hours)', 'Calories_Burned']
target19 = 'Experience_Level'

Then, I will spit the dataset into a train and test set. I will use the `train_test_split` function from the `sklearn.model_selection` module to do this. I will also set the random state to 42 to ensure that the results are reproducible.

In [None]:
X19 = gym19[features19]
y19 = gym19[target19]

X_train19, X_test19, y_train19, y_test19 = msn.train_test_split(X19, y19, test_size=0.3, random_state=42)

Next, I will train the model using RandomForestClassifier from the sklearn.ensemble module. I am setting the number of estimators to 100 as this is a good starting point. I will also set the random state to 42 to ensure that the results are reproducible. Also the `max_depth` is set to 6.

In [None]:
clf19 = ens.RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
clf19.fit(X_train19, y_train19)

After training the model, I will use the trained random forest to predict the experience levels. I will use the test set to evaluate the performance of the model. Also, I will use the train set to evaluate if the model is overfitting the data.

In [None]:
predictionsOnTrainset19 = clf19.predict(X_train19)
predictionsOnTestset19 = clf19.predict(X_test19)

# 7. Calculate and print accuracy
accuracyTrain19 = calculate_accuracy(y_train19, predictionsOnTrainset19)
accuracyTest19 = calculate_accuracy(y_test19, predictionsOnTestset19)

print(f"Train Accuracy: {accuracyTrain19:0.4f}")
print(f"Test Accuracy: {accuracyTest19:0.4f}")

The model achieved an accuracy of 91.10% on the test set and 90.60% on the training set. These results indicate that the model demonstrates strong generalization capabilities, as the performance on unseen data is consistent with the training data. The minimal difference between training and test accuracy suggests that the model is not overfitting and effectively captures the underlying patterns in the dataset.

Initially, the features selected for the model were 'Workout_Frequency (days/week)', 'Session_Duration (hours)', 'Calories_Burned', 'Water_Intake (liters)', and 'Fat_Percentage'. This configuration yielded an accuracy of 90.4% on the test set and 90% on the training set.

Subsequently, the feature set was refined to include only 'Workout_Frequency (days/week)', 'Session_Duration (hours)', 'Calories_Burned', and 'Fat_Percentage'. This adjustment resulted in a marginally improved training set accuracy of 91.19%, but the test set accuracy slightly decreased to 90.41%. The higher training accuracy suggested potential overfitting, prompting further experimentation with feature selection.

In the final iteration, the 'Fat_Percentage' and 'Water_Intake (liters)' features were excluded due to their lower correlation with the target variable. The model was retrained using the features 'Workout_Frequency (days/week)', 'Session_Duration (hours)', and 'Calories_Burned'. This approach, as detailed in the assignment above, aimed to balance model performance and generalization by focusing on the most relevant predictors. The final model achieved a training accuracy of 90.60% and a test accuracy of 91.10%, demonstrating strong generalization capabilities.

# Assignment 20 - Perceptron

In this assignment, I will train a simple neural network (a multi-layer perceptron) to recognize handwritten digits from the MNIST dataset. I will use the fetch_openml function to load the dataset and the MLPClassifier from sklearn.neural_network for training. Hyperparameters such as learning rate and number of iterations will be adjusted to explore performance. At the end, I’ll evaluate how well the model performs on both the training and testing datasets.

First, I will import all the necessary Python libraries. These include tools for visualization (seaborn, matplotlib), loading the dataset (fetch_openml), preprocessing (StandardScaler), training the model (MLPClassifier), and evaluating it (accuracy_score).

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

Now I am going to load the MNIST dataset, which contains 70,000 images of handwritten digits (0–9). Each image is 28x28 pixels, flattened into a 784-length vector. The labels are cast to integers for classification.

In [None]:
mnist20 = fetch_openml('mnist_784')
X20 = mnist20.data
y20 = mnist20.target.astype(int)

Before training the neural network, I need to scale the input data using StandardScaler. This standardizes the features to have zero mean and unit variance. Scaling is essential for gradient-based optimization to converge properly.

In [None]:
scaler20 = StandardScaler()
X20_scaled = scaler20.fit_transform(X20)

Now I split the dataset into 70% training and 30% testing using train_test_split. This allows us to train the model on one portion and evaluate it on unseen data to measure generalization.

In [None]:
X20_train, X20_test, y20_train, y20_test = train_test_split(X20_scaled, y20, test_size=0.3, random_state=42)

Here, I create and train a MLPClassifier model. I set the number of hidden units to 10 (a very small network), use a learning rate of 0.01, and limit training to 100 iterations. These parameters were chosen to keep training fast and interpretable for this experiment.

After training, I use the model to make predictions on both the training and test sets, then evaluate the accuracy of those predictions.

In [None]:
learning_rate20 = 0.01
max_iter20 = 2000
hidden_layer_sizes20 = 200

model20 = MLPClassifier(
  hidden_layer_sizes=hidden_layer_sizes20,
  learning_rate_init=learning_rate20,
  max_iter=max_iter20,
  random_state=42
)

model20.fit(X20_train, y20_train)

y20_train_pred = model20.predict(X20_train)
y20_test_pred = model20.predict(X20_test)

train_accuracy20 = accuracy_score(y20_train, y20_train_pred)
test_accuracy20 = accuracy_score(y20_test, y20_test_pred)

print(f'Training Accuracy: {train_accuracy20:.4f}')
print(f'Test Accuracy: {test_accuracy20:.4f}')

The first try, I had an accuracy of 96% on the training set and 92% on the test set. This is a good start, but I want to improve the model's performance by tuning parameters and increasing the complexity of the network. The max iterations is set to 100, which is a good starting point. The learning rate is set to 0.01, which is a good starting point. The hidden layer size is set to 10, which is a good starting point.

The second try, I had an accuracy of 99.6% on the training set and 96.7% on the test set. This is a good improvement. I used the following hyperparameters for this attempt: **Learning Rate**: 0.01, **Max Iterations**: 1000, **Hidden Layer Size**: 100. I will now try to increase the max iterations to 2000 and see if that improves the accuracy. I will also try to increase the hidden layer size to 200 and see if that improves the accuracy.

The third try, I had an accuracy of 99.74% on the training set and 96.94% on the test set. This is my final result. I am happy with this result.

# Assignment 21

In this assignment, I will train a simple neural network (a multi-layer perceptron) to predict the experience level of a gym member based on their characteristics. I will use the same dataset as in assignment 4.

First, I will import all the necessary Python libraries. These include tools for visualization (seaborn, matplotlib), loading the dataset (fetch_openml), preprocessing (StandardScaler), training the model (MLPClassifier), and evaluating it (accuracy_score).

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

Now I am going to load the gym members dataset which I chose in portfolio assignment 4. The dataset contains information about gym members, their exercise habits, and their demographics. I am going to predict the experience level of the gym members.

In [None]:
gym21 = pd.read_csv("Datasets\\gym_members_exercise_tracking.csv")
X21 = gym21[['Workout_Frequency (days/week)', 'Session_Duration (hours)', 'Calories_Burned']]
y21 = gym21['Experience_Level']

Before training the neural network, I need to scale the input data using StandardScaler. This standardizes the features to have zero mean and unit variance. Scaling is essential for gradient-based optimization to converge properly.

In [None]:
scaler21 = StandardScaler()
X21_scaled = scaler21.fit_transform(X21)

Now I split the dataset into 70% training and 30% testing using train_test_split. This allows us to train the model on one portion and evaluate it on unseen data to measure generalization.

In [None]:
X21_train, X21_test, y21_train, y21_test = train_test_split(X21_scaled, y21, test_size=0.3, random_state=42)

Here, I create and train a MLPClassifier model. I set the number of hidden units to 10 (a very small network), use a learning rate of 0.01, and limit training to 100 iterations. These parameters were chosen to keep training fast and interpretable for this experiment.

After training, I use the model to make predictions on both the training and test sets, then evaluate the accuracy of those predictions.

In [None]:
learning_rate21 = 0.05
max_iter21 = 1010
hidden_layer_sizes21 = 10

model21 = MLPClassifier(
 hidden_layer_sizes=hidden_layer_sizes21,
 learning_rate_init=learning_rate21,
 max_iter=max_iter21,
 random_state=42
)

model21.fit(X21_train, y21_train)

y21_train_pred = model21.predict(X21_train)
y21_test_pred = model21.predict(X21_test)

train_accuracy21 = accuracy_score(y21_train, y21_train_pred)
test_accuracy21 = accuracy_score(y21_test, y21_test_pred)

print(f'Training Accuracy: {train_accuracy21:.4f}')
print(f'Test Accuracy: {test_accuracy21:.4f}')

The first try, I had an accuracy of 89.57% on the training set and 90.07% on the test set. This is a good start, but I want to improve the model's performance by tuning parameters and increasing the complexity of the network.

The second try, I had an accuracy of 89.57% on the training set and 91.10% on the test set. I used the following hyperparameters for this attempt: **Learning Rate**: 0.05, **Max Iterations**: 1010, **Hidden Layer Size**: 10. This is a slight improvement in test accuracy compared to the first try, but the training accuracy remained the same. This is my final result as I am happy with this.

The dataset and the columns I used for the training are the same as in assignment 19. The achieved accuracy is 91.10% is also the same as in assignment 19 (random forest). This is a good result. I am happy with this result. The model is able to generalize well to unseen data. The minimal difference between training and test accuracy suggests that the model is not overfitting and effectively captures the underlying patterns in the dataset.

# Assignment 22

In this assignment, I explore the application of the YOLO v8 model for object detection, specifically targeting traffic sign recognition. YOLO (You Only Look Once) is a state-of-the-art real-time object detection system known for its balance between speed and accuracy.

To begin, I import a variety of libraries necessary for different tasks involved in this project. This includes libraries for image processing (`cv2`, `PIL`), data manipulation (`numpy`, `pandas`), visualization (`matplotlib`), and the YOLO framework itself (`ultralytics`).

In [None]:
import os
import cv2
import torch
import shutil
import numpy as np
import pandas as pd
import pathlib as plb
import warnings as warn
import matplotlib.pyplot as plt

from PIL import Image
from ultralytics import YOLO
from IPython.display import display, Video
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split

# Check YOLO installation
import ultralytics
ultralytics.checks()

# Suppress specific warnings
warn.filterwarnings("ignore", category=ConvergenceWarning)


Loading the YOLO v8 model initializes the pretrained weights and network architecture. This step is crucial because it allows us to leverage transfer learning: the model has already learned to recognize a wide range of objects, and we will fine-tune it specifically for traffic sign detection.

By using a pretrained model (`yolov8n.pt`), training time is reduced significantly, and the model can achieve better performance on smaller datasets.

In [None]:
## I installed NVIDIA GPU Acceleration
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")


In [None]:
# Here I set the base path to the dataset containing traffic sign images.
TSRpath = os.getcwd() +'/Datasets/TrafficSignRecognition/car'

# Load the model to ensure the latest YOLO instance is used
model = YOLO('yolov8m.pt')

# The training process fine-tunes the model on our specific traffic sign dataset.
# Train the YOLO model on the custom traffic sign dataset
results = model.train(
    data=os.path.join(TSRpath, 'data.yaml'),  # Path to dataset config
    epochs=50,                                # Number of training epochs
    imgsz=640,                                # Image size for training
    batch=16,                                 # Batch size
    name='tsd-r1',                            # Run name for experiment tracking
    device='cuda' if torch.cuda.is_available() else 'cpu',  # Use GPU if available
    workers=4,                                # Number of dataloader workers (adjust as needed)
    patience=10,                              # Early stopping if no improvement
    verbose=True                              # Print training progress
)

After training, it is essential to evaluate the model on a validation dataset that it hasn't seen before. This step helps assess how well the model generalizes beyond the training data and identifies potential overfitting or underfitting.

Metrics such as precision, recall, and mAP (mean Average Precision) give insights into the model's detection accuracy.

In [None]:
# Evaluate the trained model on the validation set to measure performance
metrics = model.val(
    data= TSRpath + '/data.yaml',
    device='cuda'
)
print(metrics)

Finally, the trained model is tested on a separate test set. This represents a real-world scenario where the model encounters new images and must detect traffic signs accurately. By setting a confidence threshold (`conf=0.25`), we control the sensitivity of detections, filtering out less certain predictions to reduce false positives. 
Saving the results allows for later review and analysis.

In [None]:
# Load the best weights from training to run inference on test images
trained = YOLO('runs/detect/tsd-r1/weights/best.pt')

# Run prediction on test images folder and save outputs
results = trained.predict(source=TSRpath + '/test/images', conf=0.25, save=True)
print("Inference completed, results saved in runs/detect/")

Visualization is a powerful tool to assess how well the model detects traffic signs in images. By displaying sample prediction images, we can judge the quality of detections, bounding box accuracy, and possible areas for improvement.

In [None]:
# Display a sample detection image from the inference results
from IPython.display import display, Image
sample = 'runs/detect/tsd-r12'
if os.path.exists(sample):
    # Display the first available predicted image if the sample does not exist
    import glob
    predicted_images = glob.glob('runs/detect/tsd-r12/*.jpg')
    if predicted_images:
        display(Image(filename=predicted_images[1]))
    else:
        print("No predicted images found in runs/detect/predict/")
