## Python notebook template block B

As of now, you created a new notebook for every study day with the related contents. However, when working on a project in the real life, all your data and code needs to be in one place for the project. Going forward in the block, all of the code that you generate with regard to the final project about NAC and the ILO's should be in this one template. Go back to the code you wrote for the previous weeks, evaluate if it is according to [PEP8](https://peps.python.org/pep-0008/) style guide and adjust where necessary. This template provides you with a natural flow through the steps of a traditional data science project. Do not forget to clearly add comments to your code. If you would like to add more stucture, add extra mark down blocks to explain what you are doing. You are **not** allowed to remove code blocks! All blocks in here need to be filled with code. If you did not write code for a section, leave the code block as is with the pre-filled in comment. Adjust this template to your needs, make sure that all your evidence for all of the ILO's is included.

‚ö†Ô∏èImportant! Before handing it in, run all of your code. All your cells need to show outputs. This is necessary for grading!‚ö†Ô∏è

The ILO's for which you can evidence your code by this notebook are: 

| ILO | Poor | Insufficient | Sufficient | Good | Excellent |
|-----|------|--------------|------------|------|-----------|
| 4.1 | x    | x            | x          | x    | x         |
| 4.2 | x    | x            | x          | x    | x*        |
| 5.0 | x    | x            | x          | x    | x         |
| 7.0 | x    | x            | x          | x    | x         |

4.2 excellent*: If you would like to showcast your graphs using streamlit, you need to hand in a seperate .py file. Evidence accordingly in your learning log.




### Add imports here
When working in .py files, you usually have all your package imports at the top of your code. This makes it easy to get a good overview of the packages that you are using and importing. As of now, we are working in .ipynb, but it is good practice to already start implementing these structures. Add all the imports that you use in all of your code in the code block below. In this way, you do not need to add it in every cell.

In [4]:
# Add your package imports here
import pandas as pd
import numpy as np 
import missingno as msno 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

### Load the data

After your package imports, you usually load your data. This is what you will be working with and what your code will be based on.

In [5]:

# Fresh dataframe
df= pd.read_csv("N1.csv")



## Data Management and Understanding

### Data Cleansing
In the following section plug in all of your relevant python codes and explanations related to data cleansing. This is related to the poor and insufficient criteria of ILO 4.1 and 4.2.

#### explanations
I started off my data preprocessing with some personal preferences, I used 'display.max_rows' to display all the wors in the output. 
I used replace() to replace the space between words with '_' in the column names to make it easier to read. 

Pre-Processing the dataset, my first step was to see which columns has missing data. Using msno.bar() I visualized my dataset, but that was not the most ideal way to find missing data by using isnull().sum() I could go through every column and see exactly how many values are missing in each of them. 

Contract Expiration had a significant amount of missing data, there are many different ways that I can use to replace these missing values but I chose to use the median of "Contract expires". I visualized the "Contract expires" data and saw that there were a couple of outliers, which is why I chose to use the median as it is less sensitive to outliers compared to mean. Out of pure curiosity I checked what the mode of ("Contract expires) was and it was the same as the median, namely, "2024-06-30". As the values were strings, I had to convert the datatype to datetime to be able to calculate the median and i did that by using 'to_datetime()'.

Then I had the columns such as "shots against per 90", most of these columns had missing data, this is because these depend on the player's position. I named these type of columns positional columns. A goalkeeper for example, would generally not have data for the column such as "Received passes per 90". The empty values in these columns, I replaced with 0. 

The column "Foot" had me a bit stumped, I was not sure as to what I should do with the missing values. Ultimately I decided that it wasn't an important variable in my dataset so I replaced it with the mode, which was "Right". 

Afterwards I called 'missing_values_count' again and there were many columns with a small amount of missing data so I decided to move on to the rows. I deleted rows that had a lot of missing values. I used a threshold of 115, meaning that I kept all the rows with at least 115 of non-NaN values. I didn't know what my threshold should be so I kept playing with different numbers until the sum of every column with missing data was 0, which is when the threshold is 115

Save the new dataset 

Call the new dataset

I had to call the new dataset because I want to add a new column to the new dataset. 
I added a new column named "Risk_assessment". I binned the players into 3 different categories based on different conditions. My three categories are 'High risk', 'Risky' and  'Low risk'. I based my conditions off of the information I found in an article called "Player‚Äôs Guide to Red Cards, Yellow Cards and Accumulated Disciplinary Points" from the National Capital Soccer League. From reading that I found out that 2 red cards equals to 20 Disciplinary Points (which I will call DP from now on) and the player has to sit out the next game. 4 yellow cards also equals to 20 DP's and the player has to sit out the next game, so what I got from that is 2 yellow cards equals to 1 red card. 

My conditions for 'High risk' are: if a a player has more or equal to 2 red cards and more or equal to 4 yellow cards, if a player has 1 red card and more or equal to 6 yellow cards, and lastly 0 red cards and more than 8 yellow cards. I classified these as high risk because 2 red cards or 4 yellow cards equals 20 DP's, 2 red and 4 yellow cards in my mind means 4 red cards. The other 2 conditions I used to define 'High risk' add up to being 4 red cards. My conditions for 'Risky' is as follows: if a player has 2 red cards and less than 4 yellow cards, if a player has 1 red card and less than 6 yellow cards, and lastly 0 red cards and between 4 and 8 yellow cards. If you convert all the yellow cards to red cards like I did before they all come out to less than 4 red cards, which is why I determined for these conditions to be risky. Lastly, for 'Low risk' my only condition was 0 red cards and less than 4 yellow cards.

I saved the new fully pre-processed dataframe

In [6]:
# Display max rows 
pd.set_option('display.max_rows', None)

# Replace the space with underscore 
df.columns = df.columns.str.replace(' ','_')

# Count missing values 
missing_values_count = df.isnull().sum()
missing_values_count

# Convert the 'Contract_expires' column to datetime format
df['Contract_expires'] = pd.to_datetime(df['Contract_expires'], errors='coerce')

# Calculate and display the median date
median_date = df['Contract_expires'].median()
median_date

# Replace missing values in 'Contract_expires' with the median date
df['Contract_expires'].fillna(median_date, inplace=True)

# Replacing missing values in positional columns with 0
positional_columns = ['Direct_free_kicks_on_target,_%', 'Direct_free_kicks_per_90', 
                      'Free_kicks_per_90', 'Aerial_duels_per_90.1', 'Exits_per_90', 
                      'Prevented_goals_per_90', 'Prevented_goals', 
                      'Shots_against_per_90', 'Conceded_goals_per_90']
df[positional_columns] = df[positional_columns].fillna(0)

# Replace missing values in 'Foot' with the mode
df['Foot'].fillna(df['Foot'].mode()[0], inplace=True)

# Drop the rows with a lot of missing data 
threshold = 115
df.dropna(thresh= threshold, inplace= True)

# Count missing values 
missing_values_count = df.isnull().sum()
missing_values_count


num_rows = df.shape[0]
print("Number of rows:", num_rows)

# Save the DataFrame to a new CSV file
df.to_csv('NAC1.csv', index=False)


KeyError: 'Contract_expires'

In [None]:
# Call the new dataframe
# Pre-processed dataframe
df2 = pd.read_csv("NAC1.csv")

# Define the conditions
conditions = [
    (df2['Red_cards'] >= 2) & (df2['Yellow_cards'] >= 4),
    (df2['Red_cards'] == 2) & (df2['Yellow_cards'] < 4),
    (df2['Red_cards'] == 1) & (df2['Yellow_cards'] >= 6),
    (df2['Red_cards'] == 1) & (df2['Yellow_cards'] < 6),
    (df2['Red_cards'] == 0) & (df2['Yellow_cards'] > 8),
    (df2['Red_cards'] == 0) & (df2['Yellow_cards'].between(4, 8, inclusive='both')),
    (df2['Red_cards'] == 0) & (df2['Yellow_cards'] < 4)
]

labels = [
    'High risk',
    'Risky',
    'High risk',
    'Risky',
    'High risk',
    'Risky',
    'Low risk'
]

# Create column names 'Risk_assessment' 
df2['Risk_assessment'] = np.select(conditions, labels, 'Not Classified')

# Print the new column and cards 
print(df2[['Player', 'Yellow_cards', 'Red_cards', 'Risk_assessment']])

# Remove rows where "Risk_assessment" is "Not classified" in-place
df2.drop(df2[df2['Risk_assessment'] == 'Not Classified'].index, inplace=True)

print(df2["Risk_assessment"].unique())

# Save the new full processed file 
df2.to_csv('NAC.csv', index =False )

# Call the new fully pre-processed dataframe 
df3 = pd.read_csv("NAC.csv")


### Exploratory Data Analysis

Include all exploratory Data Analysis questions you studied in this section. This is related to the sufficient and good criteria of ILO 4.1 and 4.2. 

I'm going to start with EDA I did to create the new column(Risk assessment), then I will provide some of the codes I used for the EDA at the start of the block.

When I started off with my idea I wasn't sure how I should categorize the players so I called some basic functions about the columns I wanted to use to get some basic information about them.
I created a variable names 'cards' thaty contained all the red and yellow cars columns so that I didn't have to keep writing them out.



In [None]:
# Specify which columns I wanted to get infromation from, for my case it would be the red and yellow cards
cards = ['Red_cards', 'Red_cards_per_90', 'Yellow_cards', 'Yellow_cards_per_90']

# Check the max and min for each column
cards_max = df3[cards].max()
cards_min = df3[cards].min()

# Check the mean, median and modes for each column 
cards_mean = df3[cards].mean()
cards_mode = df3[cards].mode()
cards_median = df3[cards].median()

# Calculate variance of cards
cards_var = df3[cards].var()

# Print them
print("\nThe maximum amount of each card that a player has gotten:\n",cards_max)
print("\nThe Minimum amount of each card that a player has gotten:\n",cards_min)
print("\nThe average amount of cards per player:\n",cards_mean)
print("\nThe median of cards:\n",cards_median)
print("\nThe mode of cards:\n",cards_mode)
print("\nThe variance of cards\n", cards_var)

# Find the player with the most red cards
player_most_red_cards = df3['Red_cards'].idxmax()
most_red_cards = df3['Red_cards'].max()

# Find the player with the most yellow cards
player_most_yellow_cards = df3['Yellow_cards'].idxmax()
most_yellow_cards = df3['Yellow_cards'].max()

# Print or use the results as needed
print(f"The player with the most red cards is {player_most_red_cards} with {most_red_cards} red cards.")
print(f"The player with the most yellow cards is {player_most_yellow_cards} with {most_yellow_cards} yellow cards.")


yellow_cards_18 = df3[df3['Yellow_cards'] == 18]

# Get the count of players with 18 yellow cards
num = yellow_cards_18.shape[0]

print("Number of players with 18 Yellow Cards:", num)

# Assuming df3 is your DataFrame
More_than_16 = df3[df3['Yellow_cards'] > 16]

# Get the count of players with more than 14 yellow cards
amount = More_than_16.shape[0]


# Print the result
print("Number of players with more than 14 Yellow Cards:", amount)

# Frequency count for each category 
risk = df3['Risk_assessment'].value_counts()

# Display the frequency counts
print("Frequency counts for Risk Assessment:", risk)






#### Explanations 
The new cell is the EDA we did at the beginning of the block, I provided a couple that I found the most interesting. 
Further down by 'Visualizations' I provided the questions that I plotted.

The first code that I provided is to answer the question "Which country has the highest representation in the dataset in terms of player birthplace? First I had to count how many players there were per country. I used idxmax() to find which country has the highest amount of reprsentation, then I count up the amount of times that country came up.

The second code I provided was to answer the question "What is the distribution of player'sn positions across different teams?". This was very simple to answer, I used 
.groupby() and specified my conditions which were "Team" and "Position" then I used 
.count() to count it all up 



In [None]:
# Add your exploratory data analysis of the NAC data here. You can add Mark Down blocks (or output f-strings) to provide explanations to your code, alongside comments made in your code. 

# I have to count the number of players from each country
country_count = df3['Birth_country'].value_counts()

# I can use max() to see which country comes up the most for the "Birth country" column
highest_rep_country = country_count.idxmax()
highest_rep_count = country_count.max()

print(f"The country with the highest representation is {highest_rep_country} with {highest_rep_count} players")

team_position_count = df.groupby("Team")["Position"].count()
print("The amount of players per team:",(team_position_count))


# Calculate mean and median of age 
average_age = df3["Age"].mean()
median_age = df3["Age"].median()

# Count the total amount of rows and columns in df 
df.shape


print("The average age of players is:", average_age)
print("The median age of players is:", median_age)

# Get the amount of nun-numerical columns 
df.info()


### Visualizations

Include all the visualizations you made in this section. This is related to the excellent criteria of ILO 4.2. Use the blocks below to enter the code for graphs you created with matplotlib (or seaborn, bokeh, or another visualization package). 

‚ùó If you would like to showcast your visualizations using streamlit, you need to hand in a seperate .py file for this. It is not possible to run streamlit code from a python notebook. Please note down below if you do so.

#### Explanations 
The first plot was to answer the question "How does the market value of players correlate with their age?". From the correlation matrix you can see that the correlation is extremely week with the correlation co-efficient being 0.0061

The second visualization showed the top 10 countries with the most amount of players

The third visualization is the correlation between a player's height, weight, and goals. This showed that a player's height and weight have a pretty strong correlation with the correlation co-efficient being 0.83. There isn't a correlation between weight and goals or height and goal, with the correlation co-efficient being 0.064 and 0.065, respectively. While the correlation is a bit stronger than the first visualization it is still an extremely week one 

In [None]:
# Define size 
plt.figure(figsize=(12, 6))  

# Yellow cards
plt.subplot(1, 2, 1)
sns.histplot(df3['Yellow_cards'], color='orange')
plt.title('Distribution of Yellow Cards')
plt.xlabel('Yellow Cards')
plt.ylabel('Frequency')

# Red cards
plt.subplot(1, 2, 2)
sns.histplot(df3['Red_cards'], color='red')
plt.title('Distribution of Red Cards')
plt.xlabel('Red Cards')
plt.ylabel('Frequency')

plt.show()

In [None]:
# Set style 
sns.set(style="whitegrid") 

# Create chart
plt.figure(figsize=(8, 6))  

# Countplot to see the categories in "Risk_assessment"
sns.countplot(x='Risk_assessment', data=df3)

# Labels and titles 
plt.xlabel('Risk Assessment')
plt.ylabel('Number of Players')
plt.title('Number of Players in Each Risk Category')
plt.show()


In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Red_cards', data=df3, color='salmon')
plt.title('Box Plot of Red Cards')
plt.xlabel('Number of Red Cards')
plt.show()

In [None]:


plt.figure(figsize=(8, 4))
sns.boxplot(x='Yellow_cards', data=df3, color='salmon')
plt.title('Box Plot of Yellow Cards')
plt.xlabel('Number of Yellow Cards')
plt.show()

In [None]:
# Add visualizations here that you made to present insights in the NAC data. Create a new codeblock for every graph. Add markdown blocks to describe your graphs where necessary.

# make a correlation matrix and plot it 
correlation_matrix = df3[['Market_value', 'Age']].corr()

sns.heatmap(correlation_matrix, annot=True)
plt.show()
# the correlation is 0.0061, which means it has a very very week correlation. Basically non-existent


In [None]:
# I want to show the top 10 countries with the most players 
# Specify top 10 
top_countries = country_count.head(10)

# Specify size 
plt.figure(figsize=(12, 6))

# Bar plot 
sns.barplot(x = top_countries.index, y=top_countries.values, palette= 'viridis')

# Give titles 
plt.title('Top 10 countries with the most amount of players')
plt.xlabel('Country')
plt.ylabel('Number of players')

plt.show()

In [None]:
# I want to create a correlation matrix that might have a good correlation
# create and plot a correlation matrix
correlation_matrix = df3[['Weight', 'Height', 'Goals' ]].corr()

sns.heatmap(correlation_matrix, annot=True)
plt.show()

In [None]:
msno.bar(df3)

### Database and ETL

Include all the python code and explanations on your RESTful API and database operations in this section. This is related to the excellent criteria of ILO 4.1.

‚ùó These code you cannot showcast using the NAC data. Use the data provided for the homework and datalab preperation of these modules.

In [None]:
# Include your code here for for the API and ETL. This is not done on the NAC data.

## Machine Learning

### Identifying basic Machine Learning applications.
In the following subsection, show your understanding of each of the listed Machine Learning algorithms. Excecute these algorithms on the NAC dataset. This is related to the poor (and insufficient) criteria of ILO 5.0. 

‚ùóRemember! All your package imports should be on top of this notebook.

#### Simple machine learning modelling pipeline

#### Explanations
Simple machine learning modeling: my target variable is the risk assessment of players. I started of by using LabelEncoder() to encode Risk assessment as it is a non numerical column. After I defined my features and target variable, I chose the features based off of the correlation analysis I did at the bottom "Correlation Analysis and Feature Selection", I chose the top columns 8 with the highest correalation co-efficient, except the red and yellow cards. My target is of course the "Risk_assessment" column. After choosing my variables I used train_test_split to split my dataset. The test size i used was 20% as that is what i am used to.
Update: I changed my test size to 0.40, I tried with different percentages but I found that most models had the same accuracy levels when I changed the test size between 0.20 and 0.40. One of the models had a better accuracy with the test size at 0.40 so I used it for both the splitting and testing sets.
I used drop() to define my features, I deicided to drop the columns rather than define the ones I want to use because there are less columns to drop

In [None]:
# Encode 'Risk_assessment' column
label = LabelEncoder()
df3['Risk_assessment'] = label.fit_transform(df3['Risk_assessment'])

# Define features and target 
XX = df3[['Matches_played', 'Minutes_played', 'Shots', 'xA', 'Conceded_goals_per_90', 'Shots_against_per_90', 'xG_against_per_90', 'Save_rate,_%']]
yy = df3['Risk_assessment']

# Split the data 
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.4, random_state=42)




#### Linear regression

#### Explanations
 I created a linear regression model using LinearRegression(), afterwards I trained the data using .fit(). I used predict() to make predictions on the testing data which is X_test, and named the predicted values y_pred. I used Mean squared error and r2 to evaluate the performance of the model, the MSE returned 0.63 and r2 return 0.01. The MSE measure the average squared difference between the predicted and actual values, the ideal value is 0, this means that with a MSE of 0.63 there are a good amount of errors in this model's prediction. This is further proved by the r2 score, r2 measures how well the model explains the variance in the target variable. The range for r2 is 0 to 1, with 0 meaning the model does not explain any variance, and 1 indicates a perfect fit. A score of 0.01 means that he model is not capturing much of the variability in the target variable.

In [None]:

# Create linear regression model
linear_model = LinearRegression()

# Train the model 
linear_model.fit(XX_train, yy_train)

# Predictions on the testing data
yy_pred = linear_model.predict(XX_test)

# Evaluate the model
mse = mean_squared_error(yy_test, yy_pred)
r2 = r2_score(yy_test, yy_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")



#### Logistic regression

#### Explanations
The logistic regression steps is a lot like the steps for linear regression. I created the logistic regression model using LogisticRegression() and trained the data using fit(). Like before I used predict() to make predictions using X_test and stored the predictions in y_pred. Then I evaluated the model, I used accucracy_score to check the accuracy of the predictions compared to the test. The accuracy score returned 0.61 or 61%, 61% is not bad especially when compared to the outcome of the linear regression model but it could be better. I also added in confusion matrix and classification report so I can get better insight, I will use this information when evaluating and comparing the performance of different models.

In [None]:
# Create logistic regression model
logreg_model = LogisticRegression()

# Train the model 
logreg_model.fit(XX_train, yy_train)

# Predictions on the testing data
yy_pred = logreg_model.predict(XX_test)

# Evaluate the model
accuracy = accuracy_score(yy_test, yy_pred)
conf_matrix = confusion_matrix(yy_test, yy_pred)
class_report = classification_report(yy_test, yy_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", class_report)
print("\nConfusion Matrix:\n", conf_matrix)


#### Tree-based model

#### Explanations 
I chose to use Random Forest for my tree based model because there is already a Gradient Boosting Tree section below and  Random Forest tends to produce more accurate predictions than a single decision tree. 
I started of like all the other models, I started by creating the Random Forest model with RandomForestClassifier. I trained the model using fit() and make predictions using predict(). I used accucarcy_score to calculate the accuracy of the model which is about 0.64 or 64%, I also added a classification report and Confusion Matrix for my own insight. 

In [None]:
# Convert the non-numerical data
le = LabelEncoder()
for col in XX.columns:
    if XX[col].dtype == 'object':
        XX[col] = le.fit_transform(XX[col])

# Split the data 
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.4, random_state=42)


# Create Random Forest Classifier
rf_model = RandomForestClassifier()

# Train the model
rf_model.fit(XX_train, yy_train)

# Make predictions 
yy_pred = rf_model.predict(XX_test)

# Evaluate the model
accuracy = accuracy_score(yy_test, yy_pred)
conf_matrix = confusion_matrix(yy_test, yy_pred)
classification_rep = classification_report(yy_test, yy_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)

#### Gradient Boosting Trees and SVM

#### Explanations
I used GradientBoostingClassifier() to create the gradient boosting classifier. I used fit() to train the model and predict() to make predictions based on the test set and stored it in yy_gb. Lastly, I calculated the accuracy and the confusion matrix. The accuracy is the highest one I've gotten with the different models, which is about 65% (0.65). 

In [None]:

# Create Gradient Boosting Classifier
gb = GradientBoostingClassifier()

# Train the model
gb.fit(XX_train, yy_train)

# Make predictions
yy_gb = gb.predict(XX_test)

# Evaluate the model
accuracy_gb = accuracy_score(yy_test, yy_gb)
conf_matrix_gb = confusion_matrix(yy_test, yy_gb)
classification_rep_gb = classification_report(yy_test, yy_gb)

print("Accuracy:", accuracy_gb)
print("\nConfusion Matrix:\n", conf_matrix_gb)



#### Explanations
We didn't get SVM in our DataLab(Prep) so I Had to google what it is first, the steps are the same as the previous ones. Like always I started with creating SVM, trained the model and made predictions. The accuracy is 0,64 which is higher than the random forest. 

In [None]:
# Create SVM
svm = SVC()

# Train the model
svm.fit(XX_train, yy_train)

# Make predictions 
yy_svm = svm.predict(XX_test)

# Evaluate the model
accuracy_svm = accuracy_score(yy_test, yy_svm)
conf_matrix_svm = confusion_matrix(yy_test, yy_svm)
classification_rep_svm = classification_report(yy_test, yy_svm)

print("Accuracy:", accuracy_svm)
print("\nConfusion Matrix:\n", conf_matrix_svm)


#### Unsupervised learning with K-Means

I needed to combine the the test and training sets as it was inconsistent. I used LabelEncoder to econcode the training and testing sets like I did above but it kept giving me an error, with the help of my sister I combined the two sets encoded them and then split the combined sets back into their original training and testing sets. To ensure that the features are on a similar scale I used StandardScaler() to standardize my data. I had different codes before which returned something totally different so I asked Edirlei to help me, we decided on using 3 clusters. fit_predict was used to train the model and predict is used the test set. We created a new column named "Clusters' and called the top rows of clusters using head(). 

In [None]:
# Combine sets 
combined_df = pd.concat([XX_train, XX_test])

le = LabelEncoder()
for col in combined_df.columns:
    if combined_df[col].dtype == 'object':
        combined_df[col] = le.fit_transform(combined_df[col])

# Split the sets back
XX_train_cluster = combined_df.iloc[:len(XX_train)]
XX_test_cluster = combined_df.iloc[len(XX_train):]

# Standardize the data 
scaler = StandardScaler()
XX_train_cluster_scaled = scaler.fit_transform(XX_train_cluster)
XX_test_cluster_scaled = scaler.transform(XX_test_cluster)

# k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)  
edirlei_train = kmeans.fit_predict(XX_train_cluster_scaled)
edirlei_test = kmeans.predict(XX_test_cluster_scaled)

# Add cluster assignments as new columns
XX_train_cluster["Clusters"] = edirlei_train
XX_test_cluster["Clusters"] = edirlei_test

print(XX_test_cluster.head())
kmeans.cluster_centers_



#### Correlation Analysis and Feature Selection


#### Explanations 
I started with this correlation from the beginning to help me choose the features I used for the previous models. 
I started with LabelEncoder() to encode all the non-numerical variables, I used .corr() to create the correlation. I wanted to know whic variables had the highest correlation to my target, so I set the target as "Risk_assessment" and sorted it by using ascending= False

In [None]:
label_encoder = LabelEncoder()
for col in df3.columns:
    if df3[col].dtype == 'object':
        df3[col] = label_encoder.fit_transform(df3[col])

# Calculate correlation matrix
correlation = df3.corr()

# Specify target
target = correlation['Risk_assessment']

# Sort features from highest correlation
features = target.abs().sort_values(ascending=False)

print(features)


#### Explanations
The accuracy for the models weren't as high as I had hoped, my highest was Gradient Boosting Trees with 65%. I'm going to use every single column except 'Risk_assessment', 'Red_cards', 'Yellow_cards', 'Red_cards_per_90', 'Yellow_cards_per_90', I'm dropping these columns because I used them to determine the risk level. I want my model to predict the risk level based off of other features. By using every variable I'm hoping to increase the accuracy 

In [None]:
# Define the features and target variable, I used this for the rest of the models 
yy = df3['Risk_assessment']
XX = df3.drop(['Risk_assessment', 'Red_cards', 'Yellow_cards', 'Red_cards_per_90', 'Yellow_cards_per_90'], axis=1)


XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.4, random_state=42)

‚úçÔ∏è I chose the features above because ...

### Evaluating the performance of the model

In the following subsection include your Python code on how you evaluated your chosen model(s). This is related to the sufficient criteria of ILO 5.0. 

#### Explanations

After talking to different mentors, I concluded that we don't have to evaluate every single model. They told me to choose either the most relevent ones or the ones that provided the highest results, I chose Gradient Boosting Trees and Random Forest as they Provided the highest results and are both classification models. 

### Random Forest

In [None]:
# Convert the non-numerical data
le = LabelEncoder()
for col in XX.columns:
    if XX[col].dtype == 'object':
        XX[col] = le.fit_transform(XX[col])
        
# Split the data 
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.4, random_state=42)

# Create Random Forest Classifier
rf = RandomForestClassifier()

# Train the model
rf.fit(XX_train, yy_train)

# Make predictions 
yy_pred = rf.predict(XX_test)

# Evaluate the model
accuracy = accuracy_score(yy_test, yy_pred)
conf_matrix = confusion_matrix(yy_test, yy_pred)
classification_rep = classification_report(yy_test, yy_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)


### Gradient Boosting Trees 

In [None]:
# Create Gradient Boosting Classifier
gb = GradientBoostingClassifier()

# Train the model
gb.fit(XX_train, yy_train)

# Make predictions
yy_gb = gb.predict(XX_test)

# Evaluate the model
accuracy_gb = accuracy_score(yy_test, yy_gb)
conf_matrix_gb = confusion_matrix(yy_test, yy_gb)
classification_rep_gb = classification_report(yy_test, yy_gb)

print("Accuracy:", accuracy_gb)
print("\nConfusion Matrix:\n", conf_matrix_gb)
print("\nClassification Report:\n", classification_rep_gb)

#### Interpretation

0 = high risk, 1 = low risk, 2 = risky
Class 1 (low risk) has the highest precision and recall, which means that the models are good at predicting low risk players. Both models have a very low recall for class 0, meaning that  there is a high number of false negatives. Random Forest has a higher precision for class 2 (risky) but Gradient Boosting Trees has a slightly better balance (f1-score). Based off of the classification report, Gradient Boosting Trees perform slightly better than the Random Forest 

### Improving the performance of the model

In the following subsection include your Python code on how you improved your chosen model(s). This is related to the good criteria of ILO 5.0.  

#### Explanations 

Firstly, I'm going to use cross_val_score to check the consistency of both models. Secondly, I used RandomizedSearchCV() hyperparameters for the models. The search I did for the random forest returned positive results (max_depth= 20, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 150) I entered this information into the random forest model and got an accuracy ok 69%. When I didn't add any hyperparameters it was also 69%. The accuracy got a bit better it went from 0.697 to 0.699, it does not make a big difference but adding hyperparameters does add some value to my model. It can ensure reproducibility and can perform more consistently if new data were to be added in the dataset. For my Gradient Boosting forest I also performed a random search, which is the code that I changed to markdown because it ran for hours. After adding the parameters the accuracy decreased ever so slightly, I think it's a bit of a personal preference. I prefer adding parameters even tho it decreased the accuracy a bit to ensure reproducibility. 

In [None]:
# Perform cross-validation
cv_rf = cross_val_score(rf, XX, yy, cv = 5)  
cv_gb = cross_val_score(gb, XX, yy, cv = 5)

print("Cross validation scores for Random Forest:", cv_rf)
print("Cross validation scores for Gradient Boosting Trees:", cv_gb)

#### Random Forest 

In [None]:
# Define the grid
param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 5)
}

# Create RandomizedSearchCV object
search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)

# Perform random search on the training data
search.fit(XX_train, yy_train)

# Get the best hyperparameters
best_params_rf = search.best_params_
print(best_params_rf)


In [None]:
# Convert the non-numerical data
le = LabelEncoder()
for col in XX.columns:
    if XX[col].dtype == 'object':
        XX[col] = le.fit_transform(XX[col])
        
# Split the data 
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.4, random_state=42)

# Create Random Forest Classifier
rf = RandomForestClassifier(max_depth= 20, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 150)

# Train the model
rf.fit(XX_train, yy_train)

# Make predictions 
yy_pred = rf.predict(XX_test)

# Evaluate the model
accuracy = accuracy_score(yy_test, yy_pred)
conf_matrix = confusion_matrix(yy_test, yy_pred)
classification_rep = classification_report(yy_test, yy_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)

#### Gradient Boosting Trees

In [None]:
# Define the grid
param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 5)
}

# Create RandomizedSearchCV object
search = RandomizedSearchCV(gb, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)

# Perform random search on the training data
search.fit(XX_train, yy_train)

# Get the best hyperparameters
best_params_gb = search.best_params_
print(best_params_gb)

In [None]:
# Create Gradient Boosting Classifier
gb = GradientBoostingClassifier(max_depth= 20, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 150)

# Train the model
gb.fit(XX_train, yy_train)

# Make predictions
yy_gb = gb.predict(XX_test)

# Evaluate the model
accuracy_gb = accuracy_score(yy_test, yy_gb)
conf_matrix_gb = confusion_matrix(yy_test, yy_gb)
classification_rep_gb = classification_report(yy_test, yy_gb)

print("Accuracy:", accuracy_gb)
print("\nConfusion Matrix:\n", conf_matrix_gb)
print("\nClassification Report:\n", classification_rep_gb) 

Steps I took to imrpove my model, I started by changing the features. Instead of choosing which features I wanted to use, I dropped the ones I didn't want to use (which is the ones I myself used to determine the risk level). The accuracy increased after I changed the features, afterwards I used cross validation to check the consistency of both models. I did a Random Grid Search for both models, after adding the parameters to their respective models I observed that it did not have a big impact on either of the accuracy score. While that is not ideal it is also not bad, like I mentioned above parameters are valuable to the models as it ensures reproducibilty and performs more consistently 

Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 150}
Test Set Accuracy: 0.6928719502471695

### Choose the best model

In the following subsection reflect on the most appropriate machine learning model. This is related to the excellent criteria of ILO 5.0.  

#### Explanations 
I concluded that the Random Forest model is the best model for me. While the accuracy score is slightly lower than the one of Gradient Boosting Forest, there are many other considerations to take in mind. From the cross validation you can see that the Random Forest is way more consistent and the average of the scores is also higher, which means that the model is more reliable and stable. Furthermore, Random Forest models are a lot more computantionally effecient it took me almost 4 hours to perform a random search on the Gradient Boost model, in comparisson to 2 minutes for Random Forest. So while the accuracy is important it is not the sole determination for the best model. 

In [None]:
# Add your code here for comparing your models. Describe in the markdown below why the model you chose is the best model.
# Perform cross-validation
cv_rf = cross_val_score(rf, XX, yy, cv = 5)  
cv_gb = cross_val_score(gb, XX, yy, cv = 5)

print("Cross validation scores for Random Forest:", cv_rf)
print("Cross validation scores for Gradient Boosting Trees:", cv_gb)

In [None]:
print("Cross validation score mean, Random Forest:", cv_rf.mean())
print("Cross vaildation score mean, Gradient Boost:",cv_gb.mean())

‚úçÔ∏è The model is chose is the best because ...

### Linear Algebra and Calculus

In the following subsection, provide the related evidences for ILO7.0.

### Assignment for "Elementary Operation on Matrices"

This task is associated with the 'Poor' criterion of ILO 7.0. You can find the assignment [here](https://adsai.buas.nl/Study%20Content/Advanced%20Python/AssignElemOpe.html).   

Please provide the related link to the PDF file for Task 1 of assignment on elementary operations on matrices. 

# Provide the link to the assignment on elementary operations on matrices here
https://github.com/BredaUniversityADSAI/2023-24b-fai1-adsai-CelineWu231265/blob/main/Deliverables%20/ILO%207%20/EleOpeMat_231265.pdf

Please provide your code for Task 2 of assignment on elementary operations on matrices.

task 2.1


In [None]:
# Compute the transpose of matrices A and B
#To compute the transpose of a matrix you can use either numpy.transpose function or .T method .
# THe first list is the first row of the matrix and the second list is the seconf row
import numpy as np

Matrix_A = [[3, -5],[-2, 7]]
Matrix_A_transpose = np.transpose(Matrix_A)
print("Matrix A transpose:")
print(Matrix_A_transpose)

Matrix_B = [[2, -3, 4], 
            [-5, 6, 7], 
            [-8, 9, 1]]
Matrix_B_transpose = np.transpose(Matrix_B)
print("Matrix B transpose:")
print(Matrix_B_transpose)

# Compute the element-wise product of matrices A and B 
# The matrices are both 3 x 3 so I'd have to have 3 lists per variable, I need 2 variables because there's matrix A and B 
# I'm not sure why i have to use .array here when I didn't above
array1 = np.array ([[3, 2, -1], [-2, 7 ,4], [1, 6, 8]])
array2 = np.array ([[2, -3, -4], [-5, -6, 7], [-8, 9, 1]])

element_wise = np.multiply(array1, array2)
print("Element-wise product of matrices A and B:")
print(element_wise)

# Compute the matrix product of matrices A and B 
print("Matrix product of matrices A and B:")
print(np.dot(array1, array2))

# Compute the inversion of matrices A and B
array1 = np.array([[3, 2, -1], [-2, 7, 4], [1, 6, 8]])
array2 = np.array([[2, -3, -4], [-5, -6, 7], [-8, 9, 1]])

# Compute the inverse of array1
inverse = np.linalg.inv(array1)

# Multiply the inverse by array2 to get the result
result = np.dot(inverse, array2)

print("Matrix Inverse:")
print(inverse)

print("Result:")
print(result)


task 2.2

In [None]:
# Matrix multiplication 

# (A.T).T = A
# define the matrices 
# verify the properties listed in the task 
# check whether the properties are true or not using np.array_eual
A = np.array([[3,2,-1],[-2,7,4],[1,6,8]])
B = np.array([[-1,2,3],[5,-4,9],[-7,8,6]])
C = np.array([[-5,4,9],[6,1,3],[7,2,-8]])

First = np.array_equal((A.T).T, A)

print("(A.T).T is equal to A:", First)

#  A + B = B + A
Second = np.array_equal(A + B, B + A)
print("A + B is equal to B + A:", Second)

# A + (B + C) = (A + B) + C
Third = np.array_equal (A + (B + C), (A + B) + C)
print ("A + (B + C) is equal to (A + B) + C:", Third)

# (A + B).T = A.T + B.T
Fourth = np.array_equal ((A + B).T, A.T + B.T)
print("(A + B).T is equal to A.T + B.T:", Fourth)

# AB != BA
Fifth = np.array_equal (A * B, B * A)
print("AB is not equal to BA:", Fifth)

# A(BC) = (AB)C
Sixth = np.array_equal(A * (B * C), (A * B) * C)
print ("A(BC) is equal to (AB)C", Sixth)

# A(B + C) = AB + AC
Seventh = np.array_equal(A * (B + C), A * B + A * C)
print("A(B+C) is equal to AB + AC:", Seventh )

# (AB).T = B.T A.T
Eighth = np.array_equal((A * B).T, B.T * A.T)
print("(AB).T is equal to B.T * A.T:", Eighth)

# (AB)^-1 = B^-1 * A^-1
Ninth = np.array_equal((A * B)^-1, B^-1 * A^-1 )
print("(AB)^-1 is equal to B^-1 * A^-1:", Ninth)

# (A.T)^-1 = (A^-1).T
Tenth = np.array_equal((A.T)^-1, (A^-1).T)
print("(A.T)^-1 is equal to (A^-1).T:", Tenth)

# (ùõº + ùõΩ)A = ùõºA + ùõΩA
alpha = 2
beta = 3
eleventh = np.array_equal((alpha + beta)* A, alpha * A + beta * B)
print ("(ùõº + ùõΩ)A is equal to ùõºA + ùõΩA:", eleventh)

# ùõº(A + B) = ùõºA + ùõºB
twelve = np.array_equal(alpha * (A + B), alpha * A + alpha * B)
print ("ùõº(A + B) is equal to ùõºA + ùõºB:", twelve)

# (ùõºA)^-1 = ùõº^-1 * A^-1
thirteenth = np.array_equal((alpha * A)^-1, alpha^-1 * A^-1)
print("(ùõºA)^-1 is equal to ùõº^-1 * A^-1:", thirteenth)

### Assignment for  "Linear Regression Model Using Normal Equations"

This task is associated with the ‚ÄòPoor' criterion of ILO 7.0. You need to complete the assignment on linear regression using normal equations at the middle of [this page](https://adsai.buas.nl/Study%20Content/Advanced%20Python/6.AdvancedNumPyMatPlotlib.html).  

In [None]:
# Task 1.1
# Coefficients
A = np.array([
    [1, 3, 1],
    [1, 1, 0],
    [1, -1, 1]
])

# Right-hand side values
B = np.array([9, 10, 8])

# Solve the system of equations
solution = np.linalg.solve(A, B)

print("Solution for task 1.1")
print("1 =", solution[0])
print("2 =", solution[1])
print("3 =", solution[2])

# Task 1.2
# Coefficients 
A = np.array([
    [5, 6, -7, 1],
    [1, 2, 3, 4],
    [1, 0, 1, 0],
    [1, -3, 0, 0]
])

# Right-hand side values
B = np.array([8, 7, 9, 12])

# Solve the system of equations
solution = np.linalg.solve(A, B)

print("Solution for task 1.2")
print("1 =", solution[0])
print("2 =", solution[1])
print("3 =", solution[2])
print("4 =", solution[3])

# Task 2 
np.random.seed(1358)

n_sample = 10
x = np.linspace(1, 5, n_sample)
e = 0.1 * np.random.randn(n_sample)

y = 2 * x + 3 + e

X = np.vstack([np.ones_like(x), x]).T  # Add a column of ones for the intercept
theta = np.linalg.inv(X.T @ X) @ X.T @ y

theta0, theta1 = theta[0], theta[1]

print("Theta0 (intercept):", theta0)
print("Theta1 (slope):", theta1)

# Task 3 
np.random.seed(1358)

n_sample = 30
x = np.linspace(1, 10, n_sample)
e = 0.2 * np.random.randn(n_sample)

y = 3 + 2 * x + 7 * x**2 + e

X = np.vstack([np.ones_like(x), x, x**2]).T

theta = np.linalg.inv(X.T @ X) @ X.T @ y

theta0, theta1, theta2 = theta[0], theta[1], theta[2]

print("Theta0 (intercept):", theta0)
print("Theta1 (slope for x):", theta1)
print("Theta2 (slope for x^2):", theta2)


### Assignment for "Calculus for Machine Learning"

This task is associated with the "Insufficient" criterion in ILO 7.0. 

You need to complete with the [Differential Calculus](https://www.khanacademy.org/math/differential-calculus) course in Khan Academy and provide a link to the PDF file of certificate of completion you have put in your personal GitHub repository.

Khan academy screenshot links
https://github.com/BredaUniversityADSAI/2023-24b-fai1-adsai-CelineWu231265/blob/15da3dfe8b66d66c5b2a551fc8f9f5775ddea632/Deliverables%20/ILO%207%20/CalMacLea2_231265.png
https://github.com/BredaUniversityADSAI/2023-24b-fai1-adsai-CelineWu231265/blob/15da3dfe8b66d66c5b2a551fc8f9f5775ddea632/Deliverables%20/ILO%207%20/CalMacLea_231265.png

### Assignment for "DataLab: Python for Symbolic Mathematics"

This task is associated with the "Insufficient" criterion in ILO 7.0. 

You need to complete all the DataLab tasks (Tasks 1-5) at the end of [this page](https://adsai.buas.nl/Study%20Content/Advanced%20Python/28.SymbolicMathematicsDataLab.html). Provide your codes in the following cell.

### Task 1 

In [None]:
from sympy import symbols
import sympy
# Define symbolic variable
x, y = symbols('x y')

# Definition of the expression
ex1 = 2 * x**2 -x * y + 3
ex1

In [None]:
ex2 = (x * ex1 + (2 * x + y)) / (x**2 + y)
ex2

In [None]:
from sympy import symbols, Eq, solve
from sympy import expand
expand(ex2)

In [None]:
ex2.evalf(subs={x:-2, y:1})

In [None]:
# Define the variables
x, y = symbols('x y')

# Define the system of equations
equation1 = Eq(2*x + y, 5)
equation2 = Eq(x - 2*y, 15)

# Define the system of equations
system_of_equations = [equation1, equation2]

# Solve the system of equations
solution = solve(system_of_equations, (x, y))

print("Solution:", solution)

In [None]:
from sympy import symbols, Eq, solve, ln

# Define the variable
equation = Eq(2*x + 5, 11)

# Solve the equation
solution = solve(equation, x)
solution

#### Limit Computation (Optional)

In [None]:
from sympy import limit, symbols
from sympy import sin
from sympy import diff, cos, cot

# Define the function
f = sin(x) / x

# calculate the limit as x approaches 0
lim_result = limit(f, x, 0)
lim_result

#### Derivative Computation

In [None]:
# Define the variable 
x = symbols('x')

# Define the function
f = x**3 + 3 * x**2 + sin(x)

# Calculate the derivative
der_f = diff(f, x)

der_f

#### Integral Computation (Optional)

In [None]:
x = symbols('x')

# Define the function
f = x*sin(x)

# Compute the indefinite integral
indefinite_integral = integrate(f,x)
indefinite_integral 

In [None]:
x = symbols('x')

# Define the function
f = cos(x)

# Compute the definite integral
definite_integral = integrate(f, (x,0,pi/2)) 
definite_integral

#### Taylor Series (Optional)

In [None]:
x = symbols('x')
f = exp(x)

# Compute the terms of the Taylor series
taylor_series = f.series(x, 0, 4).removeO()

# Display the terms of the Taylor series
taylor_series

#### Least Squares Problem

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

# Variables for the linear equation: y = mx + c
m, c = sp.symbols('m c')

# Sum of squared differences between observed and predicted y-values
error = sum((m * x + c - y)**2 for x, y in data_points)

# Finding partial derivatives of the error function with respect to m and c
partial_m = sp.diff(error, m)
partial_c = sp.diff(error, c)

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.solve((partial_m, partial_c), (m, c))

best_fit_m, best_fit_c = solution[m], solution[c]
best_fit_m, best_fit_c 

### Task 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1358)
x = np.linspace(-3,3, 100)
y_true = 0.3 * x**4 -0.1 * x**3 - 2* x**2 - 0.8*x
y = y_true + np.random.randn(len(x))
plt.plot(x,y_true, '--')
plt.plot(x,y, '.')
plt.xlabel('x', size=14)
plt.ylabel('y', size=14)

Exercise 1

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

#Defining the symbols
a1, a0 = sp.symbols('a1 a0')
ex1 = [a1 * x + a0 - y for x, y in data_points]

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.linsolve(ex1, a1, a0)
solution

Exercise 2 

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

#Defining the symbols
a2, a1, a0 = sp.symbols('a2 a1 a0')
ex2 = [a2 * x ** 2 + a1 * x + a0 - y for x, y in data_points]

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.linsolve(ex2, a2, a1, a0)
solution

Exercise 3

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

#Defining the symbols
a3, a2, a1, a0 = sp.symbols('a3 a2 a1 a0')
ex3 = [a3 * x ** 3 + a2 * x ** 2 + a1 * x + a0 - y for x, y in data_points]

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.linsolve(ex3, a3, a2, a1, a0)
solution

Exercise 4

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

#Defining the symbols
a4, a3, a2, a1, a0 = sp.symbols('a4 a3 a2 a1 a0')
ex4 = [a4 * x ** 4 + a3 * x ** 2 + a2 * x ** 2 + a1 * x + a0 - y for x, y in data_points]

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.linsolve(ex4, a4, a3, a2, a1, a0)
solution

Exercise 5

In [None]:
data_points = [(1,2), (2,3), (3,4), (4,5)]

#Defining the symbols
a5, a4, a3, a2, a1, a0 = sp.symbols('a5 a4 a3 a2 a1 a0')
ex5 = [a5 * x ** 5 + a4 * x ** 4 + a3 * x ** 3 + a2 * x ** 2 + a1 * x + a0 - y for x, y in data_points]

# Solving the system of equations to minimise the error (least squares solution)
solution = sp.linsolve(ex5, a5, a4, a3, a2, a1, a0)
solution

### Task 3 

exercise 1

In [None]:
x = symbols('x')

# I did not put y = or - y at the end of the equation as this made no difference
ex1 = x**2 + 2 * x + 1

# Caluclating the derivative
der_ex1 = diff(ex1, x)
der_ex1

Exercise 2

In [None]:
x = symbols('x')

# Define the function
ex2 = (3 * x - 5)** 3

# Caluclating the derivative
der_ex2 = diff(ex2, x)
der_ex2

Exercise 3

In [None]:
x = symbols('x')

# Define the function
ex3 = sqrt(x - 1)** 2 - (x ** 2 + 1)** 4

# Caluclating the derivative
der_ex3 = diff(ex3, x)
der_ex3

Exercise 4

In [None]:
x = symbols('x')

# Define the function
ex4 = 7 * cot(x) - 8 * cos(x)

# Calculating the derivative
der_ex4 = diff(ex4, x)
der_ex4

Exercise 5

In [None]:
x = symbols('x')

ex5 = x - ln(x) + 7

# Calculating the derivative
der_ex5 = diff(ex5, x)
der_ex5

Exercise 6 

In [None]:
x, e = symbols('x e')

ex6 = -10 * e ** x + 5 ** x + x/5

# Calculating the derivative
der_ex6 = diff(ex6, x, e)
der_ex6

Exercise 7

In [None]:
x = symbols('x')

ex7 = (2 * sin(x)) / (sin(x) - cos(x))
ex7   

Exercise 8

In [None]:
x = symbols('x')

ex8 = (x ** 2 * ln(x)) / (1 - tan(x))

# Calculating the derivative
der_ex8 = diff(ex8, x)
der_ex8

### Assignment for "Multivariable Calculus"

This task is associated with the "Insufficient" criterion in ILO 7.0. You need to complete the assignments 1-4 at the end of [this page](https://adsai.buas.nl/Study%20Content/Advanced%20Python/27.MultivariableCalculus.html)

Provide a link to a PDF file, for assignments 1-3 in the following cell. 

In [None]:
# A link to a PDF file for assignments 1-3

Put your code  for assignment 4 in the following cell.

In [None]:
# Put your code for assignment 4 here.

### Assignments for "Optimization Algorithms"

This task is associated with the "Sufficient" criterion in ILO 7.0. 

Complete the assignments at the end of [this page](https://adsai.buas.nl/Study%20Content/Advanced%20Python/29.OptimizationAlgorithms.html). Then put your code in the following cell.

In [None]:
# Put your code here

###¬†Assignments for "DataLab: Linear Regression with Gradient Descent"

This task is associated with the "Good" and "Excellent" criteria in ILO 7.0. 

Complete the assignment at the end of [this page](https://adsai.buas.nl/Study%20Content/Advanced%20Python/30.LinearRegressionGradientDescentDataLab.html). Then put your code in the following cell.

In [None]:
# Put your code here