# Project: Machine Learning

**Instructions for Students:**

Please carefully follow these steps to complete and submit your project:

1. **Completing the Project**: You are required to work on and complete all tasks in the provided project. Be disciplined and ensure that you thoroughly engage with each task.
   
2. **Creating a Google Drive Folder**: Each of you must create a new folder on your Google Drive if you haven't already. This will be the repository for all your completed assignment and project files, aiding you in keeping your work organized and accessible.
   
3. **Uploading Completed Project**: Upon completion of your project, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
   
4. **Sharing Folder Link**: You're required to share the link to your project Google Drive folder. This is crucial for the submission and evaluation of your project.
   
5. **Setting Permission toPublic**: Please make sure your Google Drive folder is set to public. This allows your instructor to access your solutions and assess your work correctly.

Adhering to these procedures will facilitate a smooth project evaluation process for you and the reviewers.

## Student Identity

In [None]:
# @title #### Student Identity
student_id = "" # @param {type:"string"}
name = "" # @param {type:"string"}
drive_link = ""  # @param {type:"string"}

## Import Package

In [None]:
!pip install rggrader
from rggrader import submit, submit_image

## Project Description

In this Machine Learning Project, you will create your own supervised Machine Learning (ML) model. We will use the full FIFA21 Dataset and we will identify players that are above average.

We will use the column "Overall" with a treshold of 75 to define players that are 'Valuable'. This will become our target output which we need for a supervised ML model.

This project will provide a comprehensive overview of your abilities in machine learning, from understanding the problem, choosing the right model, training, and optimizing it.

## Grading Criteria

Your score will be awarded based on the following criteria:
* 100: The model has an accuracy of more than 80% and an F1 score of more than 85%. This model is excellent and demonstrates a strong understanding of the task.
* 90: The model has an accuracy of more than 75% and an F1 score of more than 80%. This model is very good, with some room for improvement.
* 80: The model has an accuracy of more than 70% and an F1 score between 70% and 80%. This model is fairly good but needs improvement in balancing precision and recall.
* 70: The model has an accuracy of more than 65% and an F1 score between 60% and 70%. This model is below average and needs significant improvement.
* 60 or below: The model has an accuracy of less than 65% or an F1 score of less than 60%, or the student did not submit the accuracy and F1 score. This model is poor and needs considerable improvement.

Rmember to make a copy of this notebook in your Google Drive and work in your own copy.

Happy modeling!

In [None]:
#Write any package/module installation that you need here


## Load the dataset and clean it

In this task, you will prepare and load your dataset. You need to download the full FIFA 21 Dataset from the link here: [Kaggle FIFA Player Stats Database](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA21_official_data.csv).

>Note: Make sure you download FIFA 21 dataset.

After you download the dataset, you will then import the dataset then you will clean the data. For example there may be some empty cell in the dataset which you need to fill. Maybe there are also data that you need to convert to numeric value for analysis. Identify the data that is incomplete and fix them.

In [None]:
#Write your preprocessing and data cleaning here


## Build and Train your model

In this task you will analyze the data and select the features that is best at predicting if the Player is a 'Valuable' player or not.

The first step is to **define the target output** that you will use for training. Here's an example of how to create a target output:
- `df['OK Player'] = df['Overall'].apply(lambda x: 1 if x >= 50 else 0) #Define the OK Player using treshold of 50.`

Next you will **identify the features** that will best predict a 'Valuable' player. You are required to **submit the features you selected** in the Submit section below. Because we use the "Overall" as our target output, the use of "Overall" in your features is not allowed. You will automatically get 0 if you submit "Overall" in your features.

Once you identify the features, you will then **split the data** into Training set and Testing/Validation set.

Depending on the features you selected, **you may need to scale the features**.

Now you will **train your model, choose the algorithm** you are going to use carefully to make sure it gives the best result.

Once you have trained your model, you need to test the model effectiveness. **Make predictions against your Testing/Validation set** and evaluate your model. You are required to **submit the Accuracy Score and F1 score** in the Submit section below.


In [None]:
# Write your code here


Task-3 Model Inference

1.1 Write your code in the block below

In [None]:
#Your code here

1.2 Submit

In [None]:
#submit code

Examples:

Supervised Learning

Project Description
2. FIFA21 Dataset:

The FIFA21 dataset contains information about professional soccer players, including their skills, potentials, etc.

Supervised Learning: If the dataset includes outputs (like the number of goals a player scores), you could build a regression model to predict this based on their other stats.

Here's the dataset: https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA21_official_data.csv

Note: The list of features must be part of the submission, otherwise they can just use "Overall" and get score of accuracy and F1 of 100. The use of "Overall" is banned. THis is why we need the list of features. Or I need to mix the Overall with other column to create a new one.




The download button is here:
[insert image from screenshot]

===================================

In [None]:
#Let's see what we can do here with the FIFA21 dataset

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!pip install rggrader
from rggrader import submit
from rggrader import submit_image


# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv('FIFA21_official_data.csv')

# Preprocess Value, Wage, and Release Clause
df['Value'] = df['Value'].str.replace('€', '').replace('M', '', regex=True).replace('K', '', regex=True).astype(float)
df['Wage'] = df['Wage'].str.replace('€', '').replace('M', '', regex=True).replace('K', '', regex=True).astype(float)

# Create a new binary target variable
df['GoodPlayer'] = df['Overall'].apply(lambda x: 1 if x >= 75 else 0)

# Select features
#features = ['Age', 'Potential', 'Value', 'Wage', 'International Reputation', 'Skill Moves']
features = ['Reactions', 'Best Overall Rating']
X = df[features]
y = df['GoodPlayer']

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print(f'F1 Score: {f1:.2f}')

       ID              Name  Age  \
0  176580         L. Suárez   33   
1  192985      K. De Bruyne   29   
2  212198   Bruno Fernandes   25   
3  194765      A. Griezmann   29   
4  224334          M. Acuña   28   

                                              Photo Nationality  \
0  https://cdn.sofifa.com/players/176/580/20_60.png     Uruguay   
1  https://cdn.sofifa.com/players/192/985/20_60.png     Belgium   
2  https://cdn.sofifa.com/players/212/198/20_60.png    Portugal   
3  https://cdn.sofifa.com/players/194/765/20_60.png      France   
4  https://cdn.sofifa.com/players/224/334/20_60.png   Argentina   

                                  Flag  Overall  Potential               Club  \
0  https://cdn.sofifa.com/flags/uy.png       87         87    Atlético Madrid   
1  https://cdn.sofifa.com/flags/be.png       91         91    Manchester City   
2  https://cdn.sofifa.com/flags/pt.png       87         90  Manchester United   
3  https://cdn.sofifa.com/flags/fr.png       87         

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

def convert_height(height):
    # Split the height string on the apostrophe
    feet, inches = height.split("'")
    # Convert height to centimeters (1 foot is 30.48 cm and 1 inch is 2.54 cm)
    height_cm = int(feet) * 30.48 + int(inches) * 2.54
    return height_cm

def convert_weight(weight):
    # Remove 'lbs' from the weight string and convert to integer
    weight_lbs = int(weight.replace('lbs', ''))
    # Convert weight to kilograms (1 pound is 0.453592 kg)
    weight_kg = weight_lbs * 0.453592
    return weight_kg

# Load data
df = pd.read_csv('FIFA21_official_data.csv')

# Preprocess missing data
#x = df.isnull().sum()
#x = x[x > 0]
#print(x)

# Fill 'Club' with 'None'
df['Club'] = df['Club'].fillna('Unknown')
df['Body Type'] = df['Body Type'].fillna('Unknown')
df['Real Face'] = df['Real Face'].fillna('Unknown')
df['Position'] = df['Position'].fillna('Unknown')
df['Loaned From'] = df['Loaned From'].fillna('Unknown')

# Fill 'Jersey Number' with 0
df['Jersey Number'] = df['Jersey Number'].fillna(0)
df['Release Clause'] = df['Release Clause'].fillna('€ 0')

df['Joined'] = df['Joined'].fillna('Jan 1, 1970')
df['Contract Valid Until'] = df['Contract Valid Until'].fillna('1970')

# Fill 'Volleys' with the mean of the column
df['Volleys'] = df['Volleys'].fillna(df['Volleys'].mean())
df['Curve'] = df['Curve'].fillna(df['Curve'].mean())
df['Agility'] = df['Agility'].fillna(df['Agility'].mean())
df['Balance'] = df['Balance'].fillna(df['Balance'].mean())
df['Jumping'] = df['Jumping'].fillna(df['Jumping'].mean())
df['Interceptions'] = df['Interceptions'].fillna(df['Interceptions'].mean())
df['Positioning'] = df['Positioning'].fillna(df['Positioning'].mean())
df['Vision'] = df['Vision'].fillna(df['Vision'].mean())
df['Composure'] = df['Composure'].fillna(df['Composure'].mean())
df['Marking'] = df['Marking'].fillna(df['Marking'].mean())
df['SlidingTackle'] = df['SlidingTackle'].fillna(df['SlidingTackle'].mean())
df['DefensiveAwareness'] = df['DefensiveAwareness'].fillna(df['DefensiveAwareness'].mean())

# Preprocess Value, Wage, and Release Clause
df['Value'] = df['Value'].str.replace('€', '').replace('M', '', regex=True).replace('K', '', regex=True).astype(float)
df['Wage'] = df['Wage'].str.replace('€', '').replace('M', '', regex=True).replace('K', '', regex=True).astype(float)
df['Release Clause'] = df['Release Clause'].str.replace('€', '').replace('M', '', regex=True).replace('K', '', regex=True).astype(float)
# Apply the function to the Height column
df['Height'] = df['Height'].apply(convert_height)
# Apply the function to the Weight column
df['Weight'] = df['Weight'].apply(convert_weight)

#x = df.isnull().sum()
#x = x[x > 0]
#print(x)

# Create a new binary target variable
df['GoodPlayer'] = df['Overall'].apply(lambda x: 1 if x >= 75 else 0)

# Select features - only numeric ones and those that may make sense for your model
#features = ['Age', 'Potential', 'Value', 'Wage', 'International Reputation', 'Skill Moves']
features = ['Potential','Value', 'Wage', 'Special', 'International Reputation', 'Skill Moves', 'Jersey Number', 'Height', 'Weight', 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes', 'Best Overall Rating', 'Release Clause', 'DefensiveAwareness']
X = df[features]
y = df['GoodPlayer']

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a feature-selection and model pipeline
pipe = Pipeline([
  ('scale', StandardScaler()),
  ('select_k_best', SelectKBest(score_func=f_classif, k=2)),
  ('logistic', LogisticRegression())
])

# Train the pipeline
pipe.fit(X_train, y_train)

# Make predictions using the test set
y_pred = pipe.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print(f'F1 Score: {f1:.2f}')

# Get the SelectKBest step
selector = pipe.named_steps['select_k_best']

# Get the Boolean mask of selected features
mask = selector.get_support()

# Get the names of the selected features
selected_features = np.array(features)[mask]

# Print the names of the selected features
print(selected_features)

Accuracy: 0.98
F1 Score: 0.91
['Reactions' 'Best Overall Rating']
