# Introduction
In the world of football, strikers play a pivotal role in deciding the fate of matches and championships. Identifying the best strikers among a pool of talent involves a comprehensive analysis of various factors ranging from performance metrics to personal attributes. In this project, titled "Segmenting and Classifying the Best Strikers," we delve into a dataset containing information on 500 strikers, aiming to uncover patterns, insights, and classifications that distinguish top-performing strikers from the rest.

# Project Description
The project involves utilizing data analytics techniques to explore and understand the characteristics and performance metrics of strikers. By employing descriptive statistics, data visualization, feature engineering, and machine learning algorithms, we aim to identify the key attributes that contribute to a striker's success on the field and classify them into different categories based on their performance.

## Purpose
The primary purpose of this project is to provide a systematic framework for analyzing and categorizing strikers based on their performance metrics and personal attributes. By doing so, coaches, scouts, and football analysts can gain valuable insights into the characteristics of top-performing strikers and make informed decisions in team selection, recruitment, and strategic planning.


# Dataset Description
The dataset comprises various variables related to 500 strikers, encompassing both demographic information and performance metrics. Key variables include nationality, footedness, marital status, goals scored, assists, shot accuracy, dribbling success, and many more, providing a comprehensive overview of each striker's profile and on-field performance.

-> Striker ID: Unique identifiers assigned to each striker.

-> Nationality: The country of origin for each striker.

-> Footedness: Indicates whether the striker is right or left-footed.

-> Marital Status: Indicates whether the striker is married (yes) or unmarried (no).

-> Goals Scored: The total number of goals scored by the striker, a fundamental performance metric.

-> Assists: The number of assists provided by the striker, indicating their ability to create goal-scoring opportunities for teammates.

-> Shots on Target: The number of shots taken by the striker that hit the target, reflecting their ability to create scoring opportunities and test the goalkeeper.

-> Shot Accuracy: The percentage of shots on target out of total shots taken, showing the striker's precision and effectiveness.

-> Conversion Rate: The percentage of shots that result in goals, revealing the striker's efficiency in front of goal.

-> Dribbling Success: A metric indicating the striker's ability to bypass defenders and create goal-scoring opportunities through individual skill.

-> Movement off the Ball: Reflects how actively the striker moves to find space and create opportunities for themselves and teammates.

-> Hold-up Play: Measures the striker's ability to retain possession and bring teammates into play with passes or layoffs.

-> Aerial Duels Won: The number of aerial duels won by the striker, important for strikers strong in the air as it can create scoring chances.

-> Defensive Contribution: Reflects the striker's defensive efforts such as tracking back, pressing opponents, and making interceptions.

-> Big Game Performance: Indicates the striker's performance in important matches, which can elevate their reputation.

-> Consistency: Reflects how regularly the striker performs at a high level over the course of a season or multiple seasons.

-> Versatility: Measures the striker's ability to adapt to different tactical systems and roles within the team.

-> Penalty Success Rate: The efficiency of the striker from the penalty spot, crucial in tight matches.

-> Impact on Team Performance: Reflects how the team's results and overall attacking play are influenced by the striker's presence.

-> Off-field Conduct: Measures the striker's professionalism, leadership, and behavior, which can impact their overall performance and value to the team.

# Required Tools
Python programming language
Jupyter Notebook

# Your Job - Questions to solve
Data Cleaning:

Download the attached dataset and load it into Jupyter notebook.
Load all the relevant and necessary packages for the required tasks.
Check for missing values within any column and use SimpleImputer to impute the missing values. Use strategy 'median' for numeric and 'most frequent' for nominal columns.

Check for the correct data types and assign integer data types for specific variables: 'Goals Scored', 'Assists', 'Shots on Target', 'Movement off the Ball', 'Hold-up Play', 'Aerial Duels Won', 'Defensive Contribution', 'Big Game Performance', 'Impact on Team Performance', 'Off-field Conduct'.

# Descriptive Analysis:

-> Perform descriptive analysis on the dataset. Round the output values by 2 decimal points.

## Data Visualization:

-> Perform percentage analysis on the variable Footedness and create a pie chart on the output using matplotlib.

-> Visualize the distribution of players' footedness across different nationalities in a countplot of seaborn.

## Statistical Analysis:

-> Determine which nationality strikers have the highest average number of goals scored.

-> Calculate the average conversion rate for players based on their footedness.

-> Find whether there is any significant difference in consistency rates among strikers from various nationalities. Before doing the appropriate test, must check for the assumptions.

-> Check if there is any significant correlation between strikers' Hold-up play and consistency rate. Must check for the assumptions.

-> Check if strikers' hold-up play significantly influences their consistency rate.

# Feature Engineering:

Create a new feature - Total contribution score by summing up specific columns: 'Goals Scored', 'Assists', 'Shots on Target', 'Dribbling Success', 'Aerial Duels Won', 'Defensive Contribution', 'Big Game Performance', 'Consistency'.

Encode the Footedness and marital status by LabelEncoder.

Create dummy variables for Nationality and add them to the data.

## Clustering Analysis:

Perform KMeans clustering:

Select features by dropping the Striker_ID from the updated data.

Calculate the Within-Cluster-Sum-of-Squares (WCSS).

Visualize the elbow chart to select the optimal number of clusters (The breakpoint of elbow chart must show 2).

Build the KMeans cluster with the optimal number of clusters and add the labels into the data.

Calculate the average total contribution score by the value of clusters.

Assign the tag 'Best strikers' for 0 and 'Regular strikers' for 1 and add a new column 'Strikers types' into the data. Drop the Clusters variable.

Use feature mapping to map the new feature Strikers types: 'Best strikers' for 1 and 'Regular strikers' for 0.

# Machine Learning Model:

Select the features into x and the target column Strikers types into y. Must delete unnecessary columns (i.e., 'Strikers_ID') while selecting the features.

Perform feature scaling with StandardScaler and split the data into train and test sets where the test data size will be 20%.

Build a logistic regression machine learning model to predict strikers type.

Make predictions and evaluate by calculating the accuracy percentage.

Create the confusion matrix and visualize it.

Finally, answer the question asked in this assignment and you are done!


# Conclusion
Through a comprehensive analysis of the dataset, we've gained valuable insights into the characteristics and performance metrics of strikers. By segmenting and classifying the strikers based on their attributes and performance, we've provided a framework for identifying top-performing strikers and predicting their performance type. This project serves as a valuable resource for football professionals and enthusiasts alike, aiding in talent identification, team selection, and strategic planning.

Questions for this assignment
What is the maximum goal scored by an individual striker?

What is the portion of Right-footed strikers within the dataset?

Which nationality strikers have the highest average number of goals scored?

What is the average conversion rate for left-footed player?

How many left footed players are from France?

What is the correlation co-efficient between hold up play and consistency score?

What is the p-value for the shapiro wilk test of consistency score? Is it normally distributed?

What is the p-value for the levene's test of ANOVA analysis? Is the heteroscedasticity assumed?

Is there any significant correlation between strikers' Hold-up play and consistency rate?

Describe the beta value of Hold-up Play you have found in your regression analysis.

What is the average Total contribution score you get for the best strikers?

What is the accuracy score of your LGR model? How many regular strikers your model predicted correctly? How many best strikers your model predicted incorrectly?

#### --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Import libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as pt
import numpy as np
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from scipy.stats import shapiro, levene, pearsonr

### Load Dataset

In [10]:
data = pd.read_excel('Strikers_performance.xlsx', index_col = None)
data.head()

Unnamed: 0,Striker_ID,Nationality,Footedness,Marital Status,Goals Scored,Assists,Shots on Target,Shot Accuracy,Conversion Rate,Dribbling Success,Movement off the Ball,Hold-up Play,Aerial Duels Won,Defensive Contribution,Big Game Performance,Consistency,Penalty Success Rate,Impact on Team Performance,Off-field Conduct
0,1,Spain,Left-footed,No,17.483571,10.778533,34.795488,0.677836,0.166241,0.757061,50.921924,71.806409,15.682532,30.412215,6.152481,0.820314,0.922727,8.57037,11.451388
1,2,France,Left-footed,Yes,14.308678,13.72825,31.472436,0.544881,0.192774,0.796818,61.39615,53.726866,19.843983,26.474913,6.093172,0.803321,0.678984,3.444638,8.243689
2,3,Germany,Left-footed,No,18.238443,3.804297,25.417413,0.51818,0.160379,0.666869,65.863945,60.452227,20.090084,24.164116,3.408714,0.76654,0.843858,8.429491,9.506835
3,4,France,Right-footed,No,22.615149,9.688908,20.471443,0.599663,0.184602,0.638776,88.876877,60.511979,22.363152,44.129989,6.33982,0.611798,0.662997,6.532552,8.199653
4,5,France,Left-footed,Yes,13.829233,6.048072,29.887563,0.582982,0.105319,0.591485,75.565531,54.982158,13.165708,37.859323,8.465658,0.701638,0.906538,8.414915,6.665333


### Check for missing values

In [22]:
missing_values = data.isnull().sum()
print("Missing Values: \n", missing_values)

Missing Values: 
  Striker_ID                    0
Nationality                   0
Footedness                    0
Marital Status                0
Goals Scored                  0
Assists                       0
Shots on Target               0
Shot Accuracy                 0
Conversion Rate               0
Dribbling Success             0
Movement off the Ball         0
Hold-up Play                  0
Aerial Duels Won              0
Defensive Contribution        0
Big Game Performance          0
Consistency                   0
Penalty Success Rate          0
Impact on Team Performance    0
Off-field Conduct             0
dtype: int64


#### Missing Values are: 
Movement off the Ball         "6" | 
Big Game Performance          "2" | 
Penalty Success Rate          "5" | 

### We use simple imputer to impute missing values using median values for numeric cols

In [31]:
imputer = SimpleImputer(strategy = 'median')
imputer.fit(data[["Movement off the Ball", "Big Game Performance", "Penalty Success Rate"]])
data[["Movement off the Ball", "Big Game Performance", "Penalty Success Rate"]] = imputer.transform(data[["Movement off the Ball", "Big Game Performance", "Penalty Success Rate"]])
missing_values = data.isnull().sum()
print("Missing Values: \n", missing_values)

Missing Values: 
 Striker_ID                    0
Nationality                   0
Footedness                    0
Marital Status                0
Goals Scored                  0
Assists                       0
Shots on Target               0
Shot Accuracy                 0
Conversion Rate               0
Dribbling Success             0
Movement off the Ball         0
Hold-up Play                  0
Aerial Duels Won              0
Defensive Contribution        0
Big Game Performance          0
Consistency                   0
Penalty Success Rate          0
Impact on Team Performance    0
Off-field Conduct             0
dtype: int64


### Ensuring correct data types for numeric cols: 

In [30]:
cols = ['Goals Scored', 'Assists', 'Shots on Target', 'Movement off the Ball', 'Hold-up Play', 'Aerial Duels Won', 'Defensive Contribution', 'Big Game Performance', 'Impact on Team Performance', 'Off-field Conduct']
data[cols] = data[cols].astype(int)