# FIFA

- Task 1:-Prepare a complete data analysis report on the given data.

- Task 2:- Explore football skills and cluster football players based on their attributes.

- Task3:- Explore the data and attempt all the below asked questions in a
  step by step manner:
  
    - Prepare a rank ordered list of top 10 countries with most players. Which
      countries are producing the most footballers that play at this level?
      
    - Plot the distribution of overall rating vs. age of players. Interpret what is the
      age after which a player stops improving?
      
    - Which type of offensive players tends to get paid the most: the striker, the
      right-winger, or the left-winger?

# Task 1:-Prepare a complete data analysis report on the given data.

# BASIC LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv

import warnings
warnings.filterwarnings('ignore')

In [None]:
data =pd.read_csv('PRCP-1004-Fifa20.zip')
data

# BASIC CHECKS :

In [None]:
x=data.info(verbose=True) 

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.columns.tolist() # here we can see all the features

- 61 numerical features and 43 categorical features total 104 features are present in the dataset

In [None]:
data.describe() .T   # describe only for numerbical data.

In [None]:
numerical_col = data.describe().columns.tolist() # it shows the all numerical colunms.
numerical_col

In [None]:
categorical_col = data.describe(include=object).columns.tolist()
categorical_col   # it shows the categorical columns

# EDA PROCESS :

- In our dataset , totaly 104 input features are there. 
- By using Matplotlib & seaborn take a so much of time to see relationship and distribution of data in a dataset. 
- Upcoming analysis we use the matplotlib and seaborn for essential feature.
- Here, For the large no. of input features(104) we use the sweetviz to understanding dataset structure (correlations, distributions, missing values).

In [None]:
report = sv.analyze(data)

In [None]:
report.show_html('sweetviz_report.html')

# TASK 3 (i)

Prepare a rank ordered list of top 10 countries with most players. Which
countries are producing the most footballers that play at this level?

we need the following features from the dataset:
Required Features:

- Nationality - Identifies the player's country.
- Player ID (or Name) - Unique identifier for counting distinct players.

In [None]:
x= data.groupby('nationality')['short_name'].count() 
# here we groupby the nationality and make the count player name wise

x=x.sort_values(ascending=False) #here sorted the fitlered value in descending order 
x = x[:10]   # here we extract the top ten countries which produce the most footballers
dict(x)

In [None]:
sns.barplot(x=list(x.keys()),y=list(x.values),label=x,palette="cividis")
plt.xlabel('Country')
plt.ylabel('No.of.playes')
plt.xticks(rotation=30)
plt.title('Top 10 countries with most players')
plt.legend(bbox_to_anchor=(1.3,1),title ='Top 10 countries')
plt.show()

# insights
- England produces the highest number of players, significantly ahead of other nations.
- Germany and Spain rank second and third, respectively.

# TASK 3 (ii)



To analyze the distribution of overall rating vs. age and determine the age when players stop 
improving, we need:

-  Age - The player's age.
-  Overall Rating - The overall skill rating of the player.

In [None]:
sns.set_theme(style='white')
sns.lineplot(x='age',y='overall',data=data,marker='o',markerfacecolor='red')


peak_age = data.groupby('age')['overall'].mean().idxmax()
# here we find the peakage of overall


plt.axvline(peak_age,linestyle='--',label=f'Peak age {peak_age}',color='black')
# above code to plot the verticle line in peak age through the x axis.
             
plt.xlabel('Age')
plt.ylabel('Overall Rating')
plt.legend(bbox_to_anchor=(1,1),fontsize=8)
plt.title('Overall vs Age')
plt.grid(True)
plt.show()

In [None]:
peak_age

# insights
- From the above visualization analysis, after the age 41 the player stop improving.

# TASK 3 (iii)


To determine which type of offensive player (Striker, Right-Winger, or Left-Winger) tends to get 
paid the most, we need the following columns:

- Position - Identifies if the player is a Striker (ST), Right-Winger (RW), or Left-Winger (LW).
- Wage (€ or any currency) - The player's salary (how much they get paid).


In [None]:
# here we filtering the where the Striker, Right-Winger(RW) and Left-Winger(LW) players are played.

st_col =[]  # here we store the Striker players wage_eur
rw_col =[]  # here we store the Right-Winger(RW) players wage_eur
lw_col =[]  # here we store the Left-Winger(LW) players wage_eur

x=['ST','RW','LW'] 
for i in range(len(data)):
    for j in x:
        if j in str(data.player_positions[i]): 
            if j=='ST':
                st_col.append(data.wage_eur[i])
            elif j=='RW':
                rw_col.append(data.wage_eur[i])
            elif j=='LW':
                lw_col.append(data.wage_eur[i])
                
avg_st_pay = np.mean(st_col) # here we finding the average payment of the Strik position player
avg_rw_pay = np.mean(rw_col) # here we finding the average payment of the Right-Winger position player
avg_lw_pay = np.mean(lw_col) # here we finding the average payment of the Left-Winger position player


In [None]:
pay_data= pd.DataFrame({'Position':['Striker','Right-Winger(RW)','Left-Winger(LW)'],
                        'Average_Pay':[avg_st_pay,avg_rw_pay,avg_lw_pay]})
# here creating the dataframe for EDA
pay_data = pay_data.sort_values(by='Average_Pay',ascending=False) # here we sorting in decending order.
pay_data

In [None]:
sns.barplot(x='Position',y='Average_Pay',data=pay_data,palette="coolwarm")
plt.show()

# insights
- From the Above analysis shows the  offence player Left-Winger(LW) tends to get paid the most comparing to others


# DATA PREPROCESSING :

# (i) NULL VALUES HANDLING

In [None]:
x = data.isnull().sum()  # Here we can see the where the null values are present in the dataset
y=x[x>0]
x[x>0]

In [None]:
x[x>0].count()   #Total no.of colunms has null values.

In [None]:
per_null = ((data.isnull().sum())/len(data))*100  
# here we can see the null values in percentage

In [None]:
pd.set_option('display.max_rows',None)
per_null  # it shows the all features how much percentage of null values are present
per_null[per_null>0]  # it shows the where are null values present and shows there persentage.

### Separate the columns into categorical and numerical

## NUMERICAL COLUMNS

In [None]:
null_num_col =[]   # here we extrac the numerical columns which contains the null values 

for i in numerical_col:
    if i in y:
        null_num_col.append(i)

In [None]:
null_num_col # the numerical columns which contains the null values 

In [None]:
# Here, we handling the null values in the numerical columns  

from sklearn.impute import KNNImputer

knn = KNNImputer(n_neighbors=3)

for i in null_num_col:
    
    per_null_num =((data[i].isnull().sum())/len(data))*100
    
    if per_null_num <=20:    
        data[i]=data[i].fillna(np.mean(data[i]))
        
    elif per_null_num <=80:
        data[[i]] = knn.fit_transform(data[[i]])
        
    else:
        data.drop(columns=[i],inplace=True)
        

#### A column has <= 20 percentage  null values has been replaced by the mean values
#### A column has <=80 percentage  null values has been replaced by the kkn values
#### A column has >=80 percentage  null values has been droped     

## CATEGORICAL COLUMNS :

In [None]:
null_cate_col =[]  # here we can see the categorical columns which contains the null values 

for j in categorical_col:
    if j in y:
        null_cate_col.append(j)

In [None]:
null_cate_col       # the categorical columns which contains the null values

In [None]:
for j in null_cate_col:
    
    per_null_cate =((data[j].isnull().sum())/len(data))*100
    
    if per_null_cate<=80:   
        data[j] = data[j].fillna(data[j].mode()[0])

    else:
        data[j] = data[j].fillna(data[j].mode()[0])

#### In categorical feature, a feature has less than 80% null values means it replaced by the most frequent value by using of mode
#### In categorical feature, a feature has more than 80% null values means it should be drop 

# (ii) DUPLICATE VALUES HANDLING

In [None]:
data.duplicated().sum() # In this dataset there is no duplicate (rowise)

#### In our dataset there is no duplicates are present.

# FEATURE TRANSFORMATION

# (iii)  ENCODING

In [None]:
categorical_col  # total no.of categroical features are 43

## insights
- From the above list there are no ordinal features are present  in the categorical_col.
- Show, we handle feature in the Nominal way.
- Our dataset has 18,000 features, avoid One-Hot Encoding for high-cardinality columns. Instead, we use Frequency Encoding to control feature explosion.

## FREQUENCY ENCODING TECHNIQUE

In [None]:
# here we checking the no.of unique values present in the each features

for i in categorical_col:
    uni_len = len(data[i].unique())  
    print(i,'-',uni_len)

In [None]:
for i in categorical_col:
    uni_len = len(data[i].unique())  
    if uni_len <=5000:
        fre_data = data[i].value_counts() # Converts counts into probabilities (useful for statistical models).
        data[i]=data[i].map(fre_data)
    
    elif uni_len>5000:
        data.drop(columns=[i],inplace=True)

#### In categorical feature, a feature has less than 5000 unique values means it will handle by Frequency Encoding Technique.
#### In categorical feature, a feature has more than 5000 unique values means it should be drop.

In [None]:
data.info(verbose=False) # here clearly see no categorical features are present.

# To find the correlation of each feature in the dataset.
# To take the most correlated feature to bulid the model for the task.

In [None]:
cor_data = data.corr()
cor_data

In [None]:
columns = cor_data.columns.to_list() # collecting the columns
index = cor_data.index.to_list()   # collecting the index

cor = {}  # here we collecting the correlation value and they correalated variables

for i in columns:
    for j in index:
        if( f'{i},{j}' and f'{j},{i}') not in cor:
            if i not in  j:    # Because i and j in string  format so we using string membership
                cor[f'{i},{j}']=data[i].corr(data[j])
            
cor # this variable contains the correlation value of every combination

In [None]:
len(cor) # length of correlation variable

In [None]:
 # here we separting the positive and negative correlation values
x=cor.values() 
pos_cor = [] # collecting positive values
neg_cor = [] # collecting negative values

for i in x:
    if i>0:
        pos_cor.append(i)
    elif i<0:
        neg_cor.append(i)
        
top_pos_cor = sorted(pos_cor,reverse=True)[:15] # here we take top 10 positive correlated feature for model buliding 
top_neg_cor = sorted(neg_cor)[:15] # here we take top 10 negative correlated feature for model buliding

In [None]:
top_neg_cor

In [None]:
a={} # top 10 Positvie correlated features
b={} # top 10 negative correlated features

for i in cor.keys():
    for j in top_pos_cor:
        if cor[i]==j and len(a)<=15:
            a[i]=j
    for k in top_neg_cor:
        if cor[i]==k:
            b[i]=k

In [None]:
a #  top 10 Positvie correlated features

In [None]:
b # top 10 Negative correlated features

In [None]:
features =[]  # here we extracting the unique value form the dictionary keys
for i in a.keys():
    spl=i.split(',')
    if spl[0] not in features:
        features.append(spl[0])
    if spl[1] not in features:
        features.append(spl[1])
        
for j in b.keys():
    spl1=j.split(',')
    if spl1[0] not in features:
        features.append(spl1[0])
    if spl1[1] not in features:
        features.append(spl1[1])        

In [None]:
features  # this variable has the both positive and negative high correlative features name for model building 

In [None]:
len(features) # Totally 29 highly correlated features are taken for model building

# HEATMAP
- It shows the correlation matrix of the given dataset

In [None]:
plt.figure(figsize=[50,50])
sns.heatmap(cor_data,annot=True,annot_kws={'size':5})

# (iv) SCALING PROCESS
- Encoding technique, we use the Frequency encoding. 
- It assign the Each categories as large numerical value. Because feature has the large data 18k+
- Before appling into modeling we need to scaling the data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_data = scaler.fit_transform(data)
scaled_data

In [None]:
data_scal = pd.DataFrame(scaled_data,columns=data.columns)
data_scal # dataframe of scaled data.

# TASK 2 : To explore football skills and cluster players based on their attributes.

we need to select  features  from the dataset.

In [None]:
skill_attributes = features
skill_attributes
# selected feature based an the correlation

- Above selected features has the high correlation  in the dataset. So, we selected this.
- From the above selected feature we sepaterate the numerical and categorical features.
- We already separate the overall categorical features (categorical_col) from the whole input features.
- From the (categorical_col), here filter needed categorical features.

In [None]:
len(skill_attributes)

In [None]:
x=[]
y=[]
for i in data_scal.columns:
    if i in skill_attributes:
        x.append(i)
    else:
        y.append(i)

In [None]:
y # for used to remove the features in the dataset

In [None]:
sel_fea = data_scal.drop(columns=y)
sel_fea  # selected features for model building

#### Due to large features model will not perform well. 
#### So we apply the PCA to extract the new features for the model building.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.90)  # 0.90 indicates that give me a 90% of information covered PCA.
pca_data = pca.fit_transform(sel_fea)
pca_data

In [None]:
pca_data = pd.DataFrame(pca_data,columns=['PC1','PC2','PC3','PC4','PC5'])
pca_data  # extacted principal component dataframe

In [None]:
from sklearn.cluster import KMeans


# List to store inertia values
inertia = []

# Range of k values to try
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(pca_data)  
    inertia.append(kmeans.inertia_)

# Plot inertia vs. k
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertia, marker='o', linestyle='-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method to Determine Optimal k')
plt.xticks(k_values)
plt.grid(True)
plt.show()


## insights
- From the above Elbow chart, we can use k value as [3,4,5] anyone.
- we use the k=4 to bulid the model or make the cluster.

In [None]:
inertia # here, shows the each k value inertia

In [None]:
# apply kmeans clustering

kmeans = KMeans(n_clusters=4)
cluster = kmeans.fit_predict(pca_data)

In [None]:
from sklearn.metrics import silhouette_score
sil_score = silhouette_score(pca_data,cluster)
sil_score           

## insights
- KMeans silotte score = 0.6360542212538013, which indicates that the Clusters are  well separated.
- For increasing the silohtte score , we move on to the KMeans++.

# K-MEANS ++

In [None]:
from sklearn.cluster import KMeans


# List to store inertia values
inertia = []

# Range of k values to try
k_values = range(2, 11)

for k in k_values:
    kmeans_plus = KMeans(n_clusters=k,init='k-means++', random_state=21)
    kmeans_plus.fit(pca_data)  
    inertia.append(kmeans_plus.inertia_)

# Plot inertia vs. k
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertia, marker='o', linestyle='-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method to Determine Optimal k')
plt.xticks(k_values)
plt.grid(True)
plt.show()


## insights
- From the above Elbow chart, we can use k value as [3,4,5] anyone.
- we use the k=4 to bulid the model or make the cluster.

In [None]:
k_plus = KMeans(n_clusters=4,init='k-means++')
k_plus_predict = k_plus.fit_predict(pca_data)

In [None]:
sil_score_kplus = silhouette_score(pca_data,k_plus_predict)
sil_score_kplus      

- There is no improvement in the silohtte score
- Dataset has any outliers means Kmeans and Kmeans++ are not well perform. 
- Now we are move on to the advance alogrithm DBSCAN.
- DBSCAN  is best algorithm for the outliers detection.

# **DBSCAN**

Before applying **DBSCAN**, we need to determine key parameters such as **ε (Epsilon)** and **MinPts (Minimum Points).**  

## **1. Finding the MinPts**  
A commonly used heuristic for selecting MinPts is the **"Rule of Thumb"**, which states:  

### **Formula:**  
\[
{MinPts} = 2 * {(Number of Dimensions)}
\]


In [None]:
MinPts = 2*(len(pca_data.columns))
MinPts     # Our MinPts is 4 

##  2. Finding the esp:
- By the help of MinPts and the NearestNeighbor to calculate best esp value

In [None]:
from sklearn.neighbors import NearestNeighbors


near = NearestNeighbors(n_neighbors=MinPts)   # Set up the Nearest Neighbors model
near.fit(pca_data)   # Fit the model to the data
distance,points = near.kneighbors(pca_data)  # Find distances to the nearest neighbors

In [None]:
k_distance = distance[:,-1]
k_distance   # it gives the 4th nearest Neighbor of the points

In [None]:
k_distance=np.sort(k_distance)
k_distance # here we sorted in the ascending order low to high

In [None]:
# To find the eps value through the graph

plt.plot(range(len(k_distance)),k_distance)
plt.xlabel(' 4th nearest Points')
plt.ylabel('Distance')
plt.title("K-Distance Plot for DBSCAN eps Selection")
plt.grid(True)
plt.show()

## insights
- From the above Elbow curve observation, we choose the esp =0.1 or 0.12.
- Now we know the MinPts=10 and esp=0.12.
- Apply those parameter values in the DBSCAN to bulid the model.

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.16,min_samples=MinPts)
db_pred = dbscan.fit_predict(pca_data)

In [None]:
sil_score_DBSCAN = silhouette_score(pca_data,db_pred)
sil_score_DBSCAN             

#### There is no improvement in the silohtte score by using advance algorithm DBSCAN.
#### So, we build the semi-supervied model to prove the my model gives the best prediction.

# SEMI SUPERVISED MODEL

In [None]:
labels = kmeans.labels_
labels  # Get cluster labels assigned by KMeans for each datapoints

In [None]:
# here we can classified the no.of datapoints present in the each cluster

label_0 =[]  #1924 data points present in the cluster 0
label_1 =[] #14317 data points present in the cluster 1
label_2 =[] #2037 data points present in the cluster 2

for i in labels:
    if i==0:
        label_0.append(i)
    elif i==1:
        label_1.append(i)
    elif i==2:
        label_2.append(i)
  

In [None]:
label_fea = sel_fea

In [None]:
    label_fea['Labels']=labels # here we adding the label columns to the dataset

In [None]:
label_fea

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import  accuracy_score ,classification_report

# DATA SPLITTING

In [None]:
x = label_fea.drop(columns=['Labels'])
y = label_fea.Labels

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=21)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

# MODEL IMPLEMENTATION

In [None]:
model = LogisticRegression()
model.fit(x_train,y_train)

In [None]:
prediction = model.predict(x_test)

In [None]:
accuracy_score(y_test,prediction)

In [None]:
print(classification_report(y_test,prediction))

#### From the model kmeans, take a ouputs to store in  " labels" variable and add to the dataset and named as a label_fea.
#### By using of Label_fea  to bulid  a semi-supervised model by using Logistic Regression.
#### The builded semi-supervised  model gives me a accuracy 0.9991794310722101.