### Methodologies/Techniques in Data Science Process
CRISP-DM (Cross Industry Standard for Data Mining) gives us a framework to tackling data analysis and machine learning.

Its divided into 6 phase/steps:

1. **Business Understanding**
    * Defining the business objective(what is the goal/problem faced by the business)
    * Create a data mining strategy(based on the problem/goal you can determine the most viable dataset to address it.)
    * Create a criteria for success.

2. **Data Understanding**
    * Acquire data(know data format, does it from a database or as files.)
    * Exploring the Data(EDA, visualizations and basic statistics)
    * Grade the quality of data(how many nulls are present? Outliers?)

3. **Data Preparation**
    * Select relevant columns to work with
    * Normalize the data(bring the data into one scale(scaling) eg length, width and height in different metrics cm, mm)

4. **Modelling**
    Creating machine learning algorithms.
    * Selecting the ML algorithm(linear regression, classification, neural network)
    * Split data into train and test
    * Model using the train set
    * Assess and improve on the model

5. **Evaluation**
    Compare the different model's performance and choose the best one for your business.

6. **Deployment**
    * Once you are comfortable with the model you bring it to the mainstream. 
    * Model monitoring and maintenance over time
    * Updating

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('./data/premier-player-23-24.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 34 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player       580 non-null    object 
 1   Nation       580 non-null    object 
 2   Pos          580 non-null    object 
 3   Age          580 non-null    float64
 4   MP           580 non-null    int64  
 5   Starts       580 non-null    int64  
 6   Min          580 non-null    float64
 7   90s          580 non-null    float64
 8   Gls          580 non-null    float64
 9   Ast          580 non-null    float64
 10  G+A          580 non-null    float64
 11  G-PK         580 non-null    float64
 12  PK           580 non-null    float64
 13  PKatt        580 non-null    float64
 14  CrdY         580 non-null    float64
 15  CrdR         580 non-null    float64
 16  xG           580 non-null    float64
 17  npxG         580 non-null    float64
 18  xAG          580 non-null    float64
 19  npxG+xAG

In [10]:
df.head()

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,Ast_90,G+A_90,G-PK_90,G+A-PK_90,xG_90,xAG_90,xG+xAG_90,npxG_90,npxG+xAG_90,Team
0,Rodri,es ESP,MF,27.0,34,34,2931.0,32.6,8.0,9.0,...,0.28,0.52,0.25,0.52,0.12,0.12,0.24,0.12,0.24,Manchester City
1,Phil Foden,eng ENG,"FW,MF",23.0,35,33,2857.0,31.7,19.0,8.0,...,0.25,0.85,0.6,0.85,0.33,0.26,0.59,0.33,0.59,Manchester City
2,Ederson,br BRA,GK,29.0,33,33,2785.0,30.9,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Manchester City
3,Julián Álvarez,ar ARG,"MF,FW",23.0,36,31,2647.0,29.4,11.0,8.0,...,0.27,0.65,0.31,0.58,0.44,0.22,0.66,0.39,0.61,Manchester City
4,Kyle Walker,eng ENG,DF,33.0,32,30,2767.0,30.7,0.0,4.0,...,0.13,0.13,0.0,0.13,0.01,0.09,0.1,0.01,0.1,Manchester City


In [11]:
df.describe()


Unnamed: 0,Age,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,...,Gls_90,Ast_90,G+A_90,G-PK_90,G+A-PK_90,xG_90,xAG_90,xG+xAG_90,npxG_90,npxG+xAG_90
count,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,...,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0
mean,24.906897,19.627586,14.413793,1294.584483,14.383448,2.063793,1.481034,3.544828,1.898276,0.165517,...,0.125259,0.091621,0.21681,0.118155,0.209638,0.144983,0.100707,0.245845,0.138431,0.239466
std,4.464593,11.832419,11.926422,1024.720358,11.385342,3.621238,2.360729,5.391389,3.189739,0.77983,...,0.223161,0.160703,0.297085,0.214342,0.287035,0.222225,0.210713,0.348004,0.213947,0.340631
min,15.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,9.0,3.0,342.75,3.775,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0175,0.01,0.06,0.0175,0.06
50%,25.0,20.0,13.0,1164.0,12.95,1.0,0.0,1.0,1.0,0.0,...,0.03,0.0,0.1,0.03,0.1,0.07,0.06,0.145,0.07,0.145
75%,28.0,30.0,25.0,2104.25,23.4,2.0,2.0,4.0,2.0,0.0,...,0.17,0.13,0.31,0.16,0.3,0.19,0.14,0.37,0.18,0.35
max,38.0,38.0,38.0,3420.0,38.0,27.0,13.0,33.0,20.0,9.0,...,2.65,1.7,2.65,2.65,2.65,3.23,4.44,5.54,3.23,5.54


### Data Cleaning Techniques
Two methods for missing entries:
* `Dropping`
* `Impute` - fill in with the mean of that column
           - fill in with the mode of the column
           - fill in with placeholder (**Nan** ----> `None`)

In [34]:
import pandas as pd
import numpy as np


data = {
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Jane', 'Peter'],
    'Age': [25, np.nan, 30, 22, 22, np.nan],
    'City': ['Nairobi', 'Kampala', None, 'Nairobi', 'Kampala', 'Nairobi']
}
data = pd.DataFrame(data)
data


Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,,Kampala
2,Charlie,30.0,
3,,22.0,Nairobi
4,Jane,22.0,Kampala
5,Peter,,Nairobi


In [4]:
dropped_data = data.dropna()

dropped_data

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi


In [8]:
people_without_cities = data.dropna(subset=['City'])

people_without_cities

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,,Kampala
3,,22.0,Nairobi


Drop people without age

In [9]:
people_without_age = data.dropna(subset=['Age'])

people_without_age

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
2,Charlie,30.0,
3,,22.0,Nairobi


Drop based on two columns.

In [10]:
people_without_age_name = data.dropna(subset=['Name', 'Age'])

people_without_age_name

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
2,Charlie,30.0,


Filling in null entries with a fixed value.(Value that we had in mind is what we use to fill.)

In [19]:
data_copy = data.copy()

data_copy['Name'] = data_copy['Name'].fillna('Brian')

data_copy

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,,Kampala
2,Charlie,30.0,
3,Brian,22.0,Nairobi


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      float64
 2   City    3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 228.0+ bytes


In [26]:
data_copy = data.copy()

data_copy['Age'] = data_copy['Age'].fillna(22)

data_copy


Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,22.0,Kampala
2,Charlie,30.0,
3,,22.0,Nairobi


Filling out null entries with the mean.

In [29]:
data_copy = data.copy()

mean_age = data_copy['Age'].mean().__round__(1)

data_copy['Age'] = data_copy['Age'].fillna(mean_age)

data_copy

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,25.7,Kampala
2,Charlie,30.0,
3,,22.0,Nairobi


Use mode to fill in null entries.

In [38]:
data_copy = data.copy()

city_mode = data_copy['City'].mode()[0]

data_copy['City'] = data_copy['City'].fillna(city_mode)

data_copy

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,,Kampala
2,Charlie,30.0,Nairobi
3,,22.0,Nairobi
4,Jane,22.0,Kampala
5,Peter,,Nairobi


In [40]:
data_copy = data.copy()

mode_age = data_copy['Age'].mode()[0]

data_copy['Age'] = data_copy['Age'].fillna(mode_age)

data_copy

Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,22.0,Kampala
2,Charlie,30.0,
3,,22.0,Nairobi
4,Jane,22.0,Kampala
5,Peter,22.0,Nairobi


Using previous rows to fill in the null entries.

In [42]:
data_copy = data.copy()

data_copy['Name'] = data_copy['Name'].fillna(method='bfill')

data_copy

  data_copy['Name'] = data_copy['Name'].fillna(method='bfill')


Unnamed: 0,Name,Age,City
0,Alice,25.0,Nairobi
1,Bob,,Kampala
2,Charlie,30.0,
3,Jane,22.0,Nairobi
4,Jane,22.0,Kampala
5,Peter,,Nairobi


In [12]:
player_stats = pd.read_csv('epl_player_stats_24_25.csv')

player_stats.head()

Unnamed: 0,Player Name,Club,Nationality,Position,Appearances,Minutes,Goals,Assists,Shots,Shots On Target,...,Fouls,Yellow Cards,Red Cards,Saves,Saves %,Penalties Saved,Clearances Off Line,Punches,High Claims,Goals Prevented
0,Ben White,Arsenal,England,DEF,17,1198,0,2,9,12,...,10,2,0,0,0%,0,0,0,0,0.0
1,Bukayo Saka,Arsenal,England,MID,25,1735,6,10,67,2,...,15,3,0,0,0%,0,0,0,0,0.0
2,David Raya,Arsenal,Spain,GKP,38,3420,0,0,0,0,...,1,3,0,86,72%,0,0,8,53,2.1
3,Declan Rice,Arsenal,England,MID,35,2833,4,7,48,18,...,21,5,1,0,0%,0,0,0,0,0.0
4,Ethan Nwaneri,Arsenal,England,MID,26,889,4,0,24,0,...,9,1,0,0,0%,0,0,0,0,0.0


In [13]:
player_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562 entries, 0 to 561
Data columns (total 57 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Player Name                562 non-null    object 
 1   Club                       562 non-null    object 
 2   Nationality                562 non-null    object 
 3   Position                   562 non-null    object 
 4   Appearances                562 non-null    int64  
 5   Minutes                    562 non-null    int64  
 6   Goals                      562 non-null    int64  
 7   Assists                    562 non-null    int64  
 8   Shots                      562 non-null    int64  
 9   Shots On Target            562 non-null    int64  
 10  Conversion %               562 non-null    object 
 11  Big Chances Missed         562 non-null    int64  
 12  Hit Woodwork               562 non-null    int64  
 13  Offsides                   562 non-null    int64  

### To-do list
* Write a function to remove the percentage `%` sign.

In [14]:
player_stats['Club'].value_counts()

Club
Southampton                34
Wolverhampton Wanderers    32
Leicester City             31
Tottenham Hotspur          31
Manchester City            31
Manchester United          30
Ipswich Town               30
Bournemouth                29
Chelsea                    28
Brighton & Hove Albion     28
Brentford                  28
West Ham United            28
Aston Villa                27
Crystal Palace             26
Everton                    26
Fulham                     26
Arsenal                    25
Liverpool                  24
Newcastle United           23
Nottingham Forest          23
Brighton                    2
Name: count, dtype: int64