# Neteja de dades i regressió logística

Emprarem les dades del **Fifa22**. [Enllaç](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA22_official_data.csv)

> The dataset contains +17k unique players and more than 60 columns, general information and all KPIs the famous videogame offers. As the esport scene keeps rising espacially on FIFA, I thought it can be useful for the community (kagglers and/or gamers)
>
>Context
>
>The data was retrieved thanks to a crawler that I implemented to retrieve:
>
>    Aggregated data such as name of the players, age, country
>    Detailed data such as offensive potential, defense, acceleration
>    I like football a lot and this dataset is for me the opportunity to bring my contribution for the realization of projects that can go from simple analysis to elaboration of strategies on optimal composition under constraints…


L'objectiu d'aquesta **mini** pràctica es prediure la possició del jugador emprant entre totes les característiques disponibles quatre. S'han d'entrenar dos classificadors lineals: **Regressió Logística** i **Perceptró**. 


## Importam llibraries

El que primer fem, com sempre, és importar les llibreries que emprarem. **Pot ser que per fer l'exercici necessitis importar més llibreries**.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

import pandas as pd
import numpy as np

## Llegim les dades

Per llegir les dades emprarem la llibreria de ``pandas``. El fitxer en qüestió és el fitxer que heu descarregat de Kaggle.

In [3]:
df = pd.read_csv("./dades.csv")
df.head()

Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,212198,Bruno Fernandes,26,https://cdn.sofifa.com/players/212/198/22_60.png,Portugal,https://cdn.sofifa.com/flags/pt.png,88,89,Manchester United,https://cdn.sofifa.com/teams/11/30.png,...,65.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,€206.9M,72.0
1,209658,L. Goretzka,26,https://cdn.sofifa.com/players/209/658/22_60.png,Germany,https://cdn.sofifa.com/flags/de.png,87,88,FC Bayern München,https://cdn.sofifa.com/teams/21/30.png,...,77.0,13.0,8.0,15.0,11.0,9.0,CM,87.0,€160.4M,74.0
2,176580,L. Suárez,34,https://cdn.sofifa.com/players/176/580/22_60.png,Uruguay,https://cdn.sofifa.com/flags/uy.png,88,88,Atlético de Madrid,https://cdn.sofifa.com/teams/240/30.png,...,38.0,27.0,25.0,31.0,33.0,37.0,ST,88.0,€91.2M,42.0
3,192985,K. De Bruyne,30,https://cdn.sofifa.com/players/192/985/22_60.png,Belgium,https://cdn.sofifa.com/flags/be.png,91,91,Manchester City,https://cdn.sofifa.com/teams/10/30.png,...,53.0,15.0,13.0,5.0,10.0,13.0,CM,91.0,€232.2M,68.0
4,224334,M. Acuña,29,https://cdn.sofifa.com/players/224/334/22_60.png,Argentina,https://cdn.sofifa.com/flags/ar.png,84,84,Sevilla FC,https://cdn.sofifa.com/teams/481/30.png,...,82.0,8.0,14.0,13.0,13.0,14.0,LB,84.0,€77.7M,80.0


## EDA: *Exploratory data analysis*


## Limpieza de datos

### Eliminación de colummnas

In [4]:
df.drop(['Photo', 'Flag', 'Club Logo'] , inplace=True, axis=1)

df.head()

Unnamed: 0,ID,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Special,...,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,212198,Bruno Fernandes,26,Portugal,88,89,Manchester United,€107.5M,€250K,2341,...,65.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,€206.9M,72.0
1,209658,L. Goretzka,26,Germany,87,88,FC Bayern München,€93M,€140K,2314,...,77.0,13.0,8.0,15.0,11.0,9.0,CM,87.0,€160.4M,74.0
2,176580,L. Suárez,34,Uruguay,88,88,Atlético de Madrid,€44.5M,€135K,2307,...,38.0,27.0,25.0,31.0,33.0,37.0,ST,88.0,€91.2M,42.0
3,192985,K. De Bruyne,30,Belgium,91,91,Manchester City,€125.5M,€350K,2304,...,53.0,15.0,13.0,5.0,10.0,13.0,CM,91.0,€232.2M,68.0
4,224334,M. Acuña,29,Argentina,84,84,Sevilla FC,€37M,€45K,2292,...,82.0,8.0,14.0,13.0,13.0,14.0,LB,84.0,€77.7M,80.0


### Cambiar el tipo de dato

Quitar los signos de moneda y pasar xM o xK a millones y miles respectivamente.

In [5]:
# Value
df['Value'] = df['Value'].str.replace('€', '') # remove €
df['Value'] = df['Value'].str.replace('M', '000000') # replace M with 000000
df['Value'] = df['Value'].str.replace('K', '000') # replace K with 000

# Wage
df['Wage'] = df['Wage'].str.replace('€', '') # remove €
df['Wage'] = df['Wage'].str.replace('K', '000') # replace K with 000
df['Wage'] = df['Wage'].str.replace('M', '000000') # replace M with 000000

# Release Clause
df['Release Clause'] = df['Release Clause'].str.replace('€', '') # remove €
df['Release Clause'] = df['Release Clause'].str.replace('M', '000000') # replace M with 000000
df['Release Clause'] = df['Release Clause'].str.replace('K', '000') # replace K with 000

# Result
df['Value'] = df['Value'].astype(float)
df['Wage'] = df['Wage'].astype(float)
df['Release Clause'] = df['Release Clause'].astype(float)

df.head()

Unnamed: 0,ID,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Special,...,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,212198,Bruno Fernandes,26,Portugal,88,89,Manchester United,107.5,250000.0,2341,...,65.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,206.9,72.0
1,209658,L. Goretzka,26,Germany,87,88,FC Bayern München,93000000.0,140000.0,2314,...,77.0,13.0,8.0,15.0,11.0,9.0,CM,87.0,160.4,74.0
2,176580,L. Suárez,34,Uruguay,88,88,Atlético de Madrid,44.5,135000.0,2307,...,38.0,27.0,25.0,31.0,33.0,37.0,ST,88.0,91.2,42.0
3,192985,K. De Bruyne,30,Belgium,91,91,Manchester City,125.5,350000.0,2304,...,53.0,15.0,13.0,5.0,10.0,13.0,CM,91.0,232.2,68.0
4,224334,M. Acuña,29,Argentina,84,84,Sevilla FC,37000000.0,45000.0,2292,...,82.0,8.0,14.0,13.0,13.0,14.0,LB,84.0,77.7,80.0



### Abstracción de código

El bloque de código anterior se puede abstraer a una función que reciba los datos a cambiar y devuelva el dataframe con los cambios realizados.

In [9]:
def mod_columm(
    columns: list = [], type_to_convert=float, chars_to_remove=[], chars_to_replace=[]
):
    for column in columns:
        for char in chars_to_remove:
            df[column] = df[column].str.replace(char, "")

        for char in chars_to_replace:
            df[column] = df[column].str.replace(char[0], char[1])

        df[column] = df[column].astype(type_to_convert)


mod_columm(
    columns=['Value', 'Wage', 'Release Clause'],
    type_to_convert=float,
    chars_to_remove=["€"],
    chars_to_replace=[("K", "000"), ("M", "000000")],
)

df.head()


AttributeError: Can only use .str accessor with string values!

## Entrenament

## Avaluació