# Milestone 1 

## Dataset
International football results from 1972 to 2022. consists of two csv files, `results.csv` and `shootouts.csv`

### About the dataset
This dataset includes `43,751` results of international football matches starting from the very first official match in 1972 up to 2022. The matches range from FIFA World Cup to FIFI Wild Cup to regular friendly matches.

 ### Dataset columns 

 `results.csv`

 | date | home_team | away_team | home_score | away_score | tournament | city | country | neutral |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Date of the match  | Name of the home team | Name of the away team | Full-time home team score including extra time, not including penalty-shootouts | full-time away team score including extra time, not including penalty-shootouts | Name of the tournament | Name of the city/town/administrative unit where the match was played | Name of the country where the match was played | Column indicating whether the match was played at a neutral venue |
| `yyyy-mm-dd` format date | string | string | number | number | string | string | string | TRUE or FALSE |
| `quantitative/discrete`|`qualitative/nominal`| `qualitative/nominal`| `quantitative/discrete` | `quantitative/discrete` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` |


`shootouts.csv`

| date | home_team | away team | winner |
| --- | --- | --- | --- |
| Date of the match | Name of the home team | Name of the away team | winner of the penalty-shootout |
| string | string | string | string |
| `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal`|


## Posible missing values
- **Goals**: would give us more insights about the players, who are the top scorers in some competitions, which players are more or less suitable to score certain teams, also would give us insights about the minutes where is more possible that a goal occurs in a match with certain teams.
- **Continent**: would give us some insights if a team is more propense to win or lose playing in some continents or in the next world cup.
  



## Lets start

### Inporting libraries

In [159]:
import pandas 
import numpy 
import matplotlib.pyplot as plot
import seaborn 
import random

### Load datasets

In [160]:
results = pandas.read_csv('results.csv')
results['total_score'] = results["away_score"] + results['home_score']
shootouts = pandas.read_csv('shootouts.csv')

In [161]:
results.sample(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_score
20765,1997-04-30,Austria,Estonia,2,0,FIFA World Cup qualification,Vienna,Austria,False,2
5561,1963-12-01,Ghana,Sudan,3,0,African Cup of Nations,Accra,Ghana,False,3
28881,2006-09-03,Antigua and Barbuda,Dominica,1,0,Friendly,St. John's,Antigua and Barbuda,False,1
17477,1992-09-09,Switzerland,Scotland,3,1,FIFA World Cup qualification,Berne,Switzerland,False,4
2975,1949-05-07,Jamaica,Trinidad and Tobago,1,2,Friendly,Kingston,Jamaica,False,3


In [162]:
shootouts.sample(5)

Unnamed: 0,date,home_team,away_team,winner
411,2015-07-19,Trinidad and Tobago,Panama,Panama
174,1995-11-28,Mauritania,Sierra Leone,Sierra Leone
338,2006-02-04,Cameroon,Ivory Coast,Ivory Coast
204,1998-01-31,Iran,Chile,Iran
329,2005-07-24,United States,Panama,United States


### Dimentions of the data

In [163]:
print(f'Results: {results.shape[0]} Rows, {results.shape[1]} Columns')
print(f'Shootouts: {shootouts.shape[0]} Rows, {shootouts.shape[1]} Columns')



Results: 43919 Rows, 10 Columns
Shootouts: 503 Rows, 4 Columns


### Data types available 

#### Results:

In [164]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43919 entries, 0 to 43918
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         43919 non-null  object
 1   home_team    43919 non-null  object
 2   away_team    43919 non-null  object
 3   home_score   43919 non-null  int64 
 4   away_score   43919 non-null  int64 
 5   tournament   43919 non-null  object
 6   city         43919 non-null  object
 7   country      43919 non-null  object
 8   neutral      43919 non-null  bool  
 9   total_score  43919 non-null  int64 
dtypes: bool(1), int64(3), object(6)
memory usage: 3.1+ MB


#### Shootouts

In [165]:
shootouts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       503 non-null    object
 1   home_team  503 non-null    object
 2   away_team  503 non-null    object
 3   winner     503 non-null    object
dtypes: object(4)
memory usage: 15.8+ KB


### Misssing values, duplicate entries, and outliners

##### Missing values

In [166]:
results.isna().sum()

date           0
home_team      0
away_team      0
home_score     0
away_score     0
tournament     0
city           0
country        0
neutral        0
total_score    0
dtype: int64

In [167]:
shootouts.isna().sum()

date         0
home_team    0
away_team    0
winner       0
dtype: int64

##### Duplicated entries

In [168]:
print(f'Results duplicated entries: {results.duplicated().sum()}')
print(f'Shootouts duplicated entries: {shootouts.duplicated().sum()}')

Results duplicated entries: 0
Shootouts duplicated entries: 0


### Outliers - most common and uncommon results

In [169]:
results_unique_score = results.groupby(['home_score', 'away_score'], as_index=False).size()
seaborn.relplot(data= results_unique_score, x= 'home_score', y= 'away_score',size='size', sizes=(5,400), alpha= 0.5)

<seaborn.axisgrid.FacetGrid at 0x13a8449a0>

### Descriptive statistics of your data

In [170]:
results.describe()

Unnamed: 0,home_score,away_score,total_score
count,43919.0,43919.0,43919.0
mean,1.741365,1.17892,2.920285
std,1.748843,1.395755,2.081746
min,0.0,0.0,0.0
25%,1.0,0.0,1.0
50%,1.0,1.0,3.0
75%,2.0,2.0,4.0
max,31.0,21.0,31.0


### Get an overview of the distribution of those variables, which are quantitative continuos

In [171]:
seaborn.displot(data=results, x="home_score", kind = "hist", discrete=True, aspect= 2 )

<seaborn.axisgrid.FacetGrid at 0x1396b5180>

In [172]:
seaborn.displot(data=results, x="away_score", kind = "hist", discrete=True, aspect= 2 )

<seaborn.axisgrid.FacetGrid at 0x138552ef0>

In [173]:
seaborn.displot(data=results, x="total_score", kind = "hist", discrete=True, aspect= 2 )

<seaborn.axisgrid.FacetGrid at 0x13d6cb070>

### Better ways to show qualitative data in your dataset

prepare the dataset with winners column

In [174]:
results['winner'] = numpy.select([results['home_score']> results['away_score'],results['home_score']< results['away_score'], results['home_score']== results['away_score']], [results['home_team'], results['away_team'], 'draw'])  
results['winner_condition'] = numpy.select([results['winner'] == results['home_team'],results['winner'] == results['away_team'], results['winner']== 'draw'], ['home', 'away', 'draw']) 

results.head(10)


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_score,winner,winner_condition
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,0,draw,draw
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,6,England,home
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,3,Scotland,home
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,4,draw,draw
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,3,Scotland,home
5,1876-03-25,Scotland,Wales,4,0,Friendly,Glasgow,Scotland,False,4,Scotland,home
6,1877-03-03,England,Scotland,1,3,Friendly,London,England,False,4,Scotland,away
7,1877-03-05,Wales,Scotland,0,2,Friendly,Wrexham,Wales,False,2,Scotland,away
8,1878-03-02,Scotland,England,7,2,Friendly,Glasgow,Scotland,False,9,Scotland,home
9,1878-03-23,Scotland,Wales,9,0,Friendly,Glasgow,Scotland,False,9,Scotland,home


Teams with more wins in history, divided by home wins and away wins

In [175]:
figure, axes = plot.subplots(
    1,
    1,
    sharex=False,
    figsize=(20, 180),
)
seaborn.countplot(data=results, y='winner', order=results.winner.value_counts().index, hue='winner_condition', palette = ['blue', '#32a852', '#e8685f'])

<AxesSubplot:xlabel='count', ylabel='winner'>

### Play around with the corelation available in the features