# Milestone 1 

## Dataset
International football results from 1972 to 2022. consists of two csv files, `results.csv` and `shootouts.csv`

### About the dataset
This dataset includes `43,751` results of international football matches starting from the very first official match in 1972 up to 2022. The matches range from FIFA World Cup to FIFI Wild Cup to regular friendly matches.

 ### Dataset columns 

 `results.csv`

 | date | home_team | away_team | home_score | away_score | tournament | city | country | neutral |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Date of the match  | Name of the home team | Name of the away team | Full-time home team score including extra time, not including penalty-shootouts | full-time away team score including extra time, not including penalty-shootouts | Name of the tournament | Name of the city/town/administrative unit where the match was played | Name of the country where the match was played | Column indicating whether the match was played at a neutral venue |
| `yyyy-mm-dd` format date | string | string | number | number | string | string | string | TRUE or FALSE |
| `quantitative/discrete`|`qualitative/nominal`| `qualitative/nominal`| `quantitative/discrete` | `quantitative/discrete` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` |


`shootouts.csv`

| date | home_team | away team | winner |
| --- | --- | --- | --- |
| Date of the match | Name of the home team | Name of the away team | winner of the penalty-shootout |
| string | string | string | string |
| `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal` | `qualitative/nominal`|


## Posible missing values
- **Goals**: would give us more insights about the players, who are the top scorers in some competitions, which players are more or less suitable to score certain teams, also would give us insights about the minutes where is more possible that a goal occurs in a match with certain teams.
- **Continent**: would give us some insights if a team is more propense to win or lose playing in some continents or in the next world cup.
  



## Lets start

### Inporting libraries

In [18]:
import pandas 
import numpy 
import matplotlib.pyplot as plot
import seaborn 
import random

### Load datasets

In [19]:
results = pandas.read_csv('results.csv')
shootouts = pandas.read_csv('shootouts.csv')

In [20]:
results.sample(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
39565,2017-09-05,Moldova,Wales,0,2,FIFA World Cup qualification,Chișinău,Moldova,False
23149,2000-04-23,Ivory Coast,Rwanda,2,0,FIFA World Cup qualification,Abidjan,Ivory Coast,False
17539,1992-10-14,Belgium,Romania,1,0,FIFA World Cup qualification,Brussels,Belgium,False
26890,2004-06-02,Australia,Fiji,6,1,FIFA World Cup qualification,Adelaide,Australia,False
16849,1991-06-25,Costa Rica,Colombia,0,1,Friendly,San José,Costa Rica,False


In [21]:
shootouts.sample(5)

Unnamed: 0,date,home_team,away_team,winner
90,1987-03-02,Senegal,Sierra Leone,Senegal
288,2003-02-04,Uruguay,Iran,Uruguay
160,1994-07-10,Romania,Sweden,Sweden
92,1987-06-21,South Korea,Australia,South Korea
140,1992-06-22,Netherlands,Denmark,Denmark


### Dimentions of the data

In [22]:
print(f'Results: {results.shape[0]} Rows, {results.shape[1]} Columns')
print(f'Shootouts: {shootouts.shape[0]} Rows, {shootouts.shape[1]} Columns')



Results: 43919 Rows, 9 Columns
Shootouts: 503 Rows, 4 Columns


### Data types available 

#### Results:

In [23]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43919 entries, 0 to 43918
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        43919 non-null  object
 1   home_team   43919 non-null  object
 2   away_team   43919 non-null  object
 3   home_score  43919 non-null  int64 
 4   away_score  43919 non-null  int64 
 5   tournament  43919 non-null  object
 6   city        43919 non-null  object
 7   country     43919 non-null  object
 8   neutral     43919 non-null  bool  
dtypes: bool(1), int64(2), object(6)
memory usage: 2.7+ MB


#### Shootouts

In [24]:
shootouts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       503 non-null    object
 1   home_team  503 non-null    object
 2   away_team  503 non-null    object
 3   winner     503 non-null    object
dtypes: object(4)
memory usage: 15.8+ KB


### Misssing values, duplicate entries, and outliners

##### Missing values

In [25]:
results.isna().sum()

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [26]:
shootouts.isna().sum()

date         0
home_team    0
away_team    0
winner       0
dtype: int64

##### Duplicated entries

In [27]:
print(f'Results duplicated entries: {results.duplicated().sum()}')
print(f'Shootouts duplicated entries: {shootouts.duplicated().sum()}')

Results duplicated entries: 0
Shootouts duplicated entries: 0


### Outliers - most common and uncommon results

In [28]:
results_unique_score = results.groupby(['home_score', 'away_score'], as_index=False).size()
seaborn.relplot(data= results_unique_score, x= 'home_score', y= 'away_score',size='size', sizes=(5,400), alpha= 0.5)

<seaborn.axisgrid.FacetGrid at 0x13487ba90>

### Descriptive statistics of your data

In [29]:
results.describe()

Unnamed: 0,home_score,away_score
count,43919.0,43919.0
mean,1.741365,1.17892
std,1.748843,1.395755
min,0.0,0.0
25%,1.0,0.0
50%,1.0,1.0
75%,2.0,2.0
max,31.0,21.0


### Get an overview of the distribution of those variables, which are quantitative continuos

In [30]:
seaborn.kdeplot(data= results, x= "home_score")

<AxesSubplot:xlabel='home_score', ylabel='away_score'>

In [31]:
seaborn.kdeplot(data= results, x= "away_score")

<AxesSubplot:xlabel='home_score', ylabel='away_score'>