### Angelo Faella
Last update: Feb 2020

# What is More Important to Win the Serie A League: Attack or Defence?

This is the project I made to complete the [Applied Plotting, Charting & Data Representation in Python](https://www.coursera.org/learn/python-plotting?specialization=data-science-python) course, which is the second course in the "Applied Data Science With Python Specialization" by University of Michigan on Coursera.

If you want more info about this project, check out the [article]() I wrote about it.

## Assignment
*This assignment requires that you to find **at least** two datasets on the web which are related, and that you visualize these datasets to answer a question with the broad topic of **sport and athletics** (see below) for the region of **Italy**.*

*You can merge these datasets with data from different regions if you like!*
*You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, bitbucket, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.*

*As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.*

Here are the assignment instructions:

 * State the region and the domain category that your data sets are about (e.g., **Italy** and **sport and athletics**).
 * You must state a question about the domain category and region that you identified as being interesting.
 * You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
 * You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo's principles of truthfulness, functionality, beauty, and insightfulness.
 * You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

# The Question
What comes to mind thinking about the words sport and Italy? **Football.**

Like most Italians, I am also a football fan. So I decided to find an answer once and for all to one of the most frequently asked questions in this sport:

***What is more important to win the Serie A league: attack or defence?***

# Getting the Data
The data have been scraped from **legaseriea.it**. I scraped the final table of the league of the last 20 years (1999-2019). [Here]() is the scraper

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

%matplotlib notebook

These are the sources:

In [None]:
links = ['http://www.legaseriea.it/it/serie-a/classifica/2018-19',
        'http://www.legaseriea.it/it/serie-a/classifica/2017-18',
        'http://www.legaseriea.it/it/serie-a/classifica/2016-17',
        'http://www.legaseriea.it/it/serie-a/classifica/2015-16',
        'http://www.legaseriea.it/it/serie-a/classifica/2014-15',
        'http://www.legaseriea.it/it/serie-a/classifica/2013-14',
        'http://www.legaseriea.it/it/serie-a/classifica/2012-13',
        'http://www.legaseriea.it/it/serie-a/classifica/2011-12',
        'http://www.legaseriea.it/it/serie-a/classifica/2010-11',
        'http://www.legaseriea.it/it/serie-a/classifica/2009-10',
        'http://www.legaseriea.it/it/serie-a/classifica/2008-09',
        'http://www.legaseriea.it/it/serie-a/classifica/2007-08',
        'http://www.legaseriea.it/it/serie-a/classifica/2006-07',
        'http://www.legaseriea.it/it/serie-a/classifica/2005-06',
        'http://www.legaseriea.it/it/serie-a/classifica/2004-05',
        'http://www.legaseriea.it/it/serie-a/classifica/2003-04',
        'http://www.legaseriea.it/it/serie-a/classifica/2002-03',
        'http://www.legaseriea.it/it/serie-a/classifica/2001-02',
        'http://www.legaseriea.it/it/serie-a/classifica/2000-01',
        'http://www.legaseriea.it/it/serie-a/classifica/1999-00']

## Loading the Data

In [None]:
frames = []
for link in links[-1::-1]:
    frames.append(pd.read_csv('./serie_A_'+link[-7:]+'.csv', header=None))

It will be useful to add a 'Season' columns to distinguish the 20 seasons.

In [None]:
for df,link in zip(frames,links[-1::-1]):
    df.columns = ['Pos','Name','PT','G','V','N','P','G1','V1','N1','P1','G2','V2','N2','P2','F','S']
    df['Season'] = link[-7:]

Let's merge all the dataframes into one 

In [None]:
df = pd.concat(frames)
df.head()

# Data Cleaning and Processing

We have data on the goals scored and conceded for each team. This allows us to understand the offensive and defensive strength of a team. A statistic often used in football to represent these two characteristics in a single dimension is the **goal difference**, that is, the difference between these two factors. Let's add it as a new column.

In [None]:
df['GD'] = df['F'] - df['S']
df.head()

We must focus on the winning teams of the league. Let's create a new dataframe containing only the winning teams over the past 20 years. Furthermore, we'll discard the columns we don't need.

In [None]:
winners = df[df['Pos']==1]
winners = winners.reset_index().drop('index',axis=1)
winners.drop(['Pos','PT','G','V','N','P','G1','V1','N1','P1','G2','V2','N2','P2'], axis=1, inplace=True)
winners.head()

# Exploratory Data Analysis

How has the goal difference changed over time for these teams?

In [None]:
plt.figure()
plt.plot(winners['Season'],winners['GD'], label='Goal Difference', zorder=1, linestyle='-')

plt.ylabel('Goal Difference', fontsize=10, alpha=0.75)
plt.xticks(rotation=60)
plt.subplots_adjust(bottom=0.19)

As we can see from the graph above, the goal difference of the winning teams has grown over the seasons. This may be due to:
* An increase in goals scored
* Fewer goals conceded
* Both of the above

Which of these factors was therefore decisive? Let's view the trend of the goals scored and conceded.

In [None]:
plt.figure()
plt.plot(winners['Season'],winners['F'], label='Goal Scored', zorder=1, linestyle='-')
plt.plot(winners['Season'],winners['S'], label='Goal Conceded', zorder=1, linestyle='-')

plt.ylabel('Goals', fontsize=10, alpha=0.75)
plt.legend(fontsize=8)
plt.xticks(rotation=60)
plt.subplots_adjust(bottom=0.19)

There seems to have been both an increase in goals scored and a decrease in goals conceded. Interesting. It means that the teams has improved both their offensive and defensive game.

But which of these two factors is more decisive for the final victory? We have not yet answered the question. To find out, we must first determine, for each year, if the winning team concluded with the best defense in the league and/or with the best attack.

In [None]:
best = {}

for g,f in df.groupby('Season'):
    first = f[f['Pos']==1]
    ba = False
    bd = False
    
    if first['F'][0] == max(f['F']):
        ba = True
    if first['S'][0] == min(f['S']):
        bd = True
    
    best[g] = {'BA':ba, 'BD':bd}
    
attack_defense = pd.DataFrame(best).T    
attack_defense.head()

At this point we just have to merge this dataframe with *winners* to get the definitive one. 

In [None]:
winners = pd.merge(winners, attack_defense, left_on='Season', right_index=True)
winners.head()

This frame contains the answer we are looking for. Let's find out directly by creating the final visual.

# Visualization

The idea is to draw a line to show the **goal difference** for the winning team of every year. This will help the reader understand the attacking and defending strength of the teams. 

Overlying the line, we will draw a scatter plot to indicate if the team won with the **best defence** of the league (blue), the **best attack** of the league (orange), or both (green).

In [None]:
plt.figure()
plt.plot(winners['Season'],winners['GD'], c='#393939', zorder=1, linestyle='-',alpha=0.9, linewidth=0.7)

ba_x = winners[(winners['BA'] == True) & (winners['BD'] == False)]['Season']
ba_y = winners[(winners['BA'] == True) & (winners['BD'] == False)]['GD']

bd_x = winners[(winners['BD'] == True) & (winners['BA'] == False)]['Season']
bd_y = winners[(winners['BD'] == True) & (winners['BA'] == False)]['GD']

bdba_x = winners[(winners['BD'] == True) & (winners['BA'] == True)]['Season']
bdba_y = winners[(winners['BD'] == True) & (winners['BA'] == True)]['GD']


plt.scatter(bdba_x, bdba_y, c='#1BAD6D', s = 50, zorder=3, label='Winner had both the best attack and the best defence of the league')
plt.scatter(ba_x, ba_y, c='#F08A4B', s = 50, zorder=2, label='Winner had the best attack of the league')
plt.scatter(bd_x, bd_y, c='#36558F', s = 50, zorder=2, label='Winner had the best defence of the league') 

## Formatting

First of all we need a title, labels, and a legend. Then we can make some adjustments to follow the Cairo's principles of truthfulness, functionality, beauty, and insightfulness

In [None]:
plt.gcf().suptitle("What's More Important To Win the Serie A League?\n Attack VS Defence", fontsize=11, alpha=0.8)
plt.ylabel('Goal Difference', fontsize=10, alpha=0.75)
plt.xlabel('Season', fontsize=10, alpha=0.75)

# format ticks
plt.xticks(rotation=60)
plt.tick_params(axis='both', labelsize = 7)

# remove chart box
for spine in plt.gca().spines.values():
    spine.set_visible(False)
               
# add legend
plt.legend(fontsize = 6, loc = 'upper left')
plt.subplots_adjust(bottom=0.19)

Adding horizontal lines to help reading the values on the Y-axis.

In [None]:
# draw horizontal lines 
y_lines = []
i = 0
for y in plt.yticks()[0]:
    if i%2==0:
        y_lines.append(y)
    i += 1
    
y_lines = y_lines[1:4]
y_lines

for y in y_lines:
    plt.gca().axhline(y, c='grey', alpha=0.3, linewidth = 0.5, linestyle = '--')

# The Answer

The visual clearly shows the answer to our question. 

**Having a great defence is critical to the final victory.**

In fact, as can be seen, in 15 of the past 20 years the winning team had the best defence in the league, including:
* 8 times with only the best defence (blue dots)
* 7 times with both best attack and best defence (green dots). 

Only 5 times (1999–00, 2000–01, 2003–04, 2005–06, 2006–07) in the last 20 years the winner did not have the best defence in the league.

<img src="Final_chart.png">