# Introduction to Modeling

Field: Data Science 

Purpose: Teaching

Goal: Illustrate the "Data Science process" and the meaning of modelling through an example. 

* We will define what is a model. 

* We will go through a first example of modelling. We will perform a series of necessary steps of the "Data Science Process" that we'll later explain. These include: asking a business question, retrieving real data, preparing the data, and applying a model.

* Based on this exercise, at the end we will introduce, review and discuss the Data Science Process. 

## 1. Definition of model

This is my current best-liked definition of model: 

A model is a **systematic procedure** to explain possible outcomes of an unknown event or property from available data. 

Models can be used to try to explain: 

* Will it rain tomorrow? 

* How long will it take to arrive from my home to work this morning? 

* How many products X will be purchased in the next month? 

* Will the customer be interested in this ad? 

Explaining the possible outcomes of an unknown event is sometimes called *making a prediction* (but treat this word with care!). 

Models make assumptions on how the known data can be used to make predictions.  The process of exploiting the available data to prepare the model to make  predictions is called *fitting* (also sometimes *training* or *learning*). If a model improves when being fed more data, we say the model *learns*.  

After training, some models do not need to retain all the data anymore (these are called "parametric models"), and the information retained is called "parameters".  So, technically, model = assumptions + parameters. 


## A first example of modelling. 

Before starting, let's do some necessary configuration. 




In [1]:
import numpy as np
import pandas as pd

### a) Business question: will FC Barcelona win the League this year? 

### b)  Data understanding

We found an online source of soccer data. We download the file corresponding to the current season. 

In [2]:
matches = pd.read_csv('season-1819.csv')

In [3]:
matches.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,...,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR
0,2018-08-17,Betis,Levante,0,3,A,0,1,A,22,...,8,4,10,10,5,3,0,2,0,0
1,2018-08-17,Girona,Valladolid,0,0,D,0,0,D,13,...,1,1,21,20,3,2,1,1,0,0
2,2018-08-18,Barcelona,Alaves,3,0,H,0,0,D,25,...,9,0,6,13,7,1,0,2,0,0
3,2018-08-18,Celta,Espanol,1,1,D,0,1,A,12,...,2,5,13,14,8,7,3,2,0,0
4,2018-08-18,Villarreal,Sociedad,1,2,A,1,1,D,16,...,7,4,16,10,4,6,2,3,0,0


Data uncerstanding would also include

* Checking correctness and integrity of the data, 
* Understanding columns, performing basic tests
* Understanding whether there is some "signal" to apply a model which aims at answering the question above. 

By seeing this data, we decide we will apply a simple model: try to simulate the remaining match results (Win/Draw/Lose) based on the ratios of Win/Draw/Lose obtained so far. 

### c) Data Preparation: 

Convert the 'raw data' to the inputs of interest. 

The following function computes the #matches with (Win, Draw, Lose) for a given team: 

In [4]:
def results(frm, team):

  away = frm.AwayTeam.values == team
  home = frm.HomeTeam.values == team

  nwins = np.sum(frm[home]['FTR'] == 'H') + np.sum(frm[away]['FTR'] == 'A')
  ndraw = np.sum(frm[home | away]['FTR'] == 'D')
  nlose = np.sum(frm[home]['FTR'] == 'A') + np.sum(frm[away]['FTR'] == 'H')

  return nwins, ndraw, nlose

results(matches, 'Barcelona')

(15, 4, 2)

Now do that for all teams. 

In [5]:
# Get all teams
teams = np.unique(matches.HomeTeam)

In [6]:
# Get results for all teams and add to "board" list
board = []
for team in teams: 
  w, d, l = results(matches, team)
  board.append([team, w, d, l])

In [7]:
# Construct proper DataFrame to be exploited 
frm_board = pd.DataFrame(board)
frm_board.columns = ['Team', 'W', 'D', 'L']
frm_board['Points'] = frm_board.W*3 + frm_board.D
frm_board['Played'] = frm_board.W + frm_board.D + frm_board.L
frm_board.sort_values('Points', ascending=False)

Unnamed: 0,Team,W,D,L,Points,Played
3,Barcelona,15,4,2,49,21
2,Ath Madrid,12,8,1,44,21
13,Real Madrid,12,3,6,39,21
14,Sevilla,10,6,5,36,21
0,Alaves,9,5,7,32,21
8,Getafe,8,7,6,31,21
4,Betis,8,5,8,29,21
16,Valencia,6,11,4,29,21
15,Sociedad,7,6,8,27,21
12,Levante,7,5,9,26,21


### d) Model: Simulation

Now with these data, we can simulate future matches based on last matches. 

We will use a model for estimation called the "bootstrap", which generates future (unknown) outcomes by randomizing the past (known) outcomes. 

#### Simulate single end of League

In [8]:
# This function generates new results by randomizing previous ones
def simulate(nplayed, w, d, l):
  
  ruleta = [3]*int(w) + [1]*int(d) + [0]*int(l)
  
  # Don't hard-code 38!!
  sim = np.random.choice(ruleta, 38-nplayed)
  return sim

# Example: Simulate for one team
sub = frm_board[frm_board.Team == 'Barcelona']
np.sum(simulate(int(sub['Played']), int(sub['W']), int(sub['D']), int(sub['L'])))


36

In [9]:
# This function generalizes the sumulator to any team
# It is build in order to be applicable to each row of a DataFrame
simulator = lambda row: np.sum(simulate(int(row['Played']), int(row['W']), int(row['D']), int(row['L'])))


In [10]:
#So apply it to every row
frm_board0 = pd.DataFrame(frm_board)
for t in teams:
  frm_board0['Simulated'] = frm_board0.apply(simulator, axis=1)
  frm_board0['Final_Simulated'] = frm_board0['Points'] + frm_board0['Simulated']
  
# We have simulated one possible end of the league!
frm_board0.sort_values('Final_Simulated', ascending=False)

Unnamed: 0,Team,W,D,L,Points,Played,Simulated,Final_Simulated
3,Barcelona,15,4,2,49,21,41,90
2,Ath Madrid,12,8,1,44,21,45,89
13,Real Madrid,12,3,6,39,21,34,73
14,Sevilla,10,6,5,36,21,33,69
0,Alaves,9,5,7,32,21,27,59
8,Getafe,8,7,6,31,21,25,56
1,Ath Bilbao,5,11,5,26,21,27,53
4,Betis,8,5,8,29,21,23,52
16,Valencia,6,11,4,29,21,23,52
6,Eibar,6,8,7,26,21,23,49


In [11]:
# Get the winner
frm_board0.sort_values('Final_Simulated', ascending=False).iloc[0,0]

'Barcelona'

#### Now simulate N ends of league

In [12]:
# Now repeat the procedure n times

def simulate_all(n, frm_init):
  
  winners = []
  for _ in range(n):  
    frm_board0 = pd.DataFrame(frm_init)
    for t in teams:
      frm_board0['Simulated'] = frm_board0.apply(simulator, axis=1)
      frm_board0['Final_Simulated'] = frm_board0['Points'] + frm_board0['Simulated']
    winner = frm_board0.sort_values('Final_Simulated', ascending=False).iloc[0,0]
    winners.append(winner)
    
  return winners

In [13]:
winners = simulate_all(100, frm_board)

###  Result and conclusion

In [14]:
frm_winners = pd.DataFrame(winners, columns=['winner'])
frm_winners['aux'] = 1
frm_winners.groupby('winner').count()

Unnamed: 0_level_0,aux
winner,Unnamed: 1_level_1
Ath Madrid,6
Barcelona,94


## 3. The Data Science Process

![alt text](https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png)

*De Kenneth Jensen - Trabajo propio, basado en: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610*

We will use these examples to discuss the "Data Science Process". 

* Business understanding: 

 * Which question(s) do we want to answer? 
 * Form a hypothesis with *current* knowledge. 
 
* Data Understanding
  * What data is needed? What data is available? 
  * Is the data complete / correct / reliable? 
  * Visualization / Data Exploration. 
  
* Data Preparation: 
  * Convert raw data into the attributes or inputs needed for the analysis. 

* Modeling: 
 * Model = Systematic procedure to explain non-available information from available information. 
 * Systematic: Could have been used for any other league. 
 * Assumptions + Data --> Model
   * Which assumptions did our model make? 
   * Which other assumptions could it have made?
   
  * A model improves with more information
  
  * A model compresses the information
  
  * Modeling has an arrow back to Data Preparation: 
    * Which other data could our model have used? 
    
* Evaluation: 
  * How to check quantitatively that the assumptions are valid (typically: good enough)? 
  * Which of two models is best?

* Deployment: 
  * How to implement that as a (normally software) process? 
  * Complex and hot topic, not covered here. 
  
An arrow is missing: from evaluation / or deployment to Business: at the end of the process you have more knowledge. 

Comments: 

* Data-driven decisions can also be made without models (e.g. in a non-systematic way). Some data science projects omit models and use instead dashboards, visualizations, or descriptive statistics. 

Exercises

What about: 

* Take into account Home / Away
* E.g. using winners of previous years


