In [None]:
---
title: 'Final Project: Predicting World Cup Winner'
author: "Rodrigue Beleho"
date: "December 10, 2018"
output: html_document
---
### Introduction ###

#We decided to use the FIFA World Cup Matches dataset in order to predict the winner of the 2018 World Cup.
#Predicting this variable is relevant because we want to see if there are criteria that needs to be fulfill in order to see a team winning the world cup of if it is completely random. 
#The dataset used in this paper comes from the FIFA World Cup Archive website. The target variable(stage) is a categorical variable describing the stage at which a team is playing a game.
#The original dataset consists of 4573 observations across 21 variables breaking down the information of each contest played in the FIFA World Cup. The dataset includes input variables such as: “year”, which is the year in which the match was played; “Datetime”, which is the exact date the match was played, “Stage”, which is the phase in which the match was played, ‘Attendance’, which is the number of people in the stadium during the match, as well as many others. A full list and description of the variables in the dataset is in appendix A.
#Eight of the 21 variables have been removed from the dataset. Among them, there are:”RoundID”,”MatchID”,”Referee”,”Assistant.1”,”Assistant.2”,”Datetime”,Home.Team.Initials”, “Away.Team.Initials.” Datetime, Home.Team.Initials, “Away.Team.Initials” have been removed because they provide the same information as  other columns such as Year or ”Home.Team.Name” .  “RoundID”, “MatchID”,” Referee”,”Assistant.1”,” Assistant.2” have been removed because they are not useful for the purpose of this paper.
#The categorical variable “Stage” had 24 levels in the first place, but it was then reduced to 5 in order to have a better accuracy of the models.


##Reading the Dataset
```{r}
Iaquinta = read.csv("C:\\Users\\student\\Desktop\\MATH 421\\Math 421 Final Project\\WorldCupMatches.csv")
library(ggplot2)
library(caret)
library(rpart)
library(rattle)
library(lattice)
summary(Iaquinta)
```


##Removing useless columns
```{r}
Iaquinta[,"RoundID"] = NULL
Iaquinta[,"MatchID"] = NULL
Iaquinta[,"Referee"] = NULL
Iaquinta[,"Assistant.1"] = NULL
Iaquinta[,"Assistant.2"] = NULL
Iaquinta[,"Datetime"] = NULL
Iaquinta[,"Home.Team.Initials"] = NULL
Iaquinta[,"Away.Team.Initials"] = NULL
```

##CHecking for missing values
```{r}
sum(is.na(Iaquinta))
```


### DISCUSSION ON MISSING VALUES ###


##Handling missing values (1) - Missing values of categorical variables are replaced by the most frequent category in the variables
```{r}
AL=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      x[,i][is.na(x[,i])]=mean(x[,i], na.rm=TRUE)
    }else{
      levels=unique(x[,i])
      x[,i][is.na(x[,i])]=levels[which.max(tabulate(match(x[,i], levels)))]
    }
  }
  return (x)
}
Iaquinta <- AL(Iaquinta)

```

#Comenting on the result
```{r}

sum(is.na(Iaquinta))


#We had 22322 missing values in the first place
#This method brings the #of missing values to 0
```



##Handling Missing Values (2) - Input a data frame and return a data frame with numeric missing values being replaced by the mean of the corresponding column.  
```{r}
AL2 <- function(x) {
  lee <- ncol(x)
  for (i in 1: lee) {
    if(is.numeric(x[[i]]) == TRUE) {
     df[[i]][is.na(x[[i]])] <- mean(x[[i]], na.rm = TRUE)
    }
  }
  return(x)
}
```


```{r}
sum(is.na(Iaquinta))
#The number of missing values does not change meaning that we do not have any missing numeric values
```


##Handling Missing Values (3) - Missing values of numeric variables are replaced by the means of the non-missing values in the variables 
```{r}
Iaquinta22=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      x[,i][is.na(x[,i])]=mean(x[,i], na.rm=TRUE)
    }else{
      levels=unique(x[,i])
      x[,i][is.na(x[,i])]=levels[which.max(tabulate(match(x[,i], levels)))]
    }
  }
  return (x)
}
Iaquinta <- Iaquinta22(Iaquinta)

```

```{r}
sum(is.na(Iaquinta))
#We go from 22322 to 4507 missing values
```



#Taking Care of the levels
```{r}
levels(Iaquinta$Stage)
levels(Iaquinta$Stage)=c("Prelim", "Final", "Prelim", "Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Semi_Final","Semi_Final","Prelim","Quarter_Final","Round_16","Semi_Final","Semi_Final")
levels(Iaquinta$Stage)
```



### Part 4 - Encoding/Recoding Categorical Variables


##Recoding categorical variable using one hot encoding (dummy encoding)
```{r}
dummies_model <- dummyVars(Year ~., data=Iaquinta)
trainData_mat <- predict(dummies_model, newdata =Iaquinta)

trainData <- data.frame(trainData_mat)
trainData$Year <- Iaquinta$Year
```


#This helps he models assigns the year to its corresping World Cup


##Recoding categorical variable using one hot encoding with a different variable
```{r}
dummies_model <- dummyVars(Away.Team.Goals ~., data=Iaquinta)
trainData_mat <- predict(dummies_model, newdata =Iaquinta)

trainData <- data.frame(trainData_mat)
trainData$Away.Team.Goals <- Iaquinta$Away.Team.Goals
```

#Based on the number, it helps the model detecting whether or not the number of goals scored belong to a Home Team or an Away Team.



### VISUALIZATION AND GRAPHS ###
```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Attendance, fill = Stage)) + facet_wrap(~Stage)
```

#This graph shows that on average the attendace for the preliminary rounds is essentially below 50,000 people. 
#For the round of 16, it is pretty diverse, but the concentration is in between 25,000 and 75,000.
#For the Quarter final, the attendance is also very diverse where there is no real number that stands out more than the others.
#For the semi-finals, the attendance is essentially around 70,000.
#For the final, the pick of the attendance is 75,000.
#Overall the attendance will vary based on the teams that are playing and the capacity of a stadium.

```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Home.Team.Goals, fill = Stage)) + facet_wrap(~Stage)
```

#This graph shows the number of goals the Home Team scores during the match.

```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Away.Team.Goals, fill = Stage)) + facet_wrap(~Stage)
```

#This graph shows the number of goals the Away team has scored during the match.

```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Attendance, fill = Year)) + facet_wrap(~Year)
```

#This graph show the overall attendace of the audiance during each edition of the World Cup; we can see that from 1930 to 1938 the World Cup became more popular. For obvious reasons, there was no WOrld Cup in 1942 and 1946, before restarting slow in 1950 and regaining popularity afterward.

```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Half.time.Home.Goals, fill = Stage)) + facet_wrap(~Stage)
```

#This graph shows the number of goals the Home teams have scored after 45 minutes.

```{r}
library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Half.time.Away.Goals, fill = Stage)) + facet_wrap(~Stage)
```

#This graph shows the number of goals the away teams have scored after 45 minutes.


```{r}
Al22 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_bar(mapping = aes(x = Iaquinta[,var1], fill = Iaquinta[,var2]), position = "dodge")
  return(rt)
}

Al22(Iaquinta, 2, 2)

```


#This graph shows the number of observations we have by stage; as expected there are more information for the preliminary rounds than any other stages and less observations for the fian than any other stages.

```{r}
Al23 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_density(mapping = aes(x = Iaquinta[,var1], fill = df[,var2]), position = "dodge")
  return(rt)
}

Al23(Iaquinta, 2, 2)
```

#This graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.


```{r}
Al24 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_histogram(mapping = aes(x = Iaquinta[,var1], fill = Iaquinta[,var2]), position = "dodge")
  return(rt)
}

Al24(Iaquinta, 11, 2)
```

#Just like previously, but with another angle, this graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.


```{r}
Al25=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      print(ggplot(data=x)+geom_density(mapping=aes(x=x[,i]))+xlab(names(x)[i]))
    }
  }
}
Al25(Iaquinta)

```

#Thus graph shows the density of many variables such as attendance, Home Teams' goals, Away Teams' goals. Interesting to see that during a match, the Home Team scores frequently 2 goals while the Away Team scores 1.


```{r}
Al26= function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      print(ggplot(data=x)+geom_histogram(mapping=aes(x=x[,i]),fill="red")+xlab(names(x)[i]))
    }
  }
}
Al26(Iaquinta)
```

#Thus graph shows the number of obervations regarding many variables such as the number of goals scored by the home teams after 45 and 90 minutes, the number of goals scored by the away teams after 45 and 90 minutes.



### Model Training and Model Tuning ###

#Random Forest
```{r}
AL5 = expand.grid(mtry = 3, splitrule = c("gini"),
                     min.node.size = 5)
library(caret)
AL6 <- train(target ~ ., data = train, method = "ranger",
               trControl = trainControl(method ="cv", 
                                        number = 3, verboseIter = TRUE),
               tuneGrid = AL5)
confusionMatrix(AL6)

#Accuracy (average) : 0.9452
```


#GLMNET
```{r}
myGrid = expand.grid(alpha = 0.1,
                     lambda = 0.1)

myControl = trainControl(method = "cv", number = 5)

model2 = train(target~ ., train, method = "glmnet", 
               trControl = myControl,
               tuneGrid = myGrid)
confusionMatrix(model2)

#Accuracy (average) : 0.9184
```


#random forest with 10-fold cross validation
```{r}
myGrid = expand.grid(mtry = c(1:2), splitrule = c("gini"),
                     min.node.size = c(1:2))

rf_Iaquinta10 <- train(Stage~.,data = Iaquinta, method = "ranger", 
               trControl = trainControl(method ="cv", number = 10, verboseIter = TRUE),
               tuneGrid = myGrid)

#Best Accuracy = 0.9534136 for the mtry=2  min.node.size=1
Tuning parameter 'splitrule' was held constant at a value of gini.
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 1, splitrule = gini
 and min.node.size = 1.
```


#random forest with 7-fold cross validation
```{r}
myGrid = expand.grid(mtry = c(1:2), splitrule = c("gini"),
                     min.node.size = c(1:2))

rf_Iaquinta7 <- train(Stage~.,data = Iaquinta, method = "ranger", 
               trControl = trainControl(method ="cv", number = 7, verboseIter = TRUE),
               tuneGrid = myGrid)

rf_Iaquinta7

#Best Accuracy = 0.9534151 #for the   mtry=1  min.node.size=1 -> This is the best accuracy of all the models.
Tuning parameter 'splitrule' was held constant at a value of gini.
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 1, splitrule = gini
 and min.node.size = 1.

#Overall each model does a good job at predicting the winner of the World Cup, but it is important to keep
#in mind that anyone who is educated about football can more or less predict the top favorites. 
```

