# Football Data Analysis Project

This project involves analyzing football data to gain insights into team performance and scoring patterns. The data includes information about matches, teams, goals, and season details.

## Data Description

The dataset consists of the following columns:

- **Match_ID**: Unique identifier of the match.
- **Div**: Division of the league in which the match took place.
- **Season**: Season in which the match was played.
- **Date**: Date of the match.
- **HomeTeam**: Home team.
- **AwayTeam**: Away team.
- **FTHG**: Number of goals scored by the home team at the end of the match.
- **FTAG**: Number of goals scored by the away team at the end of the match.
- **FTR**: Final result of the match: 'H' (home team victory), 'A' (away team victory), or 'D' (draw).

## Data Analysis Steps

The analysis is performed using PySpark, a powerful data processing framework. Here are the key steps involved in the analysis:

1. **Calculating Home Team Points**: 
   - Filter the matches to select those from Division "D1" and seasons between 2007 and 2016.
   - Group the matches by HomeTeam and Season.
   - Calculate the sum of points based on the match result (H, D, or A).
   - Order the results by Season and points.

2. **Calculating Away Team Points**: 
   - Filter the matches to select those from Division "D1" and seasons between 2007 and 2016.
   - Group the matches by AwayTeam and Season.
   - Calculate the sum of points based on the match result (H, D, or A).
   - Order the results by Season and points.

3. **Total Points Calculation**: 
   - Combine the home team points and away team points using the union operation to create a unified points table.

4. **Classification of Teams**: 
   - Group the total points table by HomeTeam and Season.
   - Calculate the sum of points for each team in each season.
   - Order the results by Season and total points.

5. **Adding Position Column**: 
   - Define a window partitioned by Season and ordered by total points.
   - Add the "Position" column to the classification table using the row_number function over the window.
   - Display the classification table, showing the team, season, total points, and position.

6. **Selecting Winner Teams**: 
   - Add the "row_number" column to the classification table using the row_number function over the window.
   - Filter the classification table to select the teams with the highest total points in each season (where row_number is equal to 1).
   - Select the team, season, and total points columns.
   - Drop the "row_number" column.
   - Display the winners' table, showing the team, season, and total points.

7. **Scoring Goals Analysis**:
   - Calculate the home goals scored by each team in each season.
   - Group the matches by HomeTeam and Season.
   - Calculate the sum of FTHG (home goals).
   - Order the results by home goals in descending order.
   - Calculate the away goals scored by each team in each season.
   - Group the matches by AwayTeam and Season.
   - Calculate the sum of FTAG (away goals).
   - Order the results by away goals in descending order.

8. **Total Goals Calculation**:
   - Join the home goals and away goals tables on the team and season columns using an inner join.
   - Add the "Total Goals" column by summing the home goals and away goals.
   - Display the

 results, showing the team, home goals, away goals, season, and total goals.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql import Window

In [0]:
# Load Data Function
def loadDf(fileName):
    dt = spark.read.format('delta').options(header='true').load(fileName)
    return dt


In [0]:
# Matches

dtMatches = loadDf("dbfs:/user/hive/warehouse/matches")

# dtMatches.display()

# Match_ID: Unique identifier of the match.
# Div: Division of the league in which the game took place.
# Season: Season in which the game was held.
# Date: Date of the game.
# HomeTeam: Home team.
# AwayTeam: Away team.
# FTHG: Number of goals scored by the home team (HomeTeam) at the end of the match.
# FTAG: Number of goals scored by the away team (AwayTeam) at the end of the match.
# FTR: Final result of the match. It can be 'H' (home team victory), 'A' (away team victory), or 'D' (draw).



In [0]:
# Filter matches where "Div" is "D1" and "Season" is between 2007 and 2016
dtHomePoints = (
    dtMatches
    .filter((col("Div") == "D1") & col("Season").between(2007, 2016))
    .groupBy("HomeTeam", "Season")
    .agg(
        sum(when(col("FTR") == "H", 3).when(col("FTR") == "D", 1).otherwise(0)).alias("Points")
    )
    .orderBy("Season", desc("Points"))
    .select("HomeTeam", "Season", "Points")
)

# Display dtHomePoints table
display(dtHomePoints)



HomeTeam,Season,Points
Bayern Munich,2007,41
Werder Bremen,2007,39
Stuttgart,2007,38
Schalke 04,2007,34
Hamburg,2007,32
Leverkusen,2007,31
Hertha,2007,30
Hannover,2007,29
Ein Frankfurt,2007,28
Wolfsburg,2007,27


In [0]:
# Filter matches where "Div" is "D1" and "Season" is between 2007 and 2016
dtAwayPoints = (
    dtMatches
    .filter((col("Div") == "D1") & col("Season").between(2007, 2016))
    .groupBy("AwayTeam", "Season")
    .agg(
        sum(when(col("FTR") == "A", 3).when(col("FTR") == "D", 1).otherwise(0)).alias("Points")
    )
    .orderBy("Season", desc("Points"))
    .select("AwayTeam", "Season", "Points")
)

# Display dtAwayPoints table
display(dtAwayPoints)



AwayTeam,Season,Points
Bayern Munich,2007,35
Schalke 04,2007,30
Werder Bremen,2007,27
Wolfsburg,2007,27
Hamburg,2007,22
Leverkusen,2007,20
Hannover,2007,20
Karlsruhe,2007,19
Ein Frankfurt,2007,18
Bochum,2007,17


In [0]:
# Union of the all points made by HomeTeam and AwayTeam
dtTotalPoints = dtHomePoints.unionAll(dtAwayPoints)

# Group by "HomeTeam" and "Season", aggregate sum of "Points" as "Total Points"
# Order by "Season" and "Total Points" in descending order
# Select columns "HomeTeam" as "Team", "Season", and "Total Points"

dtClassification = (
    dtTotalPoints
    .groupBy("HomeTeam", "Season")
    .agg(sum("Points").alias("Total Points"))
    .orderBy("Season", desc("Total Points"))
    .select(col("HomeTeam").alias("Team"), "Season", "Total Points")
)

# Display dtClassification table
dtClassification.display()

# Defines and order the partition window by "Total Points" and "Season"
window = Window.partitionBy("Season").orderBy(col("Total Points").desc())

# Adds the column "row_number" to dtClassification
winnerDt = dtClassification.withColumn("row_number", row_number().over(window))

# Filters the Winners (where the registry of the line is equal to one)
# Selects columns "Team", "Season", and "Total Points"
# Drops the column "row_number"

winnerDt = (
    dtClassification
    .withColumn("row_number", row_number().over(window))
    .filter(col("row_number") == 1)
    .select("Team", "Season", "Total Points")
    .drop("row_number")
)

# Display winnerDt table without truncating column values
winnerDt.display(truncate=False)


Team,Season,Total Points
Bayern Munich,2007,76
Werder Bremen,2007,66
Schalke 04,2007,64
Hamburg,2007,54
Wolfsburg,2007,54
Stuttgart,2007,52
Leverkusen,2007,51
Hannover,2007,49
Ein Frankfurt,2007,46
Hertha,2007,44


Team,Season,Total Points
Bayern Munich,2007,76
Wolfsburg,2008,69
Bayern Munich,2009,70
Dortmund,2010,75
Dortmund,2011,81
Bayern Munich,2012,91
Bayern Munich,2013,90
Bayern Munich,2014,79
Bayern Munich,2015,88
Bayern Munich,2016,82


In [0]:
# Filter dtMatches for Div "D1", group by HomeTeam and Season, and calculate the sum of FTHG as "Home Goals"

dtScoredGoalsHome = (
    dtMatches
    .filter(col("Div") == "D1")
    .groupBy("HomeTeam", "Season")
    .agg(
        sum(col("FTHG")).alias("Home Goals"),
    )
    .select(
        col("HomeTeam").alias("Team"),  # Rename HomeTeam column to Team
        col("Home Goals"),
        col("Season")
    )
    .orderBy(desc("Home Goals"))  # Order by Home Goals in descending order
)

# Display dtScoredGoalsHome table
dtScoredGoalsHome.display()


Team,Home Goals,Season
Bayern Munich,56,2012
Bayern Munich,55,2016
Wolfsburg,51,2008
Bayern Munich,51,2015
Werder Bremen,50,2005
Bayern Munich,50,1999
Bayern Munich,49,2011
Dortmund,49,2015
Werder Bremen,48,2007
Bayern Munich,48,2010


In [0]:
# Filter dtMatches for Div equal to "D1", group by "AwayTeam" and "Season", aggregate sum of "FTAG" as "Away Goals"
# Select columns "AwayTeam" as "Team", "Away Goals", and alias "Season" as "Season"
# Order by "Away Goals" in descending order

dtScoredGoalsAway = (
    dtMatches
    .filter(col("Div") == "D1")
    .groupBy("AwayTeam", "Season")
    .agg(
        sum(col("FTAG")).alias("Away Goals"),
    )
    .select(
        col("AwayTeam").alias("Team"),
        col("Away Goals"),
        col("Season").alias("Season")
    )
    .orderBy(desc("Away Goals"))
)

# Display dtScoredGoalsAway table
dtScoredGoalsAway.display()


Team,Away Goals,Season
Bayern Munich,46,2013
Werder Bremen,43,2006
Bayern Munich,42,2012
Dortmund,41,2012
Dortmund,39,2013
Werder Bremen,37,2003
Werder Bremen,37,2009
Dortmund,36,2011
Werder Bremen,35,2004
Bayern Munich,34,2016


In [0]:
# Filter dtScoredGoalsHome for Season not equal to 2017, select columns "Team", "Home Goals", "Season"
# Join with dtScoredGoalsAway on columns "Team" and "Season" using inner join
# Add a new column "Total Goals" by summing "Home Goals" and "Away Goals"

dtScoredGoals = (
    dtScoredGoalsHome
    .filter(col("Season") != 2017)
    .select("Team", "Home Goals", "Season")
    .join(dtScoredGoalsAway, ["Team", "Season"], "inner")
    .withColumn("Total Goals", col("Home Goals") + col("Away Goals"))
)

# Display dtScoredGoals table
dtScoredGoals.display()


Team,Season,Home Goals,Away Goals,Total Goals
M'gladbach,1996,34,12,46
Bayern Munich,2011,49,28,77
Ein Frankfurt,1995,29,14,43
M'gladbach,2013,38,21,59
FC Koln,1994,31,23,54
Werder Bremen,2003,42,37,79
Augsburg,2015,18,24,42
M'gladbach,1994,37,29,66
M'gladbach,2012,27,18,45
Bielefeld,2007,21,14,35


In [0]:
# Defines and orders the partition window by the Total Points and Season
window = Window.partitionBy("Season").orderBy(col("Total Points").desc())

# Adds the Column "Position" to dtClassification
dtClassification = dtClassification.withColumn("Position", row_number().over(window))

# Show the Position column
dtClassification.display(truncate=True)

# Joins dtScoredGoals with dtClassification and orders by Season and Total Points in descending order
dtFinalTable = dtScoredGoals.join(dtClassification, ["Team", "Season"]).orderBy("Season", desc("Total Points"))

# Selects all columns from dtFinalTable except 'Position'
columns = [c for c in dtFinalTable.columns if c != 'Position']

# Reorders the columns with 'Position' as the first column
dtFinalTable = dtFinalTable.select(col('Position'), *columns)

# Displays the final table
dtFinalTable.display()


Team,Season,Total Points,Position
Bayern Munich,2007,76,1
Werder Bremen,2007,66,2
Schalke 04,2007,64,3
Hamburg,2007,54,4
Wolfsburg,2007,54,5
Stuttgart,2007,52,6
Leverkusen,2007,51,7
Hannover,2007,49,8
Ein Frankfurt,2007,46,9
Hertha,2007,44,10


Position,Team,Season,Home Goals,Away Goals,Total Goals,Total Points
1,Bayern Munich,2007,41,27,68,76
2,Werder Bremen,2007,48,27,75,66
3,Schalke 04,2007,29,26,55,64
4,Hamburg,2007,30,17,47,54
5,Wolfsburg,2007,28,30,58,54
6,Stuttgart,2007,39,18,57,52
7,Leverkusen,2007,32,25,57,51
8,Hannover,2007,32,22,54,49
9,Ein Frankfurt,2007,24,19,43,46
10,Hertha,2007,21,18,39,44
