## Expected Goals (xG) Model - Data Splitting Pipeline

This notebook is used to split the imported data into competition-specific csv files.<br>
Feel free to work with the events file directly if you want, this step isn't mandatory.<br>
<i>**Note**</i> : MLS appears in this notebooks because this is an older version.

In [2]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName("Splitting_Data").getOrCreate()

### 1. Splitting matches data

In [None]:
matches_df = spark.read.csv("../data/raw/matches.csv", header=True, inferSchema=True)

In [None]:
dataframes = {}
for competition in matches_df.groupBy('competition').count().orderBy('count', ascending=False).distinct().collect():
    dataframes[competition[0]] = matches_df.filter(matches_df.competition == competition[0])

In [9]:
for competition, df in dataframes.items():
    print(competition, ':', df.count())

Spain - La Liga : 590
France - Ligue 1 : 435
Italy - Serie A : 380
England - Premier League : 380
Germany - 1. Bundesliga : 340
International - FIFA World Cup : 128
Europe - UEFA Euro : 102
Africa - African Cup of Nations : 52
South America - Copa America : 32
United States of America - Major League Soccer : 6
Europe - Champions League : 5


In [None]:
for competition, df in dataframes.items():
    df.toPandas().to_csv(f"../data/split_data/matches/{competition}.csv", index=False)
    print(f"{competition} has been written to CSV.")

Spain - La Liga has been written to CSV.
France - Ligue 1 has been written to CSV.
Italy - Serie A has been written to CSV.
England - Premier League has been written to CSV.
Germany - 1. Bundesliga has been written to CSV.
International - FIFA World Cup has been written to CSV.
Europe - UEFA Euro has been written to CSV.
Africa - African Cup of Nations has been written to CSV.
South America - Copa America has been written to CSV.
United States of America - Major League Soccer has been written to CSV.
Europe - Champions League has been written to CSV.


### 2. Splitting events data

In [None]:
events_df = spark.read.csv("../data/raw/events.csv", header=True, inferSchema=True, sep=";")

                                                                                

In [None]:
events_df.groupBy('competition').count().orderBy('count', ascending=False).show(truncate=False)



+----------------------------------------------+-------+
|competition                                   |count  |
+----------------------------------------------+-------+
|Spain - La Liga                               |2099452|
|France - Ligue 1                              |1589854|
|Italy - Serie A                               |1353739|
|England - Premier League                      |1313783|
|Germany - 1. Bundesliga                       |1207631|
|International - FIFA World Cup                |462501 |
|Europe - UEFA Euro                            |380550 |
|Africa - African Cup of Nations               |162910 |
|South America - Copa America                  |100305 |
|United States of America - Major League Soccer|21786  |
|Europe - Champions League                     |18203  |
+----------------------------------------------+-------+



                                                                                

In [None]:
dataframes = {}
for competition in events_df.groupBy('competition').count().orderBy('count', ascending=False).distinct().collect():
    dataframes[competition[0]] = events_df.filter(events_df.competition == competition[0])

                                                                                

In [None]:
for competition, df in dataframes.items():
    df.coalesce(1).write.csv(f"../data/split_data/events/{competition}.csv", header=True, sep=";")
    print(f"{competition} events have been written to CSV.")

                                                                                

Spain - La Liga events have been written to CSV.


                                                                                

France - Ligue 1 events have been written to CSV.


                                                                                

Italy - Serie A events have been written to CSV.


                                                                                

England - Premier League events have been written to CSV.


                                                                                

Germany - 1. Bundesliga events have been written to CSV.


                                                                                

International - FIFA World Cup events have been written to CSV.


                                                                                

Europe - UEFA Euro events have been written to CSV.


                                                                                

Africa - African Cup of Nations events have been written to CSV.


                                                                                

South America - Copa America events have been written to CSV.


                                                                                

United States of America - Major League Soccer events have been written to CSV.


[Stage 127:>                                                        (0 + 1) / 1]

Europe - Champions League events have been written to CSV.


                                                                                

### 3. Splitting frames data

In [None]:
matches_df = spark.read.csv("../data/raw/matches.csv", header=True, inferSchema=True)

In [None]:
frames_df = spark.read.csv("../data/raw/frames.csv",header=True, inferSchema=True)

                                                                                

In [None]:
df_frames = {}
for comp in matches_df.groupBy('competition').count().distinct().collect():
    print(comp[0])
    df_ids = []
    for matchId in matches_df.filter(matches_df.competition == comp[0]).select('match_id').collect():
        df_ids.append(matchId[0])
    df_frames[comp[0]] = frames_df.filter(frames_df.match_id.isin(df_ids))

South America - Copa America
France - Ligue 1
Italy - Serie A
Europe - Champions League
International - FIFA World Cup
Spain - La Liga
Africa - African Cup of Nations
United States of America - Major League Soccer
Europe - UEFA Euro
England - Premier League
Germany - 1. Bundesliga


In [86]:
for comp,df in df_frames.items():
    print(comp, ':', df.count())

                                                                                

South America - Copa America : 0


                                                                                

France - Ligue 1 : 0


                                                                                

Italy - Serie A : 0


                                                                                

Europe - Champions League : 0


                                                                                

International - FIFA World Cup : 3084876


                                                                                

Spain - La Liga : 0


                                                                                

Africa - African Cup of Nations : 0


                                                                                

United States of America - Major League Soccer : 0


                                                                                

Europe - UEFA Euro : 5221376


                                                                                

England - Premier League : 0




Germany - 1. Bundesliga : 1953182


                                                                                

In [None]:
for comp, df in df_frames.items():
    if df.count() > 0:
        df.coalesce(1).write.csv(f"../data/split_data/frames/{comp}.csv", header=True)
        print(f"{comp} frames have been written to CSV.")

                                                                                

International - FIFA World Cup frames have been written to CSV.


                                                                                

Europe - UEFA Euro frames have been written to CSV.


[Stage 567:>                                                        (0 + 1) / 1]

Germany - 1. Bundesliga frames have been written to CSV.


                                                                                

### 4. Splitting lineups

In [None]:
matches_df = spark.read.csv("../data/raw/matches.csv", header=True, inferSchema=True)

In [None]:
lineups_df = spark.read.csv("../data/raw/lineups.csv", header=True, inferSchema=True)

In [None]:
df_lineups = {}
for comp in matches_df.groupBy('competition').count().distinct().collect():
    print(comp[0])
    df_ids = []
    for matchId in matches_df.filter(matches_df.competition == comp[0]).select('match_id').collect():
        df_ids.append(matchId[0])
    df_lineups[comp[0]] = lineups_df.filter(lineups_df.match_id.isin(df_ids))

South America - Copa America
France - Ligue 1
Italy - Serie A
Europe - Champions League
International - FIFA World Cup
Spain - La Liga
Africa - African Cup of Nations
United States of America - Major League Soccer
Europe - UEFA Euro
England - Premier League
Germany - 1. Bundesliga


In [100]:
for comp, df in df_lineups.items():
    print(comp, ':', df.count())

South America - Copa America : 1618
France - Ligue 1 : 15866
Italy - Serie A : 16750
Europe - Champions League : 190
International - FIFA World Cup : 6130
Spain - La Liga : 21512
Africa - African Cup of Nations : 2374
United States of America - Major League Soccer : 240
Europe - UEFA Euro : 4932
England - Premier League : 13678
Germany - 1. Bundesliga : 12340


In [None]:
for comp, df in df_lineups.items():
    df.coalesce(1).write.csv(f"../data/split_data/lineups/{comp}.csv", header=True)
    print(f"{comp} lineups have been written to CSV.")

South America - Copa America lineups have been written to CSV.
France - Ligue 1 lineups have been written to CSV.
Italy - Serie A lineups have been written to CSV.
Europe - Champions League lineups have been written to CSV.
International - FIFA World Cup lineups have been written to CSV.
Spain - La Liga lineups have been written to CSV.
Africa - African Cup of Nations lineups have been written to CSV.
United States of America - Major League Soccer lineups have been written to CSV.
Europe - UEFA Euro lineups have been written to CSV.
England - Premier League lineups have been written to CSV.
Germany - 1. Bundesliga lineups have been written to CSV.
