# <font color='gray'>Large Scale Data Analytics using Jupyter Notebooks</font>

###### In this preliminary data analysis, we observed the column names, the shape of each dataset including the data format, and the null values. <br>The data is collected in 9 different CSV files: appearances, club_games, clubs, competitions, game_events, game_lineups, games, player_valuations and players. 


In [1]:
import pandas as pd
import numpy as np

## <font color='green'>competitions</font> table 
###### This table contains information about competitions, such as the competition name, the competition country, and the competition type.

In [2]:
cpt = pd.read_csv('../csv/competitions.csv')
cpt.head()



FileNotFoundError: [Errno 2] No such file or directory: '../csv/competitions.csv'

In [None]:
cpt.columns

In [None]:
cpt.dtypes

In [None]:
cpt.shape

In [None]:
cpt.isna()

In [None]:
cpt.isna().sum()

###### Let's see in which tuples there are null values in the column country_name


In [None]:
cpt[cpt['country_name'].isnull()]

## <font color='green'>games</font> table
###### This table contains details about individual matches, such as the home team, the away team, the competition, the season, the date, the stadium and the number of goals.

In [None]:
gms = pd.read_csv('../csv/games.csv')
gms.head()

In [None]:
gms.columns

In [None]:
gms.dtypes

In [None]:
gms.shape

In [None]:
gms['date']

###### We have some dates that we can convert to the appropriate format.

In [None]:
gms['date'] = pd.to_datetime(gms['date'])

###### In some tuples, there is missing information about the manager, the stadium, the lineup, and the home team. This could be due to matches without a home team, for example, matches played on neutral grounds or in other circumstances. Additionally, in many tuples, the home team's position is not known.


In [None]:
gms.isna().sum()

###### Here we observe that the number of different competitions is 43.

In [None]:
gms['competition_id'].unique()

In [None]:
gms[gms["home_club_formation"].isnull()]

## <font color='green'>clubs</font> table
###### The clubs table contains statistical information about a particular team, such as the number of players, the market value, the number of foreign players, the home stadium, and the coach.

In [None]:
clb = pd.read_csv('../csv/clubs.csv')
clb.head()


###### The missing information is the market value, the average age, and the percentage of foreigners. The latter two pieces of information could be missing due to incomplete information about the team composition, while the market value information seems to be missing altogether because it is also missing for well-known teams (e.g., AS Roma).

In [None]:
clb.isna().sum()

In [None]:
clb.shape

In [None]:
clb[clb["total_market_value"].isnull()]

###### The data types seem correct, in particular we have three columns that can have floating-point values so they are of type float.

In [None]:
clb.dtypes

## <font color='green'>club_games</font> table
###### Here we find information about the team's performance in individual matches.

In [None]:
clb_games = pd.read_csv('../csv/club_games.csv')
clb_games.head()


In [None]:
clb_games.shape

In [None]:
clb_games.columns

In [None]:
clb_games.dtypes

###### The missing information refers to the teams' positions and the managers' names.

In [None]:
clb_games.isna().sum()

## <font color='green'>players</font> table
###### We find information about individual players, such as personal information such as date of birth, country of origin, height, preferred foot, maximum market value reached and agent name.

In [None]:
ply = pd.read_csv('../csv/players.csv')
ply.head()


In [None]:
ply.columns

In [None]:
ply.dtypes

In [None]:
ply['date_of_birth'] = pd.to_datetime(ply['date_of_birth'])
ply['contract_expiration_date'] = pd.to_datetime(ply['contract_expiration_date'])
ply.dtypes

###### The missing information is personal in nature, such as place of birth, height, information about the preferred foot, and market value. The most important information that is not available for many players is first_name. In this case, the name column will only have the value of last_name. However, since we have an identifier such as player_id, we can still associate the player to the data even if we do not know the first_name.

In [None]:
ply.isna().sum()

In [None]:
ply[ply['first_name'].isnull()]

In [None]:
ply.shape

## <font color='green'>appearances</font> table
###### In this table, we find information about players based on individual matches, such as the number of yellow/red cards, the number of goals, assists, and minutes played.

In [None]:
apr = pd.read_csv('../csv/appearances.csv')
apr.head()

In [None]:
apr.columns

In [None]:
apr.dtypes

In [None]:
apr['date'] = pd.to_datetime(apr['date'])

In [None]:
apr.dtypes

In [None]:
apr.isna().sum()

## <font color='green'>game_events</font> table
###### The game_events table contains detailed information about specific events that occur during a football match. Some examples of events include a player entering the field (substitution), an assist, a foul, a yellow or red card, and other relevant events during the match.

In [None]:
gme = pd.read_csv('../csv/game_events.csv')
gme.head()

In [None]:
gme.columns


In [None]:
gme.dtypes

In [None]:
gme['date'] = pd.to_datetime(gme['date'])

In [None]:
gme.dtypes

###### The missing information concerns the description, the player who came on as a substitute (probably if there were no substitutions in that match) and the missing information about the assists.


In [None]:
gme.isna().sum()

In [None]:
gme.shape

## <font color='green'>player_valutations</font> table
###### The information in this table contains information about the evaluations of a particular player during different seasons, such as the market value and the current team.

In [None]:
plv = pd.read_csv('../csv/player_valuations.csv')


In [None]:
plv.columns


In [None]:
plv.dtypes

In [None]:
plv.head()

In [None]:
plv.shape

In [None]:
plv['date'] = pd.to_datetime(plv['date'])
plv['dateweek'] = pd.to_datetime(plv['dateweek'])
plv['datetime'] = pd.to_datetime(plv['datetime'])

In [None]:
plv.tail()

In [None]:
plv['last_season'].nunique()

In [None]:
plv['datetime'].nunique()

In [None]:
plv['date'].nunique()

###### <font color="red">We have a match between the different values of datetime and the different values of dates, which is why it might make sense to delete one of these columns.</font>

## <font color='green'>game_lineaups</font> table
###### The data on the roles of each player of a particular team in a particular match, including the name of the captain, are collected in this table.

In [None]:
gml = pd.read_csv('../csv/game_lineups.csv')



In [None]:
gml.head()


In [None]:
gml.columns

In [None]:
gml.shape

In [None]:
gml.dtypes

In [None]:
gml.isna().sum()