# Attributes of Winning Teams
## Exploring the Data of the EU Soccer League

The goal of this analysis will be to see if there are common attributes amoung the top performing leagues in the European Soccer League. Data is from a Kaggle post,[European Soccer Database](https://www.kaggle.com/hugomathien/soccer) by Hugomathien.

The process of discovery will be the following
1. Create SQL Queries to gather/organize original data
    * Which teams the players belong too (most recent year)
    * Join the player attributes to those teams (most recent performance scores)
    * Which teams placed top on the league (most recent year)
2. Read the SQL Queries into Pandas for further manipulation
3. Provide exploratory visualization showing the combined attributes of each top-performing team
4. Make a judgement if there are common attributes that matter most to success

## Setting Up Environment

First I want to make sure I have my enviroments set up. I have a work station with three monitors but I am not always there so I want to make sure I can do analysis on the fly on my Chromebook. To do this I have set up a cloud instance with [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html). I started the project on my workstation and then createe my [GitHub Repo](https://github.com/GarrettMarkScott/European_Soccer_Exploring). In the process of doing this I quickly learned that the 300mb database (SQLite) was too large for a simple git push command. This forced me to create a .gitignore file that I placed the database in. I then was able to clone the git repo into my Jupyter Cloud Instance and simply upload the SQLite database into the same directory in my cloud instance. Now I can push/pull from both machines without large file transfers.

I used the below code to check to make sure my environment was working.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3

conn = sqlite3.connect('soccer_data.sqlite')
countries = pd.read_sql_query("SELECT * FROM Country;", conn)
countries

Unnamed: 0,id,name
0,1,Belgium
1,1729,England
2,4769,France
3,7809,Germany
4,10257,Italy
5,13274,Netherlands
6,15722,Poland
7,17642,Portugal
8,19694,Scotland
9,21518,Spain


## Matching Players to Teams

Next was time to create the SQL queries to organize the raw data a little bit so it was easier to work with. I was able to borrow my girlfriends mac and load the SQLite datebase into [DB Browser for SQLite] for quick SQL Queries (since chromebook can't run programs).

I spent quite a bit of time trying to figure out how to merge the players to teams. As seen in the screenshot below there was not a common key between player and team. My assumption is that this is due to players switching various teams over their career.

![Table Screenshot](https://i.imgur.com/KCpqgS5.jpg)

I checked the *player_fifa_api_id* but that had 11,060 DISTINCT rows. I was able to create the query below that used the Match table data to see which players were associated with each match. Due to there being substitutions throughout the match, it is assumed that these 11 players are the starting roster for each team during the 2007-2008 season (May-August). 

This turned out to be a pretty big challege because I did not relize DB Browser for SQLite seemingly does not have a formal Date type but rather stores date types as a *Timestring* format. If anyone has any information regarding this and how to deal with it in the best/cleanest way I have made a [Stack Overflow Post](https://stackoverflow.com/questions/56202599/converting-timestrings-to-datetime-db-browser)

In [12]:
#Matches from most recent season and the starting players (in ID format)
matches = pd.read_sql_query('''SELECT t.team_long_name, m.home_player_1, m.home_player_2, m.home_player_3,m.home_player_4, m.home_player_5,m.home_player_6, m.home_player_7, m.home_player_8, m.home_player_9, m.home_player_10, m.home_player_11, m.date 
                                 FROM Team t 
                                 JOIN Match m ON t.team_api_id = m.home_team_api_id 
                                WHERE DATE(m.date) BETWEEN "2015-05-01" and "2016-05-25" 
                                ORDER BY t.team_long_name;''', conn)
matches

Unnamed: 0,team_long_name,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,home_player_9,home_player_10,home_player_11,date
0,1. FC Köln,212815.0,36395.0,36934.0,264221.0,307210.0,36086.0,167589.0,127945.0,459493.0,166449.0,196366.0,2015-05-10 00:00:00
1,1. FC Köln,212815.0,36395.0,36934.0,264221.0,22824.0,450976.0,36086.0,127945.0,459493.0,196366.0,166449.0,2015-05-23 00:00:00
2,1. FC Köln,212815.0,127945.0,231753.0,303800.0,307210.0,166449.0,36086.0,167589.0,459493.0,177941.0,71605.0,2015-10-31 00:00:00
3,1. FC Köln,212815.0,127945.0,36934.0,303800.0,307210.0,166449.0,36086.0,167589.0,260470.0,177941.0,71605.0,2015-11-21 00:00:00
4,1. FC Köln,212815.0,127945.0,36934.0,303800.0,307210.0,238438.0,36086.0,167589.0,260470.0,177941.0,71605.0,2015-12-05 00:00:00
5,1. FC Köln,212815.0,212377.0,231753.0,36934.0,303800.0,307210.0,450976.0,36086.0,22824.0,127945.0,238438.0,2015-12-19 00:00:00
6,1. FC Köln,212815.0,212377.0,36934.0,303800.0,307210.0,167589.0,36086.0,127945.0,166449.0,260470.0,71605.0,2016-01-23 00:00:00
7,1. FC Köln,212815.0,212377.0,231753.0,303800.0,307210.0,127945.0,167589.0,36086.0,260470.0,238438.0,71605.0,2015-08-22 00:00:00
8,1. FC Köln,212815.0,36934.0,28435.0,303800.0,127945.0,450976.0,307210.0,259439.0,238438.0,260470.0,71605.0,2016-02-13 00:00:00
9,1. FC Köln,212815.0,36934.0,28435.0,303800.0,127945.0,36086.0,307210.0,259439.0,450976.0,71605.0,260470.0,2016-02-26 00:00:00


The next step of my process is to collect all the player ids for each team and store them as a list or dictonary. Then we will have a list of all the players that are considered "starters" for each team. Following that we can begin investigating their attributes!