
# DA&ML Basics - Assignment 1: Create an EDA (max. 10p)

Goal of this assignment is to create an **EDA** for given dataset of football players.

**EDA** (*Exploratory Data Analysis*) is an important initial step in data analysis and is performed before more complex data preparation or modelling techniques.
EDA also involves examining and visualizing the data, and understand the characteristics of the dataset.

You can find dataset of the players from the *data/* subdirectory:
[players.csv](data/players.csv).
Use this dataset in the Assignment 1.

### Add your information

In [1]:
# TODO: Replace with your name or names
student_name = 'Juuso Leppänen'
student_email = 'AD1885@student.jamk.fi'

Some hints to follow in this assignment:

* Follow the MarkDown structure in this document.
* Do not modify the MarkDown lines of the assignment or at least delete them. You can always add your own short comments.
* There is never a completely perfect answer to this kind of assignment, so that is, use your imagination and creativity and try different techniques.
* Finally, before returning the assignment, run the Jupyter Notebook document from start to finish once. Check that the numbering of the output fields starts from one and continues increasing by one.

## Sub-assignments

Study Project 1 has **five** sub-assignments.

* 1.1: Statistics (max. 2p)
* 1.2: Visualization (max. 2p)
* 1.3: Grouping (max. 2p)
* 1.4: Correlations (max. 2p)
* 1.5: Modeling (max. 2p)

## Evaluation

Each assignment you can get **0-2** points with a simple table:

* **0** points: **Fail**
* **1** points: **Moderate**
* **2** points: **Good**

Note! The points given for each sub-assignment can also be any grade between 0-2 like `0.5, 1-, 1+, 1.5, 2-`.

## Assignment 1.1: Statistics (max. 2p)

Sub-tasks in this assignment are:

* Load data to create a DataFrame.
* Print summary rows.
* Print data types of the dataset.
* Calculate different statistics for the data.
* Calculate summary statistics.


### Create a DataFrame

Load the sample data into a DataFrame from the file

In [13]:
# TODO: load data to create a DataFrame and print few lines
import pandas as pd
import numpy as np
import sklearn.linear_model

df = pd.read_csv("data/players.csv")
# df.head()


### Summary rows

Print summary rows of the DataFrame.

In [16]:
# TODO: Print summary rows
df.head()
df.tail()

Unnamed: 0,Player Name,Age,Height,Weight,Position,Goals,Assists,Pass Accuracy,Shots on Target,Tackles,Interceptions
29,Jussi Jääskeläinen,24,185,83,Goalkeeper,0,0,68.0,0,0,0
30,Niki Mäenpää,36,188,88,Goalkeeper,0,0,73.0,0,0,0
31,Paulus Arajuuri,33,190,84,Defender,2,0,85.0,8,18,10
32,Daniel O'Shaughnessy,27,191,87,Defender,1,0,83.0,5,15,12
33,Leo Väisänen,24,192,88,Defender,0,0,88.0,0,12,18


### Data types

Print data types of the dataset.
* You can also print index (rows, columns) information of the dataset.

In [21]:
# TODO: print data types and index information
print(df.dtypes)
print("\nShape:", df.shape)
print("\nColumns:", df.columns.tolist())

Player Name         object
Age                  int64
Height               int64
Weight               int64
Position            object
Goals                int64
Assists              int64
Pass Accuracy      float64
Shots on Target      int64
Tackles              int64
Interceptions        int64
dtype: object

Shape: (34, 11)

Columns: ['Player Name', 'Age', 'Height', 'Weight', 'Position', 'Goals', 'Assists', 'Pass Accuracy', 'Shots on Target', 'Tackles', 'Interceptions']


### Calculate Different statistics

Calculate different statistics (like the minimum, mean, median etc.) from the data, but only select numerical columns.

In [5]:
# TODO: Calculate different statistics
num_df = df.select_dtypes(include = [np.number])

print("Min:\n", num_df.min(), "\n")
print("Mean:\n", num_df.mean(), "\n")
print("Median:\n", num_df.median(), "\n")
print("Max:\n", num_df.max(), "\n")

### Summary statistics

Calculate different summary statistics for the data (numerical columns only).

In [6]:
# TODO: Calculate summary statistics

## Assignment 1.2: Visualization (max. 2p)

Sub-tasks in this assignment are:

* Draw histograms
* Draw bar plots
* Draw scatter plots
* Draw some other diagrams

### Draw Histograms

However, select only some columns for the histogram and other plots in the following subtasks.

In [7]:
# TODO: Draw histograms

### Draw Bar Plots

In [8]:
# TODO: Bar plots

### Draw Scatter plots

In [9]:
# TODO: Scatter Plots

### Draw other Useful Diagram

Think of the different diagram types and then select some suitable diagrams for this dataset.

In [10]:
# TODO: Other useful diagram

## Assignment 1.3: Grouping (max. 2p)

Sub-tasks in this assignment are:

* Group the data by player's position and calculate the average performance metrics
* Are there other grouping possibilities in the data?
* Visualize results

### Group data by player's position

Group the data by player's position and calculate the average performance metrics

In [11]:
# TODO: Group the data by player's position and calculate means


### Group Data Based on Other Columns

Are there other grouping possibilities in the data?

In [12]:
# TODO: Group data based on other columns


### Visualize grouped data

Visualize results after grouping in the previous sub-task.

In [13]:
# TODO: Visualize grouped data


## Assignment 1.4: Correlations (max. 2p)

Sub-tasks in this assignment are:

* Compute correlation matrix.
* Calculate lower and upper limits for each numeric column.
* Calculate the 5th and 95th percentiles.
* Identify potential outliers.
* Visualize correlations.

### Correlation Matrix

In [14]:
# TODO: Compute correlation matrix


### Calculate Lower and Upper Limits for Columns

Calculate limits for numerical columns only.


In [15]:
# TODO: Calculate lower and upper limits


###  Calculate the 5th and 95th percentiles

* Calculate the 5th and 95th percentiles for multiple features of the DataFrame.
* Try also to visualize the results per column.

In [16]:
# TODO: Calculate the 5th and 95th percentiles with visualization


### Identify Outliers

Identify potential outliers in the dataset.


In [17]:
# TODO: identify potential outliers


### Visualize Correlations

Visualize correlations to understand the relationships between features (columns).


In [18]:
# TODO: Visualize correlation results


## Assignment 1.5: Modeling (max. 2p)

Sub-tasks in this assignment are:

* Create a machine learning (ML) model with selected classification or regression method
* Train the ML model on the training data
* Make predictions that are based on the testing data
* Calculate metrics
* Visualize results

Create a model with selected classification or regression method to predict a certain feature.

Possible options for modelling are:
* Classification: Decision Tree, Random Forests, kNN
* Regression: Linear Regression, Decision Tree Regressor

Please select only **one ML method** for this assignment.

### Create a Model 

Create a machine learning model with selected classification or regression method. 
Select only one machine learning method in this sub-assignment.

In [19]:
# TODO: Create a model

### Train the Model

Train the model with training data only. Select 20-25 percent of the data for testing.

In [20]:
# TODO: Train the model

### Make Predictions

in this task make predictions using previously created model that are based on the testing dataset.

In [21]:
# TODO: Make predictions

### Calculate Other Metrics

Calculate suitable metrics for your model like errors, accuracy, precision, F1-score, Mean Squared Error, R^2 etc.

In [22]:
# TODO: Calculate other metrics

### Visualize Modelling Results

Visualize results of classification or regression models with suitable diagrams.

In [23]:
# TODO: Visualize results