# Project 5 - Machine Learning Model in Python

Artificial intelligence has been with us for a long time, but it has never developed at such a dizzying pace as today. Its models accompany us at every step. Starting with assistants in phones and online stores who recommend applications or products to complex language models such as ChatGPT or Bard.

In this project, I would like to use a popular set of tools in the most popular programming language, Python, in order not only to better understand the relationship between the data used in previous projects, but also to use them to build a model that will allow predicting the grades obtained by gymnastics competitors!

Tools used:

- Jupyter notebook

- Python 3.11.2

## STEP 1 - Prepare the environment in which we will work!

In [93]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Ok, we have environments - now we need data!

In [94]:
df_junior_qualification = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/Qualification.csv", sep=';', decimal=",")
df_junior_AA = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/AA_Final.csv", sep=';', decimal=",")
df_junior_FX = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/FX_Final.csv", sep=';', decimal=",")
df_junior_PH = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/PH_Final.csv", sep=';', decimal=",")
df_junior_SR = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/SR_Final.csv", sep=';', decimal=",")
df_junior_VT = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/VT_Final.csv", sep=';', decimal=",")
df_junior_PB = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/PB_Final.csv", sep=';', decimal=",")
df_junior_HB = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/2nd_Junior_World_Championship_2023/HB_Final.csv", sep=';', decimal=",")
df_senior_qualification = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/Qualification.csv", sep=';', decimal=",")
df_senior_AA = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/AA_Final.csv", sep=';', decimal=",")
df_senior_FX = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/FX_Final.csv", sep=';', decimal=",")
df_senior_PH = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/PH_Final.csv", sep=';', decimal=",")
df_senior_SR = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/SR_Final.csv", sep=';', decimal=",")
df_senior_VT = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/VT_Final.csv", sep=';', decimal=",")
df_senior_PB = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/PB_Final.csv", sep=';', decimal=",")
df_senior_HB = pd.read_csv("E:/Gymnastics on GitHub!/Gymnastics-on-GitHub/Project 5 - ML model in Python/Data used in this project/51_FIG_World_Championship_2022/HB_Final.csv", sep=';', decimal=",")

As the data for Polish competitors is not complete due to the lack of a penalty column, I decided to use the results from international competitions for the purposes of this project.

## STEP 2 - Data cleaning

Okay, we have the data. Before we start working on them, let's take a look at what the tables, column names look like and whether the data has been loaded correctly.

In [95]:
# dataframe with this same columns number

all_data_frames = [df_junior_qualification, df_junior_AA, df_junior_FX, df_junior_PH, df_junior_SR, df_junior_VT, df_junior_PB, df_junior_HB,
                   df_senior_qualification, df_senior_AA, df_senior_FX, df_senior_PH, df_senior_SR, df_senior_VT, df_senior_PB, df_senior_HB]

# look on column names and data types

for frame in all_data_frames:
    print(frame.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rk              120 non-null    int64  
 1   Name            120 non-null    object 
 2   NOC             120 non-null    object 
 3   FX_D_Score      118 non-null    float64
 4   FX_E_Score      120 non-null    float64
 5   FX_Total_Score  120 non-null    float64
 6   FX_Penalty      120 non-null    float64
 7   PH_D_Score      119 non-null    float64
 8   PH_E_Score      120 non-null    float64
 9   PH_Total_Score  120 non-null    float64
 10  PH_Penalty      120 non-null    float64
 11  SR_D_Score      120 non-null    float64
 12  SR_E_Score      120 non-null    float64
 13  SR_Total_Score  120 non-null    float64
 14  SR_Penalty      120 non-null    float64
 15  VT_D_Score      120 non-null    float64
 16  VT_E_Score      120 non-null    float64
 17  VT_Total_Score  120 non-null    flo

When we look at the results, we can notice several important problems:

1. __Column names__ - these are not uniform.

2. __Table format__ - if we want to examine the relationships between individual assessments, differences in the table format will definitely not help us with this, so it should be sorted out.

Let's start with the table format. To study the relationship between grades, we can get rid of the order characteristic of gymnastics and reduce them to columns: _Name, Country, DScore, EScore, Penalty, Total Score_


In [96]:
# List with frames in this same format

apparatus_data_frame = [df_junior_FX, df_junior_PH, df_junior_SR, df_junior_PB, df_junior_HB,
                        df_senior_FX, df_senior_PH, df_senior_SR, df_senior_PB, df_senior_HB]

aa_data_frame = [df_junior_qualification, df_junior_AA,
                 df_senior_qualification, df_senior_AA]

vt_data_frame = [df_senior_VT, df_junior_VT]

# List with correct columns names

col_names = ['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty', 'Total Score']

# Make final form for first part of our data

for frame in apparatus_data_frame:
    frame.columns = col_names

# Look on results
    
for frame in apparatus_data_frame:
    print (frame.columns)

Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Rank', 'Name', 'Country', 'DScore', 'EScore', 'Penalty',
       'Total Score'],
      dtype='object')
Index(['Ra

Ok, all the tables that only needed to be renamed are ready. Next, we will deal with those whose form needs to be modified. First vault result.

In [97]:
# Drop a column that will not be needed in the further part of the calculations (see the end of the "Data cleaning" chapter)

for frame in vt_data_frame:
    frame = frame.iloc[:,:-1]

# Tansform vault table

vt_junior1 = df_junior_VT[["Rank","Name","NOC","VT1_Dscore","VT1_Escore","VT1_Penalty","VT1_Total_Score"]]
vt_junior1.columns = col_names

vt_junior2 = df_junior_VT[["Rank","Name","NOC","VT2_Dscore","VT2_Escore","VT2_Penalty","VT2_Total_Score"]]
vt_junior2.columns = col_names

vt_senior1 = df_senior_VT[["Rank","Name","NOC","VT1_Dscore","VT1_Escore","VT1_Penalty","VT1_Total_Score"]]
vt_senior1.columns = col_names

vt_senior2 = df_senior_VT[["Rank","Name","NOC","VT2_Dscore","VT2_Escore","VT2_Penalty","VT2_Total_Score"]]
vt_senior2.columns = col_names

# Final form of vault table

df_junior_VT = pd.concat([vt_junior1,vt_junior2])
df_senior_VT = pd.concat([vt_senior1,vt_senior2])

In [98]:
# Transform data frame using wide_to_long module

aa_col_names = ['Rank', 'Name', 'Country', 
                'DScore_FX', 'EScore_FX', 'Total Score_FX', 'Penalty_FX', 
                'DScore_PH', 'EScore_PH', 'Total Score_PH', 'Penalty_PH', 
                'DScore_SR', 'EScore_SR', 'Total Score_SR', 'Penalty_SR', 
                'DScore_VT', 'EScore_VT', 'Total Score_VT', 'Penalty_VT', 
                'DScore_PB', 'EScore_PB', 'Total Score_PB', 'Penalty_PB', 
                'DScore_HB', 'EScore_HB', 'Total Score_HB', 'Penalty_HB', 
                'AllAround_AA']

# using loop to modify data in aa_data_frame lits

i = 0

for frame in aa_data_frame:
    frame.columns = aa_col_names
    frame = pd.wide_to_long(
        frame,
        stubnames = ['DScore','EScore','Total Score','Penalty'],
        i = ['Rank','Name','Country'],
        j = 'Apparatus',
        sep = "_",
        suffix = '\D+').drop(columns=['AllAround_AA']).reset_index()
    aa_data_frame[i] = frame
    i += 1