# Load Clean Files to Database

There are 3 files that need to be loaded to the database, NFL. The database will need to be created and tables defined before loading. 
A few things of note: 
1. I will be using SQL Server this time, which means that TINYINT is unsigned, and ranges from 0 to 255 instead of -127 to 127. This will 
have implications on any negative values, in which case, SMALLINT will have to be used. 
2. The tracking data files will be concatenated, since they include only the plays involving injuries.
3. The machine learning datasets will be separated into Concussions and Injuries, since there are data from each that would make the tables 
excessively large filled with null values per analysis. Additionally, there is only one play type for the concussion set - punt plays, which is 
only a minor subset of the injuries list.

I will be using SQLAlchemy to push and pull the data to and from the database. While I have been using polars for this analysis, SQLAlchemy 
cannot read a parquet or polars dataframe; it can only read a pandas formatted dataframe. Thus, there will be a single step before the write 
process that converts to a pandas df prior to appending the data. 

Up to this point, I have been reading the files in directly, however, since this is intended to be a pipline, the clean data will be pulled, as 
the first step, from the database to maintain data fidelity and reduce chances of data corruption. 


***
#### All Files Now

Currently, the files include: 
- All_Tracking - This contains time, position, physics, per play key including all injuries and concussions including opponent key when needed
- Full_Summary_Concussions - Contains all of the summary data in addition to the descriptive stats of the concussion plays both with and without injuries per playkey
- Full_Summary_Injuries - Same as above, only not concussion but injury set, including the IsSevere column
- OpponentPlays - Temp file, adding opponent PlayKeys to the Concussion Playkeys
- OptimizedTrackData - Temp File, Very large file from the Injuries Data, served as a temp file
- QualitativeConcussions - Temp File, qualitative collected data prior to adding to All Tracking
- QualitativeInjuries - Temp File, qualitiative collected prior to adding to All Tracking
- TrackingInjuries - Temp File, injury equivalent of OpponentPlays

I need to determine what each of these files represents, and then I need to determine which of them needs to be loaded to a database prior to machine learning and viz production


#### Loading: 

I will be loading All_Tracking for the Vizzes
Additionally, I will be loading Full_Summary_Concussions and Full_Summary_Injuries for the machine learning applications. 
In the case that I want to include different surface types or other parameters to the vizzes, I can do a join and then output those data if necessary. For now, there's no need to increase the size of the data

***

### Creating the Database and Tables

In [None]:
CREATE DATABASE NFL;
GO

USE NFL;
GO

CREATE TABLE Tracking_Data (
    PlayKey VARCHAR(20),
    time FLOAT,
    x FLOAT,
    y FLOAT,
    dir FLOAT,
    o FLOAT,
    Angle_Diff FLOAT,
    Displacement FLOAT,
    Speed FLOAT,
    vx FLOAT,
    vy FLOAT,
    omega_dir FLOAT,
    omega_o FLOAT,
    omega_diff FLOAT,
    Position VARCHAR(50),
    Height_m FLOAT,
    Weight_kg FLOAT,
    Chest_rad_m FLOAT,
    px FLOAT,
    py FLOAT,
    moment FLOAT,
    moment_upper FLOAT,
    p_magnitude FLOAT,
    L_dir FLOAT,
    L_diff FLOAT,
    Jx FLOAT,
    Jy FLOAT,
    J_magnitude FLOAT,
    torque FLOAT,
    torque_internal FLOAT,
    InjuryType VARCHAR(50),
    GSISID INT,
    Player_Activity_Derived VARCHAR(50),
    Primary_Impact_Type VARCHAR(50),
    Primary_Partner_GSISID VARCHAR(20),
    Primary_Partner_Activity_Derived VARCHAR(50),
    OpponentKey VARCHAR(20)
);

CREATE TABLE Concussion_Summary (
    PlayKey VARCHAR(20),
    Position VARCHAR(50),
    Role VARCHAR(50),
    Play_Type VARCHAR(50),
    Poss_Team VARCHAR(50),
    Game_Site VARCHAR(50),
    HomeTeamCode VARCHAR(50),
    VisitTeamCode VARCHAR(50),
    StadiumType VARCHAR(50),
    FieldType VARCHAR(50),
    Weather VARCHAR(50),
    Temperature FLOAT,
    Player_Activity_Derived VARCHAR(50),
    Primary_Impact_Type VARCHAR(50),
    Primary_Partner_Activity_Derived VARCHAR(50),
    Primary_Partner_GSISID INT,
    OpponentKey VARCHAR(20),
    IsInjured TINYINT,
    Home_Score TINYINT,
    Visiting_Score TINYINT,
    Score_Difference TINYINT,
    Position_right VARCHAR(50),
    Distance FLOAT,
    Displacement FLOAT,
    Path_Diff FLOAT,
    Max_Angle_Diff FLOAT,
    Mean_Angle_Diff FLOAT,
    Max_Speed FLOAT,
    Mean_Speed FLOAT,
    Max_Impulse FLOAT,
    Mean_Impulse FLOAT,
    Max_Torque FLOAT,
    Mean_Torque FLOAT,
    Max_Int_Torque FLOAT,
    Mean_Int_Torque FLOAT
);

CREATE TABLE Injury_Summary (
    PlayKey VARCHAR(20),
    Position VARCHAR(50),
    StadiumType VARCHAR(50),
    FieldType VARCHAR(50),
    Temperature SMALLINT,
    Weather VARCHAR(50),
    PlayType VARCHAR(50),
    BodyPart VARCHAR(50),
    DM_M1 TINYINT,
    DM_M7 TINYINT,
    DM_M28 TINYINT,
    DM_M42 TINYINT,
    IsInjured TINYINT,
    IsSevere TINYINT,
    Position_right VARCHAR(50),
    Distance FLOAT,
    Displacement FLOAT,
    Path_Diff FLOAT,
    Max_Angle_Diff FLOAT,
    Mean_Angle_Diff FLOAT,
    Max_Speed FLOAT,
    Mean_Speed FLOAT,
    Max_Impulse FLOAT,
    Mean_Impulse FLOAT,
    Max_Torque FLOAT,
    Mean_Torque FLOAT,
    Max_Int_Torque FLOAT,
    Mean_Int_Torque FLOAT
);



### Populating the Tables in the Database using SQLAlchemy

In [1]:
import polars as pl
from sqlalchemy import create_engine
import pyarrow as pa
import pyarrow.parquet as pq
import os

path = "F:/Data/Processing_data/"
file1 = "All_Tracking.parquet"
file2 = "Full_Summary_Concussions.parquet"
file3 = "Full_Summary_Injuries.parquet"

In [2]:
# Function to convert dictionary columns with uint32 indices
def convert_uint32_dict_columns(table):
    new_columns = []
    for col in table.columns:
        if pa.types.is_dictionary(col.type) and col.type.index_type == pa.uint32():
            new_col = col.cast(pa.string())
        else:
            new_col = col
        new_columns.append(new_col)
    return pa.Table.from_arrays(new_columns, names=table.column_names)

In [None]:
# Read Parquet file using Polars
df = pl.read_parquet(os.path.join(path, file1))

# Create SQLAlchemy engine
engine = create_engine('mssql+pyodbc:///?odbc_connect=DRIVER={ODBC Driver 17 for SQL Server};SERVER=GOOSEBOX;DATABASE=NFL;Trusted_Connection=yes;')

# Convert Polars DataFrame to PyArrow Table
arrow_table = df.to_arrow()

# Convert problematic columns
converted_table = convert_uint32_dict_columns(arrow_table)

# Use PyArrow to write to SQL
with engine.connect() as connection:
    # Convert PyArrow table to pandas DataFrame
    pandas_df = converted_table.to_pandas()
    
    # Write to SQL
    pandas_df.to_sql('Tracking_Data', connection, if_exists='append', index=False)

In [None]:
# Read Parquet file using Polars
df = pl.read_parquet(os.path.join(path, file2))

# Convert Polars DataFrame to PyArrow Table
arrow_table = df.to_arrow()

# Convert problematic columns
converted_table = convert_uint32_dict_columns(arrow_table)

# Use PyArrow to write to SQL
with engine.connect() as connection:
    # Convert PyArrow table to pandas DataFrame
    pandas_df = converted_table.to_pandas()
    
    # Write to SQL
    pandas_df.to_sql('Concussion_Summary', connection, if_exists='append', index=False)

In [None]:
# Read Parquet file using Polars
df = pl.read_parquet(os.path.join(path, file3))

# Convert Polars DataFrame to PyArrow Table
arrow_table = df.to_arrow()

# Convert problematic columns
converted_table = convert_uint32_dict_columns(arrow_table)

# Use PyArrow to write to SQL
with engine.connect() as connection:
    # Convert PyArrow table to pandas DataFrame
    pandas_df = converted_table.to_pandas()
    
    # Write to SQL
    pandas_df.to_sql('Injury_Summary', connection, if_exists='append', index=False)