# Designing & Creating a Database

In this project I will work with a file from [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball) games from [Retrosheet](www.retrosheet.org). The goal of the project is to:

* Import data into SQLite
* Design a normalized database schema
* Create tables for our schema
* Insert data into our schema

Retrosheet compiles detailed statistics on baseball games from the 1800s through to today. The main file we will be working from game_log.csv, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize into several separate tables using SQL, providing a robust database of game-level statistics.

Since we are  trying to create a normalized database, so our focus should be:

* Becoming familiar, at a high level, with the meaning of each column in each file.
* Thinking about the relationships between columns within each file.
* Thinking about the relationships between columns across different files.

**Disclaimer:** This project is prepared as part of the guided projects on [dataquest](http://dataquest.io/). Most of the content including normalized database was presented and discussed in the dataquest project mission. Nevertheless, some of the code and analysis belongs to the author. This project is done for learning purposes.

## Data Exploration

Setting the below options after we import pandas is recommended– they will prevent the DataFrame output from being truncated, given the size of the main game log file. Let's also read in the data and explore it. To better understand columns we can use the following [game_log_fields.txt](data/game_log_fields.txt) file, which explains fields included in the main files.

In [1]:
# load libs
import pandas as pd
import sqlite3

# set pandas options
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)

# read dataset
game_log = pd.read_csv('data/game_log.csv')
park_codes = pd.read_csv('data/park_codes.csv')
person_codes = pd.read_csv('data/person_codes.csv')
team_codes = pd.read_csv('data/team_codes.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
print("Game Log", game_log.shape)
print("Park Codes", park_codes.shape)
print("Person Codes", person_codes.shape)
print("Team Codes", team_codes.shape)

Game Log (171907, 161)
Park Codes (252, 9)
Person Codes (20494, 7)
Team Codes (150, 8)


In [3]:
game_log.head()

Unnamed: 0,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y


Let's explore the above dataset, particularly let's look at what defensive position each number represents. We can observe that columns such `h_player_1_def_pos` and `v_player_9_def_pos` indicate the defensive positions of home player 1 and visiting player 9 respectively. The defensive positions are numbered (1-9). These are the defensive positions with their respective codes [(source)](https://en.wikipedia.org/wiki/Baseball_positions):

Code. Position
1. Pitcher
2. Catcher
3. First Baseman
4. Second Baseman
5. Third Baseman
6. Shortstop
7. Left Fielder
8. Center Fielder
9. Right Fielder

The image below nicely visualizes these positions.

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Baseball_positions.svg/300px-Baseball_positions.svg.png'>

---
Let's explore the league information. The columns 4-5, 7-8 indicate visiting team & home team leagues. We will look at those values.

In [4]:
game_log.h_league.value_counts(dropna=False)

NL     88867
AL     74712
AA      5039
FL      1243
NaN     1086
PL       532
UA       428
Name: h_league, dtype: int64

In [5]:
game_log.v_league.value_counts(dropna=False)

NL     88866
AL     74713
AA      5039
FL      1243
NaN     1086
PL       532
UA       428
Name: v_league, dtype: int64

We can observe the list of leagues and also notice that majority of games have information about their team leagues. The list of leagues and their interpretation:

* AL - American League
* AA - Double A League
* FL - Florida State League
* PL - Players League
* UA - Union Association

Next, let's look at the park, person and team_codes and get familiar with the datasets.

In [6]:
park_codes.head()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,


In [7]:
person_codes.head()

Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,


In [8]:
team_codes.head()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


## Importing Data into Sqlite

To insert data into a normalized database we need to come up with a primary key for the game log table. Exploring the [Retrosheet site](https://www.retrosheet.org/eventfile.htm), we can find this data dictionary for their event files, which list every event within each game. This includes the following description:

*__id__: Each game begins with a twelve character ID record which identifies the date, home team, and number of the game. For example, ATL198304080 should be read as follows. The first three characters identify the home team (the Braves). The next four are the year (1983). The next two are the month (April) using the standard numeric notation, 04, followed by the day (08). The last digit indicates if this is a single game (0), first game (1) or second game (2) if more than one game is played during a day, usually a double header The id record starts the description of a game thus ending the description of the preceding game in the file.*

This is what we essentially need, where for our primary key we will use a composite key which has been described above. The key uses `date`, `home team` and `number of the game` to make up they composite key.

Our next task is to import the data into SQLite. There are multiple ways to do that.


* 1. Using the Python SQLite library

The [Python SQLite library](https://docs.python.org/3/library/sqlite3.html) gives us ultimate control when importing data. We will first need to get the data into Python - we might choose to use the [csv module](https://docs.python.org/3/library/csv.html) for this. Next, we would use the [`Cursor.execute()` method](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute) to create a table for our data.

We should take advantage of the `?` placeholder value syntax instead of using [python string formatting](https://pyformat.info/) to prevent SQL injection attacks (like the hilarious [XKCD 'Bobby Tables' comic](https://xkcd.com/327/) example) and maintain the correct data types. Even though in this project we won't be running any external user code, this is an extremely good habit to get into. Here's what our syntax could look like for the last step

```
my_list_of_lists = [
    [4, 4, 8, 2],
    [5, 1, 6, 3],
    [5, 2, 4, 6]
]
c = """
INSERT INTO table_name (
    column_one,
    column_two,
    [...]
) VALUES (
    ?,
    ?
    [...]
);
"""
cur.executemany(c, my_list_of_lists)
```

* 2. Using pandas

The pandas library includes a handy `DataFrame.to_sql()` method that we can use to send the contents of a dataframe to a SQLite connection object. We can either create the table first using the method above, or if the table does not exist, pandas will create it for us. Here's an example of what that looks like:

```
my_dataframe.to_sql('table_name', sqlite_connection_object, index=False)
```

Most of the time, we'll want to use index=False, otherwise pandas will create an extra column for the pandas index.

The advantage of this method is that it can often be done with a line or two of code. The disadvantage is that pandas may alter the data as it reads it in and converts the columns to types automatically. Additionally, this requires the data to be small enough to be able to be stored in-memory using pandas.

We will use pandas `DataFrame.to_sql()` function to do that.

In [9]:
# database name
db_name = 'mlb.db'

# helper functions
def run_query(q):
    with sqlite3.connect(db_name) as conn:
        return pd.read_sql(q, conn)
    
def run_command(q):
    with sqlite3.connect(db_name) as conn:
        conn.isolation_level = None
        conn.execute(q)

# show the tables
def show_tables():
    q = """
        SELECT name, type
        FROM sqlite_master
        WHERE type IN ("table","view");
    """
    return run_query(q)

In [10]:
# import dataframe into sqlite database
with sqlite3.connect(db_name) as conn:
    game_log.to_sql('game_log', conn, if_exists='replace', index=False)
    park_codes.to_sql('park_codes', conn, if_exists='replace', index=False)
    person_codes.to_sql('person_codes', conn, if_exists='replace', index=False)
    team_codes.to_sql('team_codes', conn, if_exists='replace', index=False)

In [11]:
# check tables
show_tables()

Unnamed: 0,name,type
0,game_log,table
1,park_codes,table
2,person_codes,table
3,team_codes,table


We will create a new column in `game_log` table called `game_id` which will using the key we discussed above. Composite key - `date`, `home team` and `number of the game`.

In [12]:
q = 'SELECT * FROM game_log LIMIT 15'
run_query(q)

Unnamed: 0,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,000000000,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,000002232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y
5,18710511,0,Thu,CH1,,2,CL1,,4,18,10,48.0,D,,V,,CLE01,2500.0,120.0,12120534,1410004,41.0,15.0,1.0,3.0,3.0,10.0,0.0,0.0,0.0,8.0,,1.0,0.0,,-1.0,,7.0,1.0,4.0,4.0,0.0,0.0,24.0,11.0,4.0,3.0,0.0,0.0,39.0,13.0,1.0,2.0,1.0,7.0,0.0,0.0,0.0,0.0,,0.0,0.0,,-1.0,,5.0,2.0,10.0,10.0,2.0,0.0,24.0,7.0,5.0,2.0,0.0,0.0,haynj901,J.H. Haynie,,,,,,,,,,,woodj106,Jimmy Wood,paboc101,Charlie Pabor,zettg101,George Zettlein,prata101,Al Pratt,,,,,zettg101,George Zettlein,prata101,Al Pratt,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,7.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,8.0,folet101,Tom Foley,9.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,,Y
6,18710513,0,Sat,WS3,,2,CL1,,5,12,8,54.0,D,,,,CIN01,1200.0,150.0,141020004,4100012,42.0,9.0,2.0,0.0,0.0,5.0,0.0,0.0,0.0,1.0,,1.0,1.0,,-1.0,,4.0,1.0,2.0,2.0,0.0,0.0,27.0,9.0,6.0,3.0,1.0,0.0,39.0,11.0,1.0,1.0,0.0,5.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,4.0,1.0,3.0,3.0,3.0,0.0,27.0,6.0,8.0,1.0,0.0,0.0,drapj901,Doc Draper,,,,,,,,,,,younn801,Nick Young,paboc101,Charlie Pabor,braia102,Asa Brainard,prata101,Al Pratt,,,,,braia102,Asa Brainard,prata101,Al Pratt,watef102,Fred Waterman,2.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,glenj102,John Glenn,9.0,burrh101,Henry Burroughs,5.0,leona101,Andy Leonard,7.0,braia102,Asa Brainard,1.0,hallg101,George Hall,8.0,berth101,Henry Berthrong,4.0,whitd102,Deacon White,2.0,allia101,Art Allison,8.0,paboc101,Charlie Pabor,7.0,carlj102,Jim Carleton,3.0,kimbg101,Gene Kimball,4.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,white104,Elmer White,9.0,bassj101,John Bass,6.0,,Y
7,18710513,0,Sat,CH1,,3,FW1,,2,14,5,54.0,D,,,,FOR01,1500.0,105.0,053210012,200002001,45.0,17.0,5.0,2.0,0.0,10.0,0.0,0.0,0.0,1.0,,1.0,2.0,,-1.0,,5.0,1.0,2.0,2.0,0.0,0.0,27.0,8.0,4.0,2.0,0.0,0.0,33.0,5.0,1.0,2.0,0.0,3.0,0.0,0.0,0.0,3.0,,1.0,0.0,,-1.0,,4.0,1.0,6.0,6.0,0.0,0.0,27.0,8.0,7.0,3.0,0.0,0.0,haynj901,J.H. Haynie,,,,,,,,,,,woodj106,Jimmy Wood,lennb101,Bill Lennon,zettg101,George Zettlein,mathb101,Bobby Mathews,,,,,zettg101,George Zettlein,mathb101,Bobby Mathews,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,donnp101,Pete Donnelly,8.0,kellb105,Bill Kelly,9.0,,Y
8,18710515,0,Mon,WS3,,3,FW1,,3,6,12,54.0,D,,,,FOR01,,140.0,030100101,3300123,42.0,8.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,1.0,1.0,,-1.0,,10.0,1.0,4.0,4.0,0.0,0.0,27.0,13.0,5.0,6.0,0.0,0.0,49.0,20.0,5.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,,1.0,1.0,,-1.0,,10.0,1.0,1.0,1.0,0.0,0.0,27.0,6.0,9.0,2.0,0.0,0.0,holls901,Sam Holley,,,,,,,,,,,younn801,Nick Young,lennb101,Bill Lennon,mathb101,Bobby Mathews,braia102,Asa Brainard,,,,,braia102,Asa Brainard,mathb101,Bobby Mathews,watef102,Fred Waterman,2.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,glenj102,John Glenn,9.0,burrh101,Henry Burroughs,5.0,leona101,Andy Leonard,7.0,braia102,Asa Brainard,1.0,hallg101,George Hall,8.0,berth101,Henry Berthrong,4.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,donnp101,Pete Donnelly,9.0,kellb105,Bill Kelly,8.0,,Y
9,18710516,0,Tue,TRO,,2,BS1,,3,29,14,54.0,D,,,,BOS01,2500.0,,302604(11)30,610020221,64.0,26.0,3.0,1.0,0.0,26.0,0.0,0.0,0.0,2.0,,0.0,3.0,,-1.0,,10.0,1.0,4.0,4.0,0.0,0.0,27.0,11.0,8.0,3.0,3.0,0.0,43.0,13.0,3.0,0.0,0.0,9.0,0.0,0.0,0.0,4.0,,1.0,3.0,,-1.0,,6.0,1.0,10.0,10.0,0.0,0.0,27.0,17.0,15.0,2.0,1.0,0.0,rogem901,Mort Rogers,,,,,,,,,,,pikel101,Lip Pike,wrigh101,Harry Wright,mcmuj101,John McMullin,spala101,Al Spalding,,,,,mcmuj101,John McMullin,spala101,Al Spalding,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,6.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,jacks101,Sam Jackson,8.0,HTBF,Y


In [13]:
# query to add the new column
q = 'ALTER TABLE game_log ADD COLUMN game_id VARCHAR'
run_command(q)

In [14]:
# query to populate the column with concatenation
q = 'UPDATE game_log SET game_id = h_name || date || number_of_game'
run_command(q)

In [15]:
# check to see the results
q = 'SELECT game_id, * FROM game_log LIMIT 5'
run_query(q)

Unnamed: 0,game_id,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info,game_id.1
0,FW1187105040,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y,FW1187105040
1,WS3187105050,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y,WS3187105050
2,RC1187105060,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y,RC1187105060
3,CH1187105080,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y,CH1187105080
4,TRO187105090,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y,TRO187105090


## Looking for Normalization Opportunities

Through investigation of the tables above, we can spot multiple opportunities where we can normalize our data and eventually the database.

__Repetition in Columns:__

We can see in the below following segment of data that player information is spread out across columns (`id`, `name`, `off_pos`, `def_pos`). We can normalize this information by simply having a separate table with players.

In [16]:
# check the above mentioned fragment
q = '''
SELECT v_player_1_id, v_player_1_name, v_player_1_def_pos,
        v_player_2_id, v_player_2_name, v_player_2_def_pos
FROM game_log LIMIT 10
'''
run_query(q)

Unnamed: 0,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos
0,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
1,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0
2,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
3,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
4,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0
5,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,7.0
6,watef102,Fred Waterman,2.0,forcd101,Davy Force,6.0
7,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0
8,watef102,Fred Waterman,2.0,forcd101,Davy Force,6.0
9,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0


To normalize the data we can convert into a table such as this.

|id|name|def_pos|off_pos
|---|---|---|---|
|villj001|Jonathan Villar|5.0|1.0|
|granc001|Curtis Granderson|8.0|1.0|
|kendh001|Howie Kendrick|7.0|1.0|
|jasoj001|John Jaso|3.0|1.0|
|gordd002|Dee Gordon|4.0|1.0|
|genns001|Scooter Gennett|4.0|2.0|
|cabra002|Asdrubal Cabrera|6.0|2.0|
|turnj001|Justin Turner|5.0|2.0|
|polag001|Gregory Polanco|9.0|2.0|
|telit001|Tomas Telis|2.0|2.0|

We could transfer these into a new table from our `game_log` table but actually our `person_codes` table already contains the `id` and `name` of players. We could remove player name from our `game_log` table since we have player id's in the `person_codes` table.

A similar to above approach could be used across our `game_log` and we can remove the following columns and simply keep player ids associated usually preceding these columns:

* `hp_umpire_name`
* `1b_umpire_name`
* `2b_umpire_name`
* `3b_umpire_name`
* `lf_umpire_name`
* `rf_umpire_name`
* `v_manager_name`
* `h_manager_name`
* `winning_pitcher_name`
* `losing_pitcher_name`
* `saving_pitcher_name`
* `winning_rbi_batter_id_name`
* `v_starting_pitcher_name`
* `h_starting_pitcher_name`

And as discussed above all `v_player_{num}_name` and `h_player_{num}_name` columns would also be removed and associated id's kept.

In [17]:
# check the table columns
q = 'PRAGMA table_info(game_log);'
# run_query(q)

__Redundant Data__

We want to ensure that our database doesn't contain duplicate information, that is data which we can either find in another table or derive. One of those examples can be found in the `park_codes`.  We can check out the first few rows of the `park_codes` table.

In [18]:
q = 'SELECT * FROM park_codes LIMIT 5'
run_query(q)

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,


The start and end columns show the first and last games played at the park, however we will be able to derive this information by looking at the park information for each game. Similarly, the league information is going to be available elsewhere in our database.

## Planning a Normalized Schema

In this section, we plan to prepare a database schema for our new database. We will use the [DbDesigner](https://www.dbdesigner.net/) tool to design a new schema. Below is the image of the proposed normalized schema.

<img src='mlb_schema.svg'>

To simplify and reduce the data redundancy in our database, we made following normalization choices:

* `person`
    * 'debut' columns have been omitted since they represent the date when an individual started their career. This information we can deduce from other tables
    * since game log doesn't contain information on coaches we omitted this data
* `park`
    * The start, end, and league columns contain data that is found in the main game log and can be removed.
* `league`
    * Because some of the older leagues are not well known, we will create a table to store league names.
* `appearance_type`
    * Our appearance table will include data on players with positions, umpires, managers, and awards (like winning pitcher). This table will store information on what different types of appearances are available.

## Creating Tables Without Foreign Key Relations

Let's start creating our tables.

In [19]:
# drop person table if exists
q = """DROP TABLE IF EXISTS person;"""
run_command(q)

# query to create `person` table
q = """
    CREATE TABLE person(
        person_id CHAR PRIMARY KEY,
        first_name CHAR (25),
        last_name CHAR (25)
    );
"""
run_command(q)

# insert from person_codes to person
q = """
    INSERT INTO person
    SELECT id, first, last FROM person_codes;
"""
run_command(q)

# check results
q = """
    SELECT * FROM person LIMIT 5
"""
run_query(q)

Unnamed: 0,person_id,first_name,last_name
0,aardd001,David,Aardsma
1,aaroh101,Hank,Aaron
2,aarot101,Tommie,Aaron
3,aased001,Don,Aase
4,abada001,Andy,Abad


Next, let's create the `park` table and insert the data from `park_codes`.

In [20]:
# drop person table if exists
q = """DROP TABLE IF EXISTS park;"""
run_command(q)

# query to create `park` table
q = """
    CREATE TABLE park(
        park_id CHAR PRIMARY KEY,
        name CHAR (25),
        nickname CHAR (25),
        city CHAR (25),
        state CHAR (25),
        notes CHAR (25)
    );
"""
run_command(q)

# insert from park_codes to park
q = """
    INSERT INTO park
    SELECT park_id, name, aka, city, state, notes FROM park_codes;
"""
run_command(q)

# check results
q = """
    SELECT * FROM park LIMIT 5
"""
run_query(q)

Unnamed: 0,park_id,name,nickname,city,state,notes
0,ALB01,Riverside Park,,Albany,NY,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,
3,ARL01,Arlington Stadium,,Arlington,TX,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,


Next, we do the same for league table. Here we will use the data we found manually on leagues and insert the following values:

* AL - American League
* AA - Double A League
* FL - Florida State League
* PL - Players League
* UA - Union Association
* NL - National League

In [21]:
# drop person table if exists
q = """DROP TABLE IF EXISTS league;"""
run_command(q)

# query to create `league` table
q = """
    CREATE TABLE league(
        league_id CHAR PRIMARY KEY,
        name CHAR (25)
    );
"""
run_command(q)

# insert from league data
q = """
    INSERT INTO league
    VALUES 
        ('AL', 'American League'),
        ('AA', 'Double A League'),
        ('FL', 'Florida State League'),
        ('PL', 'Players League'),
        ('UA', 'Union Association'),
        ('NL', 'National League')
"""
run_command(q)

# check results
q = """
    SELECT * FROM league LIMIT 5
"""
run_query(q)

Unnamed: 0,league_id,name
0,AL,American League
1,AA,Double A League
2,FL,Florida State League
3,PL,Players League
4,UA,Union Association


Next, we will create `appearance_type` table and insert the appearance_type.csv data.

In [22]:
# read the csv file
appearance_type = pd.read_csv("data/appearance_type.csv")

# save the dataframe to database
with sqlite3.connect('mlb.db') as conn:
    appearance_type.to_sql(name='appearance_type_temp', con=conn, if_exists='replace', index=False)
    
# copy the information to a new table with primary
q = """
    CREATE TABLE appearance_type(
        appearance_type_id CHAR PRIMARY KEY,
        name TEXT,
        category TEXT
    );
"""
run_command(q)

# insert the values from the temp table
q = """
    INSERT INTO appearance_type
    SELECT * FROM appearance_type_temp;
"""
run_command(q)
    
# drop appearance_type_temp table 
q = """DROP TABLE appearance_type_temp;"""
run_command(q)
    
# check results
q = """
    SELECT * FROM appearance_type LIMIT 5
"""
run_query(q)

Unnamed: 0,appearance_type_id,name,category
0,O1,Batter 1,offense
1,O2,Batter 2,offense
2,O3,Batter 3,offense
3,O4,Batter 4,offense
4,O5,Batter 5,offense


## Adding The Team and Game Tables

Now that we have added all of the tables that don't have foreign key relationships, lets add the next two tables. The `game` and `team` tables need to exist before our two appearance tables are created. Here are the schema of these tables, and the two tables they have foreign key relations to:

<img src="mlb_schema_2.svg">

Here are some notes on the normalization choices made with each of these tables:
* `team`
    * The start, end, and sequence columns can be derived from the game level data.
* `game`
    * We have chosen to include all columns for the game log that don't refer to one specific team or player, instead putting those in two appearance tables.
    * We have removed the column with the day of the week, as this can be derived from the date.
    * We have changed the `day_night` column to `day`, with the intention of making this a boolean column. Even though SQLite doesn't support the `BOOLEAN` type, we can use this when creating our table and SQLite will manage the underlying types behind the scenes (for more on how this works [refer to the SQLite documentation](https://www.sqlite.org/datatype3.html). This means that anyone quering the schema of our database in the future understands how that column is intended to be used.
    
Let's create `team` and `game` tables.

In [23]:
# update run_command function to enforce foreign key restraints
def run_command(c):
    with sqlite3.connect(db_name) as conn:
        conn.execute('PRAGMA foreign_keys = ON;')
        conn.isolation_level = None
        conn.execute(c)

In [24]:
# check results
q = """
    SELECT * FROM team_codes LIMIT 5
"""
run_query(q)

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


Before we proceed with inserting the data into the `team` table, there are duplicates in that table which we could drop. Instead we will simply ignore those rows and make sure only one row gets recorded.

Let's try to insert the `team_code` data into `teams`.

In [25]:
# drop person table if exists
q = """DROP TABLE IF EXISTS team;"""
run_command(q)

# query to create `team` table
q = """
    CREATE TABLE team(
        team_id CHAR PRIMARY KEY,
        league_id CHAR (25),
        city CHAR (25),
        nickname CHAR (25),
        franch_id CHAR (25),
        FOREIGN KEY(league_id) REFERENCES league(league_id)
    );
"""
run_command(q)

# insert from league data
q = """
    INSERT OR IGNORE INTO "team"
    SELECT team_id, league, city, nickname, franch_id FROM team_codes;
"""
run_command(q)

# check results
q = """
    SELECT * FROM team LIMIT 5
"""
run_query(q)

Unnamed: 0,team_id,league_id,city,nickname,franch_id
0,ALT,UA,Altoona,Mountain Cities,ALT
1,ARI,NL,Arizona,Diamondbacks,ARI
2,BFN,NL,Buffalo,Bisons,BFN
3,BFP,PL,Buffalo,Bisons,BFP
4,BL1,,Baltimore,Canaries,BL1


Next, let's create the `game` table and insert the values from game_log.

In [26]:
# drop game table if exists
q = """DROP TABLE IF EXISTS game;"""
run_command(q)

# query to create `team` table
q = """
    CREATE TABLE game(
        game_id CHAR PRIMARY KEY,
        date DATE,
        number_of_game CHAR (25),
        park_id CHAR (25),
        length_outs CHAR (25),
        day INTEGER,
        completion CHAR (255),
        forefeit CHAR (1),
        protest CHAR (1),
        attendance FLOAT,
        length_minutes FLOAT,
        additional_info CHAR (255),
        acquisition_info CHAR (255),
        FOREIGN KEY(park_id) REFERENCES park(park_id)
    );
"""
run_command(q)

In [27]:
# insert from game_log to game
q = """
    INSERT INTO game
    SELECT game_id, date, number_of_game, park_id, length_outs, day_night, completion,
            forefeit, protest, attendance, length_minutes, additional_info, acquisition_info FROM game_log;
"""
run_command(q)

# update the day column to hold 0 or 1
q = """
    UPDATE game SET day =
    CASE WHEN day = 'D' THEN 1 ELSE 0 END
"""
run_command(q)

In [28]:
# check results
q = """
    SELECT * FROM game LIMIT 5
"""
run_query(q)

Unnamed: 0,game_id,date,number_of_game,park_id,length_outs,day,completion,forefeit,protest,attendance,length_minutes,additional_info,acquisition_info
0,FW1187105040,18710504,0,FOR01,54.0,1,,,,200.0,120.0,,Y
1,WS3187105050,18710505,0,WAS01,54.0,1,,,,5000.0,145.0,HTBF,Y
2,RC1187105060,18710506,0,RCK01,54.0,1,,,,1000.0,140.0,,Y
3,CH1187105080,18710508,0,CHI01,54.0,1,,,,5000.0,150.0,,Y
4,TRO187105090,18710509,0,TRO01,54.0,1,,,,3250.0,145.0,HTBF,Y


## Adding the Team Appearance Table

At this point, because we have told SQLite to enforce foreign key constraints and have inserted data that obeys these contraints, we'll get an error if we try to drop a table or delete rows within a table. For example, we might try running `DELETE FROM park where park_id = "FOR01";`. If we get stuck, one option is to run `!rm mlb.db` in its own Jupyter cell to delete the database file so we can run all your cells to recreate the database files, tables and data.

Next, we create team_appearance table. Here is the schema with all its foreign keys.

<img src="mlb_schema_3.svg">

The `team_appearance` table has a compound primary key composed of the team name and the game ID. In addition, a boolean column `home` is used to differentiate between the home and the away team. The rest of the columns are scores or statistics that in our original game log are repeated for each of the home and away teams.

In [29]:
# drop team_appearance table if exists
q = """DROP TABLE IF EXISTS team_appearance;"""
run_command(q)

# query to create `team_appearance` table
q = """
    CREATE TABLE team_appearance(
        team_id CHAR,
        game_id CHAR,
        home CHAR (1),
        league_id CHAR,
        score INTEGER,
        line_score TEXT, 
        at_bats REAL,
        hits REAL,
        doubles REAL,
        triples REAL,
        homeruns REAL,
        rbi REAL,
        sacrifice_hits REAL,
        sacrifice_flies REAL,
        hit_by_pitch REAL,
        walks REAL,
        intentional_walks REAL,
        strikeouts REAL,
        stolen_bases REAL,
        caught_stealing REAL,
        grounded_into_double REAL,
        first_catcher_interference REAL,
        left_on_base REAL,
        pitchers_used REAL,
        individual_earned_runs REAL,
        team_earned_runs REAL,
        wild_pitches REAL,
        balks REAL,
        putouts REAL,
        assists REAL,
        errors REAL,
        passed_balls REAL,
        double_plays REAL,
        triple_plays REAL,
        FOREIGN KEY(team_id) REFERENCES team(team_id),
        FOREIGN KEY(game_id) REFERENCES game(game_id),
        FOREIGN KEY(league_id) REFERENCES league(league_id),
        PRIMARY KEY (team_id, game_id)
    );
"""
run_command(q)

In [30]:
# insert values from game_log to team_appearance
q = """
    INSERT INTO team_appearance
        SELECT
            h_name,
            game_id,
            1 AS home,
            h_league,
            h_score,
            h_line_score,
            h_at_bats,
            h_hits,
            h_doubles,
            h_triples,
            h_homeruns,
            h_rbi,
            h_sacrifice_hits,
            h_sacrifice_flies,
            h_hit_by_pitch,
            h_walks,
            h_intentional_walks,
            h_strikeouts,
            h_stolen_bases,
            h_caught_stealing,
            h_grounded_into_double,
            h_first_catcher_interference,
            h_left_on_base,
            h_pitchers_used,
            h_individual_earned_runs,
            h_team_earned_runs,
            h_wild_pitches,
            h_balks,
            h_putouts,
            h_assists,
            h_errors,
            h_passed_balls,
            h_double_plays,
            h_triple_plays
        FROM game_log
    UNION
        SELECT 
            v_name,
            game_id,
            0 AS home,
            v_league,
            v_score,
            v_line_score,
            v_at_bats,
            v_hits,
            v_doubles,
            v_triples,
            v_homeruns,
            v_rbi,
            v_sacrifice_hits,
            v_sacrifice_flies,
            v_hit_by_pitch,
            v_walks,
            v_intentional_walks,
            v_strikeouts,
            v_stolen_bases,
            v_caught_stealing,
            v_grounded_into_double,
            v_first_catcher_interference,
            v_left_on_base,
            v_pitchers_used,
            v_individual_earned_runs,
            v_team_earned_runs,
            v_wild_pitches,
            v_balks,
            v_putouts,
            v_assists,
            v_errors,
            v_passed_balls,
            v_double_plays,
            v_triple_plays
        FROM game_log;
"""
run_command(q)

In [31]:
# check results
run_query("""SELECT * FROM team_appearance ORDER BY line_score DESC LIMIT 5""")

Unnamed: 0,team_id,game_id,home,league_id,score,line_score,at_bats,hits,doubles,triples,homeruns,rbi,sacrifice_hits,sacrifice_flies,hit_by_pitch,walks,intentional_walks,strikeouts,stolen_bases,caught_stealing,grounded_into_double,first_catcher_interference,left_on_base,pitchers_used,individual_earned_runs,team_earned_runs,wild_pitches,balks,putouts,assists,errors,passed_balls,double_plays,triple_plays
0,PHI,PHI198506110,1,NL,26,97005140x,50.0,27.0,10.0,2.0,2.0,25.0,0.0,1.0,1.0,7.0,1.0,3.0,2.0,0.0,2.0,0.0,9.0,3.0,6.0,6.0,1.0,1.0,27.0,11.0,1.0,0.0,1.0,0.0
1,CHA,NYA200006180,0,AL,17,930040100,45.0,18.0,5.0,0.0,1.0,17.0,0.0,0.0,0.0,12.0,0.0,9.0,1.0,0.0,0.0,0.0,13.0,3.0,4.0,4.0,0.0,0.0,27.0,4.0,0.0,0.0,1.0,0.0
2,CIN,PHI193508242,0,NL,13,930001000,40.0,14.0,5.0,0.0,0.0,11.0,1.0,0.0,2.0,9.0,1.0,2.0,0.0,0.0,0.0,0.0,12.0,1.0,2.0,2.0,0.0,0.0,27.0,9.0,0.0,0.0,1.0,0.0
3,HOU,KCA201606240,0,AL,13,930000010,43.0,14.0,1.0,1.0,2.0,12.0,0.0,0.0,0.0,7.0,0.0,7.0,0.0,0.0,1.0,0.0,10.0,2.0,3.0,3.0,0.0,0.0,27.0,14.0,1.0,0.0,2.0,0.0
4,PHA,PHA192907250,1,AL,21,92315001x,46.0,25.0,6.0,0.0,4.0,20.0,2.0,0.0,0.0,4.0,,0.0,4.0,0.0,,0.0,7.0,2.0,3.0,0.0,0.0,0.0,27.0,16.0,0.0,0.0,3.0,0.0


In [34]:
# check results
run_query("""SELECT count(*) FROM game_log""")

Unnamed: 0,count(*)
0,171907


In [35]:
# check results
run_query("""SELECT count(*) FROM team_appearance""")

Unnamed: 0,count(*)
0,343814


We can see that we have successfully imported all of the data from game_log into team_appeareance.

## Adding the Person Appearance Table

The last table we need to create is `person_appearance`. Here is the schema of the table and the four tables it has foreign key relations to:

<img src='mlb_schema_4.svg'>

The `person_appearance` table will be used to store information on appearances in games by managers, players, and umpires as detailed in the `appearance_type` table.

We'll need to use a similar technique to insert data as we used with the `team_appearance` table, however we will have to write much larger queries - one for each column instead of one for each team as before. We will need to work out for each column what the `appearance_type_id` will be by cross-referencing the columns with the `appearance_type` table.

We have decided to create an integer primary key for this table, because having every column be a compound primary quickly becomes cumbersome when writing queries. In SQLite, if you have an integer primary key and don't specify a value for this column when inserting rows, [SQLite will autoincrement this column for you](https://sqlite.org/autoinc.html).

In [36]:
run_query("""SELECT sql FROM sqlite_master
WHERE name = "appearance_type"
  AND type = "table";""")

Unnamed: 0,sql
0,"CREATE TABLE appearance_type(\n appearance_type_id CHAR PRIMARY KEY,\n name TEXT,\n category TEXT\n )"


In [43]:
# drop person_appearance table if exists
q = """DROP TABLE IF EXISTS person_appearance;"""
run_command(q)

# query to create `person_appearance` table
q = """
    CREATE TABLE person_appearance(
        appearance_id INTEGER PRIMARY KEY AUTOINCREMENT,
        person_id CHAR,
        team_id CHAR,
        game_id CHAR,
        appearance_type_id TEXT,
        FOREIGN KEY(person_id) REFERENCES person(person_id),
        FOREIGN KEY(team_id) REFERENCES team(team_id),
        FOREIGN KEY(game_id) REFERENCES game(game_id),
        FOREIGN KEY(appearance_type_id) REFERENCES appearance_type(appearance_type_id)
    );
"""
run_command(q)

Next, let's add the managers, umpires, winning pitchers and etc.

In [44]:
q = """
    INSERT INTO person_appearance (
        game_id,
        team_id,
        person_id,
        appearance_type_id
    )
        SELECT
            game_id,
            NULL,
            lf_umpire_id,
            "ULF"
        FROM game_log
        WHERE lf_umpire_id IS NOT NULL

    UNION

        SELECT
            game_id,
            NULL,
            rf_umpire_id,
            "URF"
        FROM game_log
        WHERE rf_umpire_id IS NOT NULL

    UNION

        SELECT
            game_id,
            v_name,
            v_manager_id,
            "MM"
        FROM game_log
        WHERE v_manager_id IS NOT NULL

    UNION

        SELECT
            game_id,
            h_name,
            h_manager_id,
            "MM"
        FROM game_log
        WHERE h_manager_id IS NOT NULL

    UNION

        SELECT
            game_id,
            CASE
                WHEN h_score > v_score THEN h_name
                ELSE v_name
                END,
            winning_pitcher_id,
            "AWP"
        FROM game_log
        WHERE winning_pitcher_id IS NOT NULL

    UNION
        
        SELECT
            game_id,
            CASE
                WHEN h_score < v_score THEN h_name
                ELSE v_name
                END,
            losing_pitcher_id,
            "ALP"
        FROM game_log
        WHERE losing_pitcher_id IS NOT NULL
        
    UNION
        
        SELECT
            game_id,
            CASE
                WHEN h_score > v_score THEN h_name
                ELSE v_name
                END,
            saving_pitcher_id,
            "ASP"
        FROM game_log
        WHERE saving_pitcher_id IS NOT NULL
        
    UNION
    
        SELECT
            game_id,
            CASE
                WHEN h_rbi > v_rbi THEN h_name
                ELSE v_name
                END,
            winning_rbi_batter_id,
            "AWB"
        FROM game_log
        WHERE winning_rbi_batter_id IS NOT NULL
        
    UNION
    
        SELECT
            game_id,
            v_name,
            v_starting_pitcher_id,
            "PSP"
        FROM game_log
        WHERE v_starting_pitcher_id IS NOT NULL
        
    UNION
    
        SELECT
            game_id,
            h_name,
            h_starting_pitcher_id,
            "PSP"
        FROM game_log
        WHERE h_starting_pitcher_id IS NOT NULL;
"""
run_command(q)

In [45]:
# insert data from other tables and mainly from game_log into person_appearance
# to save time we will use a loop and python string formatting to generate the queries
template = """
INSERT INTO person_appearance (
    game_id,
    team_id,
    person_id,
    appearance_type_id
) 
    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "O{num}"
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL

UNION

    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "D" || CAST({hv}_player_{num}_def_pos AS INT)
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL;
"""

for hv in ["h","v"]:
    for num in range(1,10):
        query_vars = {
            "hv": hv,
            "num": num
        }
        # run commmand is a helper function which runs
        # a query against our database.
        run_command(template.format(**query_vars))

In [46]:
# check results
run_query("""SELECT * FROM person_appearance LIMIT 5""")

Unnamed: 0,appearance_id,person_id,team_id,game_id,appearance_type_id
0,1,curte801,ALT,ALT188404300,MM
1,2,murpj104,ALT,ALT188404300,PSP
2,3,hodnc101,SLU,ALT188404300,PSP
3,4,sullt101,SLU,ALT188404300,MM
4,5,curte801,ALT,ALT188405020,MM


As we can observe we have successfully added all of the necessary tables and inserted the required information.

## Removing the Original Tables

Lastly, we need to remove the original tables from our database and finalize the project.

In [50]:
# drop the original tables
q1 = """ DROP TABLE IF EXISTS game_log"""
q2 = """ DROP TABLE IF EXISTS park_codes"""
q3 = """ DROP TABLE IF EXISTS team_codes"""
q4 = """ DROP TABLE IF EXISTS person_codes"""
run_command(q1)
run_command(q2)
run_command(q3)
run_command(q4)

## Summary

In this project we set out to:

* Import data into SQLite
* Design a normalized database schema
* Create tables for our schema
* Insert data into our schema

And we have successfully cleaned, normalized and imported the information into the new database.